Tuesday, January 31, 2017

Lab 4 - Compiled C Lab

This assignment deals with the experimentation of some of the many argument options available when compiling C code with GCC. A simple "Hello World" program will be used to test the different options. In order to analyze the binary executable, the following arguments will be used:

    -g              #enable debugging information
    -O0             #(Letter 'o' followed by zero) do not optimize
    -fno-builtin    #do not use builtin function optimization


The source code will be compiled under a Linux x86_64 environment and a ELF (Executable and Linkable Format) file is the resulting binary.

For analyzing the executable in a readable way, we can use the 'objdump' and 'readelf' commands. With 'objdump' we can use an argument for showing the header information for the file (-f), display summary information of the file divided by section (-s) or disassemble the executable (-d). Having the debug flag when compiling also helps when using the option to show the C source code (--source), along with the instructions in assembly (-d implied).

$ objdump --source hello
========================


00000000004003e0 <printf@plt-0x10>:
  4003e0:       ff 35 22 0c 20 00       pushq  0x200c22(%rip)        # 601008 <_GLOBAL_OFFSET_TABLE_+0x8>
  4003e6:       ff 25 24 0c 20 00       jmpq   *0x200c24(%rip)        # 601010 <_GLOBAL_OFFSET_TABLE_+0x10>
  4003ec:       0f 1f 40 00             nopl   0x0(%rax)

00000000004003f0 <printf@plt>:
  4003f0:       ff 25 22 0c 20 00       jmpq   *0x200c22(%rip)        # 601018 <_GLOBAL_OFFSET_TABLE_+0x18>
  4003f6:       68 00 00 00 00          pushq  $0x0
  4003fb:       e9 e0 ff ff ff          jmpq   4003e0 <_init+0x18>
.
.
.
00000000004004f6 <main>:
#include <stdio.h>

int main() {
  4004f6:       55                      push   %rbp
  4004f7:       48 89 e5                mov    %rsp,%rbp
            printf("Hello World!\n");
  4004fa:       bf a0 05 40 00          mov    $0x4005a0,%edi
  4004ff:       b8 00 00 00 00          mov    $0x0,%eax
  400504:       e8 e7 fe ff ff          callq  4003f0 <printf@plt>
  400509:       b8 00 00 00 00          mov    $0x0,%eax
}
  40050e:       5d                      pop    %rbp
  40050f:       c3                      retq



On the main section, we find the start of our C code, although a lot was already done by the compiler previously when analyzing the assembly code. On the left , we can see the location on the heap memory where the program is running, and the following bytes (each represented as pairs of hexadecimal digits) indicating the instruction its and arguments. On the right side, 'objdump' conveniently disassembled each series of bytes. The left side could also be obtained by using the '-s' argument with 'object' dump.

$ objdump -s hello
===================

Contents of section .rodata:
 400590 01000200 00000000 00000000 00000000  ................
 4005a0 48656c6c 6f20576f 726c6421 0a00      Hello World!..

With this output, we can see the contents of the memory used throughout the whole program. Here we can see the "Hello World!" string on the 4005a0 address, under the read-only data section. Note that this address was not shown on the previous command, although it is manually moved to the 'edi' register to be used as a destination address. Afterwards, we notice the program probably preparing a 'sys_read' syscall from standard input, when zero is assigned to the 'eax' register.

The program then calls the 'printf' function through PLT( procedure linkage table), which basically helps link the executable with the C libraries on the system.

Now let's take a look on the header section for the file.

$ readelf -h hello
===================

ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           Advanced Micro Devices X86-64
  Version:                           0x1
  Entry point address:               0x400400
  Start of program headers:          64 (bytes into file)
  Start of section headers:          8752 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         9
  Size of section headers:           64 (bytes)
  Number of section headers:         35
  Section header string table index: 32


There is some useful information here. More importantly the start of section headers number. Let's save it for later.

Performing the 'du -h hello' command, I get that it occupies 11 Kilobytes on disk.

'-static' argument

Now including the 'static' flag on the gcc command, the compiler is creating an standalone application by copying its entire library includes into the executable. This makes for a very portable application that does not depend on libraries available on the system, but also immensely increases the file size. Our simple "Hello World!" program has the whole standard C library in it, but it is using only one function of it. The executable now has almost 900Kilobytes, opposed to the previous 11K.


Using built in function optimizations

The compiler utilizes some common optimizations, by default. When we use the '-fno-builtin', we are telling the compiler not to use them. Analyzing the optimized executable, we get basically the same 'main' section. A important difference is the use of 'puts' instead of 'printf', which has formatting capabilities. Since we're only using one argument as a single string, the compiler calls for 'puts', which simply echos it to standard output.


No debugging enabled

The compiler has the option to include debugging information into the executable, so it becomes easier to analyze its assembly code. On the object dump from the original 'hello' program, it echoed the C source along with the instructions. Much of the C source is then included into the executable, therefor making it a larger file. Without the debug option, we can observe a decrease in the file size from 11K to 8.4K, as well as a decrease of section headers.

Using arguments on 'printf'

Now we're going to modify the program a little, so we can compare the changes done to it in assembly.

#include <stdio.h>

int main() {
            printf("Hello World!\n %d %d %d %d %d %d %d %d %d %d",
                                   1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
}

When compiled and disassembled, we have the following:

00000000004004f6 <main>:
  4004f6:       55                      push   %rbp
  4004f7:       48 89 e5                mov    %rsp,%rbp
  4004fa:       48 83 ec 08             sub    $0x8,%rsp
  4004fe:       6a 0a                   pushq  $0xa
  400500:       6a 09                   pushq  $0x9
  400502:       6a 08                   pushq  $0x8
  400504:       6a 07                   pushq  $0x7
  400506:       6a 06                   pushq  $0x6
  400508:       41 b9 05 00 00 00       mov    $0x5,%r9d
  40050e:       41 b8 04 00 00 00       mov    $0x4,%r8d
  400514:       b9 03 00 00 00          mov    $0x3,%ecx
  400519:       ba 02 00 00 00          mov    $0x2,%edx
  40051e:       be 01 00 00 00          mov    $0x1,%esi
  400523:       bf d0 05 40 00          mov    $0x4005d0,%edi
  400528:       b8 00 00 00 00          mov    $0x0,%eax
  40052d:       e8 be fe ff ff          callq  4003f0 <printf@plt>
  400532:       48 83 c4 30             add    $0x30,%rsp
  400536:       b8 00 00 00 00          mov    $0x0,%eax
  40053b:       c9                      leaveq
  40053c:       c3                      retq
  40053d:       0f 1f 00                nopl   (%rax)

Here we can see that the 'main' section continues pretty much the same logic. What is very noticeable is the assignment of the digits to the registers and stack. Up to five arguments can be assigned to the registers to be used by the 'printf' function, but if more than that is needed, it will be pushed to the stack. The arguments that are stored on stack are assigned in the opposite order than in the program so that 'printf' pops the top value first. In this case, after the values on the registers substitute the '%d' in memory, the top value would be 6.

After 'printf' is executed, there arguments values are still on the stack. 'leaveq' takes care of that by assigning the stack pointer as base pointer, so the next values written on the stack will overwrite the previous ones.

Output function

By moving 'printf' to a separate function and calling that function in 'main', the 'main' section now calls the address of a new section called 'output', and in it 'printf' is called. With the default options, the compiler would optimize the instructions so that it functions like our original C code, but still there would be a new section for 'output'.

Full Optimization

For full optimization on compilation, the compiler must receive the '-O3' option, which means to fully optimize it, even if the output executable is not stable. Compiling our original "Hello World!" program we get this:
0000000000400400 <main>:
  400400:       48 83 ec 08             sub    $0x8,%rsp
  400404:       bf b0 05 40 00          mov    $0x4005b0,%edi
  400409:       31 c0                   xor    %eax,%eax
  40040b:       e8 e0 ff ff ff          callq  4003f0 <printf@plt>
  400410:       31 c0                   xor    %eax,%eax
  400412:       48 83 c4 08             add    $0x8,%rsp
  400416:       c3                      retq
  400417:       66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
  40041e:       00 00

Compared to the original code, it seems more efficient, but also harder to read. One simple optimization made is setting 'eax' to zero by using the Exclusive OR operation with itself, instead of moving a zero value to the register.

Conclusion

The compiler offers many debugging and optimization options, although for today's programs those optimizations are pretty much done automatically. As programs get larger and more complex, there's is only so much the compiler can optimize. There is a need for optimization for programmers, as they have the context necessary to determine the common processing of a program that the compiler couldn't.

Sunday, January 29, 2017

Lab 3 - Assembler Lab

This assignment requires the use of the Assembly language to display and increment a value, using a loop. The code should exist for both, the x86_64 and Aarch64 architectures. A sample initial loop was provided in GNU Assembler (GAS) and when compiled and executed, displays the following output:

Loop
Loop
Loop
Loop
Loop
Loop
Loop
Loop
Loop
Loop

The next objective is to modify the code so that this output is displayed:
Loop: 0
Loop: 1
Loop: 2
Loop: 3
Loop: 4
Loop: 5
Loop: 6
Loop: 7
Loop: 8
Loop: 9

Our group's solution is the following:

The first step was to create the address locations containing the data we want to display (Lines 45-54). For our solution, we created three values we want to display. The 'msg' value contains the string 'Loop: ' and its length is specified by " . (dot) - (minus) msg", which means, this current address minus the length of msg. The 'digit' value will be used to display the increasing digit. It is declared as an empty space, but will be changed in the code for a single digit. The 'nl' value is simply to add a newline character after each iteration. Each of these values correspond to an address in memory that will later be accessed.

Afterwards, we can analyze  the section that will print the 'msg' value, but before we get into the code, we need to understand 'syscall' a little better. Syscall is what is used to perform system calls while executing (e.g writing, reading, exiting, etc). It uses the registers rdi, rsi, rdx, rcx, r8d, r9d, in this specific order, when executing a system call specified by rax. In this assignment, only the 'sys_write' system call will be used and it has a code '1'. The rdi register is the destination index and it should point to the standard output on the console, therefor it should hold a value of 1. The rsi contains the data to be used as source when writing, so it should point to one of the message values we specified earlier. The last argument 'sys_write' needs is the length of the message, which was declared as len, len1 and len2, and can be stored on rdx. The final system call should be something similar to this:

                sys_write( 1 ,"Loop: ", len)

Of course, the assembly language isn't so simple as to just throw some arguments into a function. Each argument has to be placed in its right registry. Analyzing the lines 14 to 19, we start by allocating the length of msg to rdx. Then we assign 'msg' to rsi, so it can be used as source data. Next, we change rdi to 1, so we write to stdout on the console, and change rax to 1, so we call 'sys_write' on the pre-assembled (!!) registers. After we have all registers assigned, 'syscall' on line 19 will perform the system call and write "Loop: " once to the console.

The next block of code (21-28) has the logic for the single digit. The writing process is the same as the previous block of code. What is interesting about this block is how the digit to be written changes on each iteration of the loop. The first change we notice is on line 24. As we know from the sample loop, there is a loop counter on register r15. It starts at zero and is incremented by one at the end of each iteration. To display the values from 0 to 9, we simply need to assign each current value of the loop counter to rsi and convert it to ASCII. The 'movb' instruction moves the lower byte of r15 to rsi and, on the next line, we convert it to ASCII by adding 48 to it. We then write to the console.

So this concludes the first part of the assignment. The next is to display the numbers from 0 to 30, suppressing the leading zero. Here's my solution:

Now there are two digits to be displayed and if the leading digit is zero, it should be omitted. In order to know when 10 is reached, we'll divide the counter by ten. The quotient is used as the leading digit and the remainder is used as the following digit. The 'div' instruction stores the quotient in the rax register and the remainder on the rdx register, but for each division, rdx must be zeroed out as shown on line 21. After dividing the counter by ten, we than store the quotient on r13 and the remainder on r14, as those registers preserve their values.

As 'msg' is already printed, we now need to print the leading character, if not zero, so we use the 'cmp' instruction to compare the quotient with zero (32). As flags are set with the last instruction, we use the 'je' instruction (jump if equal) to jump to 'lsd' (least significant digit) if the leading digit is equal to zero, therefore not going through the process of writing it.

After changing 'max' on line 5, to 31, we get the following output:
Loop: 0
Loop: 1
Loop: 2
Loop: 3
Loop: 4
Loop: 5
Loop: 6
Loop: 7
Loop: 8
Loop: 9
Loop: 10
Loop: 11
Loop: 12
Loop: 13
Loop: 14
Loop: 15
Loop: 16
Loop: 17
Loop: 18
Loop: 19
Loop: 20
Loop: 21
Loop: 22
Loop: 23
Loop: 24
Loop: 25
Loop: 26
Loop: 27
Loop: 28
Loop: 29
Loop: 30

Now that we have the x86_64 Assembly code figured out, let's analyze the equivalent code for Aarch64:

Right on line 8 we can see the first difference. The instructions are quite different for this architecture. We can see that the register r28, the first argument, receives 'start'. One instruction quite different is the division instruction. On x86_64, it provides the quotient and the remainder on specific registers. On Aarch64, the udiv requires three arguments: a register to store the quotient, a register with the dividend and a register with the divisor. The remainder isn't provided. Instead, to get the remainder there is the need to use a separate instruction called 'msub':
            msub r0,r3,r2,r1    ->    r0 = r1 - ( r2 * r3 )

And the remainder can be calculated:
            remainder = dividend - ( quotient * divisor )

Another difference is when sending the data to be written to the 'digit' address. On the x86_64, 'movb' was used to send a single byte, but on Aarch64, 'str' was used to store a value on a place in memory, and its leading bits are zeroed out.


Debugging


While Aarch64 seemed to be more efficient and have various seemingly useful instructions, I found that x86_64 is more intuitive, but requires more manual handling. Plus I found that it had much more online resources available.

It is very hard to be efficient in assembly debugging in general without the use of a tool that allows for stepping and viewing of values during run time (e.g gdb). I had to rely on methodology a lot, and doing the steps on paper. A tool for debugging would be absolutely essential for larger projects.

Conclusion

Assembly code is the closest to the metal as it gets, unless you're willing to program in hex. It is extremely efficient for its direct communication with the CPU, but also a simple program can take pages of code and be very hard to debug. It can be very powerful and useful on certain places, but mostly embedded into low level programming. It also requires a lot of repetitive coding, as it is not portable to other architectures and there might even be different solutions that function better on each of them. Aarch64 can still get a lot of optimization for source code that was built specifically for x86_64.


Friday, January 20, 2017

Lab 2

For this assignment, two open source packages with different licenses are to be obtained, built and executed in a Linux environment.

Emacs

The legendary text editor is GPLv3 licensed and its source can be downloaded via the git repository with the following command:

git clone -b master git://git.sv.gnu.org/emacs.git

For the installation, I'll be using the 'betty' AArch64 system. Once downloaded, The first file I opened was the README file, where it stated that the INSTALL file should have the instructions necessary for installation. The INSTALL file states that the installation starts by first running the 'configure' executable, where it tries to deduce the correct values for various system-dependent variables and features. To generate the 'configure' executable, I executed 'autogen.sh'. When running './configure', it stated that I didn't have a package called 'makeinfo' with a version of 4.13 or higher, but that I could rerun the command with the argument '--without-makeinfo' and not have the info pages for Emacs. For the purpose of this article, I ran with the argument.

With the configuration done, the next step is to run 'make', to build the software executable into the 'src' directory. Once done (which took about twenty minutes), Emacs can be executed from its directory. To install it on the system, 'make install' should be run, although superuser privileges are needed.

Ncdu

This is a command line program that analyses disk usage (du) and conveniently displays it with the Ncurses (nc) library. It's a relatively small package, but can be very useful on terminal-only systems (e.g ssh connections). Ncdu is MIT licensed.

Once cloned into betty, I'm ready to start the build. The README file stated that the build should start by running the command 'autoreconf -i' when building directly from the git repository. The rest of the installation is really straight forward. Just running the commands './configure --prefix=/usr' followed by 'make', built the package executable. The build didn't take more than 10 seconds.

Afterthoughts

Building software can take a while, depending on its complexity and size. Emacs is quite a large software compared to other terminal applications (although it might even already come with graphical support out of the box) and it took quite a while for generating an executable. It is a project that has been in development for decades and it only tends to get larger, as its almost religious users keep contributing to its development.

Ncdu is a good example of a lightweight and useful utility. It is intended to do one thing and do it well. It is unlikely that it will become a large application, especially being command line focused, so its build process won't take more than a few seconds.

Lab 1

The focus of this post is to compare the development process two open source projects with different licenses. More specifically, the way each community accepts code into their so beloved project.

 Atom

Built on Electron, Atom is a fully hackable text editor with focus on development and customization. It is a strong alternative to the (in)famous Sublime Text, where customization can sometimes get in the way of productivity and the free version will pop a window once in a while to offer the full version. Atom is open source and for anyone. Atom and its core packages are MIT licensed and hosted at Github.

The MIT license is the shortest and most liberal of the open source licenses. Anyone can modify it, as long as the project creator is credited, and there are no liabilities regarding the quality of the software. That means that it is really straight forward to contribute. All that is necessary is the effort to improve the project.

The project creators use Git as the version control. To contribute, the new assets must be sent by performing a pull request. The request is then reviewed according to some guidelines and after thorough review, the request is accepted into the core project. Although there is a detailed review, the process is relatively fast. Here is an example of a pull request for a bug fix, which got through to the core code after three days.

Mozilla

The giant Mozilla is the other open source project I've picked for this assignment. It is truly makes an example of open source development done well and really, it's what saved it from not existing these days (formerly Netscape).

It uses the MPL license. According to Mozilla "the MPL fills a useful space in the spectrum of free and open source software licenses, sitting between the Apache license, which does not require modifications to be shared, and the GNU family of licenses, which requires modifications to be shared under a much broader set of circumstances than the MPL."

Mozilla is a huge deal in the open source world, and so the contribution process is longer and stricter than most. That said, the people at Mozilla want your contribution, so they help by walking you through the process in a detailed manner. The patches take more steps to get through and there is a lot of feedback and support from other developers. This patch had took about 10 months to get through since its entry.