Assembly Language to C

If you are looking for an automated software system to take your body of assembly language and rewrite it into C equivalents, keep looking. I do not think that anyone is willing to spend the time to write that. It becomes to idosyncratic as to the mind of the programmer who wrote the original assembly code. Not something you could easily code and get the "programming tricks" worked into it.

But, if you have a large investment in Assembler source and need to move that onto another processor, what I've done may interest you.


TopSome Background Info

I have a customer of long standing who never made the transition to writing his products' software in a high level language (C). Despite numerous opportunities, and offers of help, they had persisted in expanding their body of assembly language. All in the interest of "I don't have time for this right now"...

This code has been in use for nearly 30 years. It started out on a National Semiconductor 8070 processor. When the 8070 went EOL, they bought a goodly supply of them to continue their business. Eventually, the customer supply of 8070 CPUs dwindled and I moved them over to the 80188 processor.

The move to the 80188 was done circa 1992, so perl and other langauges were unknown to me. I ended up writing a YACC (effectively) which transcribed the 8070 opcodes over to 80188 equivelent series of opcodes. This wasn't too difficult as the 8070 resembled a 6502 with a 16bit accumulator.

Fast forward to the present, they have now run out of codespace on the present CPU board and no where to go. The customer now has no choice but to move from the 80188 platform to something else. Now, you would probably snicker and say something to the effect that if they had taken the time (pain) to move onto the C langauge, this transition would be a fairly quick one? That is arguable due to the non-modular approach that this code is being written.

But, the point of this article is to describe a solution that I found to moving them off of a restricted environment into something larger. Something larger that could offer them a means of correcting the error of their ways, probably not; it would give them a way to "keep going" into more assembler. heh.


TopHow It's Done

With perl, of course. Also, emulation of the 8070 processor and the hardware environment.

This bugged me for a long time as I knew this day would come, that they would transition the assembly source rather than rewrite. I've estimated that a rewrite of this code could take 5 man-years.
A number of issues prevented me from using the YACC approach again, mainly it had to the with the hardware of the system itself. Although, I rapidly understood that the program data was a huge issue! More on that later...

Here is an 8070 opcode to place an immediate value into the accumulator:
LD A,#0
The 80188 equialent would expand to:
MOV AL,#0
We won't go into what this means:
LD A,5,@P2

Ok, so how to make 'LD A,#0' work with a C compiler. Aside from all the other solutions I looked at, perl was the solution! I've come to love programming in perl, it is a wonderful way to manipulate text (regex). The solution to this conversion problem rested in using perl to build a "preprocessor" to munge the assembly statements (and data!) into C expressions. So, in the example of the 8070 register load, the C equialent is:

#define LD_A_IMMED(x) r.b.A = (byte) x;

As simple as it looks, it is not. The process of conversion I've broken out into three major steps (with a bunch of supporting scripts):

1. CleanseCode - clean up the 8070 source of assembler statements and other useless statements.

2. TokenizeFile - Take each line and describe it fully in a series of "tokens". For example, encountering a label becomes "LABLEL`NAME_OF_LABEL", an end of line becomes "ENDLINE`" and so on.

3. GenerateCode - Take all those tokens and create the C source file.

The idea is simple really, create a virtual 8070 CPU and have C macros affecting the contents of the CPU virtual registers to mimic what the real 8070 processor would do. This is possible as the 8070 only had 129 permutations of its instruction set! This processor was one of the very early ones.


TopData Is Key

It quickly became apparent that the program data was going to be the single most difficult aspect of this project. In two ways. First, the data was position dependant, not only byte alignment (packed) but some "tricks" of programming relied upon data following other data. Second, this system used a memory banking scheme to expand the 64K limit on RAM access.

Solving the problem of data aligment was not that difficult. The GNU C compiler allows for attributes to be assigned to data which will be interpreted at link time. So the statement '__attribute__((aligned(1))' neatly compressed all the empty space out of the data structures and concatenated each group of data to the previous.

The memory bank issue was a huge one to resolve. There was no easy solution for this as you could not just unwind the memory banks into contiguious memory space. The 8070 was a processor with a 16bit address register, it could only access 64K. Simply expanding the virtual address register to more bits would not work as the assembly language computed address of data elements at various places in the code. To expand the accumulator math from 16bit results to more bits would have an adverse impact on other aspects of the program!

The solution was to keep the memory banks. This was accomplished by using the OVERLAY keyword in the linker script and assigning a SECTION attribute to each data. So, some data in the lower 32K of RAM would have an attribute of '__attribute__((section(".ram_07"),aligned(1)))', data in one of the banked memory slices would be '__attribute__((section(".ram_U3S5"),aligned(1)))' and the linker would sort all this out.


TopCan You Say Humongous?

Heh, the resulting binary executable was well over 100Meg! This was the result of all those overlay sections plus thier init sections. Aside from the overlay space the linker laid down for these memory banks, there was RAM set aside for the banking manager. The banking manager was a C function that would swap RAM out of the slice windows (where the 8070 saw a memory bank), put that old RAM values aside, then bring in a new bank of memory into the window.

Nothing could be done about the 384K of RAM that was part of the banking manager, that useage had been planned for. But, the excess produced by the overlays in the linker had to be resolved. The solution was simple. The overlay initializations were not needed as no values were given inside those memory slices. This was just available RAM used for queues and stacks. So, a shell script was run after the binary executable was linked that would use objcopy to remove those unnecessary overlay init sections.

The executable dropped dramatically down to less than 10Meg in size. And, when stripped of debugging symbols, it was just over 5Meg. Keep in mind, there are over 130 thousand lines of assembly source. The 80188 code is about 500K for the executable. So, 5Meg of emulation binary is acceptable.