Disassemblers play a pivotal role in reverse engineering. In order to successfully reverse engineer a binary it is often important to understand how disassembly works and the possible shortcomings in order to prepare yourself for situations where your disassembler fails.
A Basic Disassembly algorithm
A basic disassembly program that takes machine language as input and assembly as output in order to understand the challenges, assumptions and compromises that underlie in an automated disassembly process.
Step 1
You need to first identify a region of code to disassemble which is not exactly easy as instructions might be mixed with data and its important to distinguish between the two.
The most common case - disassembly of an executable where the file will conform to a common format i.e. PE (Portable Executable) in Windows or ELF (Executable and Linking Format) in Unix.
These files generally have mechanisms for locating sections for the file that contain code and entry points for the code.
Step 2
Given an initial address of the instruction, we need to read the value contained in that address (or file offset) and then perform a table lookup to match the binary opcode value to the assembly language mnemonic. This might be trivial or may involve additional operations such as understanding prefixes that modify instructions behavior and determining operands depending on the instruction set.
Step 3
After an instruction has been fetched and operands decoded, its assembly language equivalent is formatted and output.
Step 4
Following the output, we need to move to the next instruction and repeat the same process till we have disassembled every instruction in the file.
Various well known disassembly algorithms exist.
1. Linear Sweep Disassembly
2. Recursive Descent Disassembly
Both of these algorithms in forthcoming posts.
A Basic Disassembly algorithm
A basic disassembly program that takes machine language as input and assembly as output in order to understand the challenges, assumptions and compromises that underlie in an automated disassembly process.
Step 1
You need to first identify a region of code to disassemble which is not exactly easy as instructions might be mixed with data and its important to distinguish between the two.
The most common case - disassembly of an executable where the file will conform to a common format i.e. PE (Portable Executable) in Windows or ELF (Executable and Linking Format) in Unix.
These files generally have mechanisms for locating sections for the file that contain code and entry points for the code.
Step 2
Given an initial address of the instruction, we need to read the value contained in that address (or file offset) and then perform a table lookup to match the binary opcode value to the assembly language mnemonic. This might be trivial or may involve additional operations such as understanding prefixes that modify instructions behavior and determining operands depending on the instruction set.
Step 3
After an instruction has been fetched and operands decoded, its assembly language equivalent is formatted and output.
Step 4
Following the output, we need to move to the next instruction and repeat the same process till we have disassembled every instruction in the file.
Various well known disassembly algorithms exist.
1. Linear Sweep Disassembly
2. Recursive Descent Disassembly
Both of these algorithms in forthcoming posts.