Friday, February 3, 2017

Remember the Registers (or, the ABCs of x86)

I have a friend who is about to learn assembly language on x86. This means understanding how processor registers are used. So, I'm sharing some mnemonics for remembering the registers and their roles on x86. Maybe they will help you, too.

A register, by the way, is just a memory location that is directly (and quickly!) accessible by the processor, usually located in the processor circuitry as opposed to being accessible at a memory address (although this depends on the hardware).

Oh. Okay, so not one of these. Got it.


I'll completely omit floating point (and MMX/SSE) stuff, but I'll briefly mention amd64 for the sake of example.

A, B, C, and D

There are four general purpose registers whose names are basically A, B, C, and D. On old 16-bit Intel processors, they were named AX, BX, CX, and DX. From each of these, you could also access the low eight bits (e.g. AL, as in "A low") and the high eight bits (e.g. AH, as in "A high"), and you still can.

When 32-bit came along, they prefixed them all with an E (meaning "extended"), so now they are EAX, EBX, ECX, and EDX.

And when AMD devised their derivative amd64 architecture, they prefixed them all with an R, presumably meaning "really-friggin'-extended". So, on amd64, they are named RAX, RBX, RCX, and RDX.

The "A" register has a few special roles, but most importantly it is used to hold the return value (i.e. the result) when functions return to their callers.

The "C" register sometimes has a special role, too, that I'll describe next.

So much for A, B, C, and D.

String

Several string instructions (e.g. cmps and movs, stos, lods) have implied operands:
  • ESI = Source
  • EDI = Destination
  • ECX = Counter
  • EAX = Value to write (sometimes applicable, sometimes not)

Stack

The stack starts at a high address in memory and as things are added to it, it grows down to lower addresses. The processor keeps track with a stack pointer, and then functions can further use a "base" pointer to point to the boundary between the caller's stack (and two other pieces of data), and their own local variables.
  • ESP = Stack Pointer
  • EBP = Base Pointer

Instruction Pointer

So special that it's all alone under its own heading:
  • EIP = Instruction Pointer
This register points to the instruction that is about to be executed.

Extra Credit: Segment Registers

Remember in Windows 95 when blue screens used to report "CS:EIP = xxxxxxx"? Neither do I, let's pretend I didn't admit that. Anyway, that CS is a segment selector indicating the location of the code segment. Modern operating systems use a flat model, so they're mostly set the same.

  • CS = Code Segment
  • DS = Data Segment
  • ES = "Extra" Segment (meh)
  • SS = Stack Segment
  • FS & GS = Even more extra segments, using the letters that follow C, D, and E, namely F and G Segments
These registers are 16 bits long and contain integers called segment descriptors; the CPU reads the Global or the Local Descriptor Table (the GDT or the LDT) to find the base address and size of each memory segment indicated by those descriptors. Operating systems commonly use a "flat" model instead of a segmented model, so these may all contain the same value. Since segmentation is largely a non-issue now, the extra segment registers are sometimes used to point to interesting structures. For example: FS => Thread Environment Block (TEB) in userspace Windows 32-bit applications.

Summary

In review:

  • EAX, EBX, ECX, EDX = A, B, C, D; Note that the 'A' register holds function return values
  • ESI, EDI = Source, Destination (for string operations) - ECX may be the counter and EAX may used, too.
  • ESP, EBP = Stack Pointer, Base Pointer
  • EIP = Instruction Pointer
  • CS, DS, SS, ES, FS, GS = Code, Data, Stack, and Extra segments, followed by F and G Segments

For the authoritative reference on all of this, see the Intel processor manuals.

Happy hackin' :)