Showing posts with label microcode. Show all posts
Showing posts with label microcode. Show all posts

Undocumented 8086 instructions, explained by the microcode

What happens if you give the Intel 8086 processor an instruction that doesn't exist? A modern microprocessor (80186 and later) will generate an exception, indicating that an illegal instruction was executed. However, early microprocessors didn't include the circuitry to detect illegal instructions, since the chips didn't have transistors to spare. Instead these processors would do something, but the results weren't specified.1

The 8086 has a number of undocumented instructions. Most of them are simply duplicates of regular instructions, but a few have unexpected behavior, such as revealing the values of internal, hidden registers. In the 8086, most instructions are implemented in microcode, so examining the 8086's microcode can explain why these instructions behave the way they do.

The photo below shows the 8086 die under a microscope, with the important functional blocks labeled. The metal layer is visible, while the underlying silicon and polysilicon wiring is mostly hidden. The microcode ROM and the microcode address decoder are in the lower right. The Group Decode ROM (upper center) is also important, as it performs the first step of instruction decoding.

The 8086 die under a microscope, with main functional blocks labeled. Click on this image (or any other) for a larger version.

The 8086 die under a microscope, with main functional blocks labeled. Click on this image (or any other) for a larger version.

Microcode and 8086 instruction decoding

You might think that machine instructions are the basic steps that a computer performs. However, instructions usually require multiple steps inside the processor. One way of expressing these multiple steps is through microcode, a technique dating back to 1951. To execute a machine instruction, the computer internally executes several simpler micro-instructions, specified by the microcode. In other words, microcode forms another layer between the machine instructions and the hardware. The main advantage of microcode is that it turns the processor's control logic into a programming task instead of a difficult logic design task.

The 8086's microcode ROM holds 512 micro-instructions, each 21 bits wide. Each micro-instruction performs two actions in parallel. First is a move between a source and a destination, typically registers. Second is an operation that can range from an arithmetic (ALU) operation to a memory access. The diagram below shows the structure of a 21-bit micro-instruction, divided into six types.

The encoding of a micro-instruction into 21 bits. Based on NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?

The encoding of a micro-instruction into 21 bits. Based on NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?

When executing a machine instruction, the 8086 performs a decoding step. Although the 8086 is a 16-bit processor, its instructions are based on bytes. In most cases, the first byte specifies the opcode, which may be followed by additional instruction bytes. In other cases, the byte is a "prefix" byte, which changes the behavior of the following instruction. The first byte is analyzed by something called the Group Decode ROM. This circuit categorizes the first byte of the instruction into about 35 categories that control how the instruction is decoded and executed. One category is "1-byte logic"; this indicates a one-byte instruction or prefix that is simple and implemented by logic circuitry in the 8086. For instructions in this category, microcode is not involved while the remaining instructions are implemented in microcode. Many of these instructions are in the "two-byte ROM" category indicating that the instruction has a second byte that also needs to be decoded by microcode. This second byte, called the ModR/M byte, specifies that memory addressing mode or registers that the instruction uses.

The next step is the microcode's address decoder circuit, which determines where to start executing microcode based on the opcode. Conceptually, you can think of the microcode as stored in a ROM, indexed by the instruction opcode and a few sequence bits. However, since many instructions can use the same microcode, it would be inefficient to store duplicate copies of these routines. Instead, the microcode address decoder permits multiple instructions to reference the same entries in the ROM. This decoding circuitry is similar to a PLA (Programmable Logic Array) so it matches bit patterns to determine a particular starting point. This turns out to be important for undocumented instructions since undocumented instructions often match the pattern for a "real" instruction, making the undocumented instruction an alias.

The 8086 has several internal registers that are invisible to the programmer but are used by the microcode. Memory accesses use the Indirect (IND) and Operand (OPR) registers; the IND register holds the address in the segment, while the OPR register holds the data value that is read or written. Although these registers are normally not accessible by the programmer, some undocumented instructions provide access to these registers, as will be described later.

The Arithmetic/Logic Unit (ALU) performs arithmetic, logical, and shift operations in the 8086. The ALU uses three internal registers: tmpA, tmpB, and tmpC. An ALU operation requires two micro-instructions. The first micro-instruction specifies the operation (such as ADD) and the temporary register that holds one argument (e.g. tmpA); the second argument is always in tmpB. A following micro-instruction can access the ALU result through the pseudo-register Σ (sigma).

The ModR/M byte

A fundamental part of the 8086 instruction format is the ModR/M byte, a byte that specifies addressing for many instructions. The 8086 has a variety of addressing modes, so the ModR/M byte is somewhat complicated. Normally it specifies one memory address and one register. The memory address is specified through one of eight addressing modes (below) along with an optional 8- or 16-bit displacement in the instruction. Instead of a memory address, the ModR/M byte can also specify a second register. For a few opcodes, the ModR/M byte selects what instruction to execute rather than a register.

The 8086's addressing modes. From The register assignments, from MCS-86 Assembly Language Reference Guide.

The implementation of the ModR/M byte plays an important role in the behavior of undocumented instructions. Support for this byte is implemented in both microcode and hardware. The various memory address modes above are implemented by microcode subroutines, which compute the appropriate memory address and perform a read if necessary. The subroutine leaves the memory address in the IND register, and if a read is performed, the value is in the OPR register.

The hardware hides the ModR/M byte's selection of memory versus register, by making the value available through the pseudo-register M, while the second register is available through N. Thus, the microcode for an instruction doesn't need to know if the value was in memory or a register, or which register was selected. The Group Decode ROM examines the first byte of the instruction to determine if a ModR/M byte is present, and if a read is required. If the ModR/M byte specifies memory, the Translation ROM determines which micro-subroutines to call before handling the instruction itself. For more on the ModR/M byte, see my post on Reverse-engineering the ModR/M addressing microcode.

Holes in the opcode table

The first byte of the instruction is a value from 00 to FF in hex. Almost all of these opcode values correspond to documented 8086 instructions, but there are a few exceptions, "holes" in the opcode table. The table below shows the 256 first-byte opcodes for the 8086, from hex 00 to FF. Valid opcodes for the 8086 are in white; the colored opcodes are undefined and interesting to examine. Orange, yellow, and green opcodes were given meaning in the 80186, 80286, and 80386 respectively. The purple opcode is unusual: it was implemented in the 8086 and later processors but not documented.2 In this section, I'll examine the microcode for these opcode holes.

This table shows the 256 opcodes for the 8086, where the white ones are valid instructions. Click for a larger version.

This table shows the 256 opcodes for the 8086, where the white ones are valid instructions. Click for a larger version.

D6: SALC

The opcode D6 (purple above) performs a well-known but undocumented operation that is typically called SALC, for Set AL to Carry. This instruction sets the AL register to 0 if the carry flag is 0, and sets the AL register to FF if the carry flag is 1. The curious thing about this undocumented instruction is that it exists in all x86 CPUs, but Intel didn't mention it until 2017. Intel probably put this instruction into the processor deliberately as a copyright trap. The idea is that if a company created a copy of the 8086 processor and the processor included the SALC instruction, this would prove that the company had copied Intel's microcode and thus had potentially violated Intel's copyright on the microcode. This came to light when NEC created improved versions of the 8086, the NEC V20 and V30 microprocessors, and was sued by Intel. Intel analyzed NEC's microcode but was disappointed to find that NEC's chip did not include the hidden instruction, showing that NEC hadn't copied the microcode.3 Although a Federal judge ruled in 1989 that NEC hadn't infringed Intel's copyright, the 5-year trial ruined NEC's market momentum.

The SALC instruction is implemented with three micro-instructions, shown below.4 The first micro-instruction jumps if the carry (CY) is set. If not, the next instruction moves 0 to the AL register. RNI (Run Next Instruction) ends the microcode execution causing the next machine instruction to run. If the carry was set, all-ones (i.e. FF hex) is moved to the AL register and RNI ends the microcode sequence.

           JMPS CY 2 SALC: jump on carry
ZERO → AL  RNI       Move 0 to AL, run next instruction
ONES → AL  RNI       2:Move FF to AL, run next instruction

0F: POP CS

The 0F opcode is the first hole in the opcode table. The 8086 has instructions to push and pop the four segment registers, except opcode 0F is undefined where POP CS should be. This opcode performs POP CS successfully, so the question is why is it undefined? The reason is that POP CS is essentially useless and doesn't do what you'd expect, so Intel figured it was best not to document it.

To understand why POP CS is useless, I need to step back and explain the 8086's segment registers. The 8086 has a 20-bit address space, but 16-bit registers. To make this work, the 8086 has the concept of segments: memory is accessed in 64K chunks called segments, which are positioned in the 1-megabyte address space. Specifically, there are four segments: Code Segment, Stack Segment, Data Segment, and Extra Segment, with four segment registers that define the start of the segment: CS, SS, DS, and ES.

An inconvenient part of segment addressing is that if you want to access more than 64K, you need to change the segment register. So you might push the data segment register, change it temporarily so you can access a new part of memory, and then pop the old data segment register value off the stack. This would use the PUSH DS and POP DS instructions. But why not POP CS?

The 8086 executes code from the code segment, with the instruction pointer (IP) tracking the location in the code segment. The main problem with POP CS is that it changes the code segment, but not the instruction pointer, so now you are executing code at the old offset in a new segment. Unless you line up your code extremely carefully, the result is that you're jumping to an unexpected place in memory. (Normally, you want to change CS and the instruction pointer at the same time, using a CALL or JMP instruction.)

The second problem with POP CS is prefetching. For efficiency, the 8086 prefetches instructions before they are needed, storing them in an 6-byte prefetch queue. When you perform a jump, for instance, the microcode flushes the prefetch queue so execution will continue with the new instructions, rather than the old instructions. However, the instructions that pop a segment register don't flush the prefetch buffer. Thus, POP CS not only jumps to an unexpected location in memory, but it will execute an unpredictable number of instructions from the old code path.

The POP segment register microcode below packs a lot into three micro-instructions. The first micro-instruction pops a value from the stack. Specifically, it moves the stack pointer (SP) to the Indirect (IND) register. The Indirect register is an internal register, invisible to the programmer, that holds the address offset for memory accesses. The first micro-instruction also performs a memory read (R) from the stack segment (SS) and then increments IND by 2 (P2, plus 2). The second micro-instruction moves IND to the stack pointer, updating the stack pointer with the new value. It also tells the microcode engine that this micro-instruction is the next-to-last (NXT) and the next machine instruction can be started. The final micro-instruction moves the value read from memory to the appropriate segment register and runs the next instruction. Specifically, reads and writes put data in the internal OPR (Operand) register. The hardware uses the register N to indicate the register specified by the instruction. That is, the value will be stored in the CS, DS, ES, or SS register, depending on the bit pattern in the instruction. Thus, the same microcode works for all four segment registers. This is why POP CS works even though POP CS wasn't explicitly implemented in the microcode; it uses the common code.

SP → IND  R SS,P2 POP sr: read from stack, compute IND plus 2
IND → SP  NXT     Put updated value in SP, start next instruction.
OPR → N   RNI     Put stack value in specified segment register

But why does POP CS run this microcode in the first place? The microcode to execute is selected based on the instruction, but multiple instructions can execute the same microcode. You can think of the address decoder as pattern-matching on the instruction's bit patterns, where some of the bits can be ignored. In this case, the POP sr microcode above is run by any instruction with the bit pattern 000??111, where a question mark can be either a 0 or a 1. You can verify that this pattern matches POP ES (07), POP SS (17), and POP DS (1F). However, it also matches 0F, which is why the 0F opcode runs the above microcode and performs POP CS. In other words, to make 0F do something other than POP CS would require additional circuitry, so it was easier to leave the action implemented but undocumented.

60-6F: conditional jumps

One whole row of the opcode table is unused: values 60 to 6F. These opcodes simply act the same as 70 to 7F, the conditional jump instructions.

The conditional jumps use the following microcode. It fetches the jump offset from the instruction prefetch queue (Q) and puts the value into the ALU's tmpBL register, the low byte of the tmpB register. It tests the condition in the instruction (XC) and jumps to the RELJMP micro-subroutine if satisfied. The RELJMP code (not shown) updates the program counter to perform the jump.

Q → tmpBL                Jcond cb: Get offset from prefetch queue
           JMP XC RELJMP Test condition, if true jump to RELJMP routine
           RNI           No jump: run next instruction

This code is executed for any instruction matching the bit pattern 011?????, i.e. anything from 60 to 7F. The condition is specified by the four low bits of the instruction. The result is that any instruction 60-6F is an alias for the corresponding conditional jump 70-7F.

C0, C8: RET/RETF imm

These undocumented opcodes act like a return instruction, specifically RET imm16 (source). Specifically, the instruction C0 is the same as C2, near return, while C8 is the same as CA, far return.

The microcode below is executed for the instruction bits 1100?0?0, so it is executed for C0, C2, C8, and CA. It gets two bytes from the instruction prefetch queue (Q) and puts them in the tmpA register. Next, it calls FARRET, which performs either a near return (popping PC from the stack) or a far return (popping PC and CS from the stack). Finally, it adds the original argument to the SP, equivalent to popping that many bytes.

Q → tmpAL    ADD tmpA    RET/RETF iw: Get word from prefetch, set up ADD
Q → tmpAH    CALL FARRET Call Far Return micro-subroutine
IND → tmpB               Move SP (in IND) to tmpB for ADD
Σ → SP       RNI         Put sum in Stack Pointer, end

One tricky part is that the FARRET micro-subroutine examines bit 3 of the instruction to determine whether it does a near return or a far return. This is why documented instruction C2 is a near return and CA is a far return. Since C0 and C8 run the same microcode, they will perform the same actions, a near return and a far return respectively.

C1: RET

The undocumented C1 opcode is identical to the documented C3, near return instruction. The microcode below is executed for instruction bits 110000?1, i.e. C1 and C3. The first micro-instruction reads from the Stack Pointer, incrementing IND by 2. Prefetching is suspended and the prefetch queue is flushed, since execution will continue at a new location. The Program Counter is updated with the value from the stack, read into the OPR register. Finally, the updated address is put in the Stack Pointer and execution ends.

SP → IND  R SS,P2  RET:  Read from stack, increment by 2
          SUSP     Suspend prefetching
OPR → PC  FLUSH    Update PC from stack, flush prefetch queue
IND → SP  RNI      Update SP, run next instruction

C9: RET

The undocumented C9 opcode is identical to the documented CB, far return instruction. This microcode is executed for instruction bits 110010?1, i.e. C9 and CB, so C9 is identical to CB. The microcode below simply calls the FARRET micro-subroutine to pop the Program Counter and CS register. Then the new value is stored into the Stack Pointer. One subtlety is that FARRET looks at bit 3 of the instruction to switch between a near return and a far return, as described earlier. Since C9 and CB both have bit 3 set, they both perform a far return.

          CALL FARRET  RETF: call FARRET routine
IND → SP  RNI          Update stack pointer, run next instruction

F1: LOCK prefix

The final hole in the opcode table is F1. This opcode is different because it is implemented in logic rather than microcode. The Group Decode ROM indicates that F1 is a prefix, one-byte logic, and LOCK. The Group Decode outputs are the same as F0, so F1 also acts as a LOCK prefix.

Holes in two-byte opcodes

For most of the 8086 instructions, the first byte specifies the instruction. However, the 8086 has a few instructions where the second byte specifies the instruction: the reg field of the ModR/M byte provides an opcode extension that selects the instruction.5 These fall into four categories which Intel labeled "Immed", "Shift", "Group 1", and "Group 2", corresponding to opcodes 80-83, D0-D3, F6-F7, and FE-FF. The table below shows how the second byte selects the instruction. Note that "Shift", "Group 1", and "Group 2" all have gaps, resulting in undocumented values.

Meaning of the reg field in two-byte opcodes. From MCS-86 Assembly Language Reference Guide.

Meaning of the reg field in two-byte opcodes. From MCS-86 Assembly Language Reference Guide.

These sets of instructions are implemented in two completely different ways. The "Immed" and "Shift" instructions run microcode in the standard way, selected by the first byte. For a typical arithmetic/logic instruction such as ADD, bits 5-3 of the first instruction byte are latched into the X register to indicate which ALU operation to perform. The microcode specifies a generic ALU operation, while the X register controls whether the operation is an ADD, SUB, XOR, or so forth. However, the Group Decode ROM indicates that for the special "Immed" and "Shift" instructions, the X register latches the bits from the second byte. Thus, when the microcode executes a generic ALU operation, it ends up with the one specified in the second byte.6

The "Group 1" and "Group 2" instructions (F0-F1, FE-FF), however, run different microcode for each instruction. Bits 5-3 of the second byte replace bits 2-0 of the instruction before executing the microcode. Thus, F0 and F1 act as if they are opcodes in the range F0-F7, while FE and FF act as if they are opcodes in the range F8-FF. Thus, each instruction specified by the second byte can have its own microcode, unlike the "Immed" and "Shift" instructions. The trick that makes this work is that all the "real" opcodes in the range F0-FF are implemented in logic, not microcode, so there are no collisions.

The hole in "Shift": SETMO, D0..D3/6

There is a "hole" in the list of shift operations when the second byte has the bits 110 (6). (This is typically expressed as D0/6 and so forth; the value after the slash is the opcode-selection bits in the ModR/M byte.) Internally, this value selects the ALU's SETMO (Set Minus One) operation, which simply returns FF or FFFF, for a byte or word operation respectively.7

The microcode below is executed for 1101000? bit patterns patterns (D0 and D1). The first instruction gets the value from the M register and sets up the ALU to do whatever operation was specified in the instruction (indicated by XI). Thus, the same microcode is used for all the "Shift" instructions, including SETMO. The result is written back to M. If no writeback to memory is required (NWB), then RNI runs the next instruction, ending the microcode sequence. However, if the result is going to memory, then the last line writes the value to memory.

M → tmpB  XI tmpB, NXT  rot rm, 1: get argument, set up ALU
Σ → M     NWB,RNI F     Store result, maybe run next instruction
          W DS,P0 RNI   Write result to memory

The D2 and D3 instructions (1101001?) perform a variable number of shifts, specified by the CL register, so they use different microcode (below). This microcode loops the number of times specified by CL, but the control flow is a bit tricky to avoid shifting if the intial counter value is 0. The code sets up the ALU to pass the counter (in tmpA) unmodified the first time (PASS) and jumps to 4, which updates the counter and sets up the ALU for the shift operation (XI). If the counter is not zero, it jumps back to 3, which performs the previously-specified shift and sets up the ALU to decrement the counter (DEC). This time, the code at 4 decrements the counter. The loop continues until the counter reaches zero. The microcode stores the result as in the previous microcode.

ZERO → tmpA               rot rm,CL: 0 to tmpA
CX → tmpAL   PASS tmpA    Get count to tmpAL, set up ALU to pass through
M → tmpB     JMPS 4       Get value, jump to loop (4)
Σ → tmpB     DEC tmpA F   3: Update result, set up decrement of count
Σ → tmpA     XI tmpB      4: update count in tmpA, set up ALU
             JMPS NZ 3    Loop if count not zero
tmpB → M     NWB,RNI      Store result, maybe run next instruction
             W DS,P0 RNI  Write result to memory

The hole in "group 1": TEST, F6/1 and F7/1

The F6 and F7 opcodes are in "group 1", with the specific instruction specified by bits 5-3 of the second byte. The second-byte table showed a hole for the 001 bit sequence. As explained earlier, these bits replace the low-order bits of the instruction, so F6 with 001 is processed as if it were the opcode F1. The microcode below matches against instruction bits 1111000?, so F6/1 and F7/1 have the same effect as F6/0 and F7/1 respectively, that is, the byte and word TEST instructions.

The microcode below gets one or two bytes from the prefetch queue (Q); the L8 condition tests if the operation is an 8-bit (i.e. byte) operation and skips the second micro-instruction. The third micro-instruction ANDs the argument and the fetched value. The condition flags (F) are set based on the result, but the result itself is discarded. Thus, the TEST instruction tests a value against a mask, seeing if any bits are set.

Q → tmpBL    JMPS L8 2     TEST rm,i: Get byte, jump if operation length = 8
Q → tmpBH                  Get second byte from the prefetch queue
M → tmpA     AND tmpA, NXT 2: Get argument, AND with fetched value
Σ → no dest  RNI F         Discard result but set flags.

I explained the processing of these "Group 3" instructions in more detail in my microcode article.

The hole in "group 2": PUSH, FE/7 and FF/7

The FE and FF opcodes are in "group 2", which has a hole for the 111 bit sequence in the second byte. After replacement, this will be processed as the FF opcode, which matches the pattern 1111111?. In other words, the instruction will be processed the same as the 110 bit pattern, which is PUSH. The microcode gets the Stack Pointer, sets up the ALU to decrement it by 2. The new value is written to SP and IND. Finally, the register value is written to stack memory.

SP → tmpA  DEC2 tmpA   PUSH rm: set up decrement SP by 2
Σ → IND                Decremented SP to IND
Σ → SP                 Decremented SP to SP
M → OPR    W SS,P0 RNI Write the value to memory, done

82 and 83 "Immed" group

Opcodes 80-83 are the "Immed" group, performing one of eight arithmetic operations, specified in the ModR/M byte. The four opcodes differ in the size of the values: opcode 80 applies an 8-bit immediate value to an 8-bit register, 81 applies a 16-bit value to a 16-bit register, 82 applies an 8-bit value to an 8-bit register, and 83 applies an 8-bit value to a 16-bit register. The opcode 82 has the strange situation that some sources say it is undocumented, but it shows up in some Intel documentation as a valid bit combination (e.g. below). Note that 80 and 82 have the 8-bit to 8-bit action, so the 82 opcode is redundant.

ADC is one of the instructions with opcode 80-83. From the 8086 datasheet, page 27.

ADC is one of the instructions with opcode 80-83. From the 8086 datasheet, page 27.

The microcode below is used for all four opcodes. If the ModR/M byte specifies memory, the appropriate micro-subroutine is called to compute the effective address in IND, and fetch the byte or word into OPR. The first two instructions below get the two immediate data bytes from the prefetch queue; for an 8-bit operation, the second byte is skipped. Next, the second argument M is loaded into tmpA and the desired ALU operation (XI) is configured. The result Σ is stored into the specified register M and the operation may terminate with RNI. But if the ModR/M byte specified memory, the following write micro-operation saves the value to memory.

Q → tmpBL  JMPS L8 2    alu rm,i: get byte, test if 8-bit op
Q → tmpBH               Maybe get second byte
M → tmpA   XI tmpA, NXT 2: 
Σ → M      NWB,RNI F    Save result, update flags, done if no memory writeback
           W DS,P0 RNI  Write result to memory if needed

The tricky part of this is the L8 condition, which tests if the operation is 8-bit. You might think that bit 0 acts as the byte/word bit in a nice, orthogonal way, but the 8086 has a bunch of special cases. Bit 0 of the instruction typically selects between a byte and a word operation, but there are a bunch of special cases. The Group Decode ROM creates a signal indicating if bit 0 should be used as the byte/word bit. But it generates a second signal indicating that an instruction should be forced to operate on bytes, for instructions such as DAA and XLAT. Another Group Decode ROM signal indicates that bit 3 of the instruction should select byte or word; this is used for the MOV instructions with opcodes Bx. Yet another Group Decode ROM signal indicates that inverted bit 1 of the instruction should select byte or word; this is used for a few opcodes, including 80-87.

The important thing here is that for the opcodes under discussion (80-83), the L8 micro-condition uses both bits 0 and 1 to determine if the instruction is 8 bits or not. The result is that only opcode 81 is considered 16-bit by the L8 test, so it is the only one that uses two immediate bytes from the instruction. However, the register operations use only bit 0 to select a byte or word transfer. The result is that opcode 83 has the unusual behavior of using an 8-bit immediate operand with a 16-bit register. In this case, the 8-bit value is sign-extended to form a 16-bit value. That is, the top bit of the 8-bit value fills the entire upper half of the 16-bit value, converting an 8-bit signed value to a 16-bit signed value (e.g. -1 is FF, which becomes FFFF). This makes sense for arithmetic operations, but not much sense for logical operations.

Intel documentation is inconsistent about which opcodes are listed for which instructions. Intel opcode maps generally define opcodes 80-83. However, lists of specific instructions show opcodes 80, 81, and 83 for arithmetic operations but only 80 and 81 for logical operations.8 That is, Intel omits the redundant 82 opcode as well as omitting logic operations that perform sign-extension (83).

More FE holes

For the "group 2" instructions, the FE opcode performs a byte operation while FF performs a word operation. Many of these operations don't make sense for bytes: CALL, JMP, and PUSH. (The only instructions supported for FE are INC and DEC.) But what happens if you use the unsupported instructions? The remainder of this section examines those cases and shows that the results are not useful.

CALL: FE/2

This instruction performs an indirect subroutine call within a segment, reading the target address from the memory location specified by the ModR/M byte.

The microcode below is a bit convoluted because the code falls through into the shared NEARCALL routine, so there is some unnecessary register movement. Before this microcode executes, the appropriate ModR/M micro-subroutine will read the target address from memory. The code below copies the destination address from M to tmpB and stores it into the PC later in the code to transfer execution. The code suspends prefetching, corrects the PC to cancel the offset from prefetching, and flushes the prefetch queue. Finally, it decrements the SP by two and writes the old PC to the stack.

M → tmpB    SUSP        CALL rm: read value, suspend prefetch
SP → IND    CORR        Get SP, correct PC
PC → OPR    DEC2 tmpC   Get PC to write, set up decrement
tmpB → PC   FLUSH       NEARCALL: Update PC, flush prefetch
IND → tmpC              Get SP to decrement
Σ → IND                 Decremented SP to IND
Σ → SP      W SS,P0 RNI Update SP, write old PC to stack

This code will mess up in two ways when executed as a byte instruction. First, when the destination address is read from memory, only a byte will be read, so the destination address will be corrupted. (I think that the behavior here depends on the bus hardware. The 8086 will ask for a byte from memory but will read the word that is placed on the bus. Thus, if memory returns a word, this part may operate correctly. The 8088's behavior will be different because of its 8-bit bus.) The second issue is writing the old PC to the stack because only a byte of the PC will be written. Thus, when the code returns from the subroutine call, the return address will be corrupt.

CALL: FE/3

This instruction performs an indirect subroutine call between segments, reading the target address from the memory location specified by the ModR/M byte.

IND → tmpC  INC2 tmpC    CALL FAR rm: set up IND+2
Σ → IND     R DS,P0      Read new CS, update IND
OPR → tmpA  DEC2 tmpC    New CS to tmpA, set up SP-2
SP → tmpC   SUSP         FARCALL: Suspend prefetch
Σ → IND     CORR         FARCALL2: Update IND, correct PC
CS → OPR    W SS,M2      Push old CS, decrement IND by 2
tmpA → CS   PASS tmpC    Update CS, set up for NEARCALL
PC → OPR    JMP NEARCALL Continue with NEARCALL

As in the previous CALL, this microcode will fail in multiple ways when executed in byte mode. The new CS and PC addresses will be read from memory as bytes, which may or may not work. Only a byte of the old CS and PC will be pushed to the stack.

JMP: FE/4

This instruction performs an indirect jump within a segment, reading the target address from the memory location specified by the ModR/M byte. The microcode is short, since the ModR/M micro-subroutine does most of the work. I believe this will have the same problem as the previous CALL instructions, that it will attempt to read a byte from memory instead of a word.

        SUSP       JMP rm: Suspend prefetch
M → PC  FLUSH RNI  Update PC with new address, flush prefetch, done

JMP: FE/5

This instruction performs an indirect jump between segments, reading the new PC and CS values from the memory location specified by the ModR/M byte. The ModR/M micro-subroutine reads the new PC address. This microcode increments IND and suspends prefetching. It updates the PC, reads the new CS value from memory, and updates the CS. As before, the reads from memory will read bytes instead of words, so this code will not meaningfully work in byte mode.

IND → tmpC  INC2 tmpC   JMP FAR rm: set up IND+2
Σ → IND     SUSP        Update IND, suspend prefetch
tmpB → PC   R DS,P0     Update PC, read new CS from memory
OPR → CS    FLUSH RNI   Update CS, flush prefetch, done

PUSH: FE/6

This instruction pushes the register or memory value specified by the ModR/M byte. It decrements the SP by 2 and then writes the value to the stack. It will write one byte to the stack but decrements the SP by 2, so one byte of old stack data will be on the stack along with the data byte.

SP → tmpA  DEC2 tmpA    PUSH rm: Set up SP decrement 
Σ → IND                 Decremented value to IND
Σ → SP                  Decremented value to SP
M → OPR    W SS,P0 RNI  Write the data to the stack

Undocumented instruction values

The next category of undocumented instructions is where the first byte indicates a valid instruction, but there is something wrong with the second byte.

AAM: ASCII Adjust after Multiply

The AAM instruction is a fairly obscure one, designed to support binary-coded decimal arithmetic (BCD). After multiplying two BCD digits, you end up with a binary value between 0 and 81 (0×0 to 9×9). If you want a BCD result, the AAM instruction converts this binary value to BCD, for instance splitting 81 into the decimal digits 8 and 1, where the upper digit is 81 divided by 10, and the lower digit is 81 modulo 10.

The interesting thing about AAM is that the 2-byte instruction is D4 0A. You might notice that hex 0A is 10, and this is not a coincidence. There wasn't an easy way to get the value 10 in the microcode, so instead they made the instruction provide that value in the second byte. The undocumented (but well-known) part is that if you provide a value other than 10, the instruction will convert the binary input into digits in that base. For example, if you provide 8 as the second byte, the instruction returns the value divided by 8 and the value modulo 8.

The microcode for AAM, below, sets up the registers. calls the CORD (Core Division) micro-subroutine to perform the division, and then puts the results into AH and AL. In more detail, the CORD routine divides tmpA/tmpC by tmpB, putting the complement of the quotient in tmpC and leaving the remainder in tmpA. (If you want to know how CORD works internally, see my division post.) The important step is that the AAM microcode gets the divisor from the prefetch queue (Q). After calling CORD, it sets up the ALU to perform a 1's complement of tmpC and puts the result (Σ) into AH. It sets up the ALU to pass tmpA through unchanged, puts the result (Σ) into AL, and updates the flags accordingly (F).

Q → tmpB                    AAM: Move byte from prefetch to tmpB
ZERO → tmpA                 Move 0 to tmpA
AL → tmpC    CALL CORD      Move AL to tmpC, call CORD.
             COM1 tmpC      Set ALU to complement
Σ → AH       PASS tmpA, NXT Complement AL to AH
Σ → AL       RNI F          Pass tmpA through ALU to set flags

The interesting thing is why this code has undocumented behavior. The 8086's microcode only has support for the constants 0 and all-1's (FF or FFFF), but the microcode needs to divide by 10. One solution would be to implement an additional micro-instruction and more circuitry to provide the constant 10, but every transistor was precious back then. Instead, the designers took the approach of simply putting the number 10 as the second byte of the instruction and loading the constant from there. Since the AAM instruction is not used very much, making the instruction two bytes long wasn't much of a drawback. But if you put a different number in the second byte, that's the divisor the microcode will use. (Of course you could add circuitry to verify that the number is 10, but then the implementation is no longer simple.)

Intel could have documented the full behavior, but that creates several problems. First, Intel would be stuck supporting the full behavior into the future. Second, there are corner cases to deal with, such as divide-by-zero. Third, testing the chip would become harder because all these cases would need to be tested. Fourth, the documentation would become long and confusing. It's not surprising that Intel left the full behavior undocumented.

AAD: ASCII Adjust before Division

The AAD instruction is analogous to AAM but used for BCD division. In this case, you want to divide a two-digit BCD number by something, where the BCD digits are in AH and AL. The AAD instruction converts the two-digit BCD number to binary by computing AH×10+AL, before you perform the division.

The microcode for AAD is shown below. The microcode sets up the registers, calls the multiplication micro-subroutine CORX (Core Times), and then puts the results in AH and AL. In more detail, the multiplier comes from the instruction prefetch queue Q. The CORX routine multiples tmpC by tmpB, putting the result in tmpA/tmpC. Then the microcode adds the low BCD digit (AL) to the product (tmpB + tmpC), putting the sum (Σ) into AL, clearing AH and setting the status flags F appropriately.

One interesting thing is that the second-last micro-instruction jumps to AAEND, which is the last micro-instruction of the AAM microcode above. By reusing the micro-instruction from AAM, the microcode is one micro-instruction shorter, but the jump adds one cycle to the execution time. (The CORX routine is used for integer multiplication; I discuss the internals in this post.)

Q → tmpC              AAD: Get byte from prefetch queue.
AH → tmpB   CALL CORX Call CORX
AL → tmpB   ADD tmpC  Set ALU for ADD
ZERO → AH   JMP AAEND Zero AH, jump to AAEND
i
...
Σ → AL      RNI F     AAEND: Sum to AL, done.

As with AAM, the constant 10 is provided in the second byte of the instruction. The microcode accepts any value here, but values other than 10 are undocumented.

8C, 8E: MOV sr

The opcodes 8C and 8E perform a MOV register to or from the specified segment register, using the register specification field in the ModR/M byte. There are four segment registers and three selection bits, so an invalid segment register can be specified. However, the hardware that decodes the register number ignores instruction bit 5 for a segment register. Thus, specifying a segment register 4 to 7 is the same as specifying a segment register 0 to 3. For more details, see my article on 8086 register codes.

Unexpected REP prefix

REP IMUL / IDIV

The REP prefix is used with string operations to cause the operation to be repeated across a block of memory. However, if you use this prefix with an IMUL or IDIV instruction, it has the unexpected behavior of negating the product or the quotient (source).

The reason for this behavior is that the string operations use an internal flag called F1 to indicate that a REP prefix has been applied. The multiply and divide code reuses this flag to track the sign of the input values, toggling F1 for each negative value. If F1 is set, the value at the end is negated. (This handles "two negatives make a positive.") The consequence is that the REP prefix puts the flag in the 1 state when the multiply/divide starts, so the computed sign will be wrong at the end and the result is the negative of the expected result. The microcode is fairly complex, so I won't show it here; I explain it in detail in this blog post.

REP RET

Wikipedia lists REP RET (i.e. RET with a REP prefix) as a way to implement a two-byte return instruction. This is kind of trivial; the RET microcode (like almost every instruction) doesn't use the F1 internal flag, so the REP prefix has no effect.

REPNZ MOVS/STOS

Wikipedia mentions that the use of the REPNZ prefix (as opposed to REPZ) is undefined with string operations other than CMPS/SCAS. An internal flag called F1Z distinguishes between the REPZ and REPNZ prefixes. This flag is only used by CMPS/SCAS. Since the other string instructions ignore this flag, they will ignore the difference between REPZ and REPNZ. I wrote about string operations in more detail in this post.

Using a register instead of memory.

Some instructions are documented as requiring a memory operand. However, the ModR/M byte can specify a register. The behavior in these cases can be highly unusual, providing access to hidden registers. Examining the microcode shows how this happens.

LEA reg

Many instructions have a ModR/M byte that indicates the memory address that the instruction should use, perhaps through a complicated addressing mode. The LEA (Load Effective Address) instruction is different: it doesn't access the memory location but returns the address itself. The undocumented part is that the ModR/M byte can specify a register instead of a memory location. In that case, what does the LEA instruction do? Obviously it can't return the address of a register, but it needs to return something.

The behavior of LEA is explained by how the 8086 handles the ModR/M byte. Before running the microcode corresponding to the instruction, the microcode engine calls a short micro-subroutine for the particular addressing mode. This micro-subroutine puts the desired memory address (the effective address) into the tmpA register. The effective address is copied to the IND (Indirect) register and the value is loaded from memory if needed. On the other hand, if the ModR/M byte specified a register instead of memory, no micro-subroutine is called. (I explain ModR/M handling in more detail in this article.)

The microcode for LEA itself is just one line. It stores the effective address in the IND register into the specified destination register, indicated by N. This assumes that the appropriate ModR/M micro-subroutine was called before this code, putting the effective address into IND.

IND → N   RNI  LEA: store IND register in destination, done

But if a register was specified instead of a memory location, no ModR/M micro-subroutine gets called. Instead, the LEA instruction will return whatever value was left in IND from before, typically the previous memory location that was accessed. Thus, LEA can be used to read the value of the IND register, which is normally hidden from the programmer.

LDS reg, LES reg

The LDS and LES instructions load a far pointer from memory into the specified segment register and general-purpose register. The microcode below assumes that the appropriate ModR/M micro-subroutine has set up IND and read the first value into OPR. The microcode updates the destination register, increments IND by 2, reads the next value, and updates DS. (The microcode for LES is a copy of this, but updates ES.)

OPR → N               LDS: Copy OPR to dest register
IND → tmpC  INC2 tmpC Set up incrementing IND by 2
Σ → IND     R DS,P0   Update IND, read next location
OPR → DS    RNI       Update DS

If the LDS instruction specifies a register instead of memory, a micro-subroutine will not be called, so IND and OPR will have values from a previous instruction. OPR will be stored in the destination register, while the DS value will be read from the address IND+2. Thus, these instructions provide a mechanism to access the hidden OPR register.

JMP FAR rm

The JMP FAR rm instruction normally jumps to the far address stored in memory at the location indicated by the ModR/M byte. (That is, the ModR/M byte indicates where the new PC and CS values are stored.) But, as with LEA, the behavior is undocumented if the ModR/M byte specifies a register, since a register doesn't hold a four-byte value.

The microcode explains what happens. As with LEA, the code expects a micro-subroutine to put the address into the IND register. In this case, the micro-subroutine also loads the value at that address (i.e. the destination PC) into tmpB. The microcode increments IND by 2 to point to the CS word in memory and reads that into CS. Meanwhile, it updates the PC with tmpB. It suspends prefetching and flushes the queue, so instruction fetching will restart at the new address.

IND → tmpC  INC2 tmpC   JMP FAR rm: set up to add 2 to IND
Σ → IND     SUSP        Update IND, suspend prefetching
tmpB → PC   R DS,P0     Update PC with tmpB. Read new CS from specified address
OPR → CS    FLUSH RNI   Update CS, flush queue, done

If you specify a register instead of memory, the micro-subroutine won't get called. Instead, the program counter will be loaded with whatever value was in tmpB and the CS segment register will be loaded from the memory location two bytes after the location that IND was referencing. Thus, this undocumented use of the instruction gives access to the otherwise-hidden tmpB register.

The end of undocumented instructions

Microprocessor manufacturers soon realized that undocumented instructions were a problem, since programmers find them and often use them. This creates an issue for future processors, or even revisions of the current processor: if you eliminate an undocumented instruction, previously-working code that used the instruction will break, and it will seem like the new processor is faulty.

The solution was for processors to detect undocumented instructions and prevent them from executing. By the early 1980s, processors had enough transistors (thanks to Moore's law) that they could include the circuitry to block unsupported instructions. In particular, the 80186/80188 and the 80286 generated a trap of type 6 when an unused opcode was executed, blocking use of the instruction.9 This trap is also known as #UD (Undefined instruction trap).10

Conclusions

The 8086, like many early microprocessors, has undocumented instructions but no traps to stop them from executing.11 For the 8086, these fall into several categories. Many undocumented instructions simply mirror existing instructions. Some instructions are implemented but not documented for one reason or another, such as SALC and POP CS. Other instructions can be used outside their normal range, such as AAM and AAD. Some instructions are intended to work only with a memory address, so specifying a register can have strange effects such as revealing the values of the hidden IND and OPR registers.

Keep in mind that my analysis is based on transistor-level simulation and examining the microcode; I haven't verified the behavior on a physical 8086 processor. Please let me know if you see any errors in my analysis or undocumented instructions that I have overlooked. Also note that the behavior could change between different versions of the 8086; in particular, some versions by different manufacturers (such as the NEC V20 and V30) are known to be different.

I plan to write more about the 8086, so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @[email protected] and Bluesky as @righto.com so you can follow me there too.

Notes and references

  1. The 6502 processor, for instance, has illegal instructions with various effects, including causing the processor to hang. The article How MOS 6502 illegal opcodes really work describes in detail how the instruction decoding results in various illegal opcodes. Some of these opcodes put the internal bus into a floating state, so the behavior is electrically unpredictable. 

  2. The 8086 used up almost all the single-byte opcodes, which made it difficult to extend the instruction set. Most of the new instructions for the 386 or later are multi-byte opcodes, either using 0F as a prefix or reusing the earlier REP prefix (F3). Thus, the x86 instruction set is less efficient than it could be, since many single-byte opcodes were "wasted" on hardly-used instructions such as BCD arithmetic, forcing newer instructions to be multi-byte. 

  3. For details on the "magic instruction" hidden in the 8086 microcode, see NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright Editors page 49. I haven't found anything stating that SALC was the hidden instruction, but this is the only undocumented instruction that makes sense as something deliberately put into the microcode. The court case is complicated since NEC had a licensing agreement with Intel, so I'm skipping lots of details. See NEC v. Intel: Breaking new ground in the law of copyright for more. 

  4. The microcode listings are based on Andrew Jenner's disassembly. I have made some modifications to (hopefully) make it easier to understand. 

  5. Specifying the instruction through the ModR/M reg field may seem a bit random, but there's a reason for this. A typical instruction such as ADD has two arguments specified by the ModR/M byte. But other instructions such as shift instructions or NOT only take one argument. For these instructions, the ModR/M reg field would be wasted if it specified a register. Thus, using the reg field to specify instructions that only use one argument makes the instruction set more efficient. 

  6. Note that "normal" ALU operations are specified by bits 5-3 of the instruction; in order these are ADD, OR, ADC, SBB, AND, SUB, XOR, and CMP. These are exactly the same ALU operations that the "Immed" group performs, specified by bits 5-3 of the second byte. This illustrates how the same operation selection mechanism (the X register) is used in both cases. Bit 6 of the instruction switches between the set of arithmetic/logic instructions and the set of shift/rotate instructions. 

  7. As far as I can tell, SETMO isn't used by the microcode. Thus, I think that SETMO wasn't deliberately implemented in the ALU, but is a consequence of how the ALU's control logic is implemented. That is, all the even entries are left shifts and the odd entries are right shifts, so operation 6 activates the left-shift circuitry. But it doesn't match a specific left shift operation, so the ALU doesn't get configured for a "real" left shift. In other words, the behavior of this instruction is due to how the ALU handles a case that it wasn't specifically designed to handle.

    This function is implemented in the ALU somewhat similar to a shift left. However, instead of passing each input bit to the left, the bit from the right is passed to the left. That is, the input to bit 0 is shifted left to all of the bits of the result. By setting this bit to 1, all bits of the result are set, yielding the minus 1 result. 

  8. This footnote provides some references for the "Immed" opcodes. The 8086 datasheet has an opcode map showing opcodes 80 through 83 as valid. However, in the listings of individual instructions it only shows 80 and 81 for logical instructions (i.e. bit 1 must be 0), while it shows 80-83 for arithmetic instructions. The modern Intel 64 and IA-32 Architectures Software Developer's Manual is also contradictory. Looking at the instruction reference for AND (Vol 2A 3-78), for instance, shows opcodes 80, 81, and 83, explicitly labeling 83 as sign-extended. But the opcode map (Table A-2 Vol 2D A-7) shows 80-83 as defined except for 82 in 64-bit mode. The instruction bit diagram (Table B-13 Vol 2D B-7) shows 80-83 valid for the arithmetic and logical instructions. 

  9. The 80286 was more thorough about detecting undefined opcodes than the 80186, even taking into account the differences in instruction set. The 80186 generates a trap when 0F, 63-67, F1, or FFFF is executed. The 80286 generates invalid opcode exception number 6 (#UD) on any undefined opcode, handling the following cases:

    • The first byte of an instruction is completely invalid (e.g., 64H).
    • The first byte indicates a 2-byte opcode and the second byte is invalid (e.g., 0F followed by 0FFH).
    • An invalid register is used with an otherwise valid opcode (e.g., MOV CS,AX).
    • An invalid opcode extension is given in the REG field of the ModR/M byte (e.g., 0F6H /1).
    • A register operand is given in an instruction that requires a memory operand (e.g., LGDT AX).
     

  10. In modern x86 processors, most undocumented instructions cause faults. However, there are still a few undocumented instructions that don't fault. These may be for internal use or corner cases of documented instructions. For details, see Breaking the x86 Instruction Set, a video from Black Hat 2017. 

  11. Several sources have discussed undocumented 8086 opcodes before. The article Undocumented 8086 Opcodes describes undocumented opcodes in detail. Wikipedia has a list of undocumented x86 instructions. The book Undocumented PC discusses undocumented instructions in the 8086 and later processors. This StackExchange Retrocomputing post describes undocumented instructions. These Hacker News comments discuss some undocumented instructions. There are other sources with more myth than fact, claiming that the 8086 treats undocumented instructions as NOPs, for instance. 

The Group Decode ROM: The 8086 processor's first step of instruction decoding

A key component of any processor is instruction decoding: analyzing a numeric opcode and figuring out what actions need to be taken. The Intel 8086 processor (1978) has a complex instruction set, making instruction decoding a challenge. The first step in decoding an 8086 instruction is something called the Group Decode ROM, which categorizes instructions into about 35 types that control how the instruction is decoded and executed. For instance, the Group Decode ROM determines if an instruction is executed in hardware or in microcode. It also indicates how the instruction is structured: if the instruction has a bit specifying a byte or word operation, if the instruction has a byte that specifies the addressing mode, and so forth.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip with the metal and polysilicon removed, revealing the silicon underneath. Click on this image (or any other) for a larger version.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip with the metal and polysilicon removed, revealing the silicon underneath. Click on this image (or any other) for a larger version.

The diagram above shows the position of the Group Decode ROM on the silicon die, as well as other key functional blocks. The 8086 chip is partitioned into a Bus Interface Unit that communicates with external components such as memory, and the Execution Unit that executes instructions. Machine instructions are fetched from memory by the Bus Interface Unit and stored in the prefetch queue registers, which hold 6 bytes of instructions. To execute an instruction, the queue bus transfers an instruction byte from the prefetch queue to the instruction register, under control of a state machine called the Loader. Next, the Group Decode ROM categorizes the instruction according to its structure. In most cases, the machine instruction is implemented in low-level microcode. The instruction byte is transferred to the Microcode Address Register, where the Microcode Address Decoder selects the appropriate microcode routine that implements the instruction. The microcode provides the micro-instructions that control the Arithmetic/Logic Unit (ALU), registers, and other components to execute the instruction.

In this blog post, I will focus on a small part of this process: how the Group Decode ROM decodes instructions. Be warned that this post gets down into the weeds, so you might want to start with one of my higher-level posts, such as how the 8086's microcode engine works.

Microcode

Most instructions in the 8086 are implemented in microcode. Most people think of machine instructions as the basic steps that a computer performs. However, many processors have another layer of software underneath: microcode. With microcode, instead of building the CPU's control circuitry from complex logic gates, the control logic is largely replaced with code. To execute a machine instruction, the computer internally executes several simpler micro-instructions, specified by the microcode.

Microcode is only used if the Group Decode ROM indicates that the instruction is implemented in microcode. In that case, the microcode address register is loaded with the instruction and the address decoder selects the appropriate microcode routine. However, there's a complication. If the second byte of the instruction is a Mod R/M byte, the Group Decode ROM indicates this and causes a memory addressing micro-subroutine to be called.

Some simple instructions are implemented entirely in hardware and don't use microcode. These are known as 1-byte logic instructions (1BL) and are also indicated by the Group Decode ROM.

The Group Decode ROM's structure

The Group Decode ROM takes an 8-bit instruction as input, along with an interrupt signal. It produces 15 outputs that control how the instruction is handled. In this section I'll discuss the physical implementation of the Group Decode ROM; the various outputs are discussed in a later section.

Although the Group Decode ROM is called a ROM, its implementation is really a PLA (Programmable Logic Array), two levels of highly-structured logic gates.1 The idea of a PLA is to create two levels of NOR gates, each in a grid. This structure has the advantages that it implements the logic densely and is easy to modify. Although physically two levels of NOR gates, a PLA can be thought of as an AND layer followed by an OR layer. The AND layer matches particular bit patterns and then the OR layer combines multiple values from the first layer to produce arbitrary outputs.

The Group Decode ROM. This photo shows the metal layer on top of the die.

The Group Decode ROM. This photo shows the metal layer on top of the die.

Since the output values are highly structured, a PLA implementation is considerably more efficient than a ROM, since in a sense it combines multiple entries. In the case of the Group Decode ROM, using a ROM structure would require 256 columns (one for each 8-bit instruction pattern), while the PLA implementation requires just 36 columns, about 1/7 the size.

The diagram below shows how one column of the Group Decode ROM is wired in the "AND" plane. In this die photo, I removed the metal layer with acid to reveal the polysilicon and silicon underneath. The vertical lines show where the metal line for ground and the column output had been. The basic idea is that each column implements a NOR gate, with a subset of the input lines selected as inputs to the gate. The pull-up resistor at the top pulls the column line high by default. But if any of the selected inputs are high, the corresponding transistor turns on, connecting the column line to ground and pulling it low. Thus, this implements a NOR gate. However, it is more useful to think of it as an AND of the complemented inputs (via De Morgan's Law): if all the inputs are "correct", the output is high. In this way, each column matches a particular bit pattern.

Closeup of a column in the Group Decode ROM.

Closeup of a column in the Group Decode ROM.

The structure of the ROM is implemented through the silicon doping pattern, which is visible above. A transistor is formed where a polysilicon wire crosses a doped silicon region: the polysilicon forms the gate, turning the transistor on or off. At each intersection point, a transistor can be created or not, depending on the doping pattern. If a particular transistor is created, then the corresponding input must be 0 to produce a high output.

At the top of the diagram above, the column outputs are switched from the metal layer to polysilicon wires and become the inputs to the upper "OR" plane. This plane is implemented in a similar fashion as a grid of NOR gates. The plane is rotated 90 degrees, with the inputs vertical and each row forming an output.

Intermediate decoding in the Group Decode ROM

The first plane of the Group Decode ROM categorizes instructions into 36 types based on the instruction bit pattern.2 The table below shows the 256 instruction values, colored according to their categorization.3 For instance, the first blue block consists of the 32 ALU instructions corresponding to the bit pattern 00XXX0XX, where X indicates that the bit can be 0 or 1. These instructions are all decoded and executed in a similar way. Almost all instructions have a single category, that is, they activate a single column line in the Group Decode ROM. However, a few instructions activate two lines and have two colors below.

Grid of 8086 instructions, colored according to the first level of the Group Decode Rom.

Grid of 8086 instructions, colored according to the first level of the Group Decode Rom.

Note that the instructions do not have arbitrary numeric opcodes, but are assigned in a way that makes decoding simpler. Because these blocks correspond to bit patterns, there is little flexibility. One of the challenges of instruction set design for early microprocessors was to assign numeric values to the opcodes in a way that made decoding straightforward. It's a bit like a jigsaw puzzle, fitting the instructions into the 256 available values, while making them easy to decode.

Outputs from the Group Decode ROM

The Group Decode ROM has 15 outputs, one for each row of the upper half. In this section, I'll briefly discuss these outputs and their roles in the 8086. For an interactive exploration of these signals, see this page, which shows the outputs that are triggered by each instruction.

Out 0 indicates an IN or OUT instruction. This signal controls the M/IO (S2) status line, which distinguishes between a memory read/write and an I/O read/write. Apart from this, memory and I/O accesses are basically the same.

Out 1 indicates (inverted) that the instruction has a Mod R/M byte and performs a read/modify/write on its argument. This signal is used by the Translation ROM when dispatching an address handler (details). (This signal distinguishes between, say, ADD [AX],BX and MOV [AX],BX. The former both reads and writes [AX], while the latter only writes to it.)

Out 2 indicates a "group 3/4/5" opcode, an instruction where the second byte specifies the particular instruction, and thus decoding needs to wait for the second byte. This controls the loading of the microcode address register.

Out 3 indicates an instruction prefix (segment, LOCK, or REP). This causes the next byte to be decoded as a new instruction, while blocking interrupt handling.

Out 4 indicates (inverted) a two-byte ROM instruction (2BR), i.e. an instruction is handled by the microcode ROM, but requires the second byte for decoding. This is an instruction with a Mod R/M byte. This signal controls the loader indicating that it needs to fetch the second byte. This signal is almost the same as output 1 with a few differences.

Out 5 specifies the top bit for an ALU operation. The 8086 uses a 5-bit field to specify an ALU operation. If not specified explicitly by the microcode, the field uses bits 5 through 3 of the opcode. (These bits distinguish, say, an ADD instruction from AND or SUB.) This control line sets the top bit of the ALU field for instructions such as DAA, DAS, AAA, AAS, INC, and DE that fall into a different set from the "regular" ALU instructions.

Out 6 indicates an instruction that sets or clears a condition code directly: CLC, STC, CLI, STI, CLD, or STD (but not CMC). This signal is used by the flag circuitry to update the condition code.

Out 7 indicates an instruction that uses the AL or AX register, depending on the instruction's size bit. (For instance MOVSB vs MOVSW.) This signal is used by the register selection circuitry, the M register specifically.

Out 8 indicates a MOV instruction that uses a segment register. This signal is used by the register selection circuitry, the N register specifically.

Out 9 indicates the instruction has a d bit, where bit 1 of the instruction swaps the source and destination. This signal is used by the register selection circuitry, swapping the roles of the M and N registers according to the d bit.

Out 10 indicates a one-byte logic (1BL) instruction, a one-byte instruction that is implemented in logic, not microcode. These instructions are the prefixes, HLT, and the condition-code instructions. This signal controls the loader, causing it to move to the next instruction.

Out 11 indicates instructions where bit 0 is the byte/word indicator. This signal controls the register handling and the ALU functionality.

Out 12 indicates an instruction that operates only on a byte: DAA, DAS, AAA, AAS, AAM, AAD, and XLAT. This signal operates in conjunction with the previous output to select a byte versus word.

Out 13 forces the instruction to use a byte argument if instruction bit 1 is set, overriding the regular byte/word pattern. Specifically, it forces the L8 (length 8 bits) condition for the JMP direct-within-segment and the ALU instructions that are immediate with sign extension (details).

Out 14 allows a carry update. This prevents the carry from being updated by the INC and DEC operations. This signal is used by the flag circuitry.

Columns

Most of the Group Decode ROM's column signals are used to derive the outputs listed above. However, some column outputs are also used as control signals directly. These are listed below.

Column 10 indicates an immediate MOV instruction. These instructions use instruction bit 3 (rather than bit 1) to select byte versus word, because the three low bits specify the register. This signal affects the L8 condition described earlier and also causes the M register selection to be converted from a word register to a byte register if necessary.

Column 12 indicates an instruction with bits 5-3 specifying the ALU instruction. This signal causes the X register to be loaded with the bits in the instruction that specify the ALU operation. (To be precise, this signal prevents the X register from being reloaded from the second instruction byte.)

Column 13 indicates the CMC (Complement Carry) instruction. This signal is used by the flags circuitry to complement the carry flag (details).

Column 14 indicates the HLT (Halt) instruction. This signal stops instruction processing by blocking the instruction queue.

Column 31 indicates a REP prefix. This signal causes the REPZ/NZ latch to be loaded with instruction bit 0 to indicate if the prefix is REPNZ or REPZ. It also sets the REP latch.

Column 32 indicates a segment prefix. This signal loads the segment latches with the desired segment type.

Column 33 indicates a LOCK prefix. It sets the LOCK latch, locking the bus.

Column 34 indicates a CLI instruction. This signal immediately blocks interrupt handling to avoid an interrupt between the CLI instruction and when the interrupt flag bit is cleared.

Timing

One important aspect of the Group Decode ROM is that its outputs are not instantaneous. It takes a clock cycle to get the outputs from the Group Decode ROM. In particular, when instruction decoding starts, the timing signal FC (First Clock) is activated to indicate the first clock cycle. However, the Group Decode ROM's outputs are not available until the Second Clock SC.

One consequence of this is that even the simplest instruction (such as a flag operation) takes two clock cycles, as does a prefix. The problem is that even though the instruction could be performed in one clock cycle, it takes two clock cycles for the Group Decode ROM to determine that the instruction only needs one cycle. This illustrates how a complex instruction format impacts performance.

The FC and SC timing signals are generated by a state machine called the Loader. These signals may seem trivial, but there are a few complications. First, the prefetch queue may run empty, in which case the FC and/or SC signal is delayed until the prefetch queue has a byte available. Second, to increase performance, the 8086 can start decoding an instruction during the last clock cycle of the previous instruction. Thus, if the microcode indicates that there is one cycle left, the Loader can proceed with the next instruction. Likewise, for a one-byte instruction implemented in hardware (one-byte logic or 1BL), the loader proceeds as soon as possible.

The diagram below shows the timing of an ADD instruction. Each line is half of a clock cycle. Execution is pipelined: the instruction is fetched during the first clock cycle (First Clock). During Second Clock, the Group Decode ROM produces its output. The microcode address register also generates the micro-address for the instruction's microcode. The microcode ROM supplies a micro-instruction during the third clock cycle and execution of the micro-instruction takes place during the fourth clock cycle.

This diagram shows the execution of an ADD instruction and what is happening in various parts of the 8086. The arrows show the flow from step to step. The character µ is short for "micro".

This diagram shows the execution of an ADD instruction and what is happening in various parts of the 8086. The arrows show the flow from step to step. The character µ is short for "micro".

The Group Decode ROM's outputs during Second Clock control the decoding. Most importantly, the ADD imm instruction used microcode; it is not a one-byte logic instruction (1BL). Moreover, it does not have a Mod R/M byte, so it does not need two bytes for decoding (2BR). For a 1BL instruction, microcode execution would be blocked and the next instruction would be immediately fetched. On the other hand, for a 2BR instruction, the loader would tell the prefetch queue that it was done with the second byte during the second half of Second Clock. Microcode execution would be blocked during the third cycle and the fourth cycle would execute a microcode subroutine to determine the memory address.

For more details, see my article on the 8086 pipeline.

Interrupts

The Group Decode ROM takes the 8 bits of the instruction as inputs, but it has an additional input indicating that an interrupt is being handled. This signal blocks most of the Group Decode ROM outputs. This prevents the current instruction's outputs from interfering with interrupt handling. I wrote about the 8086's interrupt handling in detail here, so I won't go into more detail in this post.

Conclusions

The Group Decode ROM indicates one of the key differences between CISC processors (Complex Instruction Set Computer) such as the 8086 and the RISC processors (Reduced Instruction Set Computer) that became popular a few years later. A RISC instruction set is designed to make instruction decoding very easy, with a small number of uniform instruction forms. On the other hand, the 8086's CISC instruction set was designed for compactness and high code density. As a result, instructions are squeezed into the available opcode space. Although there is a lot of structure to the 8086 opcodes, this structure is full of special cases and any patterns only apply to a subset of the instructions. The Group Decode ROM brings some order to this chaotic jumble of instructions, and the number of outputs from the Group Decode ROM is a measure of the instruction set's complexity.

The 8086's instruction set was extended over the decades to become the x86 instruction set in use today. During that time, more layers of complexity were added to the instruction set. Now, an x86 instruction can be up to 15 bytes long with multiple prefixes. Some prefixes change the register encoding or indicate a completely different instruction set such as VEX (Vector Extensions) or SSE (Streaming SIMD Extensions). Thus, x86 instruction decoding is very difficult, especially when trying to decode multiple instructions in parallel. This has an impact in modern systems, where x86 processors typically have 4 complex instruction decoders while Apple's ARM processors have 8 simpler decoders; this is said to give Apple a performance benefit. Thus, architectural decisions from 45 years ago are still impacting the performance of modern processors.

I've written numerous posts on the 8086 so far and plan to continue reverse-engineering the 8086 die so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @[email protected]. Thanks to Arjan Holscher for suggesting this topic.

Notes and references

  1. You might wonder what the difference is between a ROM and a PLA. Both of them produce arbitrary outputs for a set of inputs. Moreover, you can replace a PLA with a ROM or vice versa. Typically a ROM has all the input combinations decoded, so it has a separate row for each input value, i.e. 2N rows. So you can think of a ROM as a fully-decoded PLA.

    Some ROMs are partially decoded, allowing identical rows to be combined and reducing the size of the ROM. This technique is used in the 8086 microcode, for instance. A partially-decoded ROM is fairly similar to a PLA, but the technical distinction is that a ROM has only one output row active at a time, while a PLA can have multiple output rows active and the results are OR'd together. (This definition is from The Architecture of Microprocessors p117.)

    The Group Decode ROM, however, has a few cases where multiple rows are active at the same time (for instance the segment register POP instructions). Thus, the Group Decode ROM is technically a PLA and not a ROM. This distinction isn't particularly important, but you might find it interesting. 

  2. The Group Decode ROM has 38 columns, but two columns (11 and 35) are unused. Presumably, these were provided as spares in case a bug fix or modification required additional decoding. 

  3. Like the 8008 and 8080, the 8086's instruction set was designed around a 3-bit octal structure. Thus, the 8086 instruction set makes much more sense if viewed in octal instead of hexadecimal. The table below shows the instructions with an octal organization. Each 8×8 block uses the two low octal digits, while the four large blocks are positioned according to the top octal digit (labeled). As you can see, the instruction set has a lot of structure that is obscured in the usual hexadecimal table.

    The 8086 instruction set, put in a table according to the octal opcode value.

    The 8086 instruction set, put in a table according to the octal opcode value.

    For details on the octal structure of the 8086 instruction set, see The 80x86 is an Octal Machine

Reverse-engineering the multiplication algorithm in the Intel 8086 processor

While programmers today take multiplication for granted, most microprocessors in the 1970s could only add and subtract — multiplication required a slow and tedious loop implemented in assembly code.1 One of the nice features of the Intel 8086 processor (1978) was that it provided machine instructions for multiplication,2 able to multiply 8-bit or 16-bit numbers with a single instruction. Internally, the 8086 still performed a loop, but the loop was implemented in microcode: faster and transparent to the programmer. Even so, multiplication was a slow operation, about 24 to 30 times slower than addition.

In this blog post, I explain the multiplication process inside the 8086, analyze the microcode that it used, and discuss the hardware circuitry that helped it out.3 My analysis is based on reverse-engineering the 8086 from die photos. The die photo below shows the chip under a microscope. I've labeled the key functional blocks; the ones that are important to this post are darker. At the left, the ALU (Arithmetic/Logic Unit) performs the arithmetic operations at the heart of multiplication: addition and shifts. Multiplication also uses a few other hardware features: the X register, the F1 flag, and a loop counter. The microcode ROM at the lower right controls the process.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip with the metal and polysilicon removed, revealing the silicon underneath. Click on this image (or any other) for a larger version.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip with the metal and polysilicon removed, revealing the silicon underneath. Click on this image (or any other) for a larger version.

Microcode

The multiplication routines in the 8086 are implemented in microcode. Most people think of machine instructions as the basic steps that a computer performs. However, many processors (including the 8086) have another layer of software underneath: microcode. With microcode, instead of building the control circuitry from complex logic gates, the control logic is largely replaced with code. To execute a machine instruction, the computer internally executes several simpler micro-instructions, specified by the microcode. This is especially useful for a machine instruction such as multiplication, which requires many steps in a loop.

A micro-instruction in the 8086 is encoded into 21 bits as shown below. Every micro-instruction has a move from a source register to a destination register, each specified with 5 bits. The meaning of the remaining bits depends on the type field and can be anything from an ALU operation to a memory read or write to a change of microcode control flow. Thus, an 8086 micro-instruction typically does two things in parallel: the move and the action. For more about 8086 microcode, see my microcode blog post.

The encoding of a micro-instruction into 21 bits. Based on NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?

The encoding of a micro-instruction into 21 bits. Based on NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?

The behavior of an ALU micro-operation is important for multiplication. The ALU has three temporary registers that are invisible to the programmer: tmpA, tmpB, and tmpC. An ALU operation takes its first argument from any temporary register, while the second argument always comes from tmpB. An ALU operation requires two micro-instructions. The first micro-instruction specifies the ALU operation and source register, configuring the ALU. For instance, ADD tmpA to add tmpA to the default tmpB. In the next micro-instruction (or a later one), the ALU result can be accessed through the Σ register and moved to another register.

Before I get into the microcode routines, I should explain two ALU operations that play a central role in multiplication: LRCY and RRCY, Left Rotate through Carry and Right Rotate through Carry. (These correspond to the RCL and RCR machine instructions, which rotate through carry left or right.) These operations shift the bits in a 16-bit word, similar to the << and >> bit-shift operations in high-level languages, but with an additional feature. Instead of discarding the bit on the end, that bit is moved into the carry flag (CF). Meanwhile, the bit formerly in the carry flag moves into the word. You can think of this as rotating the bits while treating the carry flag as a 17th bit of the word.

The left rotate through carry and right rotate through carry micro-instructions.

The left rotate through carry and right rotate through carry micro-instructions.

These shifts perform an important part of the multiplication process since shifting can be viewed as multiplying by two. LRCY also provides a convenient way to move the most-significant bit to the carry flag, where it can be tested for a conditional jump. (This is important because the top bit is used as the sign bit.) Similarly, RRCY provides access to the least significant bit, very important for the multiplication process. Another important property is that performing RRCY on an upper word and then RRCY on a lower word will perform a 32-bit shift, since the low bit of the upper word will be moved into the high bit of the lower word via the carry bit.

Binary multiplication

The shift-and-add method of multiplication (below) is similar to grade-school long multiplication, except it uses binary instead of decimal. In each row, the multiplicand is multiplied by one digit of the multiplier. (The multiplicand is the value that gets repeatedly added, and the multiplier controls how many times it gets added.) Successive rows are shifted left one digit. At the bottom, the rows are added together to yield the product. The example below shows how 6×5 is calculated in binary using long multiplication.

    0110
   ×0101
    0110
   0000
  0110
 0000
00011110

Binary long multiplication is much simpler than decimal multiplication: at each step, you're multiplying by 0 or 1. Thus, each row is either zero or the multiplicand appropriately shifted (0110 in this case). (Unlike decimal long multiplication, you don't need to know the multiplication table.) This simplifies the hardware implementation, since each step either adds the multiplicand or doesn't. In other words, each step tests a bit of the multiplier, starting with the low bit, to determine if an add should take place or not. This bit can be obtained by shifting the multiplier one bit to the right each step.

Although the diagram above shows the sum at the end, a real implementation performs the addition at each step of the loop, keeping a running total. Moreover, in the 8086, instead of shifting the multiplicand to the left during each step, the sum shifts to the right. (The result is the same but it makes the implementation easier.) Thus, multiplying 6×5 goes through the steps below.

  0101
 ×0110
 00000
 001010
 0011110
 00011110

Why would you shift the result to the right? There's a clever reason for this. Suppose you're multiplying two 16-bit numbers, which yields a 32-bit result. That requires four 16-bit words of storage if you use the straightforward approach. But if you look more closely, the first sum fits into 16 bits, and then you need one more bit at each step. Meanwhile, you're "using up" one bit of the multiplier at each step. So if you squeeze the sum and the multiplier together, you can fit them into two words. Shifting right accomplishes this, as the diagram below illustrates for 0xffff×0xf00f. The sum (blue) starts in a 16-bit register called tmpA while the multiplier (green) is stored in the 16-bit tmpB register. In each step, they are both shifted right, so the sum gains one bit and the multiplier loses one bit. By the end, the sum takes up all 32 bits, split across both registers.

sum (tmpA)multiplier (tmpC)
00000000000000001111000000001111
01111111111111111111100000000111
10111111111111110111110000000011
11011111111111110011111000000001
11101111111111110001111100000000
01110111111111111000111110000000
00111011111111111100011111000000
00011101111111111110001111100000
00001110111111111111000111110000
00000111011111111111100011111000
00000011101111111111110001111100
00000001110111111111111000111110
00000000111011111111111100011111
10000000011101110111111110001111
11000000001110110011111111000111
11100000000111010001111111100011
11110000000011100000111111110001

The multiplication microcode

The 8086 has four multiply instructions to handle signed and unsigned multiplication of byte and word operands. These machine instructions are implemented in microcode. I'll start by describing the unsigned word multiplication, which multiplies two 16-bit values and produces a 32-bit result. The source word is provided by either a register or memory. It is multiplied by AX, the accumulator register. The 32-bit result is returned in the DX and AX registers.

The microcode below is the main routine for word multiplication, both signed and unsigned. Each micro-instruction specifies a register move on the left, and an action on the right. The moves transfer words between the visible registers and the ALU's temporary registers, while the actions are mostly subroutine calls to other micro-routines.

  move        action
AX → tmpC   LRCY tmpC        iMUL rmw:
M → tmpB    CALL X0 PREIMUL   called for signed multiplication
            CALL CORX         the core routine
            CALL F1 NEGATE    called for negative result
            CALL X0 IMULCOF   called for signed multiplication
tmpC → AX   JMPS X0 7  
            CALL MULCOF       called for unsigned multiplication
tmpA → DX   RNI  

The microcode starts by moving one argument AX into the ALU's temporary C register and setting up the ALU to perform a Left Rotate through Carry on this register, in order to access the sign bit. Next, it moves the second argument M into the temporary B register; M references the register or memory specified in the second byte of the instruction, the "ModR/M" byte. For a signed multiply instruction, the PREIMUL micro-subroutine is called, but I'll skip that for now. (The X0 condition tests bit 3 of the instruction, which in this case distinguishes MUL from IMUL.) Next, the CORX subroutine is called, which is the heart of the multiplication.4 If the result needs to be negated (indicated by the F1 condition), the NEGATE micro-subroutine is called. For signed multiplication, IMULCOF is then called to set the carry and overflow flags, while MULCOF is called for unsigned multiplication. Meanwhile, the result bytes are moved from the temporary C and temporary registers to the AX and DX registers. Finally, RNI runs the next machine instruction, ending the microcode routine.

CORX

The heart of the multiplication code is the CORX routine, which performs the multiplication loop, computing the product through shifts and adds. The first two lines set up the loop, initializing the sum (tmpA) to 0. The number of loops is controlled by a special-purpose loop counter. The MAXC micro-instruction initializes the counter to 7 or 15, for a byte or word multiply respectively. The first shift of tmpC is performed, putting the low bit into the carry flag.

The loop body performs the shift-and-add step. It tests the carry flag, the low bit of the multiplicand. It skips over the ADD if there is no carry (NCY). Otherwise, tmpB is added to tmpA. (As tmpA gets shifted to the right, tmpB gets added to higher and higher positions in the result.) The tmpA and tmpC registers are rotated right. This also puts the next bit of the multiplicand into the carry flag for the next cycle. The microcode jumps to the top of the loop if the counter is not zero (NCZ). Otherwise, the subroutine returns with the result in tmpA and tmpC.

ZERO → tmpA  RRCY tmpC   CORX: initialize right rotate
Σ → tmpC     MAXC          get rotate result, initialize counter to max value
             JMPS NCY 8  5: top of loop
             ADD tmpA     conditionally add
Σ → tmpA               F  sum to tmpA, update flags to get carry
             RRCY tmpA   8: 32-bit shift of tmpA/tmpC
Σ → tmpA     RRCY tmpC  
Σ → tmpC     JMPS NCZ 5   loop to 5 if counter is not 0
             RTN  

MULCOF

The last subroutine is MULCOF, which configures the carry and overflow flags. The 8086 uses the rule that if the upper half of the result is nonzero, the carry and overflow flags are set, otherwise they are cleared. The first two lines pass tmpA (the upper half of the result) through the ALU to set the zero flag for the conditional jump. As a side-effect, the other status flags will get set but these values are "undefined" in the documentation.6 If the test is nonzero, the carry and overflow flags are set (SCOF), otherwise they are cleared (CCOF).5 The SCOF and CCOF micro-operations were implemented solely for used by multiplication, illustrating how microcode can be designed around specific needs.

             PASS tmpA  MULCOF: pass tmpA through to test if zero
Σ → no dest  JMPS 12   F update flags

             JMPS Z 8   12: jump if zero
             SCOF RTN    otherwise set carry and overflow

             CCOF RTN   8: clear carry and overflow

8-bit multiplication

The 8086 has separate instructions for 8-bit multiplication. The process for 8-bit multiplication is similar to 16-bit multiplication, except the values are half as long and the shift-and-add loop executes 8 times instead of 16. As shown below, the 8-bit sum starts in the low half of the temporary A register and is shifted left into tmpC. Meanwhile, the 8-bit multiplier starts in the low half of tmpC and is shifted out to the right. At the end, the result is split between tmpA and tmpC.

sum (tmpA)multiplier (tmpC)
00000000000000000000000001010101
00000000011111111000000000101010
00000000001111111100000000010101
00000000100111110110000000001010
00000000010011111011000000000101
00000000101001110101100000000010
00000000010100111010110000000001
00000000101010010101011000000000
00000000010101001010101100000000

The 8086 supports many instructions with byte and word versions, using 8-bit or 16-bit arguments. In most cases, the byte and word instructions use the same microcode, with the ALU and register hardware using bytes or words based on the instruction. However, the byte- and word-multiply instructions use different registers, requiring microcode changes. In particular, the multiplier is in AL, the low half of the accumulator. At the end, the 16-bit result is returned in AX, the full 16-bit accumulator; two micro-instructions assemble the result from tmpC and tmpA into the two bytes of the accumulator, 'AL' and 'AH' respectively. Apart from those changes, the microcode is the same as the word multiply microcode discussed earlier.

AL → tmpC    LRCY tmpC         iMUL rmb:
M → tmpB     CALL X0 PREIMUL  
             CALL CORX  
             CALL F1 NEGATE  
             CALL X0 IMULCOF  
tmpC → AL    JMPS X0 7  
             CALL MULCOF  
tmpA → AH    RNI

Signed multiplication

The 8086 (like most computers) represents signed numbers using a format called two's complement. While a regular byte holds a number from 0 to 255, a signed byte holds a number from -128 to 127. A negative number is formed by flipping all the bits (known as the one's complement) and then adding 1, yielding the two's complement value.7 For instance, +5 is 0x05 while -5 is 0xfb. (Note that the top bit of a number is set for a negative number; this is the sign bit.) The nice thing about two's complement numbers is that the same addition and subtraction operations work on both signed and unsigned values. Unfortunately, this is not the case for signed multiplication, since signed and unsigned values yield different results due to sign extension.

The 8086 has separate multiplication instructions IMUL (Integer Multiply) to perform signed multiplication. The 8086 performs signed multiplication by converting the arguments to positive values, performing unsigned multiplication, and then negating the result if necessary. As shown above, signed and unsigned multiplication both use the same microcode, but the microcode conditionally calls some subroutines for signed multiplication. I will discuss those micro-subroutines below.

PREIMUL

The first subroutine for signed multiplication is PREIMUL, performing preliminary operations for integer multiplication. It converts the two arguments, stored in tmpC and tmpB, to positive values. It keeps track of the signs using an internal flag called F1, toggling this flag for a negative argument. This conveniently handles the rule that two negatives make a positive since complementing the F1 flag twice will clear it.

This microcode, below, illustrates the complexity of microcode and how micro-operations are carefully arranged to get the right values at the right time. The first micro-instruction performs one ALU operation and sets up a second operation. The calling code had set up the ALU to perform LRCY tmpC, so that's the result returned by Σ (and discarded). Performing a left rotate and discarding the result may seem pointless, but the important side-effect is that the top bit (i.e. the sign bit) ends up in the carry flag. The microcode does not have a conditional jump based on the sign, but has a conditional jump based on carry, so the point is to test if tmpC is negative. The first micro-instruction also sets up negation (NEG tmpC) for the next ALU operation.

Σ → no dest  NEG tmpC   PREIMUL: set up negation of tmpC
             JMPS NCY 7  jump if tmpC positive
Σ → tmpC     CF1         if negative, negate tmpC, flip F1
             JMPS 7      jump to shared code

             LRCY tmpB  7:
Σ → no dest  NEG tmpB    set up negation of tmpB
             JMPS NCY 11 jump if tmpB positive
Σ → tmpB     CF1 RTN     if negative, negate tmpB, flip F1
             RTN        11: return

For the remaining lines, if the carry is clear (NCY), the next two lines are skipped. Otherwise, the ALU result (Σ) is written to tmpC, making it positive, and the F1 flag is complemented with CF1. (The second short jump (JMPS) may look unnecessary, but I reordered the code for clarity.) The second half of the microcode performs a similar test on tmpB. If tmpB is negative, it is negated and F1 is toggled.

NEGATE

The microcode below is called after computing the result, if the result needs to be made negative. Negation is harder than you might expect because the result is split between the tmpA and tmpC registers. The two's complement operation (NEG) is applied to the low word, while either 2's complement or one's complement (COM1) is applied to the upper word, depending on the carry for mathematical reasons.8 The code also toggles F1 and makes tmpB positive; I think this code is only useful for division, which also uses the NEGATE subroutine.

             NEG tmpC   NEGATE: negate tmpC
Σ → tmpC     COM1 tmpA F maybe complement tmpA
             JMPS CY 6  
             NEG tmpA    negate tmpA if there's no carry
Σ → tmpA     CF1        6: toggle F1 for some reason

             LRCY tmpB  7: test sign of tmpB
Σ → no dest  NEG tmpB    maybe negate tmpB
             JMPS NCY 11 skip if tmpB positive
Σ → tmpB     CF1 RTN     else negate tmpB, toggle F1
             RTN        11: return

IMULCOF

The IMULCOF routine is similar to MULCOF, but the calculation is a bit trickier for a signed result. This routine sets the carry and overflow flags if the upper half of the result is significant, that is, it is not just the sign extension of the lower half.9 In other words, the top byte is not significant if it duplicates the top bit (the sign bit) of the lower byte. The trick in the microcode is to add the top bit of the lower byte to the upper byte by putting it in the carry flag and performing an add with carry (ADC) of 0. If the result is 0, the upper byte is not significant, handling the positive and negative cases. (This also holds for words instead of bytes.)

ZERO → tmpB  LRCY tmpC  IMULCOF: get top bit of tmpC
Σ → no dest  ADC tmpA    add to tmpA and 0 (tmpB)
Σ → no dest   F          update flags
             JMPS Z 8   12: jump if zero result
             SCOF RTN    otherwise set carry and overflow

             CCOF RTN   8: clear carry and overflow

The hardware for multiplication

For the most part, the 8086 uses the regular ALU addition and shifts for the multiplication algorithm. Some special hardware features provide assistance.

Loop counter

The 8086 has a special 4-bit loop counter for multiplication. This counter starts at 7 for byte multiplication and 15 for word multiplication, based on the instruction. This loop counter allows the microcode to decrement the counter, test for the end, and perform a conditional branch in one micro-operation. The counter is implemented with four flip-flops, along with logic to compute the value after decrementing by one. The MAXC (Maximum Count) micro-instruction sets the counter to 7 or 15 for byte or word operations respectively. The NCZ (Not Counter Zero) micro-instruction has two actions. First, it performs a conditional jump if the counter is nonzero. Second, it decrements the counter.

X register

The multiplication microcode uses an internal register called the X register to distinguish between the MUL and IMUL instructions. The X register is a 3-bit register that holds the ALU opcode, indicated by bits 5–3 of the instruction.10 Since the instruction is held in the Instruction Register, you might wonder why a separate register is required. The motivation is that some opcodes specify the type of ALU operation in the second byte of the instruction, the ModR/M byte, bits 5–3.11 Since the ALU operation is sometimes specified in the first byte and sometimes in the second byte, the X register was added to handle both these cases.

For the most part, the X register indicates which of the eight standard ALU operations is selected (ADD, OR, ADC, SBB, AND, SUB, XOR, CMP). However, a few instructions use bit 0 of the X register to distinguish between other pairs of instructions. For instance, it distinguishes between MUL and IMUL, DIV and IDIV, CMPS and SCAS, MOVS and LODS, or AAA and AAS. While these instruction pairs may appear to have arbitrary opcodes, they have been carefully assigned. The microcode can test this bit using the X0 condition and perform conditional jumps.

The implementation of the X register is straightforward, consisting of three flip-flops to hold the three bits of the instruction. The flip-flops are loaded from the prefetch queue bus during First Clock and during Second Clock for appropriate instructions, as the instruction bytes travel over the bus. Testing bit 0 of the X register with the X0 condition is supported by the microcode condition evaluation circuitry, so it can be used for conditional jumps in the microcode.

The F1 flag

The multiplication microcode uses an internal flag called F1,12 which has two distinct uses. The flag keeps track of a REP prefix for use with a string operation. But the F1 flag is also used by signed multiplication and division to keep track of the sign. The F1 flag can be toggled by microcode through the CF1 (Complement F1) micro-instruction. The F1 flag is implemented with a flip-flop, along with a multiplexer to select the value. It is cleared when a new instruction starts, set by a REP prefix, and toggled by the CF1 micro-instruction.

The diagram below shows how the F1 latch and the loop counter appear on the die. In this image, the metal layer has been removed, showing the silicon and the polysilicon wiring underneath.

The counter and F1 latch as they appear on the die. The latch for the REP state is also here.

The counter and F1 latch as they appear on the die. The latch for the REP state is also here.

Later advances in multiplication

The 8086 was pretty slow at multiplying compared to later Intel processors.13 The 8086 took up to 133 clock cycles to multiply unsigned 16-bit values due to the complicated microcode loops. By 1982, the Intel 286 processor cut this time down to 21 clock cycles. The Intel 486 (1989) used an improved algorithm that could end early, so multiplying by a small number could take just 9 cycles.

Although these optimizations improved performance, they still depended on looping over the bits. With the shift to 32-bit processors, the loop time became unwieldy. The solution was to replace the loop with hardware: instead of performing 32 shift-and-add loops, an array of adders could compute the multiplication in one step. This quantity of hardware was unreasonable in the 8086 era, but as Moore's law made transistors smaller and cheaper, hardware multiplication became practical. For instance, the Cyrix Cx486SLC (1992) had a 16-bit hardware multiplier that cut word multiply down to 3 cycles. The Intel Core 2 (2006) was even faster, able to complete a 32-bit multiplication every clock cycle.

Hardware multiplication is a fairly complicated subject, with many optimizations to maximize performance while minimizing hardware.14 Simply replacing the loop with a sequence of 32 adders is too slow because the result would be delayed while propagating through all the adders. The solution is to arrange the adders as a tree to provide parallelism. The first layer has 16 adders to add pairs of terms. The next layer adds pairs of these partial sums, and so forth. The resulting tree of adders is 5 layers deep rather than 32, reducing the time to compute the sum. Real multipliers achieve further performance improvements by splitting up the adders and creating a more complex tree: the venerable Wallace tree (1964) and Dadda multiplier (1965) are two popular approaches. Another optimization is the Booth algorithm (1951), which performs signed multiplication directly, without converting the arguments to positive values first. The Pentium 4 (2000) used a Booth encoder and a Wallace tree (ref), but research in the early 2000s found the Dadda tree is faster and it is now more popular.

Conclusions

Multiplication is much harder to compute than addition or subtraction. The 8086 processor hid this complexity from the programmer by providing four multiplication instructions for byte and word multiplication of signed or unsigned values. These instructions implemented multiplication in microcode, performing shifts and adds in a loop. By using microcode subroutines and conditional execution, these four machine instructions share most of the microcode. As the microcode capacity of the 8086 was very small, this was a critical feature of the implementation.

If you made it through all the discussion of microcode, congratulations! Microcode is even harder to understand than assembly code. Part of the problem is that microcode is very fine-grain, with even ALU operations split into multiple steps. Another complication is that 8086 microcode performs a register move and another operation in parallel, so it's hard to keep track of what's going on. Microcode can seem a bit like a jigsaw puzzle, with pieces carefully fit together as compactly as possible. I hope the explanations here made sense, or at least gave you a feel for how microcode operates.

I've written multiple posts on the 8086 so far and plan to continue reverse-engineering the 8086 die so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @[email protected].

Notes and references

  1. Mainframes going back to ENIAC had multiply and divide instructions. However, early microprocessors took a step back and didn't supports these more complex operations. (My theory is that the decline in memory prices made it more cost-effective to implement multiply and divide in software than hardware.) The National Semiconductor IMP-16, a 16-bit bit-slice microprocessor from 1973, may be the first with multiply and divide instructions. The 8-bit Motorola 6809 processor (1978) included 8-bit multiplication but not division. I think the 8086 was the first Intel processor to support multiplication. 

  2. The 8086 also supported division. Although the division instructions are similar to multiplication in many ways, I'm focusing on multiplication and ignoring division for this blog post. 

  3. My microcode analysis is based on Andrew Jenner's 8086 microcode disassembly

  4. I think CORX stands for Core Multiply and CORD stands for Core Divide

  5. The definitions of carry and overflow are different for multiplication compared to addition and subtraction. Note that the result of a multiplication operation will always fit in the available result space, which is twice as large as the arguments. For instance, the biggest value you can get by multiplying 16-bit values is 0xffff×0xffff=0xfffe0001 which fits into 32 bits. (Signed and 8-bit multiplications fit similarly.) This is in contrast to addition and subtraction, which can exceed their available space. A carry indicates that an addition exceeded its space when treated as unsigned, while an overflow indicates that an addition exceeded its space when treated as unsigned. 

  6. The Intel documentation states that the sign, carry, overflow, and parity flags are undefined after the MUL operation, even though the microcode causes them to be computed. The meaning of "undefined" is that programmers shouldn't count on the flag values because Intel might change the behavior in later chips. This thread discusses the effects of MUL on the flags, and how the behavior is different on the NEC V20 chip. 

  7. It may be worth explaining why the two's complement of a number is defined by adding 1 to the one's complement. The one's complement of a number simply flips all the bits. If you take a byte value n, 0xff - n is the one's complement, since a 1 bit in n produces a 0 bit in the result.

    Now, suppose we want to represent -5 as a signed byte. Adding 0x100 will keep the same byte value with a carry out of the byte. But 0x100 - 5 = (1 + 0xff) - 5 = 1 + (0xff - 5) = 1 + (one's complement of 5). Thus, it makes sense mathematically to represent -5 by adding 1 to the one's complement of 5, and this holds for any value. 

  8. The negation code is a bit tricky because the result is split across two words. In most cases, the upper word is bitwise complemented. However, if the lower word is zero, then the upper word is negated (two's complement). I'll demonstrate with 16-bit values to keep the examples small. The number 257 (0x0101) is negated to form -257 (0xfeff). Note that the upper byte is the one's complement (0x01 vs 0xfe) while the lower byte is two's complement (0x01 vs 0xff). On the other hand, the number 256 (0x0100) is negated to form -256 (0xff00). In this case, the upper byte is the two's complement (0x01 vs 0xff) and the lower byte is also the two's complement (0x00 vs 0x00).

    (Mathematical explanation: the two's complement is formed by taking the one's complement and adding 1. In most cases, there won't be a carry from the low byte to the upper byte, so the upper byte will remain the one's complement. However, if the low byte is 0, the complement is 0xff and adding 1 will form a carry. Adding this carry to the upper byte yields the two's complement of that byte.)

    To support multi-word negation, the 8086's NEG instruction clears the carry flag if the operand is 0, and otherwise sets the carry flag. (This is the opposite from the above because subtractions (including NEG) treat the carry flag as a borrow flag, with the opposite meaning.) The microcode NEG operation has identical behavior to the machine instruction, since it is used to implement the machine instruction.

    Thus to perform a two-word negation, the microcode negates the low word (tmpC) and updates the flags (F). If the carry is set, the one's complement is applied to the upper word (tmpA). But if the carry is cleared, the two's complement is applied to tmpA. 

  9. The IMULCOF routine considers the upper half of the result significant if it is not the sign extension of the lower half. For instance, dropping the top byte of 0x0005 (+5) yields 0x05 (+5). Dropping the top byte of 0xfffb (-5) yields 0xfb (-5). Thus, the upper byte is not significant in these cases. Conversely, dropping the top byte of 0x00fb (+251) yields 0xfb (-5), so the upper byte is significant. 

  10. Curiously, the 8086 patent states that the X register is a 4-bit register holding bits 3–6 of the byte (col. 9, line 20). But looking at the die, it is a 3-bit register holding bits 3–5 of the byte. 

  11. Some instructions are specified by bits 5–3 in the ModR/M byte rather than in the first opcode byte. The motivation is to avoid wasting bits for instructions that use a ModR/M byte but don't need a register specification. For instance, consider the instruction ADD [BX],0x1234. This instruction uses a ModR/M byte to specify the memory address. However, because it uses an immediate operand, it does not need the register specification normally provided by bits 5–3 of the ModR/M byte. This frees up the bits to specify the instruction. From one perspective, this is an ugly hack, while from another perspective it is a clever optimization. 

  12. Andrew Jenner discusses the F1 flag and the interaction between REP and multiplication here

  13. Here are some detailed performance numbers. The 8086 processor takes 70–77 clock cycles to multiply 8-bit values and 118–133 clock cycles to multiply 16-bit values. Signed multiplies are a bit slower because of the sign calculations: 80–98 and 128–154 clock cycles respectively. The time is variable because of the conditional jumps in the multiplication process.

    The Intel 186 (1982) optimized multiplication slightly, bringing the register word multiply down to 35–37 cycles. The Intel 286 (also 1982) reduced this to 21 clocks. The 486 (1989) used a shift-add multiply function but it had an "early out" algorithm that stopped when the remaining bits were zero, so a 16-bit multiply could take from 9 to 22 clocks. The 8087 floating point coprocessor (1980) used radix-4 multiplication, multiplying by pairs of bits at a time and either adding or subtracting. This yields half the addition cycles. The Pentium's P5 micro-architecture (1993) took the unusual approach of reusing the floating-point unit's hardware multiplier for integer multiplication, taking 10 cycles for a 32-bit multiplication. 

  14. This presentation gives a good overview of implementations of multiplication in hardware.