Ken Shirriff's blog

Inside the 8086 processor's instruction prefetch circuitry

The groundbreaking 8086 microprocessor was introduced by Intel in 1978 and led to the x86 architecture that still dominates desktop and server computing. One way that the 8086 increased performance was by prefetching: the processor fetches instructions from memory before they are needed, so the processor can execute them without waiting on the (relatively slow) memory. I've been reverse-engineering the 8086 from die photos and this blog post discusses what I've uncovered about the prefetch circuitry.

The 8086 was introduced at an interesting point in microprocessor history, where memory was becoming slower than the CPU. For the first microprocessors, the speed of the CPU and the speed of memory were comparable.1 However, as processors became faster, the speed of memory failed to keep up. The 8086 was probably the first microprocessor to prefetch instructions to improve performance. While modern microprocessors have megabytes of fast cache2 to act as a buffer between the CPU and much-slower main memory, the 8086 has just 6 bytes of prefetch queue. However, this was enough to increase performance by about 50%.3

The die photo below shows the 8086 microprocessor under a microscope. The metal layer on top of the chip is visible, with the silicon and polysilicon mostly hidden underneath. Around the edges of the die, bond wires connect pads to the chip's 40 external pins. I've labeled the key functional blocks; ones that are important to the prefetch queue are highlighted in red and will be discussed in detail below. Architecturally, the chip is partitioned into a Bus Interface Unit (BIU) at the top and an Execution Unit (EU) below. The BIU handles memory accesses, while the Execution Unit (EU) executes instructions.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip's single metal layer; the polysilicon and silicon are underneath. Click on this image (or any other) for a larger version.

Prefetching and the architecture of the 8086

Prefetching had a major impact on the design of the 8086. Earlier processors such as the 6502, 8080, or Z80 were deterministic. The processor fetched an instruction, executed the instruction, fetched the next instruction, and so forth. Memory accesses corresponded directly to instruction fetching and execution and instructions took a predictable number of clock cycles. This all changed with the introduction of the prefetch queue. Memory operations became unlinked from instruction execution since prefetches happen as needed and when the memory bus is available.

Since memory operations and instruction execution happen independently, the implementors of the 8086 split the chip into two processing units: the Bus Interface Unit (BIU) that handles memory accesses, and the Execution Unit (EU) that executes instructions, as shown below.4 The Bus Interface Unit contains the 6-byte instruction prefetch queue; it supplies instructions to the Execution Unit via the Q (queue) bus. The adder (Σ) performs address calculation, adding the segment register base to an address offset, among other things. The Execution Unit is what comes to mind when you think of a processor: it has most of the registers, the arithmetic/logic unit (ALU), and the microcode that implements instructions. The address adder and the ALU are independent arithmetic units. The segment registers (CS, DS, SS, ES) and the Instruction Pointer (IP) are in the Bus Interface Unit since they are directly involved in memory accesses, while the general-purpose registers are in the Execution Unit.

Block diagram of the 8086 processor. This diagram differs from most 8086 block diagrams because it shows the actual physical implementation, rather than the programmer's view of the processor. The "Internal Communication Registers" consist of the Indirect Register (IND) and the Operand Register (OPR). These hold a memory address and memory data value respectively. From The 8086 Family User's Manual.

The 8086's segment registers play an important part in this architecture, so I'll review them quickly. One of the challenges of the 8086 was how to support more than 64K of memory with 16-bit registers. The much-reviled solution was to create a 1-megabyte (20-bit) address space consisting of 64K segments, with segment registers indicating the start of each segment. Specifically, a memory address was specified by a 16-bit offset address along with a particular segment register selecting a segment (Code Segment, Data Segment, Stack Segment, or Extra Segment). The segment register's value was shifted by 4 bits to give the segment's 20-bit base address. The 16-bit offset address was added, yielding a 20-bit memory address. This gave the processor a 1-megabyte address space, although only 64K could be accessed without changing a segment register.

It may seem inefficient for the Bus Interface Unit to have its own adder instead of using the ALU, but there are a couple of reasons for the separate adder. First, every memory access uses the adder at least once to add the segment base and offset. The adder is also used to increment the PC or index registers. Since these operations are so frequent, they would create a bottleneck if they used the ALU. Second, since the Execution Unit and the Bus Interface Unit run asynchronously with respect to each other, it would be complicated to share the ALU without causing delays and conflicts.

Prefetching had another major but little-known effect on the 8086 architecture: the designers were considering making the 8086 a two-chip microprocessor. Prefetching, however, required a one-chip design because the number of control signals required to synchronize prefetching across two chips exceeded the package pins available. This became a compelling argument for the one-chip design that was used for the 8086.3 (The unsuccessful Intel iAPX 432, which was under development at the same time, ended up being a two-chip processor: one to fetch and decode instructions, and one to execute them.)

Implementing the queue

The instruction prefetch queue is implemented with three 16-bit queue registers along with two hardware pointers that keep track of the current position in the queue. One two-bit counter keeps track of the current read position from 0 to 2, i.e. the queue register that will provide the next instruction. The second counter keeps track of the current write position, i.e. the queue register that will receive the next instruction from memory. As words are fetched from the queue, the read pointer advances. As words are added to the queue, the write pointer advances. Because the queue registers hold words, while the prefetch circuitry provides bytes, another flip-flop keeps track of whether the high byte or the low byte of the word is being used. I call this the HL flip-flop. This causes the low byte to be provided first and then the high byte (since the 8086 is little-endian).

The diagram below shows an example queue configuration with four bytes. The first two queue registers (Q0 and Q1) hold data. The read pointer and HL pointer indicate that the next prefetched byte will come from the low byte of Q0. The write pointer indicates that the next prefetched word will go into Q2.

A queue configuration with four bytes in the prefetch queue. Bytes in blue hold prefetched data.

The diagram below shows how the queue pointers can wrap around. In this configuration, one byte has been used from Q2 so the next byte will be Q2's high byte. Q0 holds the next prefetched word. The next word to be prefetched will be stored in Q1, as indicated by the write pointer.

A queue configuration with three bytes in the prefetch queue.

The relative positions of the write and read pointers indicate how much data is in the queue. If the write pointer is one position before the read pointer (modulo 3), the queue holds 3 or 4 bytes. If the write pointer is one position after the read pointer (modulo 3), the queue holds 1 or 2 bytes. But what about when the read pointer and write pointer indicate the same register? This can either indicate that the queue is empty or that the queue is full (5 or 6 bytes). To distinguish these cases, a flip-flop is set if the queue enters the empty state. This flip-flop generates a signal that Intel called MT (empty).

Another complication occurs if you jump to an odd address. Because of its 16-bit bus, the 8086 will prefetch a word from the even address one less. This loads one usable byte and one byte that needs to be discarded. The 8086 handles this by setting the HL flip-flop high, using a handful of gates to detect this case. As in the diagram above, the unwanted low byte will be skipped.

The diagram below zooms in on the prefetch and queue control circuitry on the die, with the main flip-flops and circuitry labeled. The lower half manages the queue, keeping track of the read and write positions and computing the queue length. The upper circuitry controls prefetch operations and interacts with the rest of the memory cycle circuitry.

The queue and prefetch circuitry on the die. The metal layer has been removed for the closeup to show the silicon of the underlying transistors.

Even though there is not a lot of circuitry involved (about a dozen flip-flops and associated logic gates), this circuitry occupies a substantial part of the 8086 die. (The relatively small amount of circuitry did not make this easy to reverse-engineer, however!) Compared to modern chips, the density of the 8086 is very low; you can almost see the flip-flops with the naked eye. This diagram only shows the circuitry directly involved in prefetching. Additional circuitry is scattered through the memory cycle control circuitry to deal with prefetching, and the queue registers take up a substantial part of the register file. Thus, prefetching was a moderately expensive feature for the 8086, as far as die area.

The loader

To decode and execute an instruction, the Execution Unit must get instruction bytes from the Bus Interface Unit, but this is not entirely straightforward. The main problem is that the queue can be empty, in which case instruction decoding must block until a byte is available from the queue. The second problem is that instruction decoding is relatively slow, so for maximum performance, the decoder needs a new byte before the current instruction is finished. A circuit called the "loader" solves these problems by providing synchronization between the prefetch queue and the instruction decoder. The loader uses a small state machine to efficiently fetch bytes from the queue at the right time and to provide timing signals to the decoder and microcode engine.

In more detail, as the loader requests the first two instruction bytes from the prefetch queue, it generates two timing signals that control the microcode execution. The FC (First Clock) indicates that the first instruction byte is available, while the SC (Second Clock) indicates the second instruction byte. Note that the First Clock and Second Clock are not necessarily consecutive clock cycles because the prefetch queue could be empty or contain just one byte, in which case the First Clock and/or Second Clock would be delayed. The instruction decoding circuitry and the microcode engine are controlled by the First Clock and Second Clock signals, so they remain synchronized with the bytes supplied by the prefetch queue.

At the end of a microcode sequence, the Run Next Instruction (RNI) micro-operation causes the loader to fetch the next machine instruction. However, fetching and decoding the next instruction is a bit slow so microcode execution would be blocked for a cycle. In many cases, this slowdown can be avoided: if the microcode knows that it is one micro-instruction away from finishing, it issues a Next-to-last (NXT) micro-operation so the loader can start loading the next instruction. This achieves a degree of pipelining in most cases; fetching the next instruction is overlapped with finishing the execution of the previous instruction.

The state machine for the 8086 "loader" circuit. The 1BL signal indicates a 1-byte instruction implemented in logic rather than microcode. From patent US4449184.

The diagram above shows the state machine for the loader. I won't explain it in detail, but essentially it keeps track of whether it is waiting for a First Clock byte or a Second Clock byte, and if it is performing a fetch in advance (NXT) or at the end of an instruction (RNI). The state machine is implemented with two flip-flops to support its four states.

Other memory accesses

The loader takes care of fetching an instruction that consists of an opcode byte and a Mod R/M (addressing mode) byte. However, many instructions have additional bytes or don't follow this format For example, an opcode such as "ADD AX" can be followed by an 8- or 16-bit immediate value, adding that value to the AX register. Or a "move memory to AX" instruction can be followed by a 16-bit memory address The microcode uses a separate mechanism for fetching these instruction bytes from the queue. Specifically, each micro-instruction contains a source register and a destination register that specify a data move. By specifying "Q" (the queue) as the source, a byte is fetched from the prefetch queue.

A third path is used for arbitrary memory reads and writes, such as when an instruction stores a register's contents to memory. In this case, the microcode puts the memory address in the IND (indirect) register. The microcode then issues a read or write micro-operation which causes the memory contents to be read into the OPR (operand register) or written from the OPR. (The IND and OPR registers are internal 8086 registers that are not visible to the programmer.) In the 8086, a memory cycle takes at least four clock cycles (called T1 through T4), including adding the segment register to compute the memory address. An "unaligned" memory access takes twice as long, though, because the 8086 has a word-based 16-bit data bus. Thus, if you try to access a word from an odd address, two memory accesses are required, one for the first byte and one for the second byte.5

As you can see, a memory access is a fairly complex operation. In the 8086, all these steps are done by hardware in the Bus Interface Unit, rather than being performed by microcode. (I'll discuss the complex memory control circuitry in detail in a future post.) After issuing a memory read or write, the microcode engine is blocked until the memory request completes.

Microcode instructions and the correction circuitry

The microcode interacts with prefetching in several ways. In addition to requesting a byte from the queue (as discussed above), microcode can perform three micro-instructions that involve prefetching: SUSP, FLUSH, and CORR. The SUSP (suspend) micro-instruction stops prefetching, typically before a change to execution flow. The FLUSH micro-instruction flushes the prefetch queue and resumes prefetching. To implement these, the prefetching circuitry has a flip-flop to keep track of the suspended state, and logic to reset the queue pointer counters to flush the queue.

The CORR (correct) micro-instruction corrects the Instruction Pointer to point to the next execution position. This is an interesting and more complicated micro-instruction. Like most processors, the 8086 has a program counter (PC) to keep track of what instruction to execute; the 8086 calls this the Instruction Pointer (IP). In the programmer's view, the Instruction Pointer points to the memory address of the next instruction to execute. However, in the hardware, the Instruction Pointer points to the next instruction to be fetched, which is generally several bytes after the next instruction to be executed.6

For the most part, this doesn't matter; the queue provides instructions in the order they were fetched and it doesn't matter if the Instruction Pointer runs ahead. However, there are a few cases where the "real" Instruction Pointer address is needed. For example, a relative jump instruction causes execution to jump to an address relative to the current instruction. When performing a subroutine call, the return address must be pushed on the stack. The correct Instruction Pointer value is also needed for an interrupt. Thus, the 8086 needs a mechanism to compute the real Instruction Pointer value from the value in the Instruction Pointer register.

The solution is the CORR micro-instruction, which corrects the Instruction Pointer value by subtracting the prefetch queue length, so the Instruction Pointer holds the "true" value. For instance, if there are 4 bytes in the queue, then the address in the Instruction Pointer register is four more than the desired Instruction Pointer address. The Bus Interface Unit performs this subtraction by using the addressing adder and a small table of constants called the Constant ROM.7

The diagram below zooms in on the Constant ROM, located next to the adder. The Constant ROM is implemented as a PLA (programmable logic array), a two-level structured arrangement of gates. The first level (bottom) selects the desired correction constant, while the second level (middle) generates the bits of the constant: three bits plus a sign bit. The necessary correction constant is selected based on the length of the queue in words, the HL pointer, and the empty (MT) flag.

The Constant ROM, highlighted on the die.

The Constant ROM is used for more than just address correction. For example, it is also used to increment the Instruction Pointer by 2 after a prefetch. Other constants are used for the 8086's string operations, which act on a block of memory. The index registers are incremented or decremented by 1 for bytes or 2 for words. When popping a value from the stack, the stack pointer is decremented, which uses the constant -2. Additional constants are required to increment and decrement the IND register when accessing words from unaligned (odd) addresses. These increment/decrement values are selected in the upper part of the Constant ROM. In total, the Constant ROM holds values from -6 to +2.

Policy

There are some "policy" decisions on prefetching, and it's interesting to see how the 8086 implements them. Prefetching is not free: there is a tradeoff when performing a prefetch between saving time later versus delaying memory accesses from an executing instruction. Moreover, if a jump operation takes place, the prefetch queue is discarded and the memory cycles were wasted. Thus, the length of the queue is an "extremely tricky design problem, because performance can deteriorate if the queue is too long as well as if it is too short."3

Intel performed simulations to determine the best queue length. A 4-byte queue provided a large benefit, while a 6-byte queue (which they chose) was slightly better. The designers were surprised to find that performance flattened out after that; they expected a much longer queue would be necessary. The 8088 process has only a 4-byte prefetch queue because its 8-bit bus changes the tradeoffs.8

The basic prefetch policy is that if a memory access and a prefetch are requested at the same time, the memory access "wins", since it is guaranteed to be useful while the prefetch is just speculative. If the queue holds 0 to 2 bytes, prefetch happens during the next free memory cycle. If the queue holds 5 or 6 bytes, no prefetch can happen, since prefetch happens a word at a time. However, if the queue holds 3 or 4 bytes, prefetch is delayed for two clock cycles, which is an interesting choice. This gives an instruction more opportunities to perform a memory operation without being delayed by a prefetch. There is a tradeoff because maybe delaying the prefetch will waste two cycles of memory bandwidth, but performing the prefetch might waste four cycles of memory bandwidth. The motivation for this delay is that the last two bytes in the queue are less valuable because they are more likely to be discarded.

Another policy decision is how to handle a change in execution flow, such as a jump or subroutine call. The 8086 simply discards the prefetch queue and starts fetching from the new address. The 8086 designers considered better ways of handling jumps, but it wasn't practical to implement at the time. There is no intelligence if the instructions are already in the queue (e.g. jumping forward a couple of bytes). There is also no branch prediction; prefetching proceeds linearly regardless of branch instructions.

The 8086 does nothing to ensure consistency between the prefetch queue and memory if a prefetched instruction is modified in memory.9 In this case, the "stale" instruction in the queue is executed. This situation may seem contrived, but self-modifying code used to be fairly popular, where a program would change its own instructions.10

Prefetching and the 8087 coprocessor

One feature of the 8086 microprocessor is that it supports coprocessors such as the 8087 floating point chip.11 The 8087 implements high-performance floating-point computation, performing arithmetic and transcendental computations up to 100 times faster than the 8086. The 8087 gets instructions in an interesting fashion, executing floating-point instructions from the 8086's instruction stream. Specifically, an "ESCAPE" opcode indicates an instruction that is performed by the 8087 rather than the 8086. However, prefetching adds a lot of complexity to the coprocessor because the 8087 monitors the bus to determine when it should execute an instruction. With prefetching, the instruction on the bus doesn't match the instruction being executed. An instruction may be executed many cycles after it was fetched over the bus. A prefetched instruction may even be discarded and never executed.

To solve this problem, the 8087 manages its own copy of the prefetch queue to determine when the 8086 would be executing a floating-point instruction. The 8087 watches the bus to see when instructions are prefetched. The 8086 provides queue status signals (QS0 and QS1) to indicate when it takes bytes from the queue or flushes the queue. These signals allow the 8087 coprocessor to keep track of the 8086's queue state so it can tell what instruction the 8086 is executing. In other words, the 8086 doesn't tell the 8087 coprocessor what to do; instead, the two chips process the instruction stream in parallel. Another complication is the 8087 coprocessor can be used with the 8088 processor chip, which has a smaller 4-byte queue. Thus, the 8087 coprocessor must detect whether it is connected to an 8086 or an 8088 and maintain its queue appropriately.

Brief history

Caching and prefetching were used in mainframe computers dating back to the 1960s. For instance, the IBM System/360 Model 91 (1966) had a cache with prefetching. Minicomputers such as the VAX 11/780 (1977) later used caching and prefetching. However, these features took a while to trickle down to microprocessors. The Motorola 68000 (1980) had a 4-byte prefetch queue. As far as I can tell, the 8086 was the first microprocessor with a prefetch queue.

We can view the 8086 as a stepping-stone towards the large caches first used externally in the 80386 and internally in the 486. The 80186 and 80286 kept the 6-byte prefetch buffer size of the 8086. The 80386 has a 16-byte prefetch buffer, although apparently due to a bug it was shrunk to 12 bytes in later revisions. As well as the prefetch queue, the 80386 supported an external cache.

Early microprocessors such as the 6502 or Z80 could fetch the next instruction while they were finishing the previous instruction. This minimal two-stage pipelining improved performance, but was much more limited than 8086-style prefetching. An Intel study found that this simple overlap provides a 35% performance increase with 15% more hardware, while implementing prefetching provided an additional 11% gain with 14% more hardware.3 This illustrates how the increasing transistor counts from Moore's law opened up new opportunities to improve performance. But it also shows diminishing returns as performance increases become smaller and more expensive.

Conclusions

Well, this was supposed to be a quick post about the prefetch queue, but the topic turned out to have a lot more complexity than I expected. A six-byte prefetch queue may seem like a simple feature to add to a processor, but it affects many parts of the system. Prefetching is tied closely to the memory access circuitry, of course, but it also required a Constant ROM to handle the difference between the execution address and the prefetch address. Prefetching also impacted the microcode, with three micro-instructions to support prefetching.

Prefetching also illustrates some of the ways that each feature and corner case of a processor like the 8086 leads to more complexity. For instance, byte-aligned (rather than word-aligned) instructions require a mechanism to fetch bytes as well as words. Supporting multiple instruction formats (1-byte opcodes, an opcode byte followed by a Mod R/M byte, multi-byte instructions) resulted in the loader state machine. The segment registers required an adder to compute the memory address for every access. Looking at the 8086 internals makes it easier to understand the motivation behind RISC processors, discarding the complexity and corner cases to create a simpler but faster processor.

I plan to continue reverse-engineering the 8086 die so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @oldbytes.space@kenshirriff. If you're interested in the 8086, I wrote about the 8086 die, its die shrink process and the 8086 registers earlier.

Notes and references

Steve Furber, co-creator of the ARM chip, mentions that "The first integrated CPUs were coincidentally quite well matched to semiconductor memory speeds, and were therefore built without caches. This can now be seen as a temporary aberration." See VLSI Risc Architecture and Organization p77. To make this concrete, the Apple II (1977) used a MOS 6502 processor running at about 1 megahertz while its 4116 DRAM chips could perform an access in 250 nanoseconds (4 times the clock speed). The 8086 processor ran at 5-10 MHz which meant that 250 ns DRAM chips were slower than the clock speed. Nowadays, processors run at 4 GHz but DRAM access speed is about 50 nanoseconds (1/200 the clock speed). ↩
Modern processors use caches to improve memory performance; caches are often megabytes in size. Accessing data from a cache is faster than accessing it from main memory, but the tradeoff is that caches are smaller. The 8086's prefetch queue is similar to a cache in some ways, but there are some key differences. First, the prefetch queue is strictly sequential. If you jump ahead two bytes, even if the prefetch queue has those instruction bytes, the processor can't use them. Second, the prefetch queue can't reuse bytes. If you have a 6-byte loop, even though all the code fits in the prefetch queue, it will be reloaded every time. Third, the prefetch queue doesn't provide any consistency. If you modify an instruction in memory a couple of bytes ahead of the PC, the 8086 will run the old instruction if it's in the queue. ↩
The design decisions for the 8086 prefetch cache (and many other aspects of the chip) are described in: J. McKevitt and J. Bayliss, "New options from big chips," in IEEE Spectrum, vol. 16, no. 3, pp. 28-34, March 1979, doi: 10.1109/MSPEC.1979.6367944. ↩↩↩↩
A detailed block diagram of the 8086 is provided in the patent. Conveniently, the layout of the diagram is close to the physical layout of the chip.

Detailed block diagram of the 8086, based on patent US4449184. I have modified the register names to match the common naming.

I won't discuss this block diagram in detail here, but I'll point out the Q (queue) control logic in the upper center, with its associated read and write pointers. The Q bus connects the queue to various parts of the instruction decoding circuitry.

↩
Supporting misaligned memory accesses (i.e. accessing a word at an odd address) adds complexity to the 8086. It's not surprising that many RISC chips prohibit unaligned accesses. On SPARC, for instance, an unaligned access fails with a "bus error", which the Sun programmers out there probably recognize. ARM processors before ARMv7 didn't support unaligned accesses. RISC-V supports misaligned data accesses but not misaligned instructions. ↩
The 8086 patent describes how the program counter in the 8086 does not hold the "real" value:

PC is not a real or true program counter in that it does not, nor does any other register within CPU, maintain the actual execution point at any time. PC actually points to the next byte to be input into queue. The real program counter is calculated by instruction whenever a relative jump or call is required by subtracting the number of accessed instructions still remaining unused in queue from PC.
↩
The CORR correction operation adds more complexity to the system than you might expect, with synchronization between the Bus Interface Unit and the Execution Unit. Because the correction computation uses the addressing adder, the correction operation must be synchronized with memory accesses that also use the adder. To accomplish this, the Bus Interface Unit waits until any memory operation is finished and then generates two clock cycles of "fake" memory operation, keeping the adder free for the CORR instruction. As a result, the memory control circuitry needs logic to implement this memory cycle. Meanwhile, the microcode engine is stopped until the CORR instruction completes, requiring synchronization circuitry. ↩
The 8088 is famous as the processor in the original IBM PC. The 8088 processor is essentially the same as the 8086 except that it has an 8-bit data bus instead of a 16-bit data bus, so it performs memory accesses a byte at a time instead of a word at a time. Internally, the 8088 is nearly identical to the 8086 but there are a few differences in microcode and in the bus circuitry. The most visible difference is that the 8088 has a 4-byte prefetch queue instead of a 6-byte prefetch queue. Simulations showed that a 4-byte queue was sufficient for the 8088. Because it fetches one byte at a time instead of two bytes, the 8088 fills the prefetch queue more slowly and wouldn't get much benefit from the larger queue. I haven't looked at the 8088's prefetch circuitry in detail, so I can't describe it exactly. ↩
At some point, Intel implemented consistency between cached instructions and memory. This ensures that self-modifying code will run the latest version of an instruction rather than a stale instruction in the cache. I couldn't determine exactly when this was implemented; various sources say the 486, the Pentium, or the Pentium Pro. (If you have a definitive answer, please let me know.) ↩
Self-modifying code can be used as a way to distinguish between the 8086 and 8088 chips in software. Since the 8086 has a 6-byte queue and the 8088 has a 4-byte queue, you can create self-modifying code that will run a prefetched instruction on the 8086 but run the modified instruction on the 8088. ↩
Although the 8087 is the most well-known coprocessor for the 8086, it was not the only coprocessor. The Intel 8089 input/output coprocessor provided mainframe-style I/O channels, offloading I/O processing from the 8086. More than just a DMA engine, the 8089 was a separate processor with its own instruction set. Unlike the 8087, the 8089 didn't take instructions from the 8086's instruction stream so it didn't interact with prefetching; instead, the 8086 sent a Channel Attention signal to the 8089 and the 8089 read instructions from shared memory. The 8089 was complex and expensive and wasn't very popular. The Intel 82586 Ethernet coprocessor used a similar Channel Attention scheme. ↩

How the 8086 processor's microcode engine works

The 8086 microprocessor was a groundbreaking processor introduced by Intel in 1978. It led to the x86 architecture that still dominates desktop and server computing. The 8086 chip uses microcode internally to implement its instruction set. I've been reverse-engineering the 8086 from die photos and this blog post discusses how the chip's microcode engine operated. I'm not going to discuss the contents of the microcode1 or how the microcode controls the rest of the processor here. Instead, I'll look at how the 8086 decides what microcode to run, steps through the microcode, handles jumps and calls inside the microcode, and physically stores the microcode. It was a challenge to fit the microcode onto the chip with 1978 technology, so Intel used many optimization techniques to reduce the size of the microcode.

In brief, the microcode in the 8086 consists of 512 micro-instructions, each 21 bits wide. The microcode engine has a 13-bit register that steps through the microcode, along with a 13-bit subroutine register to store the return address for microcode subroutine calls. The microcode engine is assisted by two smaller ROMs: the "Group Decode ROM" to categorize machine instructions, and the "Translation ROM" to branch to microcode subroutines for address calculation and other roles. Physically, the microcode is stored in a 128×84 array. It has a special address decoder that optimizes the storage. The microcode circuitry is visible in the die photo below.

What is microcode?

Machine instructions are generally considered the basic steps that a computer performs. However, each instruction usually requires multiple operations inside the processor. For instance, an ADD instruction may involve computing the memory address, accessing the value, moving the value to the Arithmetic-Logic Unit (ALU), computing the sum, and storing the result in a register. One of the hardest parts of computer design is creating the control logic that signals the appropriate parts of the processor for each step of an instruction. The straightforward approach is to build a circuit from flip-flops and gates that moves through the various steps and generates the control signals. However, this circuitry is complicated and error-prone.

In 1951, Maurice Wilkes came up with the idea of microcode: instead of building the control circuitry from complex logic gates, the control logic could be replaced with another layer of code (i. e. microcode) stored in a special memory called a control store. To execute a machine instruction, the computer internally executes several simpler micro-instructions, specified by the microcode. In other words, microcode forms another layer between the machine instructions and the hardware. The main advantage of microcode is that it turns the processor's control logic into a programming task instead of a difficult logic design task. Microcode also permits complex instructions and a large instruction set to be implemented without making the processor more complex (apart from the size of the microcode). Finally, it is generally easier to fix a bug in microcode than in circuit logic.

Early computers didn't use microcode, largely due to the lack of good storage technologies to hold the microcode. This changed in the 1960s; for example IBM made extensive use of microcode in the System/360 (1964). (I've written about that here.) But early microprocessors didn't use microcode, returning to hard-coded control logic with logic gates.3 This logic was generally more compact and ran faster than microcode, since the circuitry could be optimized. Since space was at a premium in early microprocessors and the instruction sets were relatively simple, this tradeoff made sense. But as microprocessor instruction sets became complex and transistors became cheaper, microcode became appealing. This led to the use of microcode in the Intel 8086 (1978) and 8088 (1979) and Motorola 68000 (1979), for instance.2

The 8086's microcode

The 8086's microcode is much simpler than in most processors, but it's still fairly complex. The code below is the microcode routine from the 8086 for a routine called "CORD", part of integer division, consisting of 16 micro-instructions. I'm not going to explain how this microcode works in detail, but I want to give a flavor of it. Each line has an address on the left (blue) and the micro-instruction on the right (yellow), specifying the low-level actions during one time step (i.e. clock cycle). Each micro-instruction performs a move, transferring data from a source register (S) to a destination register (D). (The source Σ indicates the ALU output.) For parallelism, the micro-instruction performs an operation or two at the same time as the move. This operation is specified by the "a" and "b" fields; their meanings depend on the type field. For instance, type 1 indicates an ALU instruction such as subtract (SUBT) or left-rotate through carry (LRCY). Type 4 selects two general operations such as "RTN" which returns from a microcode subroutine. Type 0 indicates a jump operation; "UNC 10" is an unconditional jump to line 10 while "CY 13" jumps to line 13 if the carry flag is set. Finally, the "F" field indicates if the condition code flags should be updated. The key points are that the micro-instructions are simple and execute in one clock cycle, they can perform multiple operations in parallel to maximize performance, and they include control-flow operations such as conditional jumps and subroutines.

An example of a microcode routine. The CORD routine implements integer division with subtracts and left rotates. This is from patent 4,449,184.

Each instruction is stored at a 13-bit address (blue) which consists of 9 bits shown explicitly and a 4-bit sequence counter "CR". The eight numbered address bits usually correspond to the machine instruction's opcode. The "X" bit is an extra bit to provide more address space for code that is not directly tied to a machine instruction, such as reset and interrupt code, address computation, and the multiply/divide algorithms.

A micro-instruction is encoded into 21 bits as shown below. Every micro-instruction contains a move from a source register to a destination register, each specified with 5 bits. The meaning of the remaining bits is a bit tricky since it depends on the type field, which is two or three bits long. The "short jump" (type 0) is a conditional jump within the current block of 16 micro-instructions. The ALU operation (type 1) sets up the arithmetic-logic unit to perform an operation. Bookkeeping operations (type 4) are anything from flushing the prefetch queue to ending the current instruction. A memory read or write is type 6. A "long jump" (type 5) is a conditional jump to any of 16 fixed microcode locations (specified in an external table). Finally, a "long call" (type 7) is a conditional subroutine call to one of 16 locations (different from the jump targets).

The encoding of a micro-instruction into 21 bits. Based on NEC v. Intel: Will Hardware Be Drawn into the Black Hole of Copyright?

This "vertical" microcode format reduces the storage required for the microcode by encoding control signals into various fields. However, it requires some decoding logic to process the fields and generate the low-level control signals. Surprisingly, there's no specific "microcode decoder" circuit. Instead, the logic is scattered across the chip, looking for various microcode bit patterns to generate control signals where they are needed.

How instructions map onto the ROM

One interesting issue is how the micro-instructions are organized in the ROM, and how the right micro-instructions are executed for a particular machine instruction. The 8086 uses a clever mapping from the machine instruction to a microcode address that allows machine instructions to share microcode.

Different processors use a variety of approaches to microcode organization. One technique is for each micro-instruction to contain a field with the address of the next micro-instruction. This provides complete flexibility for the arrangement of micro-instructions, but requires a field to hold the address, increasing the number of bits in each micro-instruction. A common alternative is to execute micro-instructions sequentially, with a micro-program-counter stepping through each micro-address unless there is an explicit jump to a new address. This approach avoids the cost of an address field in each instruction, but requires a program counter with an incrementer, increasing the hardware complexity.

The 8086 uses a hybrid approach. A 4-bit program counter steps through the bottom 4 bits of the address, so up to 16 micro-instructions can be executed in sequence without a jump. This approach has the advantage of requiring a smaller 4-bit incrementer for the program counter, rather than a 13-bit incrementer. The microcode engine provides a "short jump" operation that makes it easy to jump within the group of 16 instructions using a 4-bit jump target, rather than a full 13-bit address.

Another important design decision in microcode is how to determine the starting micro-address for each machine instruction. In other words, if you want to do an ADD, where does the microcode for ADD start? One approach is a table of starting addresses: the system looks in the table to find the starting address for ADD, but this requires a large table of 256 entries. A second approach is to use the opcode code value as the starting address. That is, an ADD instruction 0x05 would start at micro-address 5. This approach has two problems. First, you can't run the microcode sequentially since consecutive micro-instructions belong to different machine instructions. Moreover, you can't share microcode since each instruction has a different address in the microcode ROM.

The 8086 solves these problems in two ways. First, the machine instructions are spaced sixteen slots apart in the microcode. In other words, the opcode is multiplied by 16 (has four zeros appended) to form the starting address in the microcode ROM, so there is plenty of space to implement each machine instruction. The second technique is that the ROM's addressing is partially decoded rather than fully decoded, so multiple micro-addresses can correspond to the same physical storage.4

To make this concrete, consider the 8086's arithmetic-logic instructions: one-byte add register to memory, one-byte add memory to register, one-word subtract memory from register, one-word xor register to memory, and so forth. There are 8 ALU operations and each can be byte- or word-sized, with memory as source or destination. This yields 32 different machine opcodes. These opcodes were carefully assigned, so they all have the format 00xxx0xx. The ROM address decoder is designed to look for three 0 bits in those positions, and ignore the other bits, so it will match that pattern. The result is that all 32 of these ALU instructions activate the same ROM column select line, and thus they all share the same microcode, shrinking the size of the ROM.

The microcode ROM's physical layout

The microcode ROM holds 512 words of 21 bits, so the obvious layout would be 512 columns and 21 rows. However, these dimensions are not practical for physically building the ROM because it would be too long and skinny. Instead, the ROM is constructed by grouping four words in each column, resulting in 128 columns of 84 rows, much closer to square. Not only does this make the physical layout more convenient, but it also reduces the number of column decoders from 512 to 128, reducing the circuitry size. Although the ROM now requires 21 multiplexers to select which of the four rows corresponds to each output bit, the circuitry is still much smaller. There is a tradeoff with the ability to merge addresses together by ignoring bits, though. Each decoder now selects a column of four words, rather than a single word, so each block of four words must have consecutive addresses.

The main components of the microcode engine. The metal layer has been removed to show the silicon and polysilicon underneath. If you zoom in, the bit pattern is visible in the silicon doping pattern.

The image above shows how microcode is stored and accessed. At the top is the 13-bit microcode address register, which will be discussed in detail below. The column selection circuit decodes 11 of the 13 address bits to select one column of the microcode storage. At the left, multiplexers select one bit out of each four rows using the two remaining address bits (specifically, the two lowest sequence bits). The selected 21 microcode outputs are latched and fed to the rest of the processor, where they are decoded as described earlier and control the processor's actions.

Optimizing the microcode

In 1978, the number of bits that could be stored in the microcode ROM was rather limited. In particular, the 8086 holds only 512 micro-instructions. Since it has approximately 256 machine-code instructions in its one-byte opcode, combined with multiple addressing modes, and each instruction requires multiple micro-instructions, compression and optimization were necessary to make the microcode fit.5 The main idea was to move functionality out of the microcode and into discrete logic when it made sense. I'll describe some of the ways they did this.

The 8086 has an arithmetic-logic unit (ALU) that performs operations such as addition and subtraction, as well as logical operations such as AND and XOR. Consider the machine instruction ADD, implemented with a few micro-operations that compute the memory address, fetch data, perform the addition, and store the result. The machine instructions for subtraction, AND, or XOR require identical steps, except that the ALU performs a different operation. In total, the 8086 has eight ALU-based operations that are identical except for the operation performed by the ALU.6 The 8086 uses a "trick" where these eight machine instructions share the same microcode. Specifically, the microcode tells the ALU to perform a special operation XI, which indicates that the ALU should look at the appropriate bits of the instruction and do the appropriate operation.7 This shrinks the microcode for these operations by a factor of eight, at the cost of requiring additional logic for the ALU. In particular, the ALU control circuitry has a register to hold the relevant instruction bits, and a PLA to decode these bits into low-level ALU control signals.

Similarly, the 8086 has eight machine instructions to increment a specific register (out of a set of 8), and eight instructions to decrement a register. All 16 instructions are handled by the same set of micro-instructions and the ALU does the increment or decrement as appropriate. Moreover, the register control circuitry determines which register is specified by the instruction, without involvement from the microcode.

Another optimization is that the 8086 has many machine instructions in pairs: an 8-bit version and a 16-bit version. One approach would be to have separate microcode for the two instructions, one to handle a single byte and one to handle two bytes. Instead, the machine instructions share microcode. The complexity is moved to the circuitry that moves data on the bus: it looks at the low bit of the instruction to determine if it should process a byte or a word. This cuts the microcode size in half for the many affected instructions.

Finally, simple instructions that can take place in one cycle are implemented with logic gates, rather than through microcode. For instance, the CLC (clear carry flag) instruction updates the flag directly. Similarly, prefix instructions for segment selection, instruction locking, or repetition are performed in logic. These instructions don't use any microcode at all, which will be important later below.

Using techniques such as these, about 75 different instruction types are implemented in the microcode (instead of about 256), making the microcode much smaller. The tradeoff is that the 8086 requires more logic circuitry, but the designers found the tradeoff to be worthwhile.

The ModR/M byte

There's another complication for 8086 microcode, however. Most 8086 instructions have a second byte: the ModR/M byte, which controls the addressing mode for the instructions in a complex way (shown below). This byte gives 8086 instructions a lot of flexibility: you can use two registers, a register and a memory location, or a register and an "immediate" value specified in the instruction. The memory location can be specified by 8 index register combinations with a one-byte or two-byte displacement optionally added. (This is useful for accessing data in an array or structure, for instance.) Although these addressing modes are powerful, they pose a problem for the microcode.

A summary of the ModR/M byte, from MCS-86 Assembly Language Reference Guide.

These different addressing modes need to be implemented in microcode, since different addressing modes require different sequences of steps. In other words, you can't use the previous trick of pushing the problem into logic gates. And you clearly don't want a separate implementation of each instruction for each addressing mode since the size of the microcode would multiply out of control.

The solution is to use a subroutine (in microcode) to compute the memory address. Thus, instructions can share the microcode for each addressing mode. This adds a lot of complexity to the microcode engine, however, since it needs to store the micro-address for a micro-subroutine-call so it can return to the right location. To support this, the microcode engine has a register to hold this return address. (Since it doesn't have a full stack, you can't perform nested subroutine calls, but this isn't a significant limitation.)

The microcode ends up having about 10 subroutines for the different addressing modes, as well as four routines for the different sizes of displacement. (The 8 possibilities for source registers are handled in the register selection logic, rather than microcode.) Thus, the microcode handles the 256 different addressing modes with about 14 short routines that add the appropriate address register(s) and the displacement to obtain the memory address.

One more complication is that machine instructions can switch the source and destination specified by the ModR/M byte, depending on the opcode. For example, one subtract instruction will subtract a memory location from a register, while a different subtract instruction subtracts a register from a memory location. The two variants are distinguished by bit 1 of the instruction, the "direction" bit. These variants are handled by the control logic, so the microcode can ignore them. Specifically, before the source and destination specifications go to the register control circuitry, a crossover circuit can swap them based on the value of the direction bit.

The Translation ROM

As explained above, the starting address for a machine instruction is derived directly from the instruction's opcode. However, the microcode engine needs a mechanism to provide the address for jump and call operations. In the 8086, this address is hard-coded into the Translation ROM, which provides a 13-bit address.8 It holds ten destination addresses for jump operations and ten (different) addresses for call operations.

A second role of the Translation ROM is to hold target addresses for each ModR/M addressing mode, pointing to the code to compute the effective address. As a complication, two of the jump table entries in the Translation ROM are implemented with conditional logic, depending on whether or not the instruction's memory address calculation includes a displacement. By wiring this condition into the Translation ROM, the microcode avoids the need to test this condition.

The image below shows how the Translation ROM appears on the die. It is implemented as a partially-decoded ROM with multiplexed inputs.9 The inputs are at the bottom left. For a jump or call, the ROM uses 4 input bits from the microcode output, since the microcode selects the jump targets. For an address computation, it takes 5 bits from the instruction's ModR/M byte, so the routine is selected by the instruction. The ROM has additional input bits to select the mode (jump, call, or address) and for the conditional jumps. The decoding logic (left half) activates a row in the right half, generating the output address. This address exits at the bottom and is loaded into the micro-address register below the Translation ROM.

The Translation ROM holds addresses of routines in the microcode.

The Group Decode ROM

In the discussion above, I've discussed how various categories of instructions are optimized. For instance, many instructions have a bit that selects if they act on a byte or a word. Many instructions have a bit to reverse the direction of the operation's memory and register accesses. These features are implemented in logic rather than microcode. Other instructions are implemented outside microcode entirely. How does the 8086 determine which way to process an instruction?

The Group Decode ROM takes an instruction opcode and generate 15 signals that indicate various categories of instructions that are handled differently.10 The outputs from the Group Decode ROM are used by various logic circuits to determine how to handle the instruction. Some cases affect the microcode, for instance calling a microcode addressing routine if the instruction has a ModR/M byte. In other cases, these signals act "downstream" of the microcode, for example to determine if the operation should act on a byte or a word. Other signals cause the microcode to be bypassed completely.

A closeup of the Group Decode ROM. The circuit uses two layers of NOR gates to generate the output signals from the opcode inputs. This image shows a composite of the metal, polysilicon, and silicon layers.

Specially-encoded instructions

For most of the 8086 instructions, the first byte specifies the instruction. However, the 8086 has a few instructions where the ModR/M byte completely changes the meaning of the first byte. For instance, opcode 0xF6 (Grp 1 below) can be a TEST, NOT, NEG, MUL, IMUL, DIV, or IDIV instruction based on the value of the ModR/M byte. Similarly, opcode 0xFE (Grp 2) indicates an INC, DEC, CALL, JMP, or PUSH instruction.11

The 8086 instruction map for opcodes 0xF0 to 0xFF. Based on MCS-86 Assembly Language Reference Guide.

This encoding may seem a bit random, but there's a reason behind it. Most instructions act on a source and a destination. But some, such as INC (increment) use the same register or memory location for the source and the destination. Others, such as CALL or JMP, only use one address. Thus, the "reg" field in the ModR/M byte is redundant. Since these bits would be otherwise "wasted", they are used instead to specify different instructions. (There are only 256 single-byte opcodes, so you want to make the best use of them.)

The implementation of these instructions in microcode is interesting. Since the instructions share the same first byte, the standard microcode mapping would put them at the same microcode address. However, these instructions are treated specially, with the "reg" field from the ModR/M byte copied into the lower bits of the microcode address. In effect, the instructions are treated as opcodes 0xF0 through 0xFF, so the different instruction variants execute at separate microcode addresses. You might expect a collision with the opcodes that really have the values 0xF0 through 0xFF. However, the 8086 opcodes were cleverly arranged so none of the other instructions in this range use microcode. As you can see above, the other instructions are prefixes (LOCK, REP, REPZ), halt (HLT), or flag operations (CMC, CLC, STC, CLI, STI, CLD, STD), all implemented outside microcode. Thus, the range 0xF0-0xFF is freed up for the "expanded" instructions.

The hardware implementation for this is not too complex. The Group ROM produces an output for these special instructions. This causes the microcode address register to load the appropriate bits from the ModR/M byte, causing the appropriate microcode routine to be executed.

The microcode address register

The heart of the microcode engine is the microcode address register, which determines which microcode address to execute. As described earlier, the microcode address is 13 bits, of which 8 bits generally correspond to the instruction opcode, one bit is an extra "X" instruction bit, and 4 bits are sequentially incremented. The diagram below shows how the circuitry for the bits is arranged. The 9 instruction bits each have a nearly-identical circuit. The sequence bits have more circuitry and each one is different, because the circuit to increment the address is different for each bit.

Layout of the microcode address register. Each bit has a roughly vertical block of circuitry.

The schematic below shows the circuitry for one bit in the microcode address register. It has two flip-flops: one to hold the current address bit and one to hold the old address while performing a subroutine call. A multiplexer (mux) selects the input to each flip-flop. For instance, if the microcode is waiting for a memory access, the "hold" input to the multiplexer causes the current address to loop around and get reloaded into the flip-flop. For a subroutine call, the "call" input saves the current address in the subroutine flip-flop. Conversely, when returning from a subroutine, the "return" input loads the old address from the subroutine flip-flop. The address flip-flop also has inputs to load the instruction as the address, to load an address from the translation ROM, or to load an interrupt microcode handler address. The circuit sends the address bit (and inverted address bit) to the microcode ROM's address decoder.

Schematic of a typical bit in the microcode address register.

Each bit has some special-case handling, so this schematic should be viewed as an illustration, not an accurate wiring diagram. In particular, the sequence bits also have inputs from the incrementer, so they can step to the next address. The low-order bits have instruction inputs to handle the specially-encoded "group" instructions discussed in the previous section.

The control signals for the multiplexers are generated from various sources. A circuit called the loader starts processing of an instruction, synchronized to the prefetch queue and instruction fetch from memory. The call and return operations are microcode instructions. The Group Decode ROM controls some of the inputs, for instance to process a ModR/M byte. Thus, there is a moderate amount of conditional logic that determines the microcode address and thus what microcode gets executed.

Conclusions

This has been a lot of material, so thank you for sticking with it to the end. I draw three conclusions from studying the microcode engine of the 8086. First, the implementation of microcode is considerably more complex than the clean description of microcode that is presented in books. A lot of functionality is implemented in logic outside of microcode, so it's not a "pure" microcode implementation. Moreover, there are many optimizations and corner cases. The microcode engine has two supporting ROMs: the Translation ROM and the Group Decode ROM. Even the microcode address register has complications.

Second, the need for all these optimizations shows how the 8086 was just on the edge of what was practical. The designers clearly went to a lot of effort to get the microcode to fit in the space available.

Finally, looking at the 8086 in detail shows how complex its instruction set is. I knew in the abstract that it was much more convoluted than, say, an ARM chip. But seeing all the special case circuitry on the die to handle the corner cases of the instruction set really makes this clear.

I plan to continue reverse-engineering the 8086 die so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @[email protected]. If you're interested in the 8086, I wrote about the 8086 die, its die shrink process and the 8086 registers earlier.

Notes and references

The 8086 microcode was disassembled (link) a couple of years ago by Andrew Jenner by extracting the bits from my die photos. My post here is a bit different, looking at hardware that runs the microcode, rather than the contents of the microcode. ↩
According to Wikipedia, the Zilog Z8000 (1979) didn't use microcode, which is a bit surprising for that timeframe. This design decision had the advantage of reducing transistor count, but the disadvantage of hard-to-fix logic bugs in the instruction decoder. ↩
As an example of a non-microcoded processor, the MOS 6502 (1975) used a PLA (Programmable Logic Array) to perform much of the decoding (details). A PLA provides a structured way of implementing logic gates in a relatively dense array. A PLA is kind of like a ROM—a PLA can implement a ROM and vice versa—so it can be hard to see the difference. The usual distinction is that only one row of a ROM is active at a time, the address that you're reading out. A PLA is more general since multiple rows can be active at a time, combined to form the outputs.

The Z80 had a slightly different implementation. It used a smaller PLA to perform basic decoding of instructions into various types. It then generated control signals through a large amount of "random" logic (so-called because of its appearance, not because it's actually random). This logic combined instruction types and timing signals to generate the appropriate control signals. ↩
In a "normal" ROM, the address bits are decoded with logic gates so each unique address selects a different storage column in the ROM. However, parts of the decoder circuitry can "ignore" bits so multiple addresses select the same storage column. For a hypothetical example, suppose you're decoding 5-bit addresses. Instead of decoding 11010 and 11011 separately, you could ignore the last bit so both addresses access the same ROM data. or you could ignore the last three bits so all 8 addresses of the form 11xxx go to the same ROM location (where x indicates a bit that can be either 0 or 1). This makes the ROM more like a PLA (Programmable Logic Array), but still accessing a single row at a time. ↩
The Intel 8087 was the floating point co-processor for the 8086. The 8087 required a lot of microcode, more than could fit in a standard ROM on the die. To get the microcode to fit, Intel created a special ROM that stored two bits per transistor (instead of one) by using four different sizes of transistors to generate four different voltages. Analog circuitry converted each voltage level into two bits. This complex technique doubled the density (at least in theory), allowing the microcode to fit on the chip. I wrote about the 8087's non-binary ROM here. ↩
The ALU operations that are grouped together are add, add with carry, subtract, subtract with borrow, logical AND, logical XOR, logical OR, and compare. The compare operation may seem out of place in this list, but it is implemented as a subtract operation that updates the condition flags without changing the value. Operations such as increment, decrement, negation, and logical NOT may seem like they should be included, but since they operate on a single argument instead of two, they are implemented differently at the microcode level. Increment and decrement are combined in the microcode, however. Negation and logical NOT could be combined except that negation affects the condition code flags, while NOT doesn't, so they need separate microcode. (This illustrates how subtle features of the instruction set can have more impact than you might expect.) Since the ALU doesn't have hardware multiplication and division, the multiplication and division operations are implemented separately in microcode with loops. ↩
The ALU itself isn't examining instruction bits to decide what to do. There's some control circuitry next to the ALU that uses a PLA (Programmable Logic Array) to examine the instruction bits and the microcode's ALU command to generate low-level control signals for the ALU. These signals control things such as carry propagation, argument negation, and logical operation selection to cause the ALU to perform the desired operation. ↩
The Translation ROM has one additional output: a wire indicating an address mode that uses the BP register. This output goes to the segment register selection circuitry and selects a different segment register. The reason is that the 8086 uses the Data Segment by default for effective address computation, unless the address mode uses BP as a base register. In that case, the Stack Segment is used. This is an example of how the 8086 architecture is not orthogonal and has lots of corner cases. ↩
You can also view the Translation ROM as a PLA (Programmable Logic Array) constructed from two layers of NOR gates. The conditional entries make it seem more like a PLA than a ROM. Technically, it can be considered a ROM since a single row is active at a time. I'm using the name "Translation ROM" because that's what Intel calls it in the patents. ↩
Although the Group Decode ROM is called a ROM in the patent, I'd consider it more of a PLA (programmable logic array). Conceptually it holds 256 words, one for each instruction. But its implementation is an array of logic functions. ↩
These instructions were called "Group 1" and "Group 2" instructions in the 8086 documentation. Later Intel documentation renamed them as "Unary Group 3", "INC/DEC Group 4" and "Indirect Group 5". Some details are here. The 8086 has two other groups of instructions where the reg field defines the instruction: the "Immediate" instructions 0x80-0x83 and the "Shift" instructions 0xD0-0xD3. For these opcodes, the different instructions were implemented by the ALU. As far as the microcode was concerned, these were "normal" instructions so I won't discuss them in this post.

I should mention that although the 8086 opcodes are always expressed in hexadecimal, the encoding makes much more sense if you look at it in octal. Details here. The octal encoding also applies to other related chips including the 8008, 8080, and Z80. ↩

A bug fix in the 8086 microprocessor, revealed in the die's silicon

The 8086 microprocessor was a groundbreaking processor introduced by Intel in 1978. It led to the x86 architecture that still dominates desktop and server computing. While reverse-engineering the 8086 from die photos, a particular circuit caught my eye because its physical layout on the die didn't match the surrounding circuitry. This circuit turns out to implement special functionality for a couple of instructions, subtlely changing the way they interacted with interrupts. Some web searching revealed that this behavior was changed by Intel in 1978 to fix a problem with early versions of the 8086 chip. By studying the die, we can get an idea of how Intel dealt with bugs in the 8086 microprocessor.

In modern CPUs, bugs can often be fixed through a microcode patch that updates the CPU during boot.1 However, prior to the Pentium Pro (1995), microprocessors could only be fixed through a change to the design that fixed the silicon. This became a big problem for Intel with the famous Pentium floating-point division bug. The chip turned out to have a bug that resulted in rare but serious errors when dividing. Intel recalled the defective processors in 1994 and replaced them, at a cost of $475 million.

The circuit on the die

The microscope photo below shows the 8086 die with the main functional blocks labeled. This photo shows the metal layer on top of the silicon. While modern chips can have more than a dozen layers of metal, the 8086 has a single layer. Even so, the metal mostly obscures the underlying silicon. Around the outside of the die, you can see the bond wires that connect pads on the chip to the 40 external pins.

The 8086 die with main functional blocks labeled. Click this image (or any other) for a larger version.

The relevant part of the chip is the Group Decode ROM in the upper center. The purpose of this circuit is to categorize instructions into groups that control how they are decoded and processed. For instance, very simple instructions (such as setting a flag) can be performed directly in one cycle. Other instructions are not complete instructions, but a prefix that modifies the following instruction. The remainder of the instructions are implemented in microcode, which is stored in the lower-right corner of the chip. Many of these instructions have a second byte, the "Mod R/M" byte that specifies a register and the memory addressing scheme. Some instructions have two versions: one for an 8-bit operand and one for a 16-bit operand. Some operations have a bit to swap the source and destination. The Group Decode ROM is responsible for looking at the 8 bits of the instruction and deciding which groups the instruction falls into.

A closeup of the Group Decode ROM. This image is a composite showing the metal, polysilicon, and silicon layers.

The photo above shows the Group Decode ROM in more detail. Strictly speaking, the Group Decode ROM is more of a PLA (Programmable Logic Array) than a ROM, but Intel calls it a ROM. It is a regular grid of logic, allowing gates to be packed together densely. The lower half consists of NOR gates that match various instruction patterns. The instruction bits are fed horizontally from the left, and each NOR gate is arranged vertically. The outputs from these NOR gates feed into a set of horizontal NOR gates in the upper half, combining signals from the lower half to produce the group outputs. These NOR gates have vertical inputs and horizontal outputs.

The diagram below is a closeup of the Group Decode ROM, showing how the NOR gates are constructed. The pinkish regions are silicon, doped with impurities to make it a semiconductor. The gray horizontal lines are polysilicon, a special type of silicon on top. Where a polysilicon crosses conductive silicon, it forms a transistor. The transistors are wired together by metal wiring on top. (I dissolved the metal layer with acid to show the silicon; the blue lines show where two of the metal wires were.) When an input is high, it turns on the corresponding transistors, pulling the vertical lines low. This creates NOR gates with multiple inputs. The key idea of the PLA is that at each point where horizontal and vertical lines cross, a transistor can be present or absent, to select the desired gate inputs. By doping the silicon in the desired pattern, transistors can be created or omitted as needed. In the diagram below, two of the transistors are highlighted. You can see that some of the other locations have transistors, while others do not. Thus, the PLA provides a dense, flexible way to produce a set of outputs from a set of inputs.

Cioseup of part of the Gate Decode ROM showing a few of the transistors. I dissolved the metal layer for this image, to reveal the silicon and polysilicon underneath.

Zooming out a bit, the PLA is connected to some unusual circuitry, shown below. The last two columns in the PLA are a bit peculiar. The upper half is unused. Instead, two signals leave the side of the PLA horizontally and bypass the top of the PLA. These signals go to a NOR gate and an inverter that are kind of in the middle of nowhere, separated from the rest of the logic. The output from these gates goes to a three-input NOR gate, which is curiously split into two pieces. The lower part is a normal two-input NOR gate, but then the transistor for the third input (the one we're looking at) is some distance away. It's unusual for a gate to be split across a distance like this.

The circuitry as it appears on the die.

It can be hard to keep track of the scale of these diagrams. The highlighted box in the image below corresponds to the region above. As you can see, the circuit under discussion spans a fairly large fraction of the die.

The red rectangle in this figure highlights the region in the diagram above.

My next question was what instructions were affected by this mystery circuitry. By looking at the transistor pattern in the Group Decode ROM, I determined that the two curious columns matched instructions with bits 10001110 and 000xx111. A look at the 8086 reference shows that the first bit pattern corresponds to the instructions MOV sr,xxx, which loads a value into a segment register. The second bit pattern corresponds to the instructions POP sr, which pops a value from the stack into a segment register. But why did these instructions need special handling?

The interrupt bug

After searching for information on these instructions, I came across errata stating: "Interrupts Following MOV SS,xxx and POP SS Instructions May Corrupt Memory. On early Intel 8088 processors (marked “INTEL ‘78” or “(C) 1978”), if an interrupt occurs immediately after a MOV SS,xxx or POP SS instruction, data may be pushed using an incorrect stack address, resulting in memory corruption." The fix to this bug turns out to be the mystery circuitry.

I'll give a bit of background. The 8086, like most processors, has an interrupt feature where an external signal, such as a timer or input/output, can interrupt the current program. The processor starts running different code to handle the interrupt, and then returns to the original program, continuing where it left off. When interrupted, the processor uses its stack in memory to keep track of what it was doing in the original program so it can continue. The stack pointer (SP) is a register that keeps track of where the stack is in memory.

A complication is that the 8086 uses "segmented memory", where memory is divided into chunks (segments) with different purposes. On the 8086, there are four segments: the Code Segment, Data Segment, Stack Segment, and Extra Segment. Each segment has an associated segment register that holds the starting memory address for that segment. Suppose you want to change the location of the stack in memory, maybe because you're starting a new program. You need to change the Stack Segment register (called SS) to point to the new location for the stack segment. And you also need to change the Stack Pointer register (SP) to point to the stack's current position within the stack segment.

A problem arises if the processor receives an interrupt after the Stack Segment register has been changed, but before the Stack Pointer register has been changed. The processor will store information on the stack using the old stack pointer address but in the new segment. Thus, the information is stored into essentially a random location in memory, which is bad.2 Intel's fix was to delay an interrupt after an update to the stack segment register, so you had a chance to update the stack pointer.3 The stack segment register could be changed in two ways. First, you could move a value to the register ("MOV SS, xxx" in assembly language), or you could pop a value off the stack into the stack segment register ("POP SS"). These are the two instructions affected by the mystery circuitry. Thus, we can see that Intel added circuitry to delay an interrupt immediately after one of these instructions and avoid the bug.

Conclusions

One of the interesting things about reverse-engineering the 8086 is when I find a curious feature on the die and then find that it matches an obscure part of the 8086 documentation. Most of these are deliberate design decisions, but they show how complex and ad-hoc the 8086 architecture is, with many special cases. Each of these cases results in some circuitry and gates, complicating the chip. (In comparison, I've reverse-engineered the ARM1 processor, a RISC processor that started the ARM architecture. The ARM1 has a much simpler architecture with very few corner cases. This is reflected in circuitry that is much simpler.)

The case of the segment registers and interrupts, however, is the first circuit that I've found on the 8086 die that is part of a bug fix. This fix appears to have been fairly tricky, with multiple gates scattered in unused parts of the chip. It would be interesting to get a die photo of a very early 8086 chip, prior to this bug fix, to confirm the change and see if anything else was modified.

If you're interested in the 8086, I wrote about the 8086 die, its die shrink process and the 8086 registers earlier. I plan to write more about the 8086 so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @[email protected].

Notes and references

The modern microcode update process is more complicated than I expected with updates possible before the BIOS is involved, during boot, or even while applications are running. Intel provides details here. Apparently Intel originally added patchable microcode to the Pentium Pro for chip debugging and testing, but realized that it would be a useful feature to fix bugs in the field (details). ↩
The obvious workaround for this problem is to disable interrupts while you're changing the Stack Segment register, and then turn interrupts back on when you're done. This is the standard way to prevent interrupts from happening at a "bad time". The problem is that the 8086 (like most microprocessors) has a non-maskable interrupt (NMI), an interrupt for very important things that can't be disabled. ↩
Intel documents the behavior in a footnote on page 2-24 of the User's Manual:

There are a few cases in which an interrupt request is not recognized until after the following instruction. Repeat, LOCK and segment override prefixes are considered "part of" the instructions they prefix; no interrupt is recognized between execution of a prefix and an instruction. A MOV (move) to segment register instruction and a POP segment register instruction are treated similarly: no interrupt is recognized until after the following instruction. This mechanism protects a program that is changing to a new stack (by updating SS and SP). If an interrupt were recognized after SS had been changed, but before SP had been altered, the processor would push the flags, CS and IP into the wrong area of memory. It follows from this that whenever a segment register and another value must be updated together, the segment register should be changed first, followed immediately by the instruction that changes the other value. There are also two cases, WAIT and repeated string instructions, where an interrupt request is recognized in the middle of an instruction. In these cases, interrupts are accepted after any completed primitive operation or wait test cycle.

Curiously, the fix on the chip is unnecessarily broad: a MOV or POP for any segment register delays interrupts. There was no hardware reason for this: the structure of the PLA means that all the necessary instruction bits were present and it would be no harder to test for the Stack Segment register specifically. The fix of delaying interrupts after a POP or MOV remains in the x86 architecture today. However, it has been cleaned up so only instructions affecting the Stack Segment register cause the delay; operations on other segment registers have no effect. ↩