Showing posts sorted by relevance for query 8086. Sort by date Show all posts

A look at the die of the 8086 processor

The Intel 8086 microprocessor was introduced 42 years ago this month,1 so I made some high-res die photos of the chip to celebrate. The 8086 is one of the most influential chips ever created; it started the x86 architecture that still dominates desktop and server computing today. By looking at the chip's silicon, we can see the internal features of this chip.

The photo below shows the die of the 8086. In this photo, the chip's metal layer is visible, mostly obscuring the silicon underneath. Around the edges of the die, thin bond wires provide connections between pads on the chip and the external pins. (The power and ground pads each have two bond wires to support the higher current.) The chip was complex for its time, containing 29,000 transistors.

Die photo of the 8086, showing the metal layer. Around the edges, bond wires are connected to pads on the die. Click for a large, high-resolution image.

Looking inside the chip

To examine the die, I started with the 8086 integrated circuit below. Most integrated circuits are packaged in epoxy, so dangerous acids are necessary to dissolve the package. To avoid that, I obtained the 8086 in a ceramic package instead. Opening a ceramic package is a simple matter of tapping it along the seam with a chisel, popping the ceramic top off.

The 8086 chip, in 40-pin ceramic DIP package.

With the top removed, the silicon die is visible in the center. The die is connected to the chip's metal pins via tiny bond wires. This is a 40-pin DIP package, the standard packaging for microprocessors at the time. Note that the silicon die itself occupies a small fraction of the chip's size.

The 8086 die is visible in the middle of the integrated circuit package.

Using a metallurgical microscope, I took dozens of photos of the die and stitched them into a high-resolution image using a program called Hugin (details). The photo at the beginning of the blog post shows the metal layer of the chip, but this layer hid the silicon underneath.

Under the microscope, the 8086 part number is visible as well as the copyright date. A bond wire is connected to a pad. Part of the microcode ROM is at the top.

For the die photo below, the metal and polysilicon layers were removed, showing the underlying silicon with its 29,000 transistors.2 The labels show the main functional blocks, based on my reverse engineering. The left side of the chip contains the 16-bit datapath: the chip's registers and arithmetic circuitry. The adder and upper registers form the Bus Interface Unit that communicates with external memory, while the lower registers and the ALU form the Execution Unit that processes data. The right side of the chip has control circuitry and instruction decoding, along with the microcode ROM that controls each instruction.

Die of the 8086 microprocessor showing main functional blocks.

One feature of the 8086 was instruction prefetching, which improved performance by fetching instructions from memory before they were needed. This was implemented by the Bus Interface Unit in the upper left, which accessed external memory. The upper registers include the 8086's infamous segment registers, which provided access to a larger address space than the 64 kilobytes allowed by a 16-bit address. For each memory access, a segment register and a memory offset were added to form the final memory address. For performance, the 8086 had a separate adder for these memory address computations, rather than using the ALU. The upper registers also include six bytes of instruction prefetch buffer and the program counter.

The lower-left corner of the chip holds the Execution Unit, which performs data operations. The lower registers include the general-purpose registers and index registers such as the stack pointer. The 16-bit ALU performs arithmetic operations (addition and subtraction), Boolean logical operations, and shifts. The ALU does not implement multiplication or division; these operations are performed through a sequence of shifts and adds/subtracts, so they are relatively slow.

Microcode

One of the hardest parts of computer design is creating the control logic that tells each part of the processor what to do to carry out each instruction. In 1951, Maurice Wilkes came up with the idea of microcode: instead of building the control logic from complex logic gate circuitry, the control logic could be replaced with special code called microcode. To execute an instruction, the computer internally executes several simpler micro-instructions, which are specified by the microcode. With microcode, building the processor's control logic becomes a programming task instead of a logic design task.

Microcode was common in mainframe computers of the 1960s, but early microprocessors such as the 6502 and Z-80 didn't use microcode because early chips didn't have room to store microcode. However, later chips such as the 8086 and 68000, used microcode, taking advantage of increasing chip densities. This allowed the 8086 to implement complex instructions (such as multiplication and string copying) without making the circuitry more complex. The downside was the microcode took a large fraction of the 8086's die; the microcode is visible in the lower-right corner of the die photos.3

A section of the microcode ROM. Bits are stored by the presence or absence of transistors. The transistors are the small white rectangles above and/or below each dark rectangle. The dark rectangles are connections to the horizontal output buses in the metal layer.

The photo above shows part of the microcode ROM. Under a microscope, the contents of the microcode ROM are visible, and the bits can be read out, based on the presence or absence of transistors in each position. The ROM consists of 512 micro-instructions, each 21 bits wide. Each micro-instruction specifies movement of data between a source and destination. It also specifies a micro-operation which can be a jump, ALU operation, memory operation, microcode subroutine call, or microcode bookkeeping. The microcode is fairly efficient; a simple instruction such as increment or decrement consists of two micro-instructions, while a more complex string copy instruction is implemented in eight micro-instructions.3

History of the 8086

The path to the 8086 was not as direct and planned as you might expect. Its earliest ancestor was the Datapoint 2200, a desktop computer/terminal from 1970. The Datapoint 2200 was before the creation of the microprocessor, so it used an 8-bit processor built from a board full of individual TTL integrated circuits. Datapoint asked Intel and Texas Instruments if it would be possible to replace that board of chips with a single chip. Copying the Datapoint 2200's architecture, Texas Instruments created the TMX 1795 processor (1971) and Intel created the 8008 processor (1972). However, Datapoint rejected these processors, a fateful decision. Although Texas Instruments couldn't find a customer for the TMX 1795 processor and abandoned it, Intel decided to sell the 8008 as a product, creating the microprocessor market. Intel followed the 8008 with the improved 8080 (1974) and 8085 (1976) processors. (I've written more about early microprocessors here.)

Datapoint 2200 computer. Photo courtesy of Austin Roche.

In 1975, Intel's next big plan was the 8800 processor designed to be Intel's chief architecture for the 1980s. This processor was called a "micromainframe" because of its planned high performance. It had an entirely new instruction set designed for high-level languages such as Ada, and supported object-oriented programming and garbage collection at the hardware level. Unfortunately, this chip was too ambitious for the time and fell drastically behind schedule. It eventually launched in 1981 (as the iAPX 432) with disappointing performance, and was a commercial failure.

Because the iAPX 432 was behind schedule, Intel decided in 1976 that they needed a simple, stop-gap processor to sell until the iAPX 432 was ready. Intel rapidly designed the 8086 as a 16-bit processor somewhat compatible with the 8-bit 8080,4 released in 1978. The 8086 had its big break with the introduction of the IBM Personal Computer (PC) in 1981. By 1983, the IBM PC was the best-selling computer and became the standard for personal computers. The processor in the IBM PC was the 8088, a variant of the 8086 with an 8-bit bus. The success of the IBM PC made the 8086 architecture a standard that still persists, 42 years later.

Why did the IBM PC pick the Intel 8088 processor?7 According to Dr. David Bradley, one of the original IBM PC engineers, a key factor was the team's familiarity with Intel's development systems and processors. (They had used the Intel 8085 in the earlier IBM Datamaster desktop computer.) Another engineer, Lewis Eggebrecht, said the Motorola 68000 was a worthy competitor6 but its 16-bit data bus would significantly increase cost (as with the 8086). He also credited Intel's better support chips and development tools.5

In any case, the decision to use the 8088 processor cemented the success of the x86 family. The IBM PC AT (1984) upgraded to the compatible but more powerful 80286 processor. In 1985, the x86 line moved to 32 bits with the 80386, and then 64 bits in 2003 with AMD's Opteron architecture. The x86 architecture is still being extended with features such as AVX-512 vector operations (2016). But even though all these changes, the x86 architecture retains compatibility with the original 8086.

Transistors

The 8086 chip was built with a type of transistor called NMOS. The transistor can be considered a switch, controlling the flow of current between two regions called the source and drain. These transistors are built by doping areas of the silicon substrate with impurities to create "diffusion" regions that have different electrical properties. The transistor is activated by the gate, made of a special type of silicon called polysilicon, layered above the substrate silicon. The transistors are wired together by a metal layer on top, building the complete integrated circuit. While modern processors may have over a dozen metal layers, the 8086 had a single metal layer.

Structure of a MOSFET in the integrated circuit.

The closeup photo of the silicon below shows some of the transistors from the arithmetic-logic unit (ALU). The doped, conductive silicon has a dark purple color. The white stripes are where a polysilicon wire crossed the silicon, forming the gate of a transistor. (I count 23 transistors forming 7 gates.) The transistors have complex shapes to make the layout as efficient as possible. In addition, the transistors have different sizes to provide higher power where needed. Note that neighboring transistors can share the source or drain, causing them to be connected together. The circles are connections (called vias) between the silicon layer and the metal wiring, while the small squares are connections between the silicon layer and the polysilicon.

Closeup of some transistors in the 8086. The metal and polysilicon layers have been removed in this photo. The doped silicon has a dark purple appearance due to thin-film interference.

Conclusions

The 8086 was intended as a temporary stop-gap processor until Intel released their flagship iAPX 432 chip, and was the descendant of a processor built from a board full of TTL chips. But from these humble beginnings, the 8086's architecture (x86) unexpectedly ended up dominating desktop and server computing until the present.

Although the 8086 is a complex chip, it can be examined under a microscope down to individual transistors. I plan to analyze the 8086 in more detail in future blog posts8, so follow me on Twitter at @kenshirriff for updates. I also have an RSS feed. Here's a bonus high-resolution photo of the 8086 with the metal and polysilicon removed; click for a large version.

Die photo of the Intel 8086 processor. The metal and polysilicon have been removed to reveal the underlying silicon.

Notes and references

The 8086 was released on June 8, 1978. ↩
To expose the chip's silicon, I used Armour Etch glass etching cream to remove the silicon dioxide layer. Then I dissolved the metal using hydrochloric acid (pool acid) from the hardware store. I repeated these steps until the bare silicon remained, revealing the transistors. ↩
The designers of the 8086 used several techniques to keep the size of the microcode manageable. For instance, instead of implementing separate microcode routines for byte operations and word operations, they re-used the microcode and implemented control circuitry (with logic gates) to handle the different sizes. Similarly, they used the same microcode for increment and decrement instructions, with circuitry to add or subtract based on the opcode. The microcode is discussed in detail in New options from big chips and patent 4449184. ↩↩
The 8086 was designed to provide an upgrade path from the 8080, but the architectures had significant differences, so they were not binary compatible or even compatible at the assembly code level. Assembly code for the 8080 could be converted to 8086 assembly via a program called CONV-86, which would usually require manual cleanup afterward. Many of the early programs for the 8086 were conversions of 8080 programs. ↩
Eggebrecht, one of the original engineers on the IBM PC, discusses the reasons for selecting the 8088 in Interfacing to the IBM Personal Computer (1990), summarized here. He discussed why other chips were rejected: IBM microprocessors lacked good development tools, and 8-bit processors such as the 6502 or Z-80 had limited performance and would make IBM a follower of the competition. I get the impression that he would have preferred the Motorola 68000. He concludes, "The 8088 was a comfortable solution for IBM. Was it the best processor architecture available at the time? Probably not, but history seems to have been kind to the decision." ↩
The Motorola 68000 processor was a 32-bit processor internally, with a 16-bit bus, and is generally considered a more advanced processor than the 8086/8088. It was used in systems such as Sun workstations (1982), Silicon Graphics IRIS (1984), the Amiga (1985), and many Apple systems. Apple used the 68000 in the original Apple Macintosh (1984), upgrading to the 68030 in the Macintosh IIx (1988), and the 68040 with the Macintosh Quadra (1991). However, in 1994, Apple switched to the RISC PowerPC chip, built by an alliance of Apple, IBM, and Motorola. In 2006, Apple moved to Intel x86 processors, almost 28 years after the introduction of the 8086. Now, Apple is rumored to be switching from Intel to its own ARM-based processors. ↩
For more information on the development of the IBM PC, see A Personal History of the IBM PC by Dr. Bradley. ↩
The main reason I haven't done more analysis of the 8086 is that I etched the chip for too long while removing the metal and removed the polysilicon as well, so I couldn't photograph and study the polysilicon layer. Thus, I can't determine how the 8086 circuitry is wired together. I've ordered another 8086 chip to try again. ↩

Inside the 8086 processor's instruction prefetch circuitry

The groundbreaking 8086 microprocessor was introduced by Intel in 1978 and led to the x86 architecture that still dominates desktop and server computing. One way that the 8086 increased performance was by prefetching: the processor fetches instructions from memory before they are needed, so the processor can execute them without waiting on the (relatively slow) memory. I've been reverse-engineering the 8086 from die photos and this blog post discusses what I've uncovered about the prefetch circuitry.

The 8086 was introduced at an interesting point in microprocessor history, where memory was becoming slower than the CPU. For the first microprocessors, the speed of the CPU and the speed of memory were comparable.1 However, as processors became faster, the speed of memory failed to keep up. The 8086 was probably the first microprocessor to prefetch instructions to improve performance. While modern microprocessors have megabytes of fast cache2 to act as a buffer between the CPU and much-slower main memory, the 8086 has just 6 bytes of prefetch queue. However, this was enough to increase performance by about 50%.3

The die photo below shows the 8086 microprocessor under a microscope. The metal layer on top of the chip is visible, with the silicon and polysilicon mostly hidden underneath. Around the edges of the die, bond wires connect pads to the chip's 40 external pins. I've labeled the key functional blocks; ones that are important to the prefetch queue are highlighted in red and will be discussed in detail below. Architecturally, the chip is partitioned into a Bus Interface Unit (BIU) at the top and an Execution Unit (EU) below. The BIU handles memory accesses, while the Execution Unit (EU) executes instructions.

The 8086 die under a microscope, with main functional blocks labeled. This photo shows the chip's single metal layer; the polysilicon and silicon are underneath. Click on this image (or any other) for a larger version.

Prefetching and the architecture of the 8086

Prefetching had a major impact on the design of the 8086. Earlier processors such as the 6502, 8080, or Z80 were deterministic. The processor fetched an instruction, executed the instruction, fetched the next instruction, and so forth. Memory accesses corresponded directly to instruction fetching and execution and instructions took a predictable number of clock cycles. This all changed with the introduction of the prefetch queue. Memory operations became unlinked from instruction execution since prefetches happen as needed and when the memory bus is available.

Since memory operations and instruction execution happen independently, the implementors of the 8086 split the chip into two processing units: the Bus Interface Unit (BIU) that handles memory accesses, and the Execution Unit (EU) that executes instructions, as shown below.4 The Bus Interface Unit contains the 6-byte instruction prefetch queue; it supplies instructions to the Execution Unit via the Q (queue) bus. The adder (Σ) performs address calculation, adding the segment register base to an address offset, among other things. The Execution Unit is what comes to mind when you think of a processor: it has most of the registers, the arithmetic/logic unit (ALU), and the microcode that implements instructions. The address adder and the ALU are independent arithmetic units. The segment registers (CS, DS, SS, ES) and the Instruction Pointer (IP) are in the Bus Interface Unit since they are directly involved in memory accesses, while the general-purpose registers are in the Execution Unit.

Block diagram of the 8086 processor. This diagram differs from most 8086 block diagrams because it shows the actual physical implementation, rather than the programmer's view of the processor. The "Internal Communication Registers" consist of the Indirect Register (IND) and the Operand Register (OPR). These hold a memory address and memory data value respectively. From The 8086 Family User's Manual.

The 8086's segment registers play an important part in this architecture, so I'll review them quickly. One of the challenges of the 8086 was how to support more than 64K of memory with 16-bit registers. The much-reviled solution was to create a 1-megabyte (20-bit) address space consisting of 64K segments, with segment registers indicating the start of each segment. Specifically, a memory address was specified by a 16-bit offset address along with a particular segment register selecting a segment (Code Segment, Data Segment, Stack Segment, or Extra Segment). The segment register's value was shifted by 4 bits to give the segment's 20-bit base address. The 16-bit offset address was added, yielding a 20-bit memory address. This gave the processor a 1-megabyte address space, although only 64K could be accessed without changing a segment register.

It may seem inefficient for the Bus Interface Unit to have its own adder instead of using the ALU, but there are a couple of reasons for the separate adder. First, every memory access uses the adder at least once to add the segment base and offset. The adder is also used to increment the PC or index registers. Since these operations are so frequent, they would create a bottleneck if they used the ALU. Second, since the Execution Unit and the Bus Interface Unit run asynchronously with respect to each other, it would be complicated to share the ALU without causing delays and conflicts.

Prefetching had another major but little-known effect on the 8086 architecture: the designers were considering making the 8086 a two-chip microprocessor. Prefetching, however, required a one-chip design because the number of control signals required to synchronize prefetching across two chips exceeded the package pins available. This became a compelling argument for the one-chip design that was used for the 8086.3 (The unsuccessful Intel iAPX 432, which was under development at the same time, ended up being a two-chip processor: one to fetch and decode instructions, and one to execute them.)

Implementing the queue

The instruction prefetch queue is implemented with three 16-bit queue registers along with two hardware pointers that keep track of the current position in the queue. One two-bit counter keeps track of the current read position from 0 to 2, i.e. the queue register that will provide the next instruction. The second counter keeps track of the current write position, i.e. the queue register that will receive the next instruction from memory. As words are fetched from the queue, the read pointer advances. As words are added to the queue, the write pointer advances. Because the queue registers hold words, while the prefetch circuitry provides bytes, another flip-flop keeps track of whether the high byte or the low byte of the word is being used. I call this the HL flip-flop. This causes the low byte to be provided first and then the high byte (since the 8086 is little-endian).

The diagram below shows an example queue configuration with four bytes. The first two queue registers (Q0 and Q1) hold data. The read pointer and HL pointer indicate that the next prefetched byte will come from the low byte of Q0. The write pointer indicates that the next prefetched word will go into Q2.

A queue configuration with four bytes in the prefetch queue. Bytes in blue hold prefetched data.

The diagram below shows how the queue pointers can wrap around. In this configuration, one byte has been used from Q2 so the next byte will be Q2's high byte. Q0 holds the next prefetched word. The next word to be prefetched will be stored in Q1, as indicated by the write pointer.

A queue configuration with three bytes in the prefetch queue.

The relative positions of the write and read pointers indicate how much data is in the queue. If the write pointer is one position before the read pointer (modulo 3), the queue holds 3 or 4 bytes. If the write pointer is one position after the read pointer (modulo 3), the queue holds 1 or 2 bytes. But what about when the read pointer and write pointer indicate the same register? This can either indicate that the queue is empty or that the queue is full (5 or 6 bytes). To distinguish these cases, a flip-flop is set if the queue enters the empty state. This flip-flop generates a signal that Intel called MT (empty).

Another complication occurs if you jump to an odd address. Because of its 16-bit bus, the 8086 will prefetch a word from the even address one less. This loads one usable byte and one byte that needs to be discarded. The 8086 handles this by setting the HL flip-flop high, using a handful of gates to detect this case. As in the diagram above, the unwanted low byte will be skipped.

The diagram below zooms in on the prefetch and queue control circuitry on the die, with the main flip-flops and circuitry labeled. The lower half manages the queue, keeping track of the read and write positions and computing the queue length. The upper circuitry controls prefetch operations and interacts with the rest of the memory cycle circuitry.

The queue and prefetch circuitry on the die. The metal layer has been removed for the closeup to show the silicon of the underlying transistors.

Even though there is not a lot of circuitry involved (about a dozen flip-flops and associated logic gates), this circuitry occupies a substantial part of the 8086 die. (The relatively small amount of circuitry did not make this easy to reverse-engineer, however!) Compared to modern chips, the density of the 8086 is very low; you can almost see the flip-flops with the naked eye. This diagram only shows the circuitry directly involved in prefetching. Additional circuitry is scattered through the memory cycle control circuitry to deal with prefetching, and the queue registers take up a substantial part of the register file. Thus, prefetching was a moderately expensive feature for the 8086, as far as die area.

The loader

To decode and execute an instruction, the Execution Unit must get instruction bytes from the Bus Interface Unit, but this is not entirely straightforward. The main problem is that the queue can be empty, in which case instruction decoding must block until a byte is available from the queue. The second problem is that instruction decoding is relatively slow, so for maximum performance, the decoder needs a new byte before the current instruction is finished. A circuit called the "loader" solves these problems by providing synchronization between the prefetch queue and the instruction decoder. The loader uses a small state machine to efficiently fetch bytes from the queue at the right time and to provide timing signals to the decoder and microcode engine.

In more detail, as the loader requests the first two instruction bytes from the prefetch queue, it generates two timing signals that control the microcode execution. The FC (First Clock) indicates that the first instruction byte is available, while the SC (Second Clock) indicates the second instruction byte. Note that the First Clock and Second Clock are not necessarily consecutive clock cycles because the prefetch queue could be empty or contain just one byte, in which case the First Clock and/or Second Clock would be delayed. The instruction decoding circuitry and the microcode engine are controlled by the First Clock and Second Clock signals, so they remain synchronized with the bytes supplied by the prefetch queue.

At the end of a microcode sequence, the Run Next Instruction (RNI) micro-operation causes the loader to fetch the next machine instruction. However, fetching and decoding the next instruction is a bit slow so microcode execution would be blocked for a cycle. In many cases, this slowdown can be avoided: if the microcode knows that it is one micro-instruction away from finishing, it issues a Next-to-last (NXT) micro-operation so the loader can start loading the next instruction. This achieves a degree of pipelining in most cases; fetching the next instruction is overlapped with finishing the execution of the previous instruction.

The state machine for the 8086 "loader" circuit. The 1BL signal indicates a 1-byte instruction implemented in logic rather than microcode. From patent US4449184.

The diagram above shows the state machine for the loader. I won't explain it in detail, but essentially it keeps track of whether it is waiting for a First Clock byte or a Second Clock byte, and if it is performing a fetch in advance (NXT) or at the end of an instruction (RNI). The state machine is implemented with two flip-flops to support its four states.

Other memory accesses

The loader takes care of fetching an instruction that consists of an opcode byte and a Mod R/M (addressing mode) byte. However, many instructions have additional bytes or don't follow this format For example, an opcode such as "ADD AX" can be followed by an 8- or 16-bit immediate value, adding that value to the AX register. Or a "move memory to AX" instruction can be followed by a 16-bit memory address The microcode uses a separate mechanism for fetching these instruction bytes from the queue. Specifically, each micro-instruction contains a source register and a destination register that specify a data move. By specifying "Q" (the queue) as the source, a byte is fetched from the prefetch queue.

A third path is used for arbitrary memory reads and writes, such as when an instruction stores a register's contents to memory. In this case, the microcode puts the memory address in the IND (indirect) register. The microcode then issues a read or write micro-operation which causes the memory contents to be read into the OPR (operand register) or written from the OPR. (The IND and OPR registers are internal 8086 registers that are not visible to the programmer.) In the 8086, a memory cycle takes at least four clock cycles (called T1 through T4), including adding the segment register to compute the memory address. An "unaligned" memory access takes twice as long, though, because the 8086 has a word-based 16-bit data bus. Thus, if you try to access a word from an odd address, two memory accesses are required, one for the first byte and one for the second byte.5

As you can see, a memory access is a fairly complex operation. In the 8086, all these steps are done by hardware in the Bus Interface Unit, rather than being performed by microcode. (I'll discuss the complex memory control circuitry in detail in a future post.) After issuing a memory read or write, the microcode engine is blocked until the memory request completes.

Microcode instructions and the correction circuitry

The microcode interacts with prefetching in several ways. In addition to requesting a byte from the queue (as discussed above), microcode can perform three micro-instructions that involve prefetching: SUSP, FLUSH, and CORR. The SUSP (suspend) micro-instruction stops prefetching, typically before a change to execution flow. The FLUSH micro-instruction flushes the prefetch queue and resumes prefetching. To implement these, the prefetching circuitry has a flip-flop to keep track of the suspended state, and logic to reset the queue pointer counters to flush the queue.

The CORR (correct) micro-instruction corrects the Instruction Pointer to point to the next execution position. This is an interesting and more complicated micro-instruction. Like most processors, the 8086 has a program counter (PC) to keep track of what instruction to execute; the 8086 calls this the Instruction Pointer (IP). In the programmer's view, the Instruction Pointer points to the memory address of the next instruction to execute. However, in the hardware, the Instruction Pointer points to the next instruction to be fetched, which is generally several bytes after the next instruction to be executed.6

For the most part, this doesn't matter; the queue provides instructions in the order they were fetched and it doesn't matter if the Instruction Pointer runs ahead. However, there are a few cases where the "real" Instruction Pointer address is needed. For example, a relative jump instruction causes execution to jump to an address relative to the current instruction. When performing a subroutine call, the return address must be pushed on the stack. The correct Instruction Pointer value is also needed for an interrupt. Thus, the 8086 needs a mechanism to compute the real Instruction Pointer value from the value in the Instruction Pointer register.

The solution is the CORR micro-instruction, which corrects the Instruction Pointer value by subtracting the prefetch queue length, so the Instruction Pointer holds the "true" value. For instance, if there are 4 bytes in the queue, then the address in the Instruction Pointer register is four more than the desired Instruction Pointer address. The Bus Interface Unit performs this subtraction by using the addressing adder and a small table of constants called the Constant ROM.7

The diagram below zooms in on the Constant ROM, located next to the adder. The Constant ROM is implemented as a PLA (programmable logic array), a two-level structured arrangement of gates. The first level (bottom) selects the desired correction constant, while the second level (middle) generates the bits of the constant: three bits plus a sign bit. The necessary correction constant is selected based on the length of the queue in words, the HL pointer, and the empty (MT) flag.

The Constant ROM, highlighted on the die.

The Constant ROM is used for more than just address correction. For example, it is also used to increment the Instruction Pointer by 2 after a prefetch. Other constants are used for the 8086's string operations, which act on a block of memory. The index registers are incremented or decremented by 1 for bytes or 2 for words. When popping a value from the stack, the stack pointer is decremented, which uses the constant -2. Additional constants are required to increment and decrement the IND register when accessing words from unaligned (odd) addresses. These increment/decrement values are selected in the upper part of the Constant ROM. In total, the Constant ROM holds values from -6 to +2.

Policy

There are some "policy" decisions on prefetching, and it's interesting to see how the 8086 implements them. Prefetching is not free: there is a tradeoff when performing a prefetch between saving time later versus delaying memory accesses from an executing instruction. Moreover, if a jump operation takes place, the prefetch queue is discarded and the memory cycles were wasted. Thus, the length of the queue is an "extremely tricky design problem, because performance can deteriorate if the queue is too long as well as if it is too short."3

Intel performed simulations to determine the best queue length. A 4-byte queue provided a large benefit, while a 6-byte queue (which they chose) was slightly better. The designers were surprised to find that performance flattened out after that; they expected a much longer queue would be necessary. The 8088 process has only a 4-byte prefetch queue because its 8-bit bus changes the tradeoffs.8

The basic prefetch policy is that if a memory access and a prefetch are requested at the same time, the memory access "wins", since it is guaranteed to be useful while the prefetch is just speculative. If the queue holds 0 to 2 bytes, prefetch happens during the next free memory cycle. If the queue holds 5 or 6 bytes, no prefetch can happen, since prefetch happens a word at a time. However, if the queue holds 3 or 4 bytes, prefetch is delayed for two clock cycles, which is an interesting choice. This gives an instruction more opportunities to perform a memory operation without being delayed by a prefetch. There is a tradeoff because maybe delaying the prefetch will waste two cycles of memory bandwidth, but performing the prefetch might waste four cycles of memory bandwidth. The motivation for this delay is that the last two bytes in the queue are less valuable because they are more likely to be discarded.

Another policy decision is how to handle a change in execution flow, such as a jump or subroutine call. The 8086 simply discards the prefetch queue and starts fetching from the new address. The 8086 designers considered better ways of handling jumps, but it wasn't practical to implement at the time. There is no intelligence if the instructions are already in the queue (e.g. jumping forward a couple of bytes). There is also no branch prediction; prefetching proceeds linearly regardless of branch instructions.

The 8086 does nothing to ensure consistency between the prefetch queue and memory if a prefetched instruction is modified in memory.9 In this case, the "stale" instruction in the queue is executed. This situation may seem contrived, but self-modifying code used to be fairly popular, where a program would change its own instructions.10

Prefetching and the 8087 coprocessor

One feature of the 8086 microprocessor is that it supports coprocessors such as the 8087 floating point chip.11 The 8087 implements high-performance floating-point computation, performing arithmetic and transcendental computations up to 100 times faster than the 8086. The 8087 gets instructions in an interesting fashion, executing floating-point instructions from the 8086's instruction stream. Specifically, an "ESCAPE" opcode indicates an instruction that is performed by the 8087 rather than the 8086. However, prefetching adds a lot of complexity to the coprocessor because the 8087 monitors the bus to determine when it should execute an instruction. With prefetching, the instruction on the bus doesn't match the instruction being executed. An instruction may be executed many cycles after it was fetched over the bus. A prefetched instruction may even be discarded and never executed.

To solve this problem, the 8087 manages its own copy of the prefetch queue to determine when the 8086 would be executing a floating-point instruction. The 8087 watches the bus to see when instructions are prefetched. The 8086 provides queue status signals (QS0 and QS1) to indicate when it takes bytes from the queue or flushes the queue. These signals allow the 8087 coprocessor to keep track of the 8086's queue state so it can tell what instruction the 8086 is executing. In other words, the 8086 doesn't tell the 8087 coprocessor what to do; instead, the two chips process the instruction stream in parallel. Another complication is the 8087 coprocessor can be used with the 8088 processor chip, which has a smaller 4-byte queue. Thus, the 8087 coprocessor must detect whether it is connected to an 8086 or an 8088 and maintain its queue appropriately.

Brief history

Caching and prefetching were used in mainframe computers dating back to the 1960s. For instance, the IBM System/360 Model 91 (1966) had a cache with prefetching. Minicomputers such as the VAX 11/780 (1977) later used caching and prefetching. However, these features took a while to trickle down to microprocessors. The Motorola 68000 (1980) had a 4-byte prefetch queue. As far as I can tell, the 8086 was the first microprocessor with a prefetch queue.

We can view the 8086 as a stepping-stone towards the large caches first used externally in the 80386 and internally in the 486. The 80186 and 80286 kept the 6-byte prefetch buffer size of the 8086. The 80386 has a 16-byte prefetch buffer, although apparently due to a bug it was shrunk to 12 bytes in later revisions. As well as the prefetch queue, the 80386 supported an external cache.

Early microprocessors such as the 6502 or Z80 could fetch the next instruction while they were finishing the previous instruction. This minimal two-stage pipelining improved performance, but was much more limited than 8086-style prefetching. An Intel study found that this simple overlap provides a 35% performance increase with 15% more hardware, while implementing prefetching provided an additional 11% gain with 14% more hardware.3 This illustrates how the increasing transistor counts from Moore's law opened up new opportunities to improve performance. But it also shows diminishing returns as performance increases become smaller and more expensive.

Conclusions

Well, this was supposed to be a quick post about the prefetch queue, but the topic turned out to have a lot more complexity than I expected. A six-byte prefetch queue may seem like a simple feature to add to a processor, but it affects many parts of the system. Prefetching is tied closely to the memory access circuitry, of course, but it also required a Constant ROM to handle the difference between the execution address and the prefetch address. Prefetching also impacted the microcode, with three micro-instructions to support prefetching.

Prefetching also illustrates some of the ways that each feature and corner case of a processor like the 8086 leads to more complexity. For instance, byte-aligned (rather than word-aligned) instructions require a mechanism to fetch bytes as well as words. Supporting multiple instruction formats (1-byte opcodes, an opcode byte followed by a Mod R/M byte, multi-byte instructions) resulted in the loader state machine. The segment registers required an adder to compute the memory address for every access. Looking at the 8086 internals makes it easier to understand the motivation behind RISC processors, discarding the complexity and corner cases to create a simpler but faster processor.

I plan to continue reverse-engineering the 8086 die so follow me on Twitter @kenshirriff or RSS for updates. I've also started experimenting with Mastodon recently as @oldbytes.space@kenshirriff. If you're interested in the 8086, I wrote about the 8086 die, its die shrink process and the 8086 registers earlier.

Notes and references

Steve Furber, co-creator of the ARM chip, mentions that "The first integrated CPUs were coincidentally quite well matched to semiconductor memory speeds, and were therefore built without caches. This can now be seen as a temporary aberration." See VLSI Risc Architecture and Organization p77. To make this concrete, the Apple II (1977) used a MOS 6502 processor running at about 1 megahertz while its 4116 DRAM chips could perform an access in 250 nanoseconds (4 times the clock speed). The 8086 processor ran at 5-10 MHz which meant that 250 ns DRAM chips were slower than the clock speed. Nowadays, processors run at 4 GHz but DRAM access speed is about 50 nanoseconds (1/200 the clock speed). ↩
Modern processors use caches to improve memory performance; caches are often megabytes in size. Accessing data from a cache is faster than accessing it from main memory, but the tradeoff is that caches are smaller. The 8086's prefetch queue is similar to a cache in some ways, but there are some key differences. First, the prefetch queue is strictly sequential. If you jump ahead two bytes, even if the prefetch queue has those instruction bytes, the processor can't use them. Second, the prefetch queue can't reuse bytes. If you have a 6-byte loop, even though all the code fits in the prefetch queue, it will be reloaded every time. Third, the prefetch queue doesn't provide any consistency. If you modify an instruction in memory a couple of bytes ahead of the PC, the 8086 will run the old instruction if it's in the queue. ↩
The design decisions for the 8086 prefetch cache (and many other aspects of the chip) are described in: J. McKevitt and J. Bayliss, "New options from big chips," in IEEE Spectrum, vol. 16, no. 3, pp. 28-34, March 1979, doi: 10.1109/MSPEC.1979.6367944. ↩↩↩↩
A detailed block diagram of the 8086 is provided in the patent. Conveniently, the layout of the diagram is close to the physical layout of the chip.

Detailed block diagram of the 8086, based on patent US4449184. I have modified the register names to match the common naming.

I won't discuss this block diagram in detail here, but I'll point out the Q (queue) control logic in the upper center, with its associated read and write pointers. The Q bus connects the queue to various parts of the instruction decoding circuitry.

↩
Supporting misaligned memory accesses (i.e. accessing a word at an odd address) adds complexity to the 8086. It's not surprising that many RISC chips prohibit unaligned accesses. On SPARC, for instance, an unaligned access fails with a "bus error", which the Sun programmers out there probably recognize. ARM processors before ARMv7 didn't support unaligned accesses. RISC-V supports misaligned data accesses but not misaligned instructions. ↩
The 8086 patent describes how the program counter in the 8086 does not hold the "real" value:

PC is not a real or true program counter in that it does not, nor does any other register within CPU, maintain the actual execution point at any time. PC actually points to the next byte to be input into queue. The real program counter is calculated by instruction whenever a relative jump or call is required by subtracting the number of accessed instructions still remaining unused in queue from PC.
↩
The CORR correction operation adds more complexity to the system than you might expect, with synchronization between the Bus Interface Unit and the Execution Unit. Because the correction computation uses the addressing adder, the correction operation must be synchronized with memory accesses that also use the adder. To accomplish this, the Bus Interface Unit waits until any memory operation is finished and then generates two clock cycles of "fake" memory operation, keeping the adder free for the CORR instruction. As a result, the memory control circuitry needs logic to implement this memory cycle. Meanwhile, the microcode engine is stopped until the CORR instruction completes, requiring synchronization circuitry. ↩
The 8088 is famous as the processor in the original IBM PC. The 8088 processor is essentially the same as the 8086 except that it has an 8-bit data bus instead of a 16-bit data bus, so it performs memory accesses a byte at a time instead of a word at a time. Internally, the 8088 is nearly identical to the 8086 but there are a few differences in microcode and in the bus circuitry. The most visible difference is that the 8088 has a 4-byte prefetch queue instead of a 6-byte prefetch queue. Simulations showed that a 4-byte queue was sufficient for the 8088. Because it fetches one byte at a time instead of two bytes, the 8088 fills the prefetch queue more slowly and wouldn't get much benefit from the larger queue. I haven't looked at the 8088's prefetch circuitry in detail, so I can't describe it exactly. ↩
At some point, Intel implemented consistency between cached instructions and memory. This ensures that self-modifying code will run the latest version of an instruction rather than a stale instruction in the cache. I couldn't determine exactly when this was implemented; various sources say the 486, the Pentium, or the Pentium Pro. (If you have a definitive answer, please let me know.) ↩
Self-modifying code can be used as a way to distinguish between the 8086 and 8088 chips in software. Since the 8086 has a 6-byte queue and the 8088 has a 4-byte queue, you can create self-modifying code that will run a prefetched instruction on the 8086 but run the modified instruction on the 8088. ↩
Although the 8087 is the most well-known coprocessor for the 8086, it was not the only coprocessor. The Intel 8089 input/output coprocessor provided mainframe-style I/O channels, offloading I/O processing from the 8086. More than just a DMA engine, the 8089 was a separate processor with its own instruction set. Unlike the 8087, the 8089 didn't take instructions from the 8086's instruction stream so it didn't interact with prefetching; instead, the 8086 sent a Channel Attention signal to the 8089 and the 8089 read instructions from shared memory. The 8089 was complex and expensive and wasn't very popular. The Intel 82586 Ethernet coprocessor used a similar Channel Attention scheme. ↩

Inside a counterfeit 8086 processor

Intel introduced the 8086 processor in 1978, leading to the x86 architecture in use today. I'm currently reverse-engineering the circuitry of the 8086 so I've been purchasing vintage 8086 chips off eBay. One chip I received is shown below. From the outside, it looks like a typical Intel 8086.

The package of the fake 8086. It is labeled as an Intel 8086 from 1978.

I opened up the chip and looked at it under the microscope, creating the die photo below. The whitish lines are the metal layer, connecting the chip's circuitry. Underneath, the silicon has a purple hue. Around the outside of the die, bond wires connect the square pads to the 40 external pins on the IC.

Die photo of the fake 8086, showing the metal layer on top. The thick horizontal and vertical strips provide power and ground, while the other wiring connects the components.

I quickly noticed, however, that this wasn't an 8086 processor but something entirely different! For comparison, look at my die photo of a genuine 8086 below. As you can see, the chips are entirely different and the 8086 is much more complex. Someone had taken a random 40-pin chip and relabeled it as an Intel 8086 processor. The genuine 8086 has various functional blocks visible: the 16-bit registers and ALU on the left, the large microcode ROM in the lower right, and various other blocks of circuitry throughout the chip. (The genuine chip also has a tiny Intel copyright and the 8086 part number in the lower right. Click the image to magnify.) The fake chip above, on the other hand, is an irregular grid of horizontal and vertical wiring, with thicker horizontal and vertical lines for power.

Die photo of a genuine 8086 chip.

The ULA or Uncommitted Logic Array

If the chip isn't an 8086, what is it? I believe the fake chip is an Uncommitted Logic Array, a type of gate array. A gate array is a way of making semi-custom integrated circuits without the expense of a fully-custom design. The idea behind a gate array is that the silicon die has a standard array of transistors that can be wired up to create the desired logic functions. This wiring is done in the chip's metal layers, which are designed for the customer's requirements.2 Although a gate array doesn't provide the flexibility of a fully-custom design, it was considerably cheaper and faster to design.

Ferranti invented the ULA in 1972, claiming that it was the first "to turn the logic array concept into a practical proposition." A ULA allowed a single LSI chip to replace hundreds or even thousands of gates that otherwise would be implemented in a board full of 7400-series TTL chips. The most well-known users of a ULA are the popular Sinclair ZX 81 and ZX Spectrum home computers.3

A ULA was based on a matrix of identical cells that were wired to form the logic gates. Around the edges of the chip, standardized peripheral cells provided the desired I/O capabilities. The diagram below shows a typical cell in the matrix. The cell contains multiple transistors and resistors, which are mostly unconnected by default. The ULA is customized by creating connections between the components to build a set of logic gates.

Layout and schematic of a ULA matrix cell. From Ferranti Quick Reference Guide.

The photo below shows the fake chip with the metal layers removed, revealing the transistor array underneath. Each small green/yellow rectangle is a transistor; there are nearly 1000 of them. Note the repeated pattern of cells in the matrix,1 as well as the different peripheral cells around the outside. The density of transistors is fairly low; the chip has empty columns to provide room to route the metal layer.

Die photo of the fake 8086 showing the underlying silicon. The metal layers were removed for this photo.

The fake chip uses bipolar transistors,4 completely different from the NMOS transistors in the 8086 processor. The closeup below shows transistors (the striped rectangles) and the two layers of metal wiring connecting them. (The genuine 8086 only has one metal layer, so the fake chip is probably more recent, from the 1980s.)

A closeup of the fake chip showing transistors.

There is no manufacturer printed on the die of the fake chip. The matrix cells don't look like the Ferranti cells. The photo below shows a ULA built by Plessey, another ULA manufacturer. That die has a smaller transistor matrix than my chip, but the overall structure is roughly similar, so Plessey might be the manufacturer.

A Plessey ULA die. From "Computer Aided Design and New Manufacturing Methods for Electronic Materials", 1985.

The photo below shows another detail of the fake chip. Matrix cells are at the top. The peripheral cell below has much larger transistors for I/O. (There are also resistors in the brownish regions, but they aren't really visible.) The upper metal layer consists of horizontal wiring, while the lower metal layer is mostly vertical. The thick metal line at the right is for power (or perhaps ground) and is connected to a horizontal power distribution trace at the bottom.

Detail of the fake 8086, showing transistors, resistors, and metal wiring.

To summarize, the position of the transistors and resistors in the ULA is fixed. This allows the same underlying silicon wafers to be manufactured for all the customers, keeping volume high and costs low. But by customizing the metal wiring layers, the ULA can be completed to fulfill the logic functions each customer needs.

Conclusions

Why would someone go to all the work of relabeling a $3.80 chip? I guess someone had a stack of old custom ICs with no value. By re-labeling them, they could at least get something for them. It hardly seems worth the effort, but I guess they make up for it in volume. The seller has sold over 215 of these 8086's, although I don't know if they were all fake or if I was unlucky. In any case, the seller gave me a prompt refund.

The fake 8086 for sale on eBay.

The seller's feedback (below) shows a lot of complaints about fake chips. Even so, the seller's feedback is 99.2% positive, so I suspect that there are just a few fake chips mixed in with many types of real chips. It's also possible that most vintage 8086s are purchased by IC collectors who never test the chip.

Feedback on the seller.

I've been asked if this chip would actually work as an 8086. Sometimes counterfeiters sell a lower-quality chip in place of the real thing, such as the fake expensive op amps found by Zeptobars. But other times the fake chip is unrelated, such as the vintage bipolar RAM chips that I determined was a Touch-Tone dialer. Since an 8086 has 29,000 MOS transistors but the fake chip has under 1000 bipolar transistors, it's clear that this chip won't function as an 8086.

The moral is to always be careful when you're buying chips, since you never know what you might find. Semiconductor counterfeiting is a big business and I've encountered just a tiny piece of it. I plan to write more about reverse-engineering the (real) 8086, so follow me on Twitter at @kenshirriff for updates. I also have an RSS feed.

Notes and references

I think the fake chip has a matrix of 8×12 cells, with each of the large "IXI" patterns composed of four cells. ↩
At first, a ULA was designed by hand by an engineer drawing the interconnects on paper, but by the 1980s, CAD software automated most of the design and testing. The CAD station below is pretty wild.

CAD system for designing ULAs at Plessey. From "Computer Aided Design and New Manufacturing Methods for Electronic Materials", 1985.

↩
The book The ZX Spectrum ULA: How to design a microcomputer discusses Ferranti ULAs in detail along with a complete explanation of the ULA in the ZX Spectrum. ↩
Early ULAs used bipolar transistors, with CMOS circuitry introduced later. Different logic families were supported, depending on the needs of the application. Ferranti's ULAs had three types of matrix cells: RTL (resistor-transistor logic), CML (current-mode logic), and buffered current-mode logic. Other ULAs supported fast ECL (emitter-coupled logic) or standard TTL (transistor-transistor logic). ↩

Latches inside: Reverse-engineering the Intel 8086's instruction register

The Intel 8086 microprocessor is one of the most influential chips ever created; it led to the x86 architecture that dominates desktop and server computing today. But it is still simple enough that its circuitry can be studied under the microscope and understood. In this post, I explain the implementation of a dynamic latch, a circuit that holds a single bit. The 8086 has over 80 latches scattered throughout the chip, holding a variety of important processor state bits,1 but I'll focus on the eight latches that implement the instruction register and hold the instruction that is being executed.

The 8086 die, showing the 8-bit instruction register.

The photo above shows the silicon die of the 8086 processor under a microscope. I removed the metal and polysilicon layers to reveal the transistors, approximately 29,000 of them. The highlighted region indicates the 8086's 8-bit instruction buffer, consisting of eight latches. (This 1978 processor is simple enough that a single 8-bit register occupies a substantial region of the die.) The closeup shows the silicon and transistors making up a single latch.

The dynamic latch and how it works

The latch is one of the most important circuits in the 8086, since the latches keep track of what the processor is doing. While latches can be made in many ways,2 the 8086 uses a compact circuit called the dynamic latch. The dynamic latch depends on a two-phase clock, commonly used to control microprocessors of that era.3 A two-phase clock consists of two clock signals that are active in alternation. In the first phase, clock is high and the complement clock is low. Then they switch so clock is low and clock is high. This cycle repeats at the clock frequency, such as 5 MHz.

A latch in the 8086 processor is built from four pass transistors and two inverters. The latch runs off the alternating clock signals. The control signals are load and hold.

The schematic above shows a typical latch in the 8086. It consists of two inverters and several pass transistors. For our purposes, the pass transistor can be considered a switch: if the gate input is 1, the transistor passes the signal through. If the gate input is 0, the transistor blocks the signal. The pass transistors are controlled by several signals: load, which loads a bit into the latch; hold, which holds the existing bit value; clock, the first clock phase; and clock, the second, inverted clock phase.

The diagram below shows how a value (1 in this case) is loaded into the latch. The load signal is brought high, allowing the input (1 in this example) to pass through the first transistor. Since clock is high, the signal passes through the second transistor to the inverter, which outputs 0. At this point, the third (clock) transistor blocks the signal.

The input is loaded into the latch when the load signal is high.

In the next clock phase (below), clock goes high, allowing the 0 signal to reach the second inverter, which outputs 1. Since hold is high, the signal loops back, but is blocked by the clock transistor. The important point, which makes this circuit dynamic, is that at this time there is no active input to the first inverter. Instead, its input remains 1 (shown in gray) due to the capacitance of the circuit. Eventually, this charge would leak away, losing the value, but before that happens, the clocks toggle.

When clock is high, the value passes through the second inverter. The (grayed-out) input to the first inverter is maintained by the circuit's capacitance.

After the clocks switch state, the second inverter's input is provided by the capacitance of the circuit (below). The signal loops around, recharging and refreshing the input to the first inverter. As the clock signals continue to toggle, the latch switches between this diagram and the previous diagram, preserving the value in the latch and keeping the output stable.4

When clock is high, the value passes through the first inverter.

The implementation in silicon

The 8086 and other processors of that era were built from a type of transistor called NMOS. They were constructed from a silicon substrate that was "doped" by diffusion of arsenic or boron to form the transistors. On top of the silicon, polysilicon wiring created the gates of the transistors and wired components together. Finally, a metal layer on top provided more wiring. (In comparison, modern processors are built from CMOS technology, which combines NMOS and PMOS transistors, and they have many layers of metal wiring.)

Structure of an NMOS transistor (MOSFET) as implemented in an integrated circuit.

The diagram above shows the structure of a transistor. The transistor can be viewed as a switch, allowing current to flow between two diffusion regions called the source and drain. The transistor is controlled by the gate, made of a special type of silicon called polysilicon. Applying voltage to the gate lets current flow between the source and drain, while pulling the gate to 0 volts blocks the current flow. The gate is separated from the silicon by an insulating oxide layer; this makes the gate act like a capacitor as seen in the dynamic latch.

An inverter (below) is built from an NMOS transistor and a resistor.5 With a low input, the transistor is off, so the pull-up resistor pulls the output high. With a high input, the transistor turns on. This connects the output to ground, pulling the output low. Thus, the circuit inverts the input signal.

This schematic shows how an inverter is created from a transistor and resistor. The photo shows the implementation inside the chip. The metal layer was removed to show the polysilicon and silicon underneath.

The photo on the right shows how an inverter is physically constructed in the 8086. The yellowish regions are conductive doped silicon and the speckled regions are the polysilicon on top. A transistor is created where polysilicon crosses doped silicon: the polysilicon forms the transistor's gate, while the silicon regions on either side are the transistor's source and drain. The large polysilicon rectangle forms the pull-up resistor between +5 volts and the output. These physical structures can be matched with the schematic.

The diagram below shows the implementation of a latch on the chip. The pass transistors and the two inverters are indicated; the first inverter is the one described above. Polysilicon wiring connects the components together; the metal layer (removed) provided additional wiring. The transistors have complex shapes to make the most efficient use of the space.

Microscope photo of a latch in the 8086 processor. The metal wiring was removed, but traces remain as reddish vertical lines. Note: this photo is rotated 180° to match the schematic.

The latch includes output buffers, not shown on the schematic above, that provide high-current signals for the output and inverted output. This type of buffer has the amusing name "superbuffer" because it provides much higher current than a regular NMOS inverter. The problem with an NMOS inverter is it is slow when driving something with high capacitance. Since the superbuffer provides more current, it will switch the signal much faster. The superbuffer accomplishes this by replacing the pullup resistor with a transistor, which provides higher current. The downside is that the pullup transistor requires an inverter to drive it, so the superbuffer circuit is more complex. Thus, superbuffers are only used when necessary, typically when sending a signal to many gates or when driving a long bus line.

Superbuffer implementation in the 8086's latch. Note that the +5V and ground connections are switched on the rightmost transistors.

The diagram above shows the superbuffer circuit in the 8086's latches. Unlike the typical superbuffer, this one includes both an inverting and non-inverting superbuffer. To understand the circuit, note that the central resistor and transistor form an inverter. The inverter output is connected to the upper transistors, while the uninverted input is connected to the lower transistors. Thus, if the input is 1, the lower transistors will turn on, while if the input is 0, the upper transistors will turn on due to the inverter. Thus, for a 1 input, the lower transistors will pull Output high and the complement Output low. But for a 0 input, the upper transistors will pull Output low and the complement Output high.6

The instruction register

The 8086, like most processors, has an instruction register that holds the instruction that is currently being executed. In the 8086, the instruction register holds the first byte of an instruction (which may consist of multiple bytes), so it is built from eight latches (below). You might expect the latches to be identical, but each latch has a different shape. Since the layout of the 8086 is highly optimized, each latch is shaped to make the best use of the available space, constrained by the neighboring wiring. In particular, note that some latches are merged together so they can share power and ground connections. Layout optimization is also probably why the latches are not in sequential order.

The 8 latches all have somewhat different shapes, optimized for the wiring around them. The previous sections described latch 1, rotated 180° from this photo. The red vertical lines are traces of the removed metal layer.

An instruction takes a winding journey through the 8086 chip. The 8086 processor uses prefetching, improving performance by loading instructions from memory before they are required. Prefetched instructions are stored in the instruction queue, a 6-byte queue in the middle of the 8086's register file. (In comparison, modern processors can have megabytes of instruction cache.) When an instruction is executed, it is stored in the instruction register, roughly in the middle of the chip. (The relatively large distances explains the use of superbuffers.) The instruction register feeds the instruction to the "group decode ROM". This ROM determines the high-level characteristics of the instruction, such as if it is a single-byte instruction, a multi-byte instruction, or an instruction prefix. (This is only a piece of the 8086's complex instruction handling. Other latches hold pieces of the instruction indicating register usage and the ALU operation, while a separate circuit controls the microcode engine, but I'll discuss that in another post.)

The 8086 die, showing key pieces for instruction processing. Around the outside of the die, bond wires connect the die to the external pins.

Conclusions

The 8086 makes extensive use of dynamic latches to store state internally. These latches are visible under a microscope and their circuitry can be traced out and understood. The 8086 is an interesting subject for die analysis since unlike modern processors, its transistors are large enough to see under a microscope, unlike modern processors. It was a complex processor at the time, with 29,000 transistors, but it is still simple enough that the circuitry can be traced out and understood.

I've written multiple posts about the internals of the 8086 processor lately. I plan to analyze the 8086 in more detail in future blog posts so follow me on Twitter @kenshirriff or RSS for updates.

Notes and references

The 8086 has over 80 latches. Some latches hold values for the AD (address/data) pins or control pins. Other latches hold the current microcode address and the microinstruction, as well as the return address for a microcode subroutine call. Other latches hold the source and destination register bits from the instruction, and the ALU operation from the instruction. Many latches hold internal state values that I'm still investigating. ↩
Many microprocessors use cross-coupled NOR (or NAND) gates to form an SR latch. An SR latch typically takes up more space than a dynamic latch, especially if additional circuity is added to make it clocked. Edge-triggered flip flops are popular, but are even more complex, using six gates. In many cases, a pass transistor provides sufficient storage; it can hold a value across a clock cycle, but doesn't provide the long-term storage of a latch. ↩
Processors always have a maximum clock speed, the fastest they can run. (The original 8086 ran at up to 5 MHz, while the later 8086-1 supported 10 MHz.) However, due to the use of dynamic logic, the 8086 also had a minimum clock speed of 2 MHz. If the clock ran slower than that, there was a risk of the charge on a wire leaking away before it was used, causing errors. ↩
A key to the operation of the latch is that there are two inverters, so the output is stable. An odd number of inverters would result in oscillation, a feature used by the 8086's charge pump oscillator. The 8086's register file also uses pairs of inverters to store bits. However, in the register file, the two inverters are connected to each other directly, without the clocked pass transistors, resulting in storage that is more compact but more difficult to control. ↩
The pull-up resistor in an NMOS gate is implemented by a special transistor. The depletion-mode transistor acts as a resistor but is more compact and performs better than an actual resistor. ↩
Some more information on superbuffers. The problem with an NMOS inverter is that the pull-up resistor provides limited current. When outputting a 0, the transistor in an inverter pulls the output low quickly, with a relatively high current. However, when outputting a 1, the output is pulled high by the much weaker pullup resistor.

The superbuffer is somewhat like a CMOS inverter in that it has a pullup transistor and a pulldown transistor. The difference is that CMOS uses both PMOS and NMOS transistors, and the PMOS transistor has an inverted gate input. In contrast, with an NMOS superbuffer, a separate inverter is required. In other words, a CMOS inverter uses two transistors, while a superbuffer is much less efficient, requiring four transistors.

The superbuffer uses a depletion mode transistor for the pullup and an enhancement mode transistor for the pulldown. The depletion-mode transistor has a threshold voltage below zero, allowing its output (source) to get pulled up to 5V, rather than shutting off a bit lower. When the output is low, the depletion-mode transistor will still be (somewhat) on, acting like the pullup in a regular inverter, so there is some current flow through it. For more on superbuffers, see Introduction to VLSI Systems, page 28. ↩

Ken Shirriff's blog

A look at the die of the 8086 processor

Looking inside the chip

Microcode

History of the 8086

Transistors

Conclusions

Notes and references

Inside the 8086 processor's instruction prefetch circuitry

Prefetching and the architecture of the 8086

Implementing the queue

The loader

Other memory accesses

Microcode instructions and the correction circuitry

Policy

Prefetching and the 8087 coprocessor

Brief history

Conclusions

Notes and references

Inside a counterfeit 8086 processor

The ULA or Uncommitted Logic Array

Conclusions

Notes and references

Latches inside: Reverse-engineering the Intel 8086's instruction register

The dynamic latch and how it works

The implementation in silicon

The instruction register

Conclusions

Notes and references

Get new posts by email: