Ken Shirriff's blog

Latches inside: Reverse-engineering the Intel 8086's instruction register

The Intel 8086 microprocessor is one of the most influential chips ever created; it led to the x86 architecture that dominates desktop and server computing today. But it is still simple enough that its circuitry can be studied under the microscope and understood. In this post, I explain the implementation of a dynamic latch, a circuit that holds a single bit. The 8086 has over 80 latches scattered throughout the chip, holding a variety of important processor state bits,1 but I'll focus on the eight latches that implement the instruction register and hold the instruction that is being executed.

The 8086 die, showing the 8-bit instruction register.

The photo above shows the silicon die of the 8086 processor under a microscope. I removed the metal and polysilicon layers to reveal the transistors, approximately 29,000 of them. The highlighted region indicates the 8086's 8-bit instruction buffer, consisting of eight latches. (This 1978 processor is simple enough that a single 8-bit register occupies a substantial region of the die.) The closeup shows the silicon and transistors making up a single latch.

The dynamic latch and how it works

The latch is one of the most important circuits in the 8086, since the latches keep track of what the processor is doing. While latches can be made in many ways,2 the 8086 uses a compact circuit called the dynamic latch. The dynamic latch depends on a two-phase clock, commonly used to control microprocessors of that era.3 A two-phase clock consists of two clock signals that are active in alternation. In the first phase, clock is high and the complement clock is low. Then they switch so clock is low and clock is high. This cycle repeats at the clock frequency, such as 5 MHz.

A latch in the 8086 processor is built from four pass transistors and two inverters. The latch runs off the alternating clock signals. The control signals are load and hold.

The schematic above shows a typical latch in the 8086. It consists of two inverters and several pass transistors. For our purposes, the pass transistor can be considered a switch: if the gate input is 1, the transistor passes the signal through. If the gate input is 0, the transistor blocks the signal. The pass transistors are controlled by several signals: load, which loads a bit into the latch; hold, which holds the existing bit value; clock, the first clock phase; and clock, the second, inverted clock phase.

The diagram below shows how a value (1 in this case) is loaded into the latch. The load signal is brought high, allowing the input (1 in this example) to pass through the first transistor. Since clock is high, the signal passes through the second transistor to the inverter, which outputs 0. At this point, the third (clock) transistor blocks the signal.

The input is loaded into the latch when the load signal is high.

In the next clock phase (below), clock goes high, allowing the 0 signal to reach the second inverter, which outputs 1. Since hold is high, the signal loops back, but is blocked by the clock transistor. The important point, which makes this circuit dynamic, is that at this time there is no active input to the first inverter. Instead, its input remains 1 (shown in gray) due to the capacitance of the circuit. Eventually, this charge would leak away, losing the value, but before that happens, the clocks toggle.

When clock is high, the value passes through the second inverter. The (grayed-out) input to the first inverter is maintained by the circuit's capacitance.

After the clocks switch state, the second inverter's input is provided by the capacitance of the circuit (below). The signal loops around, recharging and refreshing the input to the first inverter. As the clock signals continue to toggle, the latch switches between this diagram and the previous diagram, preserving the value in the latch and keeping the output stable.4

When clock is high, the value passes through the first inverter.

The implementation in silicon

The 8086 and other processors of that era were built from a type of transistor called NMOS. They were constructed from a silicon substrate that was "doped" by diffusion of arsenic or boron to form the transistors. On top of the silicon, polysilicon wiring created the gates of the transistors and wired components together. Finally, a metal layer on top provided more wiring. (In comparison, modern processors are built from CMOS technology, which combines NMOS and PMOS transistors, and they have many layers of metal wiring.)

Structure of an NMOS transistor (MOSFET) as implemented in an integrated circuit.

The diagram above shows the structure of a transistor. The transistor can be viewed as a switch, allowing current to flow between two diffusion regions called the source and drain. The transistor is controlled by the gate, made of a special type of silicon called polysilicon. Applying voltage to the gate lets current flow between the source and drain, while pulling the gate to 0 volts blocks the current flow. The gate is separated from the silicon by an insulating oxide layer; this makes the gate act like a capacitor as seen in the dynamic latch.

An inverter (below) is built from an NMOS transistor and a resistor.5 With a low input, the transistor is off, so the pull-up resistor pulls the output high. With a high input, the transistor turns on. This connects the output to ground, pulling the output low. Thus, the circuit inverts the input signal.

This schematic shows how an inverter is created from a transistor and resistor. The photo shows the implementation inside the chip. The metal layer was removed to show the polysilicon and silicon underneath.

The photo on the right shows how an inverter is physically constructed in the 8086. The yellowish regions are conductive doped silicon and the speckled regions are the polysilicon on top. A transistor is created where polysilicon crosses doped silicon: the polysilicon forms the transistor's gate, while the silicon regions on either side are the transistor's source and drain. The large polysilicon rectangle forms the pull-up resistor between +5 volts and the output. These physical structures can be matched with the schematic.

The diagram below shows the implementation of a latch on the chip. The pass transistors and the two inverters are indicated; the first inverter is the one described above. Polysilicon wiring connects the components together; the metal layer (removed) provided additional wiring. The transistors have complex shapes to make the most efficient use of the space.

Microscope photo of a latch in the 8086 processor. The metal wiring was removed, but traces remain as reddish vertical lines. Note: this photo is rotated 180° to match the schematic.

The latch includes output buffers, not shown on the schematic above, that provide high-current signals for the output and inverted output. This type of buffer has the amusing name "superbuffer" because it provides much higher current than a regular NMOS inverter. The problem with an NMOS inverter is it is slow when driving something with high capacitance. Since the superbuffer provides more current, it will switch the signal much faster. The superbuffer accomplishes this by replacing the pullup resistor with a transistor, which provides higher current. The downside is that the pullup transistor requires an inverter to drive it, so the superbuffer circuit is more complex. Thus, superbuffers are only used when necessary, typically when sending a signal to many gates or when driving a long bus line.

Superbuffer implementation in the 8086's latch. Note that the +5V and ground connections are switched on the rightmost transistors.

The diagram above shows the superbuffer circuit in the 8086's latches. Unlike the typical superbuffer, this one includes both an inverting and non-inverting superbuffer. To understand the circuit, note that the central resistor and transistor form an inverter. The inverter output is connected to the upper transistors, while the uninverted input is connected to the lower transistors. Thus, if the input is 1, the lower transistors will turn on, while if the input is 0, the upper transistors will turn on due to the inverter. Thus, for a 1 input, the lower transistors will pull Output high and the complement Output low. But for a 0 input, the upper transistors will pull Output low and the complement Output high.6

The instruction register

The 8086, like most processors, has an instruction register that holds the instruction that is currently being executed. In the 8086, the instruction register holds the first byte of an instruction (which may consist of multiple bytes), so it is built from eight latches (below). You might expect the latches to be identical, but each latch has a different shape. Since the layout of the 8086 is highly optimized, each latch is shaped to make the best use of the available space, constrained by the neighboring wiring. In particular, note that some latches are merged together so they can share power and ground connections. Layout optimization is also probably why the latches are not in sequential order.

The 8 latches all have somewhat different shapes, optimized for the wiring around them. The previous sections described latch 1, rotated 180° from this photo. The red vertical lines are traces of the removed metal layer.

An instruction takes a winding journey through the 8086 chip. The 8086 processor uses prefetching, improving performance by loading instructions from memory before they are required. Prefetched instructions are stored in the instruction queue, a 6-byte queue in the middle of the 8086's register file. (In comparison, modern processors can have megabytes of instruction cache.) When an instruction is executed, it is stored in the instruction register, roughly in the middle of the chip. (The relatively large distances explains the use of superbuffers.) The instruction register feeds the instruction to the "group decode ROM". This ROM determines the high-level characteristics of the instruction, such as if it is a single-byte instruction, a multi-byte instruction, or an instruction prefix. (This is only a piece of the 8086's complex instruction handling. Other latches hold pieces of the instruction indicating register usage and the ALU operation, while a separate circuit controls the microcode engine, but I'll discuss that in another post.)

The 8086 die, showing key pieces for instruction processing. Around the outside of the die, bond wires connect the die to the external pins.

Conclusions

The 8086 makes extensive use of dynamic latches to store state internally. These latches are visible under a microscope and their circuitry can be traced out and understood. The 8086 is an interesting subject for die analysis since unlike modern processors, its transistors are large enough to see under a microscope, unlike modern processors. It was a complex processor at the time, with 29,000 transistors, but it is still simple enough that the circuitry can be traced out and understood.

I've written multiple posts about the internals of the 8086 processor lately. I plan to analyze the 8086 in more detail in future blog posts so follow me on Twitter @kenshirriff or RSS for updates.

Notes and references

The 8086 has over 80 latches. Some latches hold values for the AD (address/data) pins or control pins. Other latches hold the current microcode address and the microinstruction, as well as the return address for a microcode subroutine call. Other latches hold the source and destination register bits from the instruction, and the ALU operation from the instruction. Many latches hold internal state values that I'm still investigating. ↩
Many microprocessors use cross-coupled NOR (or NAND) gates to form an SR latch. An SR latch typically takes up more space than a dynamic latch, especially if additional circuity is added to make it clocked. Edge-triggered flip flops are popular, but are even more complex, using six gates. In many cases, a pass transistor provides sufficient storage; it can hold a value across a clock cycle, but doesn't provide the long-term storage of a latch. ↩
Processors always have a maximum clock speed, the fastest they can run. (The original 8086 ran at up to 5 MHz, while the later 8086-1 supported 10 MHz.) However, due to the use of dynamic logic, the 8086 also had a minimum clock speed of 2 MHz. If the clock ran slower than that, there was a risk of the charge on a wire leaking away before it was used, causing errors. ↩
A key to the operation of the latch is that there are two inverters, so the output is stable. An odd number of inverters would result in oscillation, a feature used by the 8086's charge pump oscillator. The 8086's register file also uses pairs of inverters to store bits. However, in the register file, the two inverters are connected to each other directly, without the clocked pass transistors, resulting in storage that is more compact but more difficult to control. ↩
The pull-up resistor in an NMOS gate is implemented by a special transistor. The depletion-mode transistor acts as a resistor but is more compact and performs better than an actual resistor. ↩
Some more information on superbuffers. The problem with an NMOS inverter is that the pull-up resistor provides limited current. When outputting a 0, the transistor in an inverter pulls the output low quickly, with a relatively high current. However, when outputting a 1, the output is pulled high by the much weaker pullup resistor.

The superbuffer is somewhat like a CMOS inverter in that it has a pullup transistor and a pulldown transistor. The difference is that CMOS uses both PMOS and NMOS transistors, and the PMOS transistor has an inverted gate input. In contrast, with an NMOS superbuffer, a separate inverter is required. In other words, a CMOS inverter uses two transistors, while a superbuffer is much less efficient, requiring four transistors.

The superbuffer uses a depletion mode transistor for the pullup and an enhancement mode transistor for the pulldown. The depletion-mode transistor has a threshold voltage below zero, allowing its output (source) to get pulled up to 5V, rather than shutting off a bit lower. When the output is low, the depletion-mode transistor will still be (somewhat) on, acting like the pullup in a regular inverter, so there is some current flow through it. For more on superbuffers, see Introduction to VLSI Systems, page 28. ↩

Reverse-engineering the adder inside the Intel 8086

The Intel 8086 processor contains many interesting components that can be understood through reverse engineering. In this article, I'll discuss the adder that is used for address calculations. The photo below shows the tiny silicon die of the 8086 processor under a microscope. The left part of the chip has the 16-bit datapath including the registers and the Arithmetic-Logic Unit (ALU); you can see the pattern of circuitry repeated 16 times. The rectangle in the lower-right is the microcode ROM, defining the execution of each instruction.

Die photo of the 8086 microprocessor, highlighting the 16-bit address adder. The microcode ROM is in the lower right. The metal layer has been removed for this photo, revealing the silicon and polysilicon underneath. The colors are due to thin-film effects from partially-removed oxide layers.

The 16-bit adder, the topic of this post, is in the upper left. The magnified view shows how the adder is constructed from 16 stages, one for each bit. The upper row handles the top bits (15-8) and the lower row handles the low bits (7-0).1 Studying the die reveals how this 16-bit adder was optimized through clever circuit design, specialized logic gates, and careful layout techniques.

How the adder is used in the 8086

You might wonder why the 8086 contains both an adder and an ALU (arithmetic-logic unit). The reason is that the adder is used for address calculations, while the ALU is used for data calculations. The 8086 prefetches instructions using a "Bus Interface Unit", which runs semi-independently from the "Execution Unit" that executed instructions. It would have been difficult for the Bus Interface Unit and the Execution Unit to share the ALU without conflicts. By providing both an adder2 and the ALU, the two calculations can take place in parallel.

Microprocessors of the early 1970s typically had 16-bit addresses, capable of accessing 64 kilobytes of memory. At first, 64 kilobytes seemed like more memory than anyone would need (or afford), but as the price of memory chips plunged, the demand for memory grew.4 To support a larger address space, Intel added segment registers to the 8086, a hack that allowed the processor to access a megabyte of memory but led to years of gnashed teeth. The concept is to break memory into 64-kilobyte segments. A segment register specifies the start of the memory segment, and a 16-bit address indicates an address within that segment. These are combined in the adder, as shown below, to obtain the memory address. One downside is that accessing regions of memory larger than 64 kilobytes is difficult; the segment register must be modified to get outside the current segment.3

The segment register and the offset are added to create a 20-bit physical address. From iAPX 86,88 User's Manual, page 2-13.

How does the 16-bit adder compute a 20-bit address? The trick is that since the segment register is shifted 4 bits, the adder sums the 16 bits of the segment register and the top 12 bits of the offset. The four low bits of the offset bypass the adder since they are unchanged. For other purposes (such as incrementing the instruction counter), the adder operates on unshifted 16-bit addresses. Thus, the register circuitry has logic to feed either shifted or non-shifted values to the adder.

The diagram below, from the 8086 patent, shows how the adder sits between the segment registers and the address pins, computing the address. In the patent, the segment registers were named RC, RD, RS, and RA, not their current names: CS, DS, SS, and ES.

The adder, highlighted in yellow, is a key part of the Bus Interface Unit. The upper register file (separate from the general-purpose registers) is connected to the adder. IND and OPR are internal registers, not visible to the programmer. From the 8086 patent.

The adder implementation

If you've studied digital logic, you may be familiar with the full adder, a building-block for adding binary numbers. Specifically, a full adder takes two bits and a carry-in bit. It adds these three bits and outputs the 1-bit sum, as well as a carry-out bit. (For instance 1+0+1 = 10 in binary, so the carry-out is 1 and the sum bit is 0.) A 16-bit adder can be created by joining 16 full-adders, with the carry-out from one feeding into the carry-in of the next. Just as you add two decimal numbers, moving carries to the next column on the left, each full adder adds one column in the binary numbers, and the carry is passed on to the left.

A full adder can be implemented in different ways; the 8086's circuit is shown below. (This circuit is repeated 16 times in the 16-bit adder.) Each adder stage takes two inputs (at the bottom) and the carry-in (inverted, at the right). These are summed to form a 1-bit sum output (bottom) and a carry-out (at the left). The sum bit is formed by the two exclusive-NOR gates that combine the two inputs and the carry-in.5 The output passes through a tri-state buffer (at the top), allowing it to be connected to an internal data bus.6

Schematic of one stage of the 8086's adder. The schematic layout corresponds to the physical layout on the chip.

The carry computation uses an optimization called the Manchester carry chain7, dating back to 1959. The problem with addition is carries are slow; in the straightforward approach, each bit sum can't be computed until the carry to the right has been computed. (Similar to computing 99999999+1 with long addition; each digit requires you to carry the one.) If each bit must wait for the previous carry, addition becomes a slow, serial process.

The idea behind the Manchester carry chain is to decide, in parallel, if each stage will generate a carry, propagate an existing carry, or block any carry. Then, the carry can rapidly flow through the "carry chain" without sequential evaluation. To understand this, consider the cases when adding two bits and a carry-in. For 0+0, there will be no carry, regardless of any carry-in. On the other hand, adding 1+1 will always produce a carry, regardless of any carry-in; this case is called "carry generate". The interesting cases are 0+1 and 1+0; there will be a carry-out if there was a carry-in. This case is called "carry propagate" since the carry-in propagates through the stage unchanged.

The "carry generate" and "carry propagate" signals are used to open or close switches (i.e. transistors) in the carry line. For "carry propagate", carry-in is connected to carry-out, so the carry can flow through. Otherwise, the incoming carry is disconnected. For "carry generate", a carry signal is sent to carry-out. Since these switches can all be set in parallel, carry computation is quick. There is still some propagation delay as the carry-in flows through the switches, potentially from bit 0 all the way to bit 15, but this is much faster than computing the carry through a sequence of logic gates.

Four stages of the adder, with the carry chain indicated. In this photo, the metal layer on top of the chip is visible, mostly obscuring the polysilicon and silicon underneath. The input and output wiring for each stage is at the bottom.

The carry chain is visible on the die; the photo above shows four stages of the adder. The horizontal lines are the metal wiring: control signals, ground, and power (the thick line near the bottom). The silicon circuitry is barely visible underneath the metal. The carry chain wires are interrupted at each stage, to connect to the transistors underneath, and the new carry continues on to the next stage.

Carry-skip

Careful examination of the adder shows that while the 16 single-bit stages are very similar, they are not all identical. The extra circuitry indicated below turns out to be a performance optimization called the carry-skip adder.

These two stages of the 8086's adder are almost identical, except for the circuitry indicated by the arrow. In this photo, the metal and polysilicon layers were removed, showing the underlying silicon.

The idea of carry-skip is to skip over some of the stages in the carry chain if possible, reducing the worst-case delay through the chain. For example, if there is a carry-in to bit 8, and the carry propagate is set for bits 8, 9, 10, and 11, then it can be immediately determined that there is a carry-in to bit 12. Thus, by ANDing together the carry-in and the four carry-propagate values, the carry-in to bit 12 can be calculated immediately for this case. In other words, the carry skips from bit 8 to bit 12. Likewise, similar carry-skip circuits allow the carry to skip from bit 2 to bit 4, and bit 4 to bit 8. These carry-skip circuits reduced the adder's worst-case computation time.8

Regular logic vs dynamic logic

The performance of the adder is critical to the overall speed of the 8086, so it uses some interesting techniques to implement fast logic gates. Some of the adder's gates are built with dynamic logic. A standard logic gate is straightforward: you put signals in and you get the result out. In contrast, a dynamic logic gate uses a periodic clock signal to compute the logic function.9 Since dynamic logic can be faster and more compact, it is used in modern processors, in the form of domino logic.

Dynamic logic depends on a two-phase clock, commonly used for timing in microprocessors of that era. The two-phase clock consists of two clock signals that are active in alternation. First, phase 1 (ɸ1) is high and phase 2 (ɸ2) is low. Then phase 1 is low and phase 2 is high. This cycle repeats at the clock frequency, such as 5 MHz.

The schematic below shows a dynamic NAND gate from the adder. In phase 1, the clock ɸ1 turns on the lower transistor, pulling the input to the inverter low. Phase 2 is the evaluation phase, where the logic function is computed. If both inputs are high, the two input transistors will turn on, allowing clock ɸ2 to pass through to the inverter input, pulling it high and causing the output to be low. On the other hand, if either input is low, the clock ɸ2 cannot pass through the transistors. Instead, the inverter input remains low from the previous phase, due to the stray capacitance of the wire, so the output is high. Thus, in either case, the circuit implements the NAND functionality, with a low output only if the inputs are both high. Note that unlike a standard logic gate, the dynamic logic gate's output is only valid during clock phase 2.

Implementation of a NAND gate using dynamic logic. The gate is controlled by the two-phase clock signals.

The diagram below shows how the dynamic NAND gate is physically implemented on the die; the layout of the schematic corresponds to the physical layout. In the photo, the metal layer has been removed, showing the silicon underneath. The yellowish regions are doped, conductive silicon. The brownish, metallic lines are polysilicon, a special type of silicon used as wiring. A transistor is formed when polysilicon crosses doped silicon; the polysilicon is the gate, controlling conduction between the silicon on either side. The transistors have complex, twisted shapes to fit the circuitry in as little space as possible. Each transistor was given a particular size for the best balance between speed and power consumption. For example, the input transistors are small, while the inverter transistor is much larger.

A dynamic NAND gate in the 8086, with corresponding schematic. The metal layer has been removed for this photo, revealing the silicon and polysilicon. The layout is slightly different between the lower stages (shown) and the upper stages.

The diagram below shows the location of a NAND gate in the 8086 chip. The first box zooms in on one of the 16 single-bit adder circuits. The second box shows the position of the NAND gate within the adder. The NAND gate is almost visible in the overall die photo showing how large the features are, compared to a modern chip.

Each stage of the adder has a dynamic NAND gate. The NAND gate in one of these stages is highlighted.

Another interesting dynamic logic gate in the adder is exclusive-NOR (XNOR, the complement of XOR), which outputs 1 if both inputs are the same, and 0 otherwise. The schematic below shows the implementation of XNOR.10 As before, during phase 1, the inverter input is pulled to ground. In the evaluation phase, clock ɸ2 can pull the inverter input high through either the upper pair of transistors or the lower pair of transistors. This will happen if the inputs are different (input 2 is high and input 1 is low, or if input 1 is high and input 2 is low), causing the inverter output to be low. Otherwise, the inverter input will remain low from phase 1, and the inverter output will be high. Thus, the output is high if the two inputs are equal, and low otherwise, the desired XNOR behavior.

A dynamic XNOR gate, as implemented on the 8086.

Conclusions

The adder in the 8086 has a critical role, computing addresses for every memory access. A 16-bit adder may seem like a straightforward circuit, but the adder in the 8086 was highly optimized so it wouldn't be a performance bottleneck. To speed up carry processing, the adder uses a Manchester carry chain, with carry-skip circuitry on top of that. The adder uses three different designs for logic gates: standard NMOS gates, pass-transistor logic, and dynamic logic. Even at the transistor level, the circuit is highly optimized, with transistors of all shapes and sizes carefully packed together.

The Intel 8086 is an interesting processor with complex circuits but still simple enough that its circuits can be studied under a microscope. The 8086 has 29,000 transistors and features that are a few micrometers large. In comparison, modern processors have billions of transistors and transistors that are measured in nanometers. While the progress of Moore's law has yielded great improvements in modern processors, the processors of the 1970s are much better for reverse engineering.

If you're interested in the 8086, I wrote about the 8086 die, its die shrink process and the 8086 registers earlier. I plan to write more about the 8086 so follow me on Twitter @kenshirriff or RSS for updates.

Notes and references

The adder's layout has bits 15-8 in the top and bits 7-0 below. This layout is a consequence of the bit ordering in the data path: the bits are interleaved 15-7-14-6-...-8-0, instead of linearly 15-14-...-0. The reason behind this interleaving is that it makes it easy to swap the two bytes in the 16-bit word, by swapping pairs of bits. The adder is split into two rows so it fits into the horizontal space available. Even with the tall, narrow layout of an adder stage, a bit of the adder is wider than a bit of the register file. Splitting the adder into two rows keeps the bit spacing approximately the same, avoiding long wires between the register file and the adder. ↩
Many early microprocessors (such as the 6502 and Z-80) had an incrementer for the program counter, separate from the ALU. (One motivation was the ALU was 8 bits while the program counter was 16 bits.) The 68000 had address adders, separate from the ALU. ↩
The 8086's segmented architecture led to programming with near pointers and far pointers. A near pointer was a 16-bit pointer that could be held in a register and manipulated easily, but couldn't access more than 64 kilobytes. A far pointer was the combination of an offset and a segment value, resulting in a pointer that could access the full memory but required twice the storage for each pointer. Comparing far pointers was problematic, since they were not unique; multiple offset/segment combinations could address the same physical memory address. ↩
In contrast to the 8086, the Motorola 68000 microprocessor (1979) had 32-bit registers. Its address bus was 24 bits wide, allowing it to access 16 megabytes of memory directly, without segment registers. The 68020 (1984) extended the address bus to 32 bits, allowing 4 gigabytes of memory to be accessed.

The 68000 was provided in a 64-pin package, providing plenty of pins for the 24 address lines and 16 data lines. In comparison, Intel didn't like large IC packages and used a 40-pin package for the 8086. As a result, the 8086 used 20 pins for the address lines, and reused (i.e. multiplexed) 16 of these pins for data lines. The 8086 also multiplexed many of the control pins, complicating system design. ↩
The desired sum output is input1⊕input2⊕carry-in. In the 8086 adder, the carry-in is inverted, there are two exclusive-NOR gates, and an inverter in the path. Thus, the circuit has four inversions in total; since this number is even, they cancel out and the circuit produces the desired exclusive-OR of the three values. ↩
A tri-state buffer has three different outputs: high (1), low (0), or high-impedance (hi-Z). In the hi-Z state, the buffer is not outputting anything and is electrically disconnected. The motivation for this is that multiple signals can be connected to a bus through tri-state buffers. By enabling one buffer and disabling the rest, the desired signal can be output to the bus. (Regular buffers wouldn't work because electrical problems would arise if one buffer outputs a 1 and another outputs a 0.) Open-collector outputs are an alternative for connecting multiple signals to a bus. ↩
The Manchester carry chain was developed by the University of Manchester and described in the article Parallel addition in digital computers: a new fast 'carry' circuit, 1959. It was used in the Atlas supercomputer (1962).

The original diagram showing how the Manchester carry chain is implemented, from 1959.

The diagram above, from the original article, shows the structure of the Manchester carry chain. Although the switches look like relay contacts, the carry chain was implemented with transistors (2N501 micro-alloy diffused-base transistors). The structure of the carry chain in the 8086 is similar to the diagram above, but the top switches are replaced by XNOR gates. ↩
A few notes on the carry-skip implementation. Conceptually the signals are ANDed together, but the implementation uses a NOR gate since the carry and propagate signal inputs are inverted. For carry-skip to be useful, computing the carry with a gate must be faster than the carry chain, which was achieved by skipping four stages at a time. (I don't know why the first stage was implemented with a smaller skip.) Note that carry-skip helps in specific cases (which include the worst-case), so the regular carry circuitry is still required. ↩
Processors always have a maximum clock speed, the fastest they can run. (The original 8086 ran at up to 5 MHz, while the later 8086-1 supported 10 MHz.) However, due to the use of dynamic logic, the 8086 also had a minimum clock speed of 2 MHz. If the clock ran slower than that, there was a risk of the charge on a wire leaking away before it was used, causing errors. ↩
Surprisingly, the adder uses a completely different implementation for the upper XNOR gate; it is implemented with pass-transistor logic rather than dynamic logic. I think the motivation is that the carry-in signal to these XNOR gates is not quite synchronous, due to propagation delay through the carry chain. Dynamic logic has the disadvantage that if an input signal switches low after the clock, the gate can't recover; the circuit has been charged and won't be discharged until the next clock phase. In particular, if a carry comes in after clock phase 2 has started, it can't switch the output high. By using non-dynamic logic, the output will switch correctly when the carry arrives, even if it is not aligned with the clock.

Pass-transistor logic is different from "regular" NMOS logic gates, but provides a more efficient way of implementing XNOR. The circuit is similar to the XNOR in the Z-80 microprocessor, which I've described earlier, so I won't go into more detail here.

Pass-transistor logic is also used to implement the input and output latches on the adder. On the patent diagram shown earlier, these latches appear as "TMP B" and "TMP C" on the input side of the adder and "TMP ɸ1" on the output side. These latches are necessary because otherwise the adder's output would be connected directly to the input, causing the adder to repeatedly add. The implementation of these latches is simply clocked pass transistors in the path, holding the value by capacitance. ↩

Inside the 8086 processor, tiny charge pumps create a negative voltage

Introduced in 1978, the revolutionary Intel 8086 microprocessor led to the x86 processors used in most desktop and server computing today. This chip is built from digital circuits, as you would expect. However, it also has analog circuits: charge pumps that turn the 8086's 5-volt supply into a negative voltage to improve performance.1 I've been reverse-engineering the 8086 from die photos, and in this post I discuss the construction of these charge pumps and how they work.

Die photo of the 8086 microprocessor. The ALU and registers are on the left. The microcode ROM is in the lower right. Click for a high-resolution image.

The photo above shows the tiny silicon die of the 8086 processor under a microscope. The metal layer on top of the chip is visible, with the silicon hidden underneath. Around the outside edge, bond wires connect pads on the die to the chip's 40 external pins. However, careful examination shows that the die has 42 bond pads, not 40. Why are there two extra ones?

An integrated circuit starts with a silicon substrate, and transistors are built on this. For high-performance integrated circuits, it is beneficial to apply a negative "bias" voltage to the substrate. 2 To obtain this substrate bias voltage, many chips in the 1970s had an external pin that was connected to -5V,3 but this additional power supply was inconvenient for the engineers using these chips. By the end of the 1970s, however, on-chip "charge pump" circuits were designed that generated the negative voltage internally. These chips used a single convenient +5V supply, making engineers happier.

A closeup of the 8086 chip showing the silicon die and the bond wires connecting it to the lead frame.

On the 8086 die, the two extra pads feed this negative bias voltage to the substrate. The photo above shows the silicon die as mounted in the chip, with bond wires connected to the lead frame that forms the pins. Looking carefully, there are two small gray squares above and below the die; each connected to one of the "extra" bond pads. The charge pumps on the 8086 die generate a negative voltage, which passes through the bond wires to these squares, and then through the metal plate underneath to the 8086's substrate.

How the charge pumps work

The photo below highlights the two charge pumps in the 8086. I'll discuss the top one; the bottom one has the same circuitry but a different layout to fit in the available space. Each pump has driver circuitry, a large capacitor, and a pad with the bond wire to the substrate. Each pump is located next to one of the 8086's two ground pads, presumably to minimize electrical noise.

Die photo of the 8086 microprocessor, zooming in on the two substrate bias generators.

You might wonder how a charge pump can turn a positive voltage into a negative voltage. The trick is a "flying" capacitor, as shown below. On the left, the capacitor is charged to 5 volts. Now, disconnect the capacitor and connect the positive side to ground. The capacitor still has its 5-volt charge, so now the low side must be at -5 volts. By rapidly switching the capacitor between the two states, the charge pump produces a negative voltage.

On the left, the "flying capacitor' is charged to 5 volts. By switching ground to the upper terminal, the capacitor now outputs -5 volts. (source)

The 8086's charge pump circuit uses MOSFET transistors and diodes to switch the capacitor between the two states, with an oscillator to control the transistors, as shown in the schematic below. The ring oscillator consists of three inverters connected in a loop (or ring). Because the number of inverters is odd, the system is unstable and will oscillate.5 For instance, if the input to the first inverter is 0, its output will be 1, the second output will be 0, and the third output will be 1. This will flip the first inverter, and the "flip" will travel through the loop causing oscillation. To slow down the oscillation rate, two resistor-capacitor networks are inserted into the ring. Since the capacitors will take some time to charge and discharge, the oscillations will be slowed, giving the charge pump time to operate.4

Schematic of the charge pump used in the Intel 8086 to provide negative substrate bias.

The outputs from the ring oscillator are fed to the transistors that drive the capacitor. In the first step, the upper transistor is switched on, causing the capacitor to charge through the first diode to 5 volts with respect to ground. The second step is where the magic happens. The lower transistor turns on, connecting the high side of the capacitor to ground. Since the capacitor is still charged to 5 volts, the low side of the capacitor must now be at -5 volts, producing the desired negative voltage. This goes through the second diode and the bond wire to the substrate. When the oscillator flips again, the upper transistor turns on and the cycle repeats. The charge pump gets its name because it pumps charge from the output to ground.6 The diodes are similar to check valves in a water pump, making sure charge moves in the right direction.

The implementation in silicon

The photo below shows the charge pump as it is implemented on the chip. In this photo, the metal wiring is visible on top, with reddish polysilicon underneath and beige silicon at the bottom. The main capacitor is visible in the center, with H-shaped wiring connecting it to the circuitry on the left. (Part of the capacitor is hidden under the wide metal power trace at the top.) On the right, the substrate bond wire is attached to the pad. A test pattern is below the pad; it has a square for each mask used to produce a layer of the chip.

The charge pump, showing the metal layer.

Removing the metal layer shows the circuitry more clearly, below. The large charge pump capacitor takes up the right half of the photo. Although microscopic, this capacitor is huge by chip standards, about the size of a 16-bit register. The capacitor consists of polysilicon over a silicon region, separated by insulating oxide; the polysilicon and silicon form the plates of the capacitor. On the left side are the smaller capacitors and the resistors that provide the R-C delay for the oscillator. Below them is the oscillator circuitry and the drive transistors.7

An 8086 charge pump, showing the key components. The metal has been removed for this photo, to show the silicon and polysilicon underneath.

One interesting feature of the charge pump is the two diodes, each built from eight transistors in a regular pattern. The diagram below shows the structure of a transistor. Regions of the silicon are doped with impurities to create diffusion regions with desired properties. The transistor can be viewed as a switch, allowing current to flow between two diffusion regions called the source and drain. The transistor is controlled by the gate, made of a special type of silicon called polysilicon. A high voltage on the gate lets current flow between the source and drain, while a low voltage blocks current flow. These tiny transistors can be combined to form logic gates, the components of microprocessors and other digital chips. But in this case, the transistors are used as diodes.

Structure of an NMOS transistor (MOSFET) as implemented in an integrated circuit.

The photo below shows a transistor in the charge pump, viewed from above. As in the diagram, polysilicon forms the gate between the silicon diffusion regions on either side. A diode can be formed from a MOSFET by connecting the gate and drain together (details) through the silicon/polysilicon connection at the bottom of the photo. The silicon can also be connected to the metal layer through a "via". The metal layer was removed for this photo, but faint circles indicate the position of silicon/metal vias.

A transistor in the charge pump circuit. The polysilicon gate separates the transistor's source and drain on either side.

The diagram below shows how the two diodes are implemented from 16 transistors. To support the relatively high current of the charge pump, eight transistors are used in parallel for each diode. Note that neighboring transistors share source or drain regions, allowing transistors to be packed densely. The blue lines indicate the metal wires; the metal was removed for this photo. The dark circles indicate connections (vias) between the metal and silicon.

The charge pump has two diodes, each implemented with 8 transistors. The source, gate, and drain are indicated with S, G, and D.

Putting this all together, the upper eight transistors have their sources connected to ground by a metal wire. Their gates and drains connected together by the polysilicon below the transistors, making them into diodes, and they are connected to the capacitor by a metal wire. The lower eight transistors form a second diode; their gates and drains are wired together by the lower metal wire loop. Note how the layout has been optimized; for example, the gates have bent shapes to avoid the vias (black dots).

Conclusions

The substrate bias generator on the 8086 chip9 is an interesting combination of digital circuitry (a ring oscillator formed from inverters) and an analog charge pump. While the bias generator may seem like an obscure part of 1970s computer history, bias generation is still part of modern integrated circuits. It is much more complex in modern chips which have multiple carefully regulated biases in multiple power domains. 8 In a sense it is analogous to the x86 architecture, something that started in the 1970s and is even more popular today, but has become unimaginably more complex in the quest for higher performance.

If you're interested in the 8086, I wrote about the 8086 die, its die shrink process and the 8086 registers earlier. I plan to analyze the 8086 in more detail in future blog posts so follow me on Twitter @kenshirriff or RSS for updates.

Notes and references

Strictly speaking, the entire chip is analog: there's an old saying that "Digital computers are made from analog parts". This saying came from DEC engineer Don Vonada and was published in DEC's Computer Engineering in 1978.

Vonada's Engineering Maxims (text).

↩
Putting a negative bias voltage on the substrate had several benefits. It decreased parasitic capacitance making the chip faster, made the transistor threshold voltage more predictable, and reduced leakage current. ↩
Early DRAM memory chips and microprocessor chips often required three supplies: +5V (Vcc), +12V (Vdd) and -5V (Vbb) bias voltage. In the late 1970s, improvements in chip technology allowed a single supply to be used instead. For example, Mostek's MK4116 (a 16 kilobit DRAM from 1977) required three voltages while the improved MK4516 (1981) operated on a single +5V supply, simplifying hardware designs. (Amusingly, some of these chips still kept the Vbb and Vcc pins for backward compatibility but left them unconnected.) Intel's memory chips followed a similar path, with the 2116 DRAM (16K, 1977) using three voltages and the improved 2118 (1979) using a single voltage. Similarly, the famous Intel 8080 microprocessor (1974) used enhancement-mode transistors and required three voltages. An improved version, the 8085 (1976), used depletion-mode transistors and was powered by a single +5V supply. The Motorola 6800 microprocessor (1974) used a different approach for a single supply; although the 6800 was built from the older enhancement-load transistors it avoided the +12 supply by implementing an on-chip voltage doubler, a charge pump that increased the voltage. ↩
I tried to measure the frequency of the charge pump by looking at the chip's current to see fluctuations due to the charge pump. I measured 90 MHz fluctuations, but I suspect I was measuring noise and not the charge pump's oscillations. ↩
Because the circuit has an odd number of inverters, it oscillates. If, on the other hand, it had an even number of inverters, it would be stable in two different states. This technique is used in the 8086's registers: a pair of inverters stores each bit (details). ↩
I've simplified the charge pump discussion slightly. Due to voltage drops in the transistors, the substrate voltage will probably be around -3V, not -5V. (If a chip requires a larger voltage drop, charge pump stages can be cascaded.) For the pump direction, I'm referring to current flow. If you think of it as pumping electrons, the negative electrons are being pumped the opposite direction, into the substrate. ↩
The oscillator is built from 13 transistors. Seven transistors form the 3 inverters (one inverter has an extra transistor to provide extra output current. The six drive transistors consist of two transistors pulling the output high and four transistors pulling the output low. The layout is strangely different from normal inverter circuitry, probably because the current requirements are different from normal digital logic. ↩
Bias generators are now available as IP blocks that can be licensed and be plugged into a chip design. For more information on bias in modern chips, see Body bias, Multi bias domain implementation, or this presentation. There is even a standard IEEE 1801 power format that allows IC design tools to generate the necessary circuitry. ↩
The Intel 8087, the math coprocessor chip that goes along with the 8086, also has a substrate bias generator. It uses the same principles, but unexpectedly has a different circuit, using 5 inverters. I wrote about it in detail here. ↩

The Intel 8086 processor's registers: from chip to transistors

The Intel 8086 microprocessor is one of the most influential chips ever created; it led to the x86 architecture that dominates desktop and server computing today. I've been reverse-engineering the 8086 from die photos, and in this post I discuss how its register file is implemented.

The 8086 die, showing the register storage. The upper registers are used by the Bus Interface Unit for memory accesses, while the general-purpose lower registers are used by the Execution Unit. The instruction buffer is a 6-byte queue of prefetched instructions.

The photo above shows the silicon die of the 8086 processor under a microscope. The metal layer on top of the chip is visible, with the silicon hidden underneath. Around the outside edge, bond wires connect pads on the die to the chip's 40 external pins.

The highlighted region indicates the 8086's fifteen 16-bit registers and six bytes of instruction prefetch queue.1 Registers take up a significant portion of the die, even though they are just 36 bytes in total. Due to space limitations, early microprocessors had a relatively small number of registers; in comparison, a modern processor chip has kilobytes of registers and megabytes of cache storage.2

How a register is implemented in silicon

I'll start by explaining how the 8086 is built from NMOS transistors. Then I'll explain how an inverter is constructed, how a single bit is stored using inverters, and how a register is constructed.

The 8086 and other chips of that era were built from a type of transistor called NMOS. These chips consisted of a silicon substrate, which was "doped" by diffusion of arsenic or boron to form transistors. Above the silicon, polysilicon wiring created the gates of the transistors and wired components together. Finally, a metal layer on top provided more wiring. (Modern processors, in comparison, use CMOS technology, which combines NMOS and PMOS transistors, and they have many metal layers.)

The schematic below shows an inverter built from an NMOS transistor and a resistor3 With a low input, the transistor is off, so the pull-up resistor pulls the output high. With a high input, the transistor turns on, connecting ground and the output, pulling the output low. Thus, the input signal is inverted.

This schematic shows how an inverter is created from a transistor and resistor. The photo shows the implementation on the chip. The metal layer was removed to show the polysilicon and silicon underneath.

The photo above shows how an inverter is physically constructed in the 8086. The pinkish regions are conductive doped silicon and the sparkly copper-colored lines are polysilicon on top. A transistor is created where polysilicon crosses silicon: the polysilicon forms the transistor's gate, while the silicon regions on either side are the transistor's source and drain. The large polysilicon rectangle forms the pull-up resistor between +5 volts and the output. Thus, the chip's circuitry matches the inverter schematic. Under a microscope, circuits such as this inverter are visible and can be reverse-engineered.

The building block for the registers is two inverters in a feedback loop, storing a single bit, as shown below. If the top wire has a 0, the right inverter will output a 1 to the bottom wire. The left inverter will then output a 0 to the top wire, completing the cycle. Thus, the circuit is stable and will "remember" the 0. Likewise, if the top wire is a 1, this will get inverted to a 0 at the bottom wire, and back to a 1 at the top. Thus, this circuit can store either a 0 or a 1, forming a 1-bit memory.

In the 8086, two coupled inverters hold a single bit in the register. This circuit is stable in either the 0 or 1 state.

Three transistors are added to make a usable register cell from the inverter pair.4 One transistor selects the cell for reading, another transistor selects the cell for writing, and the third amplifies the signal when reading. In the center of the schematic below, two inverters store the bit. To read the bit, the read line is energized. This connects the inverter output to the bit line through the amplifying transistor. To write the bit, the write line is energized, connecting the bit line to the inverters. By putting a high-current 0 or 1 signal on the bit line, the inverters (and thus the stored bit) are forced to the desired value. Note that the bit line is used for both reading and writing.

Schematic diagram of a register cell storing a single bit. The register file is built from an array of these cells.

The register file consists of a matrix of register cells like the one above. The matrix is 16 cells wide since registers are 16 bits wide. Each register is arranged horizontally, so a read line or write line select all the cells for a particular register. The 16 vertical bit lines form a bus, so all 16 bits in the selected register are read or written in parallel.

The photo below zooms in on the 8086's general-purpose register file, showing the matrix of register cells: 16 columns and 8 rows for eight 16-bit registers. It then zooms in on a single register cell in the register file. I'll now explain how this cell is implemented.

Die photo of the 8086, zooming in on the lower register file (eight 16-bit registers) and then a single register cell. The metal and polysilicon were removed for this photo to show the silicon structures.

The 8086 is constructed from doped silicon and polysilicon wiring with metal wiring on top. The left photo below shows the vertical metal wiring of a register cell. The ground, power, and bit line wires are indicated. (The remaining wire crosses the register file but isn't connected to it.) In the right photo, the metal layer has been dissolved to show the polysilicon and silicon underneath. The read and write lines are horizontal polysilicon wires. (Because the chip has only one layer of metal, the register uses metal for the vertical lines and polysilicon for the horizontal lines so they don't run into each other.) The connections (called vias) between the metal and the silicon are visible as brighter circles in the metal photo and as circular spots in the silicon photo.

A register storage cell. The photo on the left shows the metal layer, while the photo on the right shows the corresponding polysilicon and silicon underneath. The bright circles on the metal layer are vias connected to the circles on the silicon.

The diagram below shows how the physical layout of the register cell matches up with the schematic. The inverters are formed from transistors A and B, along with the resistors. Transistors C, D, and E are formed by the labeled strips of polysilicon. The bit line is not visible below, since it is in the metal layer. Note that the layout of the memory cell is highly optimized to minimize its size. Also note that transistor A is much smaller than the other transistors; inverter A has a weak output so it can be overpowered by the bit line when a value is written.

A register cell in the 8086 with the corresponding schematic.

8-bit register support

Careful examination of the die shows that some of the register cells have a slightly different structure. On the left is a pair of the register cells discussed above,5 while the right photo shows a pair of register cells with two write control lines instead of one. In the left photo, the write line crosses the silicon in both register cells. However, in the right photo, the "write right" line crosses the silicon on the right side but goes between the silicon regions on the left. Conversely, the "write left" line crosses the silicon on the left side and goes between the silicon on the right. Thus, one write line controls writes to the right-hand bit, while the other controls writes to the left-hand bit. In the full 16-bit register, this allows alternating 8-bit parts to be written separately.6

Two pairs of memory cells, showing different circuitry. The left cells have a single write line, while the right cells have separate write lines for the left and right bits.

Why do some registers have two write lines while others have one? The reason is that the 8086 has 16-bit registers, but four of them can also be accessed as 8-bit registers, as shown below. For example, the 16-bit accumulator A can be accessed as an 8-bit AH (accumulator high) register and an 8-bit AL (accumulator low) register. By implementing the registers with two write control lines, either half of the register can be written separately.7

The general-purpose registers in the 8086 processor. The A, B, C, and D registers can be split into two 8-bit registers. From The 8086 Family User's Manual.

Multi-port registers

So far, I've discussed the eight general-purpose "lower registers". The 8086 also has seven "upper registers" used for memory accesses, including the infamous segment registers.8 These registers have a more complex "multi-port" design, allowing multiple reads and writes to take place simultaneously.9 For instance, the multi-ported register file would allow the program counter to be read, a segment register to be read, and a different segment register to be written, all at the same time.

The multi-ported register cell below is built around the same two-inverter circuit as before but it has three bit lines (compared to one earlier) and five control lines (compared to two). The three read control lines allow the register cell contents to be read to any of the three bit lines, while the two write control lines allow bit line A or bit line C to be written to the register cell.

A multi-ported register cell in the 8086 processor.

At first glance, the 8086's register file looked like a uniform set of registers, but close examination reveals that each register has been optimized based on its function.10 Some registers are simple 16-bit registers, which have the most compact layout. Other 16-bit registers can also be accessed as two 8-bit registers, requiring another control line. The most complex registers have two or three read ports and one or two write ports. In each case, the physical layout of the register cell has been carefully designed to be as compact as possible, with elaborate transistor shapes, as seen below. Intel's engineers shrunk the register layout as much as possible to fit all the registers in the available space.

The upper register file, consisting of ten 16-bit registers. This photo shows the silicon and polysilicon. The vertical red lines are traces of the metal layer that was removed. Click for a larger image.

Conclusions

Although the 8086 processor is 42 years old, it still heavily influences modern computing through the x86 architecture in heavy use today. The registers of the 8086 still exist in modern x86 computers, although the registers are now 64 bits long and have been joined by many new registers.

The 8086 is an interesting subject for die analysis since its transistors are large enough to be visible under a microscope. It was a complex processor at the time, with 29,000 transistors, but it is still simple enough that the circuitry can be traced out and understood. I plan to analyze the 8086 in more detail in future blog posts so follow me on Twitter @kenshirriff or on RSS for updates.11

Notes and references

The 8086 was apparently the first microprocessor to implement instruction prefetching. The Motorola 68000 (1979) had a 4-byte instruction prefetch buffer. Prefetching in mainframes goes back to the IBM Stretch (1961), CDC 6600 (1964), and IBM System/360 Model 91 (1966). ↩
It's difficult to determine how many registers are in a modern processor; the only accurate description I could find was in The Anatomy of a High-Performance Microprocessor, which describes the AMD K6 processor (1997) in detail. Due to register renaming modern processors have many more physical registers than architectural registers (the registers visible to a programmer), and the number of physical registers is not documented. (In addition to the eight general-purpose x86 registers, the K6 had 16 microarchitecture scratch registers for renaming.)

Processors supporting AVX-512 include 32 512-bit registers, so that's 2 kilobytes of registers for that feature alone. This makes it even harder to determine the register size. As for cache size, high-end processors have up to 77 MB of cache storage.) ↩
The pull-up resistor in an NMOS gate is actually a special transistor. The depletion-mode transistor acts as a resistor but is more compact and performs better than an actual resistor. ↩
Other processors use slightly different register storage cells. The 6502 uses an additional transistor in the inverter feedback loop to break the feedback loop when writing a new value. The Z-80 writes to both inverters at the same time, making the transition "easier" but requiring two write wires. While the 8086 has an amplification transistor in each register cell for reads, other processors read the outputs from both inverters and use an external differential amplifier to strengthen the signal. The 8086's basic register cell uses 7 transistors (7T), more than a typical 6-transistor (6T) or 4-transistor (4T) static RAM cell, but it only uses one bit line rather than two differential bit lines. Dynamic memory (DRAM) is much more efficient, using one transistor and a capacitor, but data will be lost without refresh. ↩
On the die, register cells are not repeated uniformly, but instead alternating cells are mirror images. This improves the density of the register cells because a power line running between two mirror-image cells can feed both of them (and the same with ground). Thus, the mirror-image layout reduces the number of power and ground lines by half. ↩
Although block diagrams always show the 16-bit registers split into a left half and a right half, the actual implementation alternates the bits from each half instead of storing one 8-bit part on the left and the other on the right. This implementation makes it easier to swap the two halves of a 16-bit word, which is required in several cases. (One is an unaligned memory read or write. Another is an ALU operation using the top half of a register, such as AH.) Swapping bits between the left half and the right half would require running long wires between the halves for each bit. But with the interleaved implementation, swapping the two halves is a matter of swapping each pair of neighboring bits, which doesn't need long wires. In other words, the interleaved layout in the 8086's registers simplifies the wiring for swapping the two halves of a word. ↩
If the register file only supported 16-bit registers instead of 8-bit half-registers, the processor could still work but would be less efficient. Writes to an 8-bit half could be done by reading the full 16 bits, modifying the 8-bit half, and then writing back the full 16 bits. This would take three registers accesses instead of one. Note that the register file doesn't need special support for 8-bit reads since the unwanted half can be ignored. ↩
The block diagram below is different from most 8086 block diagrams because it shows the actual physical implementation, rather than the programmer's view of the processor. In particular, this diagram shows two "Internal Communication Registers" in the Bus Interface Unit registers (right) along with the segment registers, matching the 7 registers visible on the die. (The temporary registers below are physically part of the ALU, so I'm not discussing them in this blog post.)

Block diagram of the 8086 processor. From The 8086 Family User's Manual.

↩
The book Modern Processor Design discusses the complex register systems of processors from the early 2000s. It says that circuit complexity increases rapidly beyond 3 ports, but some high-end processors had register files with 20 ports or more. ↩
The upper registers have differing numbers of read and write ports, as follows: two registers with 3 read control lines and 2 write lines, one register with 2 read lines and 2 write lines, and four registers with 2 read lines and 1 write line. The first three registers are probably the program counter, the "indirect" temporary register, and the "operand" temporary register. The last four are probably the SS, DS, SS, and ES segment registers. There are also three instruction prefetch buffer registers, each with 1 read line and 1 write line.

The 8088 processor, used in the original IBM PC was essentially identical to the 8086, except it had an external 8-bit bus instead of a 16-bit bus to reduce system cost. The 8088's prefetch buffer was four bytes instead of six, presumably because four bytes was sufficient with the 8088's slower memory bus.

Unlike the 8086, the prefetch registers in the 8088 support writing to 8-bit halves independently (similar to the 8088's A, B, C, and D registers, but with a different register cell design). The reason is the 8088 fetched instructions one byte at a time instead of one word at a time, due to its narrower bus. Thus, the 8088's prefetch registers need to support byte-sized writes, while the 8086 does word-sized prefetches. ↩
I wrote about the 8086 die and the die shrink process earlier. For more about register files, see my posts on registers in the Z-80 and in the 8085. ↩