Reverse-engineering and comparing two Game Boy audio amplifier chips

The Nintendo Game Boy contains an audio amplifier chip for sound through a speaker or headphones. In this post, I reverse-engineer this chip and compare it with the later Game Boy Color chip (reverse-engineered earlier). Unexpectedly the Game Boy Color uses an entirely different amplifier design from the original Game Boy, which may explain why the two systems sound different.

The diagram below shows the Game Boy amplifier's silicon die, with the main functional components labeled.1 The upper-left part of the chip has the two large driver transistors for the speaker output (one to pull the signal low and the other to pull the signal high). The headphone amplifier consists of two nearly-identical blocks: one for the left channel and one for the right. The circuitry for the current sources and current mirrors is shared by both headphone channels. The lower-left of the chip contains digital logic to enable either the speaker amp or the headphone amp, switching when the headphones are plugged in.

The chip with pins and key functional blocks labeled. The hi-res die photo is courtesy of John McMaster.

By examining the die closely, components such as transistors and resistors can be identified. From this, the complete circuit can be determined. In the photo above, the white lines are the chip's metal layer, connecting the components. The silicon itself appears greenish and is underneath the metal. The green squares around the outside are the pads where tiny bond wires connected the silicon die to the chip's 18 pins. Regions of the chip are treated (doped) to change the electrical properties of the silicon. The next sections explain how components are created from these different types of silicon.

NPN transistors

The amplifier chip is built from transistors known as NPN and PNP bipolar transistors, different from the low-power MOS transistors used in processors. These transistors have three connections: the emitter, the base, and the collector. The magnified photo below shows an NPN transistor from above. The slightly different tints in the silicon indicate regions that have been doped to form N and P regions, with dark lines separating the regions. The bubbly silverish areas are the metal layer of the chip on top of the silicon—these form the wires connected to the emitter, base, and collector.

An NPN transistor in the Game Boy Color amplifier chip. The collector (C), emitter (E), and base (B) are labeled, along with N and P doped silicon.

Underneath the photo is a vertical cross-section illustrating how the transistor is constructed. The emitter (E) wire is connected to N+ silicon. Below that is a P layer connected to the base contact (B). And below that is an N+ layer connected (indirectly) to the collector (C). If you look at the vertical cross-section below the 'E', you can find the N-P-N layers that form the transistor.

A different structure (below) is used for the high-current output transistors that drive the speaker. These transistors are much larger and have multiple interlocking "fingers" of the emitter and base, surrounded by the large collector. If you look back at the die photo, you can see two of these transistors filling the upper left part of the die.

A large, high-current NPN output transistor in the Game Boy Color audio amplifier chip. The collector (C), base (B), and emitter (E) are labeled.

PNP transistors

The chip also uses PNP transistors, which have an entirely different construction, as shown in the diagram below. The most obvious difference is that the PNP transistors are round.2 A PNP transistor has a small circular emitter (P-silicon), surrounded by a ring-shaped base region (N-silicon), which in turn is surrounded by the collector (P-silicon). (The emitter metal covers both the emitter and the base, but is only connected to the emitter.) These regions form a P-N-P sandwich horizontally (laterally), unlike the vertical structure of the NPN transistors. Note that although the base region physically surrounds the emitter, the metal connection to the base is further away; the base signal passes through the N region underneath the collector to reach the base region.

A PNP transistor in the Game Boy audio amplifier chip. Connections for the collector (C), emitter (E) and base (B) are labeled, along with N and P doped silicon. The base forms a ring around the emitter, and the collector forms a ring around the base.

Resistors

Resistors are an important component of analog chips. The photo below shows some long, zig-zagging resistors, formed from strips of P silicon, which appears beige in the photo. Its resistance is proportional to the length of the resistor, so large-value resistors have a zig-zag shape to fit in the available space. Because resistors are relatively large and inaccurate, chip designs try to minimize the number of resistors required. Even so, an analog chip like this one requires numerous resistors.

Some resistors in the Game Boy audio amplifier chip. In the center, two resistors in parallel provide a low resistance. The long, meandering resistors provide high resistance.

The photo below shows seven small resistors, but only the two in the middle are connected (in parallel) to the circuit. These extra resistors allow the resistance to be modified by modifying the metal layer, which is much easier than changing the silicon. (These resistors bias the output transistor, and it appears this is a critical resistance that required adjustment.)

This photo shows seven short resistors but some are not used.

Capacitors

This chip has three large capacitors, one for each amplifier. The photo below shows one of the capacitors. The capacitors are simply a large layer of metal over the underlying silicon, separated by a thin insulating oxide layer. At the top and right of the photo, you can see the connections between the metal wiring and the underlying silicon. In this chip, capacitors are used to ensure the stability of the amplifiers. Because they are large, the three capacitors are easy to spot in the chip die photo.

A capacitor on the Game Boy audio amplifier chip.

The LM380

The Game Boy amplifier chip has a design very similar to the popular LM380 power audio amplifier chip (1972), so I'll start with an overview of how the LM380 works. (See the footnote5 for details.) The LM380 has positive and negative inputs and an output that amplifies the difference between the inputs by a fixed factor of 50. This may sound like an op-amp, but the LM380 is intended as an audio amplifier and is different from an op-amp in several ways: it has a small, fixed gain, it doesn't have a negative power supply, and its internal implementation is different.

The schematic below shows the main functional blocks of the LM380. The inputs go into a differential pair circuit (blue)3. The output from the differential pair (green) goes into a single-transistor amplification stage that provides more gain. The capacitor across the amplification stage stabilizes the amplifier to prevent oscillation. Finally, the output stage (purple) produces the high-current output: power transistor Q7 pulls the output high, while Q8 and Q94 pull the output low. The feedback network controls the gain of the LM380, fixing the gain at a factor of 50. Note that unlike an op-amp, the LM380's feedback network is connected to internal points of the amplifier, not the inputs.

The LM380 audio amplifier. Diagram based on the application note.

Game Boy Audio chip: headphone amplifier

Game Boy circuit board. The audio amplifier chip is midway on the right-hand side. © Raimond Spekking / CC BY-SA 4.0 (via Wikimedia Commons).

The Game Boy amplifier chip contains three amplifiers: two identical amplifiers for the left and right headphone channels, and a more powerful mono amplifier for the speaker. The Game Boy headphone amplifiers and the speaker amplifier are somewhat different, but they are both similar to the LM380.

The schematic below shows the Game Boy headphone amplifier. Comparing it with the LM380 schematic above shows the similarities between the LM380 and the headphone amplifier, but also some differences. The input stage and feedback circuit of the LM380 are the most distinctive parts of that chip, and the headphone amplifier's circuit is essentially identical.6 The "Amplification" stage of the headphone amplifier has three transistors compared to one in the LM380, probably to produce more gain. The headphone amplifier's output stage is similar but simplified; the PNP/NPN pair that pulls the LM380 output low is replaced with a single PNP transistor. The biggest difference is the "Control" section of the headphone amplifier, which is not present in the LM380. This control circuitry powers down the headphone amplifier if headphones are not plugged in, conserving battery life.

Schematic of the Game Boy headphone amplifier that I created by reverse-engineering the die.

The photo below shows the left headphone amplifier. The output pin (lower-right, next to the part number SBG14) is driven by seven PNP transistors in parallel (top-left) and seven smaller NPN transistors in parallel (lower center). The capacitor is in the upper left, near the center. Many resistors snake around the die.

The left headphone amplifier circuit on the die. The right amplifier is a mirror image. Photo courtesy of John Mcmaster.

Game Boy Audio chip: speaker amplifier

The next schematic shows the Game Boy speaker amplifier. Unlike the two channels for headphone amplification, there is a single speaker amplifier, producing a mixture of the left and right channels. Again, the input stage and feedback are almost identical to the LM380. The output stage has only minor differences. However, the amplification stage for the speaker is completely different: it includes a four-transistor differential amplifier stage, which will provide much more amplification.7 Although this amplification stage looks very similar to the input stage at first glance, its is wired differently and uses NPN transistors.8

Schematic of the speaker amplifier in the Game Boy audio amplifier chip.

The chip provides pins for bypass capacitors to reduce the effect of power supply fluctuations.9 The headphone amplifiers have external bypass capacitors, but the speaker bypass capacitor is omitted for some reason (see the Game Boy schematic). Lack of this capacitor may contribute to the background hum that people hear in the Game Boy's sound.

Comparison with the Game Boy Color

I recently reverse-engineered the Game Boy Color's amplifier chip, so it's interesting to compare the two chips. The amplifier chips for the Game Boy and the Game Boy Color provide similar functions. Even at the die level (below), the two chips look similar. They both have power transistors in the upper-left for the speaker, control circuitry in the lower-left, and two headphone channels on the right.

Comparison of the audio amplifier chip from the Game Boy (left) and Game Boy Color (right). Photos courtesy of John McMaster.

Surprisingly, the implementations of the two chips are completely different. While the Game Boy uses LM380-style audio amplifiers, the Game Boy Color uses power op-amps with more complicated circuitry. The most important difference is that the Game Boy chip has internal feedback to control the gain, while the Game Boy Color also has an external feedback capacitor, which causes it to act as a high-pass filter. For more information, see my Game Boy Color amplifier article and schematic.

Collectors of Game Boy systems have noticed that the different versions have a very different sound (discussion). The original Game Boy has a "warm, bassy sound", while the Game Boy Color has a "thin sound" with background noise and hum. These aren't just subjective differences, but show up in the waveforms:

Graphs courtesy of Herbert Weixelbaum.

What's interesting is that we can explain much of the sound difference through the analysis of the amplifier chips. The Game Boy's output is close to a square wave, but the waveform drops somewhat due to the speaker's 100µF DC blocking capacitor (schematic). The amplifier in the Game Boy Color, on the other hand, is configured as a high-pass filter, so it outputs higher-frequency spikes, losing the bass sound.

Conclusion

The Game Boy (1989) and Game Boy Color (1998) use custom amplifier chips. By examining die photos, the circuitry can be reverse engineered. The chips are different from common amplifier chips in two main ways, which probably explains why custom chips were created. First, each chip has three amplifiers: two for headphone channels and one for the speaker. Second, to conserve power the chip has circuitry to power-down the unused amplifiers, based on whether or not headphones are plugged in. Reverse-engineering the chips also explains much of the difference in sound between the Game Boy and the Game Boy Color. The Game Boy Color's chip implements a high-pass filter, so the sound is thin and lacks the bass of the Game Boy.

I announce my latest blog posts on Twitter, so follow me @kenshirriff for future articles. I also have an RSS feed. My KiCad files for the schematic are on Github. Thanks to John McMaster for providing the chip photos; his page is here. Thanks to Herbert Weixelbaum for the sound waveforms.

Notes and references

The audio amplifier chip is labeled DMG-AMP, standing for "Dot Matrix Game amplifier". The part number on this 18-pin chip (made by Sharp) is IR3R40.

The IR3R40 chip. Photo courtesy of John McMaster.

Internally, the chip is labeled SBG14.

↩
The die is labeled SBG14.
Most of the PNP transistors on this chip are round. However, when multiple PNP transistors are combined, a square structure is used instead. The square PNP transistors are larger than the square NPN transistors. The chip also has some PNP transistors with multiple collectors. Other PNP transistors have no explicit collector connection but use the substrate (ground). ↩
The inputs to the LM380 (or Game Boy amplifier) go into a differential pair (Q3, Q4), but this differential pair is different from the standard one used in op-amps. In particular, the emitters receive different, varying currents, and this is where the feedback happens. ↩
The output stages of the LM380 and the Game Boy speaker amplifier use two transistors for pull-down configured as a Szilaki pair. The combined PNP and NPN transistors act as a higher-performance PNP transistor, somewhat like a Darlington pair. ↩
The LM380 is explained in detail in the National Semiconductor application note, and Power Audio Amplifier IC LM380. The similar LM386 is discussed in LM386 lecture and LM386 adventures.

I'll explain the feedback network since the Game Boy chip operates the same way. The diagram below shows how the feedback network in the LM380 operates with no input. In the upper left, the supply voltage V_S across R1 creates a current I. Transistors Q5 and Q6 form a current mirror: this forces the current through Q6 to match the current (I) through Q5. The current from Q4 to the rest of the chip must be approximately 0 (since it is strongly amplified by the rest of the chip). Putting this all together, the current through R2 (generated from the output voltage feedback) must also be I. Since R2 is half the resistance of R1, the output voltage must be half of the supply voltage. The conclusion is that the output voltage at idle will be half of the supply voltage, as desired.

The LM380 with no signal applied. Schematic of the amplifier feedback network, from the LM380 datasheet.

When inputs are applied, the feedback network acts as seen below. Suppose a voltage ΔV is applied to the positive input. Emitter-follower transistors Q3 and Q4 buffer and raise the inputs, so the same ΔV appears across resistor R3. This generates a current ΔI through the resistor. This increases the current through Q5 to I+ΔI, and because of the current mirror, the same current will flow through Q6. Adding up the various currents, the current through R2 must be I+2ΔI. Since R2 has 25 times the resistance of R3, 2ΔI corresponds to an increase in the output voltage of 50ΔV. Therefore, the input voltage is multiplied by a factor of 50. The point of this is that the feedback network fixes the gain at 50.

The LM380 with a small signal applied.

It seems to me that the best way to understand the LM380 is to consider it as constructed from an operational transresistance amplifier (OTRA), an obscure relative of the op-amp. An OTRA acts like an op-amp, except the two inputs are currents instead of voltages, and the difference between the currents is amplified to produce the output voltage. The two currents (I) into the OTRA must be approximately equal, but the input voltages can diverge (unlike an op-amp).

My simplified schematic of the LM380, using an operational transresistance amplifier.

The schematic above shows the LM380's circuitry represented as an operational transconductance amplifier and feedback network. Equating the two currents yields V_out = Vs/2 + 51V₊ - 50.5V_- or approximately V_out = Vs/2 + 50*(V₊-V_-). In other words, the output is centered at half the supply voltage, and the difference in input voltages is amplified by a factor of 50. (Nobody else describes the LM380 in this way, so it's quite possible that I am looking at it wrong, but this analysis makes sense to me.) ↩
I don't know the exact values of the resistances on the die, but by comparing lengths on the die I can determine ratios of resistances. Looking at resistors R48, R49, R50, and R51, I calculate that the speaker amplifier has a gain factor of 22. From resistors R2, R3, R4, and R7, I calculate that the speaker amplifier has a gain factor of 30, significantly more than the headphone amplifiers. ↩
Note that the overall amplification of the chip is limited by the feedback network. The idea of an op-amp is the raw gain will be something like 100,000, but the feedback reduces the gain to something reasonable like a factor of 50. The "extra" gain improves performance and reduces distortion. In other words, the additional amplification stage in the Game Boy compared to the LM380 isn't going to make it 100 times louder. ↩
I'm a bit puzzled by the second amplification stage for the speaker amplifier. It looks like a differential amplifier, except a differential amplifier normally has the emitters connected and this circuit has the collectors connected. ↩
The bypass capacitors used by the Game Boy chip (and the LM380) help reduce the impact of power supply fluctuations. It's common for chips to have a bypass capacitor between power and ground, but this bypass capacitor is a bit different. It is connected to a point in the feedback network where it is more effective than a regular bypass capacitor. ↩

A look at the die of the 8086 processor

The Intel 8086 microprocessor was introduced 42 years ago this month,1 so I made some high-res die photos of the chip to celebrate. The 8086 is one of the most influential chips ever created; it started the x86 architecture that still dominates desktop and server computing today. By looking at the chip's silicon, we can see the internal features of this chip.

The photo below shows the die of the 8086. In this photo, the chip's metal layer is visible, mostly obscuring the silicon underneath. Around the edges of the die, thin bond wires provide connections between pads on the chip and the external pins. (The power and ground pads each have two bond wires to support the higher current.) The chip was complex for its time, containing 29,000 transistors.

Die photo of the 8086, showing the metal layer. Around the edges, bond wires are connected to pads on the die. Click for a large, high-resolution image.

Looking inside the chip

To examine the die, I started with the 8086 integrated circuit below. Most integrated circuits are packaged in epoxy, so dangerous acids are necessary to dissolve the package. To avoid that, I obtained the 8086 in a ceramic package instead. Opening a ceramic package is a simple matter of tapping it along the seam with a chisel, popping the ceramic top off.

The 8086 chip, in 40-pin ceramic DIP package.

With the top removed, the silicon die is visible in the center. The die is connected to the chip's metal pins via tiny bond wires. This is a 40-pin DIP package, the standard packaging for microprocessors at the time. Note that the silicon die itself occupies a small fraction of the chip's size.

The 8086 die is visible in the middle of the integrated circuit package.

Using a metallurgical microscope, I took dozens of photos of the die and stitched them into a high-resolution image using a program called Hugin (details). The photo at the beginning of the blog post shows the metal layer of the chip, but this layer hid the silicon underneath.

Under the microscope, the 8086 part number is visible as well as the copyright date. A bond wire is connected to a pad. Part of the microcode ROM is at the top.

For the die photo below, the metal and polysilicon layers were removed, showing the underlying silicon with its 29,000 transistors.2 The labels show the main functional blocks, based on my reverse engineering. The left side of the chip contains the 16-bit datapath: the chip's registers and arithmetic circuitry. The adder and upper registers form the Bus Interface Unit that communicates with external memory, while the lower registers and the ALU form the Execution Unit that processes data. The right side of the chip has control circuitry and instruction decoding, along with the microcode ROM that controls each instruction.

Die of the 8086 microprocessor showing main functional blocks.

One feature of the 8086 was instruction prefetching, which improved performance by fetching instructions from memory before they were needed. This was implemented by the Bus Interface Unit in the upper left, which accessed external memory. The upper registers include the 8086's infamous segment registers, which provided access to a larger address space than the 64 kilobytes allowed by a 16-bit address. For each memory access, a segment register and a memory offset were added to form the final memory address. For performance, the 8086 had a separate adder for these memory address computations, rather than using the ALU. The upper registers also include six bytes of instruction prefetch buffer and the program counter.

The lower-left corner of the chip holds the Execution Unit, which performs data operations. The lower registers include the general-purpose registers and index registers such as the stack pointer. The 16-bit ALU performs arithmetic operations (addition and subtraction), Boolean logical operations, and shifts. The ALU does not implement multiplication or division; these operations are performed through a sequence of shifts and adds/subtracts, so they are relatively slow.

Microcode

One of the hardest parts of computer design is creating the control logic that tells each part of the processor what to do to carry out each instruction. In 1951, Maurice Wilkes came up with the idea of microcode: instead of building the control logic from complex logic gate circuitry, the control logic could be replaced with special code called microcode. To execute an instruction, the computer internally executes several simpler micro-instructions, which are specified by the microcode. With microcode, building the processor's control logic becomes a programming task instead of a logic design task.

Microcode was common in mainframe computers of the 1960s, but early microprocessors such as the 6502 and Z-80 didn't use microcode because early chips didn't have room to store microcode. However, later chips such as the 8086 and 68000, used microcode, taking advantage of increasing chip densities. This allowed the 8086 to implement complex instructions (such as multiplication and string copying) without making the circuitry more complex. The downside was the microcode took a large fraction of the 8086's die; the microcode is visible in the lower-right corner of the die photos.3

A section of the microcode ROM. Bits are stored by the presence or absence of transistors. The transistors are the small white rectangles above and/or below each dark rectangle. The dark rectangles are connections to the horizontal output buses in the metal layer.

The photo above shows part of the microcode ROM. Under a microscope, the contents of the microcode ROM are visible, and the bits can be read out, based on the presence or absence of transistors in each position. The ROM consists of 512 micro-instructions, each 21 bits wide. Each micro-instruction specifies movement of data between a source and destination. It also specifies a micro-operation which can be a jump, ALU operation, memory operation, microcode subroutine call, or microcode bookkeeping. The microcode is fairly efficient; a simple instruction such as increment or decrement consists of two micro-instructions, while a more complex string copy instruction is implemented in eight micro-instructions.3

History of the 8086

The path to the 8086 was not as direct and planned as you might expect. Its earliest ancestor was the Datapoint 2200, a desktop computer/terminal from 1970. The Datapoint 2200 was before the creation of the microprocessor, so it used an 8-bit processor built from a board full of individual TTL integrated circuits. Datapoint asked Intel and Texas Instruments if it would be possible to replace that board of chips with a single chip. Copying the Datapoint 2200's architecture, Texas Instruments created the TMX 1795 processor (1971) and Intel created the 8008 processor (1972). However, Datapoint rejected these processors, a fateful decision. Although Texas Instruments couldn't find a customer for the TMX 1795 processor and abandoned it, Intel decided to sell the 8008 as a product, creating the microprocessor market. Intel followed the 8008 with the improved 8080 (1974) and 8085 (1976) processors. (I've written more about early microprocessors here.)

Datapoint 2200 computer. Photo courtesy of Austin Roche.

In 1975, Intel's next big plan was the 8800 processor designed to be Intel's chief architecture for the 1980s. This processor was called a "micromainframe" because of its planned high performance. It had an entirely new instruction set designed for high-level languages such as Ada, and supported object-oriented programming and garbage collection at the hardware level. Unfortunately, this chip was too ambitious for the time and fell drastically behind schedule. It eventually launched in 1981 (as the iAPX 432) with disappointing performance, and was a commercial failure.

Because the iAPX 432 was behind schedule, Intel decided in 1976 that they needed a simple, stop-gap processor to sell until the iAPX 432 was ready. Intel rapidly designed the 8086 as a 16-bit processor somewhat compatible with the 8-bit 8080,4 released in 1978. The 8086 had its big break with the introduction of the IBM Personal Computer (PC) in 1981. By 1983, the IBM PC was the best-selling computer and became the standard for personal computers. The processor in the IBM PC was the 8088, a variant of the 8086 with an 8-bit bus. The success of the IBM PC made the 8086 architecture a standard that still persists, 42 years later.

Why did the IBM PC pick the Intel 8088 processor?7 According to Dr. David Bradley, one of the original IBM PC engineers, a key factor was the team's familiarity with Intel's development systems and processors. (They had used the Intel 8085 in the earlier IBM Datamaster desktop computer.) Another engineer, Lewis Eggebrecht, said the Motorola 68000 was a worthy competitor6 but its 16-bit data bus would significantly increase cost (as with the 8086). He also credited Intel's better support chips and development tools.5

In any case, the decision to use the 8088 processor cemented the success of the x86 family. The IBM PC AT (1984) upgraded to the compatible but more powerful 80286 processor. In 1985, the x86 line moved to 32 bits with the 80386, and then 64 bits in 2003 with AMD's Opteron architecture. The x86 architecture is still being extended with features such as AVX-512 vector operations (2016). But even though all these changes, the x86 architecture retains compatibility with the original 8086.

Transistors

The 8086 chip was built with a type of transistor called NMOS. The transistor can be considered a switch, controlling the flow of current between two regions called the source and drain. These transistors are built by doping areas of the silicon substrate with impurities to create "diffusion" regions that have different electrical properties. The transistor is activated by the gate, made of a special type of silicon called polysilicon, layered above the substrate silicon. The transistors are wired together by a metal layer on top, building the complete integrated circuit. While modern processors may have over a dozen metal layers, the 8086 had a single metal layer.

Structure of a MOSFET in the integrated circuit.

The closeup photo of the silicon below shows some of the transistors from the arithmetic-logic unit (ALU). The doped, conductive silicon has a dark purple color. The white stripes are where a polysilicon wire crossed the silicon, forming the gate of a transistor. (I count 23 transistors forming 7 gates.) The transistors have complex shapes to make the layout as efficient as possible. In addition, the transistors have different sizes to provide higher power where needed. Note that neighboring transistors can share the source or drain, causing them to be connected together. The circles are connections (called vias) between the silicon layer and the metal wiring, while the small squares are connections between the silicon layer and the polysilicon.

Closeup of some transistors in the 8086. The metal and polysilicon layers have been removed in this photo. The doped silicon has a dark purple appearance due to thin-film interference.

Conclusions

The 8086 was intended as a temporary stop-gap processor until Intel released their flagship iAPX 432 chip, and was the descendant of a processor built from a board full of TTL chips. But from these humble beginnings, the 8086's architecture (x86) unexpectedly ended up dominating desktop and server computing until the present.

Although the 8086 is a complex chip, it can be examined under a microscope down to individual transistors. I plan to analyze the 8086 in more detail in future blog posts8, so follow me on Twitter at @kenshirriff for updates. I also have an RSS feed. Here's a bonus high-resolution photo of the 8086 with the metal and polysilicon removed; click for a large version.

Die photo of the Intel 8086 processor. The metal and polysilicon have been removed to reveal the underlying silicon.

Notes and references

The 8086 was released on June 8, 1978. ↩
To expose the chip's silicon, I used Armour Etch glass etching cream to remove the silicon dioxide layer. Then I dissolved the metal using hydrochloric acid (pool acid) from the hardware store. I repeated these steps until the bare silicon remained, revealing the transistors. ↩
The designers of the 8086 used several techniques to keep the size of the microcode manageable. For instance, instead of implementing separate microcode routines for byte operations and word operations, they re-used the microcode and implemented control circuitry (with logic gates) to handle the different sizes. Similarly, they used the same microcode for increment and decrement instructions, with circuitry to add or subtract based on the opcode. The microcode is discussed in detail in New options from big chips and patent 4449184. ↩↩
The 8086 was designed to provide an upgrade path from the 8080, but the architectures had significant differences, so they were not binary compatible or even compatible at the assembly code level. Assembly code for the 8080 could be converted to 8086 assembly via a program called CONV-86, which would usually require manual cleanup afterward. Many of the early programs for the 8086 were conversions of 8080 programs. ↩
Eggebrecht, one of the original engineers on the IBM PC, discusses the reasons for selecting the 8088 in Interfacing to the IBM Personal Computer (1990), summarized here. He discussed why other chips were rejected: IBM microprocessors lacked good development tools, and 8-bit processors such as the 6502 or Z-80 had limited performance and would make IBM a follower of the competition. I get the impression that he would have preferred the Motorola 68000. He concludes, "The 8088 was a comfortable solution for IBM. Was it the best processor architecture available at the time? Probably not, but history seems to have been kind to the decision." ↩
The Motorola 68000 processor was a 32-bit processor internally, with a 16-bit bus, and is generally considered a more advanced processor than the 8086/8088. It was used in systems such as Sun workstations (1982), Silicon Graphics IRIS (1984), the Amiga (1985), and many Apple systems. Apple used the 68000 in the original Apple Macintosh (1984), upgrading to the 68030 in the Macintosh IIx (1988), and the 68040 with the Macintosh Quadra (1991). However, in 1994, Apple switched to the RISC PowerPC chip, built by an alliance of Apple, IBM, and Motorola. In 2006, Apple moved to Intel x86 processors, almost 28 years after the introduction of the 8086. Now, Apple is rumored to be switching from Intel to its own ARM-based processors. ↩
For more information on the development of the IBM PC, see A Personal History of the IBM PC by Dr. Bradley. ↩
The main reason I haven't done more analysis of the 8086 is that I etched the chip for too long while removing the metal and removed the polysilicon as well, so I couldn't photograph and study the polysilicon layer. Thus, I can't determine how the 8086 circuitry is wired together. I've ordered another 8086 chip to try again. ↩

Die analysis of the 8087 math coprocessor's fast bit shifter

Floating-point numbers are very useful for scientific programming, but early microprocessors only supported integers directly.1 Although floating-point was common in mainframes back in the 1950s and 1960s, it wasn't until 1980 that Intel introduced the 8087 floating-point coprocessor for microcomputers.2 Adding this chip to a microcomputer such as the IBM PC made floating-point operations up to 100 times faster. This was a huge benefit for applications such as AutoCAD, spreadsheets, or flight simulators.3 The downside was the 8087 chip cost hundreds of dollars.4

It's hard to implement floating-point operations so they are computed quickly and accurately. Problems can arise from overflow, rounding, transcendental operations, and numerous edge cases. Prior to the 8087, each manufacturer had their own incompatible ad hoc implementation of floating point. Intel, however, enlisted numerical analysis expert William Kahan to design accurate floating point based on rigorous principles.5 The result was the floating-point architecture of the 8087. This became the IEEE 754 standard used in almost all modern computers, so I consider the 8087 one of the most influential chips ever designed.

Die of the Intel 8087 floating point unit chip, with main functional blocks labeled. The die is 5mm×6mm. The shifter is outlined in red. Click for a larger image.

To explore how the 8087 works, I opened up an 8087 chip and took photos of the silicon die with a microscope. Containing 40,000 transistors, the 8087 pushed chip manufacturing to the limit; in comparison, the companion 8086 microprocessor only had 29,000 transistors. To make the chip possible, Intel developed new techniques. In this article, I focus on the high-speed binary shifter (outlined in red above). The shifter takes up a large fraction of the chip's area, so minimizing its area was vital to making the 8087 possible.

A floating-point number consists of a fraction (also called significand or mantissa), an exponent, and a sign bit. (These are expressed in binary, but for a base-10 analogy, the number 6.02×10²³ has 6.02 as the fraction and 23 as the exponent.) The circuitry to process the fraction is at the bottom of the die photo. From left to right, the fraction circuitry consists of a constant ROM, a shifter (highlighted), adder/subtracters, and the register stack. The exponent processing circuitry is in the middle of the chip. Above it, the microcode engine and ROM control the chip.

The shifter

The role of the shifter is to shift binary numbers left or right, a task with several critical roles in floating-point operations. When two floating-point numbers are added or subtracted, the numbers must be shifted so the binary points line up. (The binary point is like the decimal point, but for a binary number.) The 8087's transcendental instructions are built around shift and add operations, using an algorithm called CORDIC. The shifter is also used to assemble a floating-point number from 16-bit chunks read from memory.8

Since shifts are so essential to performance, the 8087 uses a "barrel shifter", which can shift a number by any number of bits in a single step.6 Intel used a two-stage shifter design that kept its size manageable while still providing high performance. The first stage shifts the value by 0 to 7 bits, while the second stage shifts by 0 to 7 bytes. In combination, the two stages shift a value by any amount from 0 to 63 bits.

The bit shifter

I'll start by describing the bit shifter, which performs a shift of 0 to 7 bit positions. The diagram below outlines the structure of the bit shifter, showing five of the inputs and outputs; the full shifter supports 68 bits.7 The concept is that by activating a particular column, the input is shifted by the desired amount. Each circle indicates a transistor that can act as a switch between an input line and an output line. The vertical select lines are used to activate the desired transistors. Each input line is connected diagonally to eight transistors, allowing it to be directed to one of eight outputs. For example, the diagram shows shift select line 3 activated, turning on the associated transistors (green). The highlighted input 20 (orange) is directed to output 23 (blue). Similarly, the other inputs are connected to the corresponding outputs, yielding a shift by 3. By activating a different shift select line, the input will be shifted by a different amount between 0 and 7 bits.

Structure of the bit shifter. By energizing a shift select line, the inputs are connected to outputs with the desired bit shift.

To explain the internal construction of the shifter, I'll start by describing the NMOS transistors used in the 8087 chip. Transistors are built by doping areas of the silicon substrate with impurities to create "diffusion" regions with different electrical properties. The transistor can be considered a switch, controlling the flow of current between two regions called the source and drain. The transistor is activated by the gate, made of a special type of silicon called polysilicon, layered above the substrate silicon. Applying voltage to the gate lets current flow between the source and drain, which is otherwise blocked. Transistors are wired together by a metal layer on top, building a complex integrated circuit.

Structure of a MOSFET as implemented in an integrated circuit.

The photo below shows a transistor in the 8087 as it appears under the microscope. Its structure matches the diagram above, although its shape is more complex. The source, gate, and drain all continue out of the photo, connected to other transistors. In addition, wiring in the metal layer is connected to the silicon at the circular vias. (The metal layer was removed with acid for this photo.)

An NMOS transistor in the 8087 chip, as seen under the microscope.

Zooming out, the diagram below shows part of the bit shifter as implemented on the chip. About 48 transistors, similar to the one above, are in this photo. The orange and yellow diagonal corresponds to one of the inputs: the orange regions show transistors connected through the silicon, while the yellow lines show connections in the metal layer. (The metal layer is used to jump over the polysilicon select lines.) The green highlight shows the polysilicon line for shift-by-three. In the center, this polysilicon gate line turns on a transistor, connecting the input to the long yellow output line, shifting the highlighted input by three positions. (The other non-highlighted inputs are shifted similarly.) Thus, this circuit implements the shifter as described at the beginning of the section. The photo shows six of the 68 inputs, so the complete shifter is much taller.

Closeup of the silicon circuitry for the bit shifter. The path of one signal is shown, as controlled by the shift-by-three control (green).

The byte shifter

The byte shifter shifts its inputs by multiples of eight bits, rather than one bit. Its design is similar to the bit shifter, except each input connects to every eighth output. For instance, input 20 connects to outputs 20, 28, 36, and so forth, shifting by bytes. As a result, the diagonal connections are steep and packed tightly, with eight lines between each switch. In the diagram below, the line for shift-by-four is selected, with the connection from input 0 to output 32 highlighted. Note the lack of wires in the right half of the diagram because any bit shifted from beyond input 0 becomes zeroed. For instance, when shifting left by 4 bytes, low-order bits 31 and below become zero.

The structure of the byte shifter.

The die photo below shows part of the bit shifter and the byte shifter. This photo is zoomed-out to show the overall structure; individual transistors are barely visible. The bit shifter's area is densely packed with transistors, but the byte shifter consists mostly of wiring, with columns of transistors in between.9 Also note that the byte shifter is partially empty at the top, filling in with more wiring towards the bottom. The wiring layout isn't as orderly as in the diagram above, but is arranged for maximum efficiency.

The bit shifter and byte shifter in the 8087 chip.

The bidirectional drivers

So far, the bit and byte shifters only shift bits in one direction.11 However, bits need to be shifted in both directions. One of the key innovations of the 8087's shifter is its bidirectional design: data can be passed through the shifter in reverse to shift bits the opposite direction. This is possible because the shifter is constructed with pass transistors, not logic gates. Pass transistor logic uses transistors as switches that pass or block signals, so signals can travel in either direction. (In contrast, regular logic gates such as NOR gates have specific inputs and outputs.)

Special driver circuitry on the left and right sides of the shifter allows the shifter to operate in either direction. To send data from left to right, the left-hand driver reads data from the fraction bus and sends it into the shifter. The right-hand driver circuit receives this shifted data, latches it temporarily, and then writes it back to the fraction bus. To send data in the opposite direction, the driver circuits reverse roles: the right-hand driver sends data from the fraction bus into the shifter while the left-hand circuit receives the shifted data.10

The multiplexer / decoders

The final feature I'll describe is the circuitry that controlled the shifter. Three different sources control how many positions to shift. First, the microcode engine can specify the number directly. Second, the number can come from a loop counter; this is used as part of the CORDIC transcendental algorithms. Finally, the number can come from a leading zero counter; this allows numbers to be normalized by eliminating leading zeroes through shifting. Each of these sources provides a 6-bit shift number; the six multiplexers each select one bit from the desired source.12

The multiplexer/decoder circuitry.

Next, decoders activate one of eight bit-shift lines and one of eight byte-shift lines to control the appropriate pass transistors in the shifter. (Each decoder takes a 3-bit input and activates one of 8 output lines.) Because each decoder line controls a large column of pass transistors in the shifter, the decoder uses relatively large power transistors.13 At the bottom, the 16 control lines exit the circuitry.

Conclusion

The 8087 is a complex chip with many functional units. However, by examining the die closely, the circuits of the 8087 can be understood. This blog post described the 8087's fast barrel shifter, capable of shifting by up to 63 bits at a time.14 Intel received a patent on this innovative programmable bidirectional shifter.

The shifter was just one of the features that let the 8087 compute floating-point operations much faster than the 8086 processor could. The 8087 operates on 80 bits at a time instead of 16. The 8087 has 80-bit wide registers, reducing memory accesses during computations. The 8087 stores constants for transcendental operations in a ROM, also avoiding memory accesses. Hardware in the 8087 checked for NaN, underflow, overflow, etc., avoiding slow checks in code. The 8087's hardware made multiplication and division faster. I don't know the relative contributions of these factors, but in combination, they improved floating-point performance dramatically, by up to a factor of 100.

The benefits of floating point hardware are so great that Intel started integrating the floating-point unit into the processor with the 80486 (1989). Now, most processors include a floating-point unit and the expense of purchasing a separate floating-point coprocessor is a thing of the past.

Die photo of the 8087 with the metal layer removed. The colors are due to some of the oxide layer remaining. Click for a larger image.

For more information on the 8087, see my other articles: Extracting ROM constants from the 8087, The two-bit-per-transistor ROM and The substrate bias generator. I announce my latest blog posts on Twitter, so follow me @kenshirriff for future articles. I also have an RSS feed.

Notes and references

Even without floating-point hardware, early microcomputers could perform floating-point operations. The operations would be broken down into many integer operations, manipulating the exponent and fraction as necessary. In other words, floating-point support didn't make floating-point operations possible, it just made them much faster. (Another way to represent non-integers is fixed-point numbers, which have a fixed number of digits after the decimal. Fixed-point numbers are simpler than floating-point, but can't represent as large a range.) ↩
The 8087 wasn't the first floating-point chip. National Semiconductor introduced the MM57109 Number Cruncher Unit (that is the real name) in 1977. It was essentially a repackaged 12-digit scientific calculator chip, operating on binary-coded decimal values with values entered in Reverse Polish Notation. This chip was absurdly slow; a tangent, for instance, could take over a second. AMD introduced their floating-point chip, the Am9511, in 1978 (details). This chip supported 32-bit floating-point numbers and took up to 1.4 milliseconds for a tangent. (Intel ended up licensing the Am9511 from AMD and selling it as the 8231.) A 10-MHz 8087 in comparison, could do a tangent in 54 microseconds, operating on an 80-bit floating-point number. Thus, the 8087's performance and accuracy were far superior to previous chips. ↩
The original IBM PC (1981) had an empty socket on its motherboard for adding an 8087 coprocessor. a huge benefit for applications such as AutoCAD. The large empty socket is visible in the upper left below, above the 8088 microprocessor. A list of applications with support for the 8087 is here.

Motherboard of the original IBM PC (1981). Photo from Wikimedia, CC BY-SA 3.0.

↩
I couldn't find the original price for the 8087, but it was expensive. At first, Intel only sold the 8087 as a matched and tested pair with an 8088, due to timing flakiness with the 8087. By 1982, Intel dropped the price of the 8087 to $230, equivalent to about $500 in current dollars. Compared to today's open-source world, it seems strange that customers also had to pay for software support: using the 8087 with the BASIC language cost another $150, while Intel's 8087 development library was $1250. ↩
The designers of the 8087 commented on the guidance offered by Professor Kahan: "We did not do as well as he wanted, but we did better than he expected." Kahan later received a Turing Award for his work on floating point. ↩
Processors often include a variety of shift instructions, including rotate operations that shift bits from one end of the word to the other. The 8087 only performs straight shifts, not rotates. ↩
The shifter handles the 8087's 64-bit fraction, along with three extra bits for rounding accuracy, so it supports 67 bits. Unless I miscounted, the shifter also has an extra bit in the most significant position, making it 68 bits wide. ↩
Multiplication and division make heavy use of shifting; multiplication is performed by shifts and adds, while division uses shifts and subtracts. However, the 8087 does not use the general-purpose shifter for these operations, but has specialized shifters optimized for these operations. ↩
In order to pack the wiring as close together as possible, the shifter alternated wires of diffused silicon and wires of polysilicon. In the photo below, the diffused silicon wires are pinkish, while the polysilicon is yellowish. The 8087 was built with Intel's HMOS III process, which required a 4µm spacing for polysilicon and 5µm for diffusion, probably due to the resolution of the photolithography practice. However, the spacing between a diffusion line and a polysilicon line could be much smaller, probably because they were created with separate masks and were on separate layers. Thus, alternating diffusion and polysilicon lines could be packed together tightly, saving space.

↩
Wiring in the byte shifter consists of alternating, tightly-packed silicon and polysilicon lines. The large rectangles on either side are pairs of transistors, controlled by vertical polysilicon lines.
The driver circuitry has a few subtleties. Instead of sending data directly into the shifter, bits are transferred in two steps. First, the shifter lines are pre-charged to a high level. Then, any 1-bit inputs cause the corresponding shifter lines to be pulled low. In other words, the shifter lines are active-low, with a low voltage representing a 1. Since any unused outputs keep their high voltage (a 0 bit), 0 bits are shifted into low bit positions automatically. I think the pre-charge technique also was a better match for NMOS circuitry, which was better at pulling a signal low than pulling it high, so pre-charging the lines helped performance, especially given their relatively high capacitance. The latch between the shifter and the fraction bus prevents an unwanted cycle with the shifted data immediately flowing back into the shifter and getting re-shifted. ↩
This footnote will clarify the physical shift versus the logical shift. On the die, the fraction circuitry is arranged with the most-significant bit at the bottom. Passing data through the shifter from left to right shifts bits physically downward. This corresponds to a left-shift of a binary number, moving bits to a higher position. In the opposite direction, passing data through the shifter from right to left performs a right-shift of the data. ↩
The left/right direction also needs to be selected from one of the three shift sources, but I haven't located the circuitry for that yet. ↩
Each decoder essentially consists of eight NOR gates: seven will be pulled low and only the one with all inputs low will be high. However, it's not implemented as a straightforward logic gate. Instead, all outputs are precharged high, and then the seven undesired outputs are pulled low. This sort of dynamic precharge logic is still used in modern circuits; see the book Synchronous Precharge Logic. The multiplexers are also implemented with precharge logic. ↩
Intel's x86 processors didn't include a barrel shifter until the 80386 (1985), which provided a 64-bit barrel shifter. Before that, the 8086 and descendants shifted one bit at a time, so shifts by many bit positions were much slower. ↩

Extracting ROM constants from the 8087 math coprocessor's die

Intel introduced the 8087 chip in 1980 to improve floating-point performance on the 8086 and 8088 processors, and it was used with the original IBM PC. Since early microprocessors operated only on integers, arithmetic with floating-point numbers was slow and transcendental operations such as arctangent or logarithms were even worse. Adding the 8087 co-processor chip to a system made floating-point operations up to 100 times faster.

I opened up an 8087 chip and took photos with a microscope. The photo below shows the chip's tiny silicon die. Around the edges of the chip, tiny bond wires connect the chip to the 40 external pins. The labels show the main functional blocks, based on my reverse engineering. By examining the chip closely, various constants can be read out of the chip's ROM, numbers such as pi that the chip uses in its calculations.

Die of the Intel 8087 floating point unit chip, with main functional blocks labeled. The constant ROM is outlined in green. Click for a larger image.

The top half of the chip contains the control circuitry. Performing a floating-point instruction might require 1000 steps; the 8087 used microcode to specify these steps. The die photo above shows the "engine" that ran the microcode program; it is basically a simple CPU. Next to it is the large ROM that holds the microcode.

The bottom half of the die holds the circuitry that processes floating-point numbers. A floating-point number consists of a fraction (also called significand or mantissa), an exponent, and a sign bit. (For a base-10 analogy, in the number 6.02×10²³, 6.02 is the fraction and 23 is the exponent.) The chip has separate circuitry to process the fraction and the exponent in parallel. The fraction processing circuitry supports 67-bit values, a 64-bit fraction with three extra bits for accuracy. From left to right, the fraction circuitry consists of a constant ROM, a shifter, adder/subtracters, and the register stack. The constant ROM (highlighted in green) is the subject of this post.

The 8087 operated as a co-processor with the 8086 processor. When the 8086 encountered a special floating-point instruction, the processor ignored it and let the 8087 execute the instruction in parallel.1 I won't explain in detail how the 8087 works internally, but as an overview, floating-point operations are implemented using integer adds/subtracts and shifts. To add or subtract two floating-point numbers, the 8087 shifts the numbers until the binary points (i.e. the decimal points but in binary) line up, and then adds or subtracts the fraction. Multiplication, division, and square root are performed through repeated shifts and adds or subtracts. Transcendental operations (tan, arctan, log, power) use CORDIC algorithms, which use shifts and adds of special constants for efficient computation.

Implementation of the ROM

This post describes the ROM that holds constants (not to be confused with the larger, four-level microcode ROM.2) The constant ROM holds the constants (such as pi, ln(2), and sqrt(2)) that the 8087 needs for its computations. The photo below shows part of the constant ROM. The metal layer has been removed to show the silicon underneath. The pinkish regions are silicon doped to have different properties, while the reddish and greenish lines are polysilicon, a special type of silicon wiring layered on top. Note the regular grid structure of the ROM. The ROM consists of two columns of transistors, holding the bits. To explain how the ROM works, I'll start by explaining how a transistor works.

Part of the constant ROM, with the metal layer removed. The three columns of larger transistors are used to select between rows.

High-density integrated circuits in the 1970s were usually built from a type of transistor known as NMOS. (Modern computers are built from CMOS, which consists of NMOS transistors along with opposite-polarity PMOS transistors.) The diagram below shows the structure of an NMOS transistor. An integrated circuit is constructed from a silicon substrate, with transistors built on it. Regions of the silicon are doped with impurities to create "diffusion" regions with desired electrical properties. The transistor can be viewed as a switch, allowing current to flow between two diffusion regions called the source and drain. The transistor is controlled by the gate, made of a special type of silicon called polysilicon. Applying voltage to the gate lets current flow between the source and drain, which is otherwise blocked. The die of the 8087 is fairly complex, with about 40,000 of these transistors.3

Structure of a MOSFET as implemented in an integrated circuit.

Zooming in on the ROM shows the individual transistors. The pinkish regions are the doped silicon, forming transistor sources and drains. The vertical polysilicon select lines form the gates of the transistors. The indicated silicon regions are connected to ground, pulling one side of each transistor low. The circles are connections called vias between the silicon and the metal lines above. (The metal lines have been removed; the orange line shows the position of one.)

A portion of the constant ROM. Each select line selects a particular constant. Transistors are indicated by the yellow symbols. An X indicates a missing transistor, corresponding to a 0 bit. The orange line indicates the position of a metal wire. (The metal layer was dissolved for this picture.)

The important feature of the ROM is that some of the transistors are missing, the first one in the upper row, and two marked with X in the lower row. Bits are programmed into the ROM by changing the silicon doping pattern, creating transistors or leaving insulating regions. Each transistor or missing transistor represents one bit. When a select line is activated, all the transistors in that column will turn on, pulling the corresponding output lines low. But if the transistor is missing from a selected position, the corresponding output line will remain high. Thus, a value is read from the ROM by activating a select line, reading that ROM value onto the output lines.

Contents of the ROM

The constant ROM has 134 rows of 21 columns.5 Under a microscope, the bit pattern of the ROM is visible and can be extracted.4 How to interpret the raw bits is not obvious, though. The first question is if a transistor (versus a gap) indicates a 0 or a 1. (It turns out that a transistor indicates a 1 bit.) The next issue is how to map the 134×21 grid of bits into values.6

The chip's data path consists of 67 horizontal rows, so it seemed pretty clear that the 134 rows in the ROM corresponded to two sets of 67-bit constants. I extracted one set of constants for the odd rows and one for the even rows, but the values didn't make any sense. After more thought, I determined that the rows do not alternate but are arranged in a repeating "ABBA" pattern.7 Using this pattern yielded a bunch of recognizable constants, including pi and 1. Bits from those constants are shown in the diagram below. (In this photo, a 1 bit appears as a green stripe, while a 0 bit appears as a red stripe.) In binary, pi is 11.001001... and this value is visible in the upper labeled bits. The bottom value is the constant 1.8

Bit values labeled in the constant ROM. The top bits are the first part of pi, while the lower bits are the constant 1, This diagram has been rotated 90 degrees compared to the other diagrams. The unlabeled bits form other constants.

The next difficulty in interpretation is that this ROM holds just the fractional parts of the numbers, not the exponents. (I haven't found the separate exponent ROM yet.) I experimented with various exponents until I got values that were sensible numbers. Some were straightforward: for instance, the constant 1.204120 yielded log₁₀(2) when the exponent 2^-2 was used. Others were harder,9 such as 1.734723. Eventually, I figured out that 1.734723×2⁵⁹ is 10¹⁸.10

The complete table of constants is in the footnotes.11 Physically, the constants are arranged in three groups. The first group is values that the user can load (1, pi, log₂10, log₂e, log₁₀2, and ln 2)12 along with values used internally (10¹⁸, ln(2)/3, 3*log₂(e), log₂(e), and sqrt(2)). The second group is sixteen arctan constants, and the third is fourteen log₂ constants. The last two groups of constants are used to compute transcendental functions using the CORDIC algorithm, which I will discuss next.

The CORDIC algorithms

The constants in the ROM reveal some details about the algorithms used by the 8087. The ROM contains 16 arctangent values, the arctans of 2^-n. It also contains 14 log values, the base-2 logs of (1+2^-n). These may seem like unusual values, but they are used in an efficient algorithm called CORDIC, which was invented in 1958.

The basic idea of CORDIC is to compute tangent and arctangent by breaking down an angle into smaller angles, and rotating a vector by these angles. The trick is that by carefully choosing the smaller angles, each rotation can be computed with efficient shifts and adds instead of trig functions. Specifically, suppose we want to find tan(z). We can break z into a sum of smaller angles: z ≈ {atan(2^-1) or 0} + {atan(2^-2) or 0} + {atan(2^-3) or 0} + ... + {atan(2^-16) or 0}. Now, rotating a vector by, say atan(2^-2), can be done by multiplying by 2^-2 and adding. The key thing is that multiplying by 2^-2 is just a fast bit shift. Putting this all together, computing tan(z) can be done by comparing z with the atan constants, and then doing 16 cycles of additions and shifts, which are fast to perform in hardware.13 To make the algorithm work, the atan constants are precomputed and stored in the constant ROM.14

Computing the base-2 log and base-2 exponential also use CORDIC algorithms, with the associated logarithmic constants. The key observation is that multiplying by (1 + 2^-n) can be done quickly with a shift and addition. By multiplying one side of the equation by the sequence of values, and adding the corresponding log constants to the other side, the log or exponential can be computed.15

The 8087's support for transcendental functions is more limited than you might expect. It only supports tangent and arctangent, not sine or cosine; the user must apply trig identities to compute sine or cosine. Logs and exponentials only support base 2; for base 10 or base e, the user must apply the appropriate scale factor. At the time, the 8087 pushed the limits of what could fit on a chip, so the instruction set was limited to the essentials.

Conclusion

The 8087 is a complex chip and at first it looks like a hopeless maze of circuitry. But much of it can be understood with careful study. It contains 42 constants in a ROM, and the values of these constants can be extracted under a microscope. Some of the constants (such as pi) are expected, while others (such as ln(2)/3) are more puzzling. Many of the constants are used for computing the tangent, arctangent, log, and power functions, using fast CORDIC algorithms.

Die photo of the 8087 with the metal layer removed. Click for a larger image.

Even though Intel's 8087 floating point unit chip was introduced 40 years ago, it still has a large influence today. It spawned the IEEE 754 floating-point standard used for most modern floating-point arithmetic, and the 8087's instructions remain a part of the x86 processors used in most computers.

For more information on the 8087, see my other articles: the two-bit-per-transistor ROM and the substrate bias generator. I announce my latest blog posts on Twitter, so follow me @kenshirriff for future articles. I also have an RSS feed.

Notes and references

The interaction between the 8086 processor and the 8087 floating point unit is somewhat tricky; I'll discuss some highlights. The simplified view is that the 8087 watches the 8086's instruction stream, and executes any instructions that are 8087 instructions. The complication is that the 8086 has an instruction prefetch buffer, so the instruction being fetched isn't the one being executed. Thus, the 8087 duplicates the 8086's prefetch buffer (or the 8088's smaller prefetch buffer), so it knows that the 8086 is doing. (A Twitter thread discusses this in detail.) Another complication is the complex addressing modes used by the 8086, which use registers inside the 8086. The 8087 can't perform these addressing modes since it doesn't have access to the 8086 registers. Instead, when the 8086 sees an 8087 instruction, it does a memory fetch from the addressed location and ignores the result. Meanwhile, the 8087 grabs the address off the bus so it can use the address if it needs it. If there is no 8087 present, you might expect a trap, but that's not what happens. Instead, for a system without an 8087, the linker rewrites the 8087 instructions, replacing them with subroutine calls to the emulation library. ↩
The 8087's microcode ROM is built with an unusual technique that stores two bits per transistor. It does this by using three different transistor sizes or no transistor in each position. The four possibilities at each position represent two bits. This complex technique was necessary in order to fit the large ROM onto the 8087 die. I wrote a blog post with more details. The constant ROM, in comparison, is built using standard techniques. ↩
Sources provide inconsistent values for the number of transistors in the 8087: Intel claims 40,000 transistors while Wikipedia claims 45,000. The discrepancy could be due to different ways of counting transistors. In particular, since the number of transistors in a ROM, PLA or similar structure depends on the data stored in it, sources often count "potential" transistors rather than the number of physical transistors. Other discrepancies can be due to whether or not pull-up transistors are counted and if high-current drivers are counted as multiple transistors in parallel or one large transistor. ↩
Instead of copying bits from the ROM by hand, I made a simple JavaScript program to help me read out the ROM. I clicked on the ROM image to indicate each transistor, and the program produced the corresponding pattern of 0's and 1's. ↩
The ROM has 134 rows of 21 bits, except there is a 6×6 chunk missing from the upper left. Thus, the physical size is of the constant ROM is 2946 bits.

The upper-left corner of the constant ROM, showing the missing 6×6 section.

Because of the ROM layout, this missing section means that the first 12 constants are 64 bits long, rather than 67 bits. These are the non-CORDIC constants, which apparently don't require the extra bits for accuracy. ↩
There are two ways to determine the encoding of the bits. The first is to trace out the circuitry that reads from the ROM and examine how the data is used. The second is to look for patterns in the raw data, and determine what makes sense for an encoding. Since the 8087 is very complex, I wanted to avoid a full reverse-engineering to understand the constants and I used the second approach. ↩
The organization of the rows follows the pattern ABBAABBAABBA..., where "A" rows hold bits for one set of constants and "B" rows hold bits for the second set of constants. This layout was probably used instead of alternating rows ("ABAB") because one connection can drive two neighboring selection transistors. That is, each "AA" or "BB" group can be selected with one wire. ↩
A bit more trial-and-error was necessary to pull the values out of the ROM. I determined three key factors. First, the bits started at the bottom of the ROM, going up. Second, a transistor indicated a 1, rather than a 0. Third, the constants did not have an implicit 1 bit at the beginning. (In other words, the constant format does not match the external data format used by the 8087.) ↩
Some of the exponents were tricky to determine. I used brute force for some of them, seeing if any exponent would yield the log or power of some number. One of the hardest numbers to figure out was ln(2)/3; I'm not sure why this value is important. ↩
Why does the 8087 contain the constant 10¹⁸? Probably because the 8087 supports a packed BCD datatype holding 18 digits, so it can hold up to 10¹⁸. ↩

The following table summarizes the contents of the constant ROM. The "meaning" column is my interpretation of the number.

Constant	Decimal value	Meaning
1.204120×2^-2	0.3010300	log₁₀(2)
1.386294×2^-1	0.6931472	ln(2)
1.442695×2⁰	1.4426950	log₂(e)
1.570796×2¹	3.1415927	Pi
1.000000×2⁰	1.0000000	1
1.660964×2¹	3.3219281	log₂(10)
1.734723×2⁵⁹	1.000e+18	10¹⁸
1.734723×2⁵⁹	1.000e+18	10¹⁸
1.848392×2^-3	0.2310491	ln(2)/3
1.082021×2²	4.3280851	3*log₂(e)
1.442695×2⁰	1.4426950	log₂(e)
1.414214×2⁰	1.4142136	sqrt(2)
1.570796×2^-1	0.7853982	atan(2⁰)
1.854590×2^-2	0.4636476	atan(2^-1)
2.000000×2^-15	0.0000610	atan(2^-14)
2.000000×2^-16	0.0000305	atan(2^-15)
1.959829×2^-3	0.2449787	atan(2^-2)
1.989680×2^-4	0.1243550	atan(2^-3)
2.000000×2^-13	0.0002441	atan(2^-12)
2.000000×2^-14	0.0001221	atan(2^-13)
1.997402×2^-5	0.0624188	atan(2^-4)
1.999349×2^-6	0.0312398	atan(2^-5)
1.999999×2^-11	0.0009766	atan(2^-10)
2.000000×2^-12	0.0004883	atan(2^-11)
1.999837×2^-7	0.0156237	atan(2^-6)
1.999959×2^-8	0.0078123	atan(2^-7)
1.999990×2^-9	0.0039062	atan(2^-8)
1.999997×2^-10	0.0019531	atan(2^-9)
1.441288×2^-9	0.0028150	log₂(1+2^-9)
1.439885×2^-8	0.0056245	log₂(1+2^-8)
1.437089×2^-7	0.0112273	log₂(1+2^-7)
1.431540×2^-6	0.0223678	log₂(1+2^-6)
1.442343×2^-11	0.0007043	log₂(1+2^-11)
1.441991×2^-10	0.0014082	log₂(1+2^-10)
1.420612×2^-5	0.0443941	log₂(1+2^-5)
1.399405×2^-4	0.0874628	log₂(1+2^-4)
1.442607×2^-13	0.0001761	log₂(1+2^-13)
1.442519×2^-12	0.0003522	log₂(1+2^-12)
1.359400×2^-3	0.1699250	log₂(1+2^-3)
1.287712×2^-2	0.3219281	log₂(1+2^-2)
1.442673×2^-15	0.0000440	log₂(1+2^-15)
1.442651×2^-14	0.0000881	log₂(1+2^-14)

It's clear from the CORDIC constants that the values in the ROM are not physically stored in order, i.e. sequential rows are not addressed in order. I'm not sure why 10¹⁸ appears twice; probably one exponent is different. The binary exponents are not in the ROM that I examined, so I had to estimate them. ↩

The 8087 provides seven instructions to load constants directly. The instructions FDLZ, FLD1, FLDPI, FLD2T, FLD2E, FLDLG2, and FLDLN2 load onto the stack the constants 0, 1, pi, log₂10, log₂e, log₁₀2, and ln 2, respectively. Apart from 0, these constants can be found in the ROM. ↩
The 8087's CORDIC algorithm is described in Implementation of transcendental functions on a numerics processor. I wrote sample tangent code based on that description here. There are also a couple of multiplications and divisions in the 8087's full tan algorithm. It uses a simple rational approximation of tangent on the "leftover" angle, giving it a bit more accuracy than straight CORDIC. ↩
Computing the arctangent of an angle uses an algorithm that is similar to the tangent algorithm, but in reverse: as rotations are performed, the angles (from the constant ROM) are summed up to yield the resulting angle. ↩
I couldn't find documentation on the 8087's log and exponent algorithms. I think the algorithms are very similar to the ones on this page, except the 8087 uses base 2 instead of base e. I'm a bit puzzled why the 8087 doesn't need the constant log₂(1 + 2^-1), which is used by that algorithm. ↩

Ken Shirriff's blog

Reverse-engineering and comparing two Game Boy audio amplifier chips

NPN transistors

PNP transistors

Resistors

Capacitors

The LM380

Game Boy Audio chip: headphone amplifier

Game Boy Audio chip: speaker amplifier

Comparison with the Game Boy Color

Conclusion

Notes and references

A look at the die of the 8086 processor

Looking inside the chip

Microcode

History of the 8086

Transistors

Conclusions

Notes and references

Die analysis of the 8087 math coprocessor's fast bit shifter

The shifter

The bit shifter

The byte shifter

The bidirectional drivers

The multiplexer / decoders

Conclusion

Notes and references

Extracting ROM constants from the 8087 math coprocessor's die

Implementation of the ROM

Contents of the ROM

The CORDIC algorithms

Conclusion

Notes and references

Get new posts by email: