Ken Shirriff's blog

Inside the Apollo "8-Ball" FDAI (Flight Director / Attitude Indicator)

During the Apollo flights to the Moon, the astronauts observed the spacecraft's orientation on a special instrument called the FDAI (Flight Director / Attitude Indicator). This instrument showed the spacecraft's attitude—its orientation—by rotating a ball. This ball was nicknamed the "8-ball" because it was black (albeit only on one side). The instrument also acted as a flight director, using three yellow needles to indicate how the astronauts should maneuver the spacecraft. Three more pointers showed how fast the spacecraft was rotating.

An Apollo FDAI (Flight Director/Attitude Indicator) with the case removed. This FDAI is on its side to avoid crushing the needles.

Since the spacecraft rotates along three axes (roll, pitch, and yaw), the ball also rotates along three axes. It's not obvious how the ball can rotate to an arbitrary orientation while remaining attached. In this article, I look inside an FDAI from Apollo that was repurposed for a Space Shuttle simulator1 and explain how it operates. (Spoiler: the ball mechanism is firmly attached at the "equator" and rotates in two axes. What you see is two hollow shells around the ball mechanism that spin around the third axis.)

The FDAI in Apollo

For the missions to the Moon, the Lunar Module had two FDAIs, as shown below: one on the left for the Commander (Neil Armstrong in Apollo 11) and one on the right for the Lunar Module Pilot (Buzz Aldrin in Apollo 11). With their size and central positions, the FDAIs dominate the instrument panel, a sign of their importance. (The Command Module for Apollo also had two FDAIs, but with a different design; I won't discuss them here.2)

The instrument panel in the Lunar Module. From Apollo 15 Lunar Module, NASA, S71-40761. If you're looking for the DSKY, it is in the bottom center, just out of the picture.

Each Lunar Module FDAI could display inputs from multiple sources, selected by switches on the panel.3 The ball could display attitude from either the Inertial Measurement Unit or from the backup Abort Guidance System, selected by the "ATTITUDE MON" toggle switch next to either FDAI. The pitch attitude could also be supplied by an electromechanical unit called ORDEAL (Orbital Rate Display Earth And Lunar) that simulates a circular orbit. The error indications came from the Apollo Guidance Computer, the Abort Guidance System, the landing radar, or the rendezvous radar (controlled by the "RATE/ERROR MON" switches). The pitch, roll, and yaw rate displays were driven by the Rate Gyro Assembly (RGA). The rate indications were scaled by a switch below the FDAI, selecting 25°/sec or 5°/sec.

The FDAI mechanism

The ball inside the indicator shows rotation around three axes. I'll first explain these axes in the context of an aircraft, since the axes of a spacecraft are more arbitrary.4 The roll axis indicates the aircraft's angle if it rolls side-to-side along its axis of flight, raising one wing and lowering the other. Thus, the indicator shows the tilt of the horizon as the aircraft rolls. The pitch axis indicates the aircraft's angle if it pitches up or down, with the indicator showing the horizon moving down or up in response. Finally, the yaw axis indicates the compass direction that the aircraft is heading, changing as the aircraft turns left or right. (A typical aircraft attitude indicator omits yaw.)

I'll illustrate how the FDAI rotates the ball in three axes, using an orange as an example. Imagine pinching the horizontal axis between two fingers with your arm extended. Rotating your arm will roll the ball counter-clockwise or clockwise (red arrow). In the FDAI, this rotation is accomplished by a motor turning the frame that holds the ball. For pitch, the ball rotates forward or backward around the horizontal axis (yellow arrow). The FDAI has a motor inside the ball to produce this rotation. Yaw is a bit more difficult to envision: imagine hemisphere-shaped shells attached to the top and bottom shafts. When a motor rotates these shells (green arrow), the hemispheres will rotate, even though the ball mechanism (the orange) remains stationary.

A sphere, showing the three axes.

The diagram below shows the mechanism inside the FDAI. The indicator uses three motors to move the ball. The roll motor is attached to the FDAI's frame, while the pitch and yaw motors are inside the ball. The roll motor rotates the roll gimbal through gears, causing the ball to rotate clockwise or counterclockwise. The roll gimbal is attached to the ball mechanism at two points along the "equator"; these two points define the pitch axis. Numerous wires on the roll gimbal enter the ball along the pitch axis. The roll control transformer provides position feedback, as will be explained below.

The main components inside the FDAI.

Removing the hemispherical shells reveals the mechanism inside the ball. When the roll gimbal is rotated, this mechanism rotates with it. The pitch motor causes the ball mechanism to rotate around the pitch axis. The yaw motor and control transformer are not visible in this photo; they are behind the pitch components, oriented perpendicularly. The yaw motor turns the vertical shaft, with the two hemisphere shells attached to the top and bottom of the shaft. Thus, the yaw motor rotates the ball shells around the yaw axis, while the mechanism itself remains stationary. The control transformers for pitch and yaw provide position feedback.

The components inside the ball of the FDAI.

Why doesn't the wiring get tangled up as the ball rotates? The solution is two sets of slip rings to implement the electrical connections. The photo below shows the first slip ring assembly, which handles rotation around the roll axis. These slip rings connect the stationary part of the FDAI to the rotating roll gimbal. The vertical metal brushes are stationary; there are 23 pairs of brushes, one for each connection to the ball mechanism. Each pair of brushes contacts one metal ring on the striped shaft, maintaining contact as the shaft rotates. Inside the shaft, 23 wires connect the circular metal contacts to the roll gimbal.

The slip ring assembly in the FDAI.

A second set of slip rings inside the ball handles rotation around the pitch axis. These rings provide the electrical connection between the wiring on the roll gimbal and the ball mechanism. The yaw axis does not use slip rings since only the hemisphere shells rotate around the yaw axis; no wires are involved.

Synchros and the servo loop

In this section, I'll explain how the FDAI is controlled by synchros and servo loops. In the 1950s and 1960s, the standard technique for transmitting a rotational signal electrically was through a synchro. Synchros were used for everything from rotating an instrument indicator in avionics to rotating the gun on a navy battleship. A synchro produces an output that depends on the shaft's rotational position, and transmits this output signal on three wires. If you connect these wires to a second synchro, you can use the first synchro to control the second one: the shaft of the second synchro will rotate to the same angle as the first shaft. Thus, synchros are a convenient way to send a control signal electrically.

The photo below shows a typical synchro, with the input shaft on the top and five wires at the bottom: two for power and three for the output.

A synchro transmitter.

Internally, the synchro has a rotating winding called the rotor that is driven with 400 Hz AC. Three fixed stator windings provide the three AC output signals. As the shaft rotates, the voltages of the output signals change, indicating the angle. (A synchro resembles a transformer with three variable secondary windings.) If two connected synchros have different angles, the magnetic fields create a torque that rotates the shafts into alignment.

The schematic symbol for a synchro transmitter or receiver.

The downside of synchros is that they don't produce a lot of torque. The solution is to use a more powerful motor, controlled by the synchro and a feedback loop called a servo loop. The servo loop drives the motor in the appropriate direction to eliminate the error between the desired position and the current position.

The diagram below shows how the servo loop is constructed from a combination of electronics and mechanical components. The goal is to rotate the output shaft to an angle that exactly matches the input angle, specified by the three synchro wires. The control transformer compares the input angle and the output shaft position, producing an error signal. The amplifier uses this error signal to drive the motor in the appropriate direction until the error signal drops to zero. To improve the dynamic response of the servo loop, the tachometer signal is used as a negative feedback voltage. The feedback slows the motor as the system gets closer to the right position, so the motor doesn't overshoot the position and oscillate. (This is sort of like a PID controller.)

This diagram shows the structure of the servo loop, with a feedback loop ensuring that the rotation angle of the output shaft matches the input angle.

A control transformer is similar to a synchro in appearance and construction, but the rotating shaft operates as an input, not the output. In a control transformer, the three stator windings receive the inputs and the rotor winding provides the error output. If the rotor angle of the synchro transmitter and control transformer are the same, the signals cancel out and there is no error voltage. But as the difference between the two shaft angles increases, the rotor winding produces an error signal. The phase of the error signal indicates the direction of the error.

In the FDAI, the motor is a special motor/tachometer, a device that was often used in avionics servo loops. This motor is more complicated than a regular electric motor. The motor is powered by 115 volts AC at 400 hertz, but this won't spin the motor on its own. The motor also has two low-voltage control windings. Energizing the control windings with the proper phase causes the motor to spin in one direction or the other. The motor/tachometer unit also contains a tachometer to measure its speed for the feedback loop. The tachometer is driven by another 115-volt AC winding and generates a low-voltage AC signal that is proportional to the motor's rotational speed.

A motor/tachometer similar (but not identical) to the one in the FDAI.

The photo above shows a motor/tachometer with the rotor removed. The unit has many wires because of its multiple windings. The rotor has two drums. The drum on the left, with the spiral stripes, is for the motor. This drum is a "squirrel-cage rotor", which spins due to induced currents. (There are no electrical connections to the rotor; the drums interact with the windings through magnetic fields.) The drum on the right is the tachometer rotor; it induces a signal in the output winding proportional to the speed due to eddy currents. The tachometer signal is at 400 Hz like the driving signal, either in phase or 180º out of phase, depending on the direction of rotation. For more information on how a motor/tachometer works, see my teardown.

The amplifiers

The FDAI has three servo loops—one for each axis—and each servo loop has a separate control transformer, motor, and amplifier. The photo below shows one of the three amplifier boards. The construction is unusual and somewhat chaotic, with some components stacked on top of others to save space. Some of the component leads are long and protected with clear plastic sleeves.5 The cylindrical pulse transformer in the middle has five colorful wires coming out of it. At the left are the two transistors that drive the motor's control windings, with two capacitors between them. The transistors are mounted on a heat sink that is screwed down to the case of the amplifier assembly for cooling. Each amplifier is connected to the FDAI through seven wires with pins that plug into the sockets on the right of the board.6

One of the three amplifier boards. At the right front of the board, you can see a capacitor stacked on top of a resistor. The board is shiny because it is covered with conformal coating.

The function of the board is to amplify the error signal so the motor rotates in the appropriate direction. The amplifier also uses the tachometer output from the motor unit to slow the motor as the error signal decreases, preventing overshoot. The inputs to the amplifier are 400 hertz AC signals, with the magnitude indicating the amount of error or speed and the phase indicating the direction. The two outputs from the amplifier drive the two control windings of the motor, determining which direction the motor rotates.

The schematic for the amplifier board is below. 7 The two transistors on the left amplify the error and tachometer signals, driving the pulse transformer. The outputs of the pulse transformer will have opposite phases, driving the output transistors for opposite halves of the 400 Hz cycle. This activates the motor control winding, causing the motor to spin in the desired direction.8

The schematic of an amplifier board.

History of the FDAI

Bill Lear, born in 1902, was a prolific inventor with over 150 patents, creating everything from the 8-track tape to the Learjet, the iconic private plane of the 1960s. He created multiple companies in the 1920s as well as inventing one of the first car radios for Motorola before starting Lear Avionics, a company that specialized in aerospace instruments.9 Lear produced innovative aircraft instruments and flight control systems such as the F-5 automatic pilot, which received a trophy as the "greatest aviation achievement in America" for 1950.

Bill Lear went on to solve an indicator problem for the Air Force: the supersonic F-102 Delta Dagger interceptor (1953) could climb at steep angles, but existing attitude indicators could not handle nearly vertical flight. Lear developed a remote two-gyro platform that drove the cockpit indicator while avoiding "gimbal lock" during vertical flight. For the experimental X-15 rocket-powered aircraft (1959), Lear improved this indicator to handle three axes: roll, pitch, and yaw.

Meanwhile, the Siegler Corporation started in 1950 to manufacture space heaters for homes. A few years later, Siegler was acquired by John Brooks, an entrepreneur who was enthusiastic about acquisitions. In 1961, Lear Avionics became his latest acquisition, and the merged company was called Lear Siegler Incorporated, often known as LSI. (Older programmers may know Lear Siegler through the ADM-3A, an inexpensive video display terminal from 1976 that housed the display and keyboard in a stylish white case.)

The X-15's attitude indicator became the basis of the indicator for the F-4 fighter plane (the ARU/11-A). Then, after "a minimum of modification", the attitude-director indicator was used in the Gemini space program. In total, Lear Siegler provided 11 instruments in the Gemini instrument panel, with the attitude director the most important. Next, Gemini's indicator was modified to become the FDAI (flight director-attitude indicator) in the Lunar Module for Apollo.10 Lear Siegler provided numerous components for the Apollo program, from a directional gyro for the Lunar Rover to the electroluminescent display for the Apollo Guidance Computer's Display/Keyboard (DSKY).

An article titled "LSI Instruments Aid in Moon Landing" from LSI's internal LSI Log publication, July 1969. (Click for a larger version.)

In 1974, Lear Siegler obtained a contract to develop the Attitude-Director Indicator (ADI) for the Space Shuttle, producing a dozen ADI units for the Space Shuttle. However, by this time, Lear Siegler was losing enthusiasm for low-volume space avionics. The Instrument Division president said that "the business that we were in was an engineering business and engineers love a challenge." However, manufacturing refused to deal with the special procedures required for space manufacturing, so the Shuttle units were built by the engineering department. Lear Siegler didn't bid on later Space Shuttle avionics and the Shuttle ADI became its last space product. In the early 2000s, the Space Shuttle's instruments were upgraded to a "glass cockpit" with 11 flat-panel displays known as the Multi-function Electronic Display System (MEDS). The MEDS was produced by Lear Siegler's long-term competitor, Honeywell.

Getting back to Bill Lear, he wanted to manufacture aircraft, not just aircraft instruments, so he created the Learjet, the first mass-produced business jet. The first Learjet flew in 1963, with over 3000 eventually delivered. In the early 1970s, Lear designed a steam turbine automobile engine. Rather than water, the turbine used a secret fluorinated hydrocarbon called "Learium". Lear had visions of thousands of low-pollution "Learmobiles", but the engine failed to catch on. Lear had been on the verge of bankruptcy in the 1960s; one of his VPs explained that "the great creative minds can't be bothered with withholding taxes and investment credits and all this crap". But by the time of his death in 1978, Lear had a fortune estimated at $75 million.

Comparing the ARU/11-A and the FDAI

Looking inside our FDAI sheds more details on the evolution of Lear Siegler's attitude directors. The photo below compares the Apollo FDAI (top) to the earlier ARU/11-A used in the F-4 aircraft (bottom). While the basic mechanism and the electronic amplifiers are the same between the two indicators, there are also substantial changes.

Comparison of an FDAI (top) with an ARU-11/A (bottom). The amplifier boards and needles have been removed from the FDAI.

The biggest difference between the ARU-11/A indicator and the FDAI is that the electronics for the ARU-11/A are in a separate module that was plugged into the back of the indicator, while the FDAI includes the electronics internally, with boards mounted on the instrument frame. Specifically, the ARU-11/A has a separate unit containing a multi-winding transformer, a power supply board, and three amplifier boards (one for each axis), while the FDAI contains these components internally. The amplifier boards in the ARU-11/A and the FDAI are identical, constructed from germanium transistors rather than silicon.11 The unusual 11-pin transformers are also the same. However, the power supply boards are different, probably because the boards also contain scaling resistors that vary between the units.12 The power supply boards are also different shapes to fit the available space.

The ball assemblies of the ARU/11-A and the FDAI are almost the same, with the same motor assemblies and slip ring mechanism. The gearing has minor changes. In particular, the FDAI has two plastic gears, while the ARU/11-A uses exclusively metal gears.

The ARU/11-A has a patented pitch trim feature that was mostly—but not entirely—removed from the Apollo FDAI. The motivation for this feature is that an aircraft in level flight will be pitched up a few degrees, the "angle of attack". It is desirable for the attitude indicator to show the aircraft as horizontal, so a pitch trim knob allows the angle of attack to be canceled out on the display. The problem is that if you fly your fighter plane vertically, you want the indicator to show precisely vertical flight, rather than applying the pitch trim adjustment. The solution in the ARU-11/A is a special 8-zone potentiometer on the pitch axis that will apply the pitch trim adjustment in level flight but not in vertical flight, while providing a smooth transition between the regions. This special potentiometer is mounted inside the ball of the ARU-11/A. However, this pitch trim adjustment is meaningless for a spacecraft, so it is not implemented in the Apollo or Space Shuttle instruments. Surprisingly, the shell of the potentiometer still exists in our FDAI, but without the potentiometer itself or the wiring. Perhaps it remained to preserve the balance of the ball. In the photo below, the cylindrical potentiometer shell is indicated by an arrow. Note the holes in the front of the shell; in the ARU-11/A, the potentiometer's wiring terminals protrude through these holes, but in the FDAI, the holes are covered with tape.

Inside the ball of the FDAI. The potentiometer shell is indicated with an arrow.

Finally, the mounting of the ball hemispheres is slightly different. The ARU/11-A uses four screws at the pole of each hemisphere. Our FDAI, however, uses a single screw at each pole; the screw is tightened with a Bristol Key, causing the shaft to expand and hold the hemisphere in place.

To summarize, the Apollo FDAI occupies a middle ground: while it isn't simply a repurposed ARU-11/A, neither is it a complete redesign. Instead, it preserves the old design where possible, while stripping out undesired features such as pitch trim. The separate amplifier and mechanical units of the ARU/11-A were combined to form the larger FDAI.

Differences from Apollo

The FDAI that we examined is a special unit: it was originally built for Apollo but was repurposed for a Space Shuttle simulator. Our FDAI is labeled Model 4068F, which is a Lunar Module part number. Moreover, the FDAI is internally stamped with the date "Apr. 22 1968", over a year before the first Moon landing.

However, a closer look shows that several key components were modified to make the Apollo FDAI work in the Shuttle Simulator.14 The Apollo FDAI (and the Shuttle ADI) used resolvers as inputs to control the ball, while our FDAI uses synchros. (Resolvers and synchros are similar, except resolvers use sine and cosine inputs, 90° apart, on two wire pairs, while synchros use three inputs, 120° apart, on three wires.) NASA must have replaced the three resolver control transformers in the FDAI with synchro control transformers for use in the simulator.

The Apollo FDAI used electroluminescent lighting for the display, while ours uses eight small incandescent bulbs. The metal case of our FDAI has a Dymo embossed tape label "INCANDESCENT LIGHTING", alerting users to the change from Apollo's illumination. Our FDAI also contains a step-down transformer to convert the 115 VAC input into 5 VAC to power the bulbs, while the Shuttle powered its ADI illumination directly from 5 volts.

The dial of our FDAI was repainted to match the dial of the Shuttle FDAI. The Apollo FDAI had red bands on the left and right of the dial. A close examination of our dial shows that black paint was carefully applied over the red paint, but a few specks of red paint are still visible (below). Moreover, the edges of the lines and the lozenge show slight unevenness from the repainting. Second, the Apollo FDAI had the text "ROLL RATE", "PITCH RATE", and "YAW RATE" in white next to the needle scales. In our FDAI, this text has been hidden by black paint to match the Shuttle display.13 Third, the Apollo LM FDAI had a crosshair in the center of the instrument, while our FDAI has a white U-shaped indicator, the same as the Shuttle (and the Command Module's FDAI). Finally, the ball of the Apollo FDAI has red circular regions at the poles to warn of orientations that can cause gimbal lock. Our FDAI (like the Shuttle) does not have these circles. We couldn't see any evidence that these regions were repainted, so we suspect that our FDAI has Shuttle hemispheres on the ball.

A closeup of the dial on our FDAI shows specks of red paint around the dial markings. The color is probably Switzer DayGlo Rocket Red.

Our FDAI has also been modified electrically. Small green connectors (Micro-D MDB1) have been added between the slip rings and the motors, as well as on the gimbal arm. We think these connectors were added post-Apollo, since they are attached somewhat sloppily with glue and don't look flight-worthy. Perhaps these connectors were added to make disassembly and modification easier. Moreover, our FDAI has an elapsed time indicator, also mounted with glue.

The back of our FDAI is completely different from Apollo. First, the connector's pinout is completely different. Second, each of the six indicator needles has a mechanical adjustment as well as a trimpot (details). Finally, each of the three axes has an adjustment potentiometer.

The Shuttle's ADI (Attitude Director Indicator)

Each Space Shuttle had three ADIs (Attitude Director Indicators), which were very similar to the Apollo FDAI, despite the name change. The photo below shows the two octagonal ADIs in the forward flight deck, one on the left in front of the Commander, and one on the right in front of the Pilot. The aft flight deck station had a third ADI.15

This photo shows Discovery's forward flight deck on STS-063 (1999). The ADIs are indicated with arrows. The photo is from the National Archives.

Our FDAI appears to have been significantly modified for use in the Shuttle simulator, as described above. However, it is much closer to the Apollo FDAI than the ADI used in the Shuttle, as I'll show in this section. My hypothesis is that the simulator was built before the Shuttle's ADI was created, so the Apollo FDAI was pressed into service.

The Shuttle's ADI was much more complicated electrically than the Apollo FDAI and our FDAI, providing improved functionality.16 For instance, while the Apollo FDAI had a simple "OFF" indicator flag to show that the indicator had lost power, the Shuttle's ADI had extensive error detection. It contained voltage level monitors to check its five power supplies. (The Shuttle ADI used three DC power sources and two AC power sources, compared to the single AC supply for Apollo.) The Shuttle's ADI also monitored the ball servos to detect position errors. Finally, it received an external "Data OK" signal. If a fault was detected by any of these monitors, the "OFF" flag was deployed to indicate that the ADI could not be trusted.

The Shuttle's ADI had six needles, the same as Apollo, but the Shuttle used feedback to make the positions more accurate. Specifically, each Shuttle needle had a feedback sensor, a Linear Variable Differential Transformer (LVDT) that generates a voltage based on the needle position. The LVDT output drove a servo feedback loop to ensure that the needle was in the exact desired position. In the Apollo FDAI, on the other hand, the needle input voltage drove a galvanometer, swinging the needle proportionally, but there was no closed loop to ensure accuracy.

I assume that the Shuttle's ADI had integrated circuit electronics to implement this new functionality, considerably more modern than the germanium transistors in the Apollo FDAI. The Shuttle probably used the same mechanical structures to rotate the ball, but I can't confirm that.

Conclusions

The FDAI was a critical instrument in Apollo, indicating the orientation of the spacecraft in three axes. It wasn't obvious to me how the "8-ball" can rotate in three axes while still being securely connected to the instrument. The trick is that most of the mechanism rotates in two axes, while hollow hemispherical shells provide the third rotational axis.

The FDAI has an interesting evolutionary history, from the experimental X-15 rocket plane and the F-4 fighter to the Gemini, Apollo, and Space Shuttle flights. Our FDAI has an unusual position in this history: since it was modified from Apollo to function in a Space Shuttle simulator, it shows aspects of both Apollo and the Space Shuttle indicators. It would be interesting to compare the design of a Shuttle ADI to the Apollo FDAI, but I haven't been able to find interior photos of a Shuttle ADI (or of an unmodified Apollo FDAI).17

You can see a brief video of the FDAI in motion here. For more, follow me on Bluesky (@righto.com), Mastodon (@[email protected]), or RSS. (I've given up on Twitter.) I worked on this project with CuriousMarc, Mike Stewart, and Eric Schlapfer, so expect a video at some point. Thanks to Richard for providing the FDAI. I wrote about the F-4 fighter plane's attitude indicator here.

Inside the FDAI. The amplifier boards have been removed for this photo.

Notes and references

There were many Space Shuttle simulators, so it is unclear which simulator was the source of our FDAI. The photo below shows a simulator, with one of the ADIs indicated with an arrow. Presumably, our FDAI became available when a simulator was upgraded from physical instruments to the screens of the Multi-function Electronic Display System (MEDS).

"Forward flight deck of the fixed-base simulator." From Introduction to Shuttle Mission Simulation

The most complex simulators were the three Shuttle Mission Simulators, one of which could dynamically move to provide motion cues. These simulators were at the simulation facility in Houston—officially the Jake Garn Mission Simulator and Training Facility—which also had a guidance and navigation simulator, a Spacelab simulator, and integration with the WETF (Weightless Environment Training Facility, an underground pool to simulate weightlessness). The simulators were controlled by a computer complex containing dozens of networked computers. The host computers were three UNIVAC 1100/92 mainframes, 36-bit computers that ran the simulation models. These were supported by seventeen Concurrent Computer Corporation 3260 and 3280 super-minicomputers that simulated tracking, telemetry, and communication. The simulators also used real Shuttle computers running the actual flight software; these were IBM AP101S General-Purpose Computers (GPC). For more information, see Introduction to Shuttle Mission Simulation.

NASA had additional Shuttle training facilities beyond the Shuttle Mission Simulator. The Full Fuselage Trainer was a mockup of the complete Shuttle orbiter (minus the wings). It included full instrument panels (including the ADIs), but did not perform simulations. The Crew Compartment Trainers could be positioned horizontally or vertically (to simulate pre-launch operations). They contained accurate flight decks with non-functional instruments. Three Single System Trainers provided simpler mockups for astronauts to learn each system, both during normal operation and during malfunctions, before using the more complex Shuttle Mission Simulator. A list of Shuttle training facilities is in Table 3.1 of Preparing for the High Frontier. Following the end of the Shuttle program, the trainers were distributed to various museums (details). ↩
The Command Module for Apollo used a completely different FDAI (flight director-attitude indicator) that was built by Honeywell. The two designs can be easily distinguished: the Honeywell FDAI is round, while the Lear Siegler FDAI is octagonal. ↩
The FDAI's signals are more complicated than I described above. Among other things, the IMU's gimbal angles use a different coordinate system from the FDAI, so an electromechanical unit called GASTA (Gimbal Angle Sequence Transformation Assembly) used resolvers and motors to convert the coordinates. The digital attitude error signals from the computer are converted to analog by the Inertial Measurement Unit's Coupling Data Unit (IMU CDU). For attitude, the IMU is selected with the PGNS (Primary Guidance and Navigation System) switch setting. See the Lunar Module Systems Handbook, Lunar Module System Handbook Rev A, and the Apollo Operations Handbook for more.

The connections to the Apollo FDAIs. Adapted from LM-1 Systems Handbook. I think this diagram predates the ORDEAL system. (Click for a larger version.)

↩
The roll, pitch, and yaw axes of the Lunar Module are not as obvious as the axes of an airplane. The diagram below defines these axes.

The roll, pitch, and yaw axes of the Lunar Module. Adapted from LM Systems Handbook.

↩
The amplifier is constructed on a single-sided printed circuit board. Since the components are packed tightly on the board, routing of the board was difficult. However, some of the components have long leads, protected by plastic sleeves. This provides additional flexibility for the board routing since the leads could be positioned as desired, regardless of the geometry of the component. As a result, the style of this board is very different from modern circuit boards, where components are usually arranged in an orderly pattern. ↩
In our FDAI, the amplifier boards as well as the needle actuators are connected by pins that plug into sockets. These connections don't seem suitable for flight since they could easily vibrate loose. We suspect that the pin-and-socket connections made the module easier to reconfigure in the simulator, but were not used in flyable units. In particular, in the similar aircraft instruments (ARU/11-A) that we examined, the wires to the amplifier boards were soldered. ↩
The board has a 56-volt Zener diode, but the function of the diode is unclear. The board is powered by 28 volts, not enough voltage to activate the Zener. Perhaps the diode filters high-voltage transients, but I don't see how transients could arise in that part of the circuit. (I can imagine transients when the pulse transformer switches, but the Zener isn't connected to the transformer.) ↩
In more detail, each motor's control winding is a center-tapped winding, with the center connected to 28 volts DC. The amplifier board's output transistors will ground either side of the winding during alternate half-cycles of the 400 Hz cycle. This causes the motor to spin in one direction or the other. (Usually, control winding are driven 90° out of phase with the motor power, but I'm not sure how this phase shift is applied in the FDAI.) ↩
The history of Bill Lear and Lear Siegler is based on Love him or hate him, Bill Lear was a creator and On Course to Tomorrow: A History of Lear Siegler Instrument Division’s Manned Spaceflight Systems 1958-1981. ↩
Numerous variants of the Lear Siegler FDAI were built for Apollo, as shown before. Among other things, the length of the unit ("L MAX") varied from 8 inches to 11 inches. (Our FDAI is approximately 8 inches long.)

The Apollo FDAI part number chart from Grumman Specification Control Drawing LSC350-301. (Click for a larger view.)

↩
We examined a different ARU-11/A where the amplifier boards were not quite identical: the boards had one additional capacitor and some of the PCB traces were routed slightly differently. These boards were labeled "REV C" in the PCB copper, so they may have been later boards with a slight modification. ↩
The amplifier scaling resistors were placed on the power supply board rather than the amplifier boards, which may seem strange. The advantage of this approach is that it permitted the three amplifier boards to be identical, since the components that differ between the axes were not part of the amplifier boards. This simplified the manufacture and repair of the amplifier boards. ↩
On the front panel of our FDAI, the text "ROLL RATE", "PITCH RATE", and "YAW RATE" has been painted over. However, the text is still faintly visible (reversed) on the inside of the panel, as shown below.

The inside of the FDAI's front cover.

↩
The diagram below shows the internals of the Apollo LM FDAI at a high level. This diagram shows several differences between the LM FDAI and the FDAI that we examined. First, the roll, pitch, and yaw inputs to the LM FDAI are resolver inputs (i.e. sin and cos), rather than the synchro inputs to our FDAI. Second, the needle signals below are modulated on an 800 Hz carrier and are demodulated inside the FDAI. Our FDAI, however, uses positive or negative voltages to drive the needle galvanometers directly. A minor difference is that the diagram below shows the Power Off Flag wired to +28V internally, while our FDAI has the flag wired to connector pins, probably so the flag could be controlled by the simulator.

The diagram of the FDAI in the LM Systems Handbook. Click for a larger image.

↩
The Space Shuttle instruments were replaced with color LCD screens in the MEDS (Multifunction Electronic Display System) upgrade. This upgrade is discussed in New Displays for the Space Shuttle Cockpit. The Space Shuttle Systems Handbook shows the ADIs on the forward console (pages 263-264) and the aft console (page 275). The physical ADI is compared to the MEDS ADI display in Displays and Controls, Vol. 1 page 119. ↩
The diagram below shows the internals of the Shuttle's ADI at a high level. The Shuttle's ADI is more complicated than the Apollo FDAI, even though they have the same indicator ball and needles.

A diagram of the Space Shuttle's ADI. From Space Shuttle Systems Handbook Vol. 1, 1 G&C DISP 1. (Click for a larger image.)

↩
Multiple photos of the exterior of the Shuttle ADI are available here, from the National Air and Space Museum. There are interior photos of Apollo FDAIs online, but they all appear to be modified for Shuttle simulators. ↩

Reverse engineering the 386 processor's prefetch queue circuitry

In 1985, Intel introduced the groundbreaking 386 processor, the first 32-bit processor in the x86 architecture. To improve performance, the 386 has a 16-byte instruction prefetch queue. The purpose of the prefetch queue is to fetch instructions from memory before they are needed, so the processor usually doesn't need to wait on memory while executing instructions. Instruction prefetching takes advantage of times when the processor is "thinking" and the memory bus would otherwise be unused.

In this article, I look at the 386's prefetch queue circuitry in detail. One interesting circuit is the incrementer, which adds 1 to a pointer to step through memory. This sounds easy enough, but the incrementer uses complicated circuitry for high performance. The prefetch queue uses a large network to shift bytes around so they are properly aligned. It also has a compact circuit to extend signed 8-bit and 16-bit numbers to 32 bits. There aren't any major discoveries in this post, but if you're interested in low-level circuits and dynamic logic, keep reading.

The photo below shows the 386's shiny fingernail-sized silicon die under a microscope. Although it may look like an aerial view of a strangely-zoned city, the die photo reveals the functional blocks of the chip. The Prefetch Unit in the upper left is the relevant block. In this post, I'll discuss the prefetch queue circuitry (highlighted in red), skipping over the prefetch control circuitry to the right. The Prefetch Unit receives data from the Bus Interface Unit (upper right) that communicates with memory. The Instruction Decode Unit receives prefetched instructions from the Prefetch Unit, byte by byte, and decodes the opcodes for execution.

This die photo of the 386 shows the location of the registers. Click this image (or any other) for a larger version.

The left quarter of the chip consists of stripes of circuitry that appears much more orderly than the rest of the chip. This grid-like appearance arises because each functional block is constructed (for the most part) by repeating the same circuit 32 times, once for each bit, side by side. Vertical data lines run up and down, in groups of 32 bits, connecting the functional blocks. To make this work, each circuit must fit into the same width on the die; this layout constraint forces the circuit designers to develop a circuit that uses this width efficiently without exceeding the allowed width. The circuitry for the prefetch queue uses the same approach: each circuit is 66 µm wide1 and repeated 32 times. As will be seen, fitting the prefetch circuitry into this fixed width requires some layout tricks.

What the prefetcher does

The purpose of the prefetch unit is to speed up performance by reading instructions from memory before they are needed, so the processor won't need to wait to get instructions from memory. Prefetching takes advantage of times when the memory bus is otherwise idle, minimizing conflict with other instructions that are reading or writing data. In the 386, prefetched instructions are stored in a 16-byte queue, consisting of four 32-bit blocks.2

The diagram below zooms in on the prefetcher and shows its main components. You can see how the same circuit (in most cases) is repeated 32 times, forming vertical bands. At the top are 32 bus lines from the Bus Interface Unit. These lines provide the connection between the datapath and external memory, via the Bus Interface Unit. These lines form a triangular pattern as the 32 horizontal lines on the right branch off and form 32 vertical lines, one for each bit. Next are the fetch pointer and the limit register, with a circuit to check if the fetch pointer has reached the limit. Note that the two low-order bits (on the right) of the incrementer and limit check circuit are missing. At the bottom of the incrementer, you can see that some bit positions have a blob of circuitry missing from others, breaking the pattern of repeated blocks. The 16-byte prefetch queue is below the incrementer. Although this memory is the heart of the prefetcher, its circuitry takes up a relatively small area.

A close-up of the prefetcher with the main blocks labeled. At the right, the prefetcher receives control signals.

The bottom part of the prefetcher shifts data to align it as needed. A 32-bit value can be split across two 32-bit rows of the prefetch buffer. To handle this, the prefetcher includes a data shift network to shift and align its data. This network occupies a lot of space, but there is no active circuitry here: just a grid of horizontal and vertical wires.

Finally, the sign extend circuitry converts a signed 8-bit or 16-bit value into a signed 16-bit or 32-bit value as needed. You can see that the sign extend circuitry is highly irregular, especially in the middle. A latch stores the output of the prefetch queue for use by the rest of the datapath.

Limit check

If you've written x86 programs, you probably know about the processor's Instruction Pointer (EIP) that holds the address of the next instruction to execute. As a program executes, the Instruction Pointer moves from instruction to instruction. However, it turns out that the Instruction Pointer doesn't actually exist! Instead, the 386 has an "Advance Instruction Fetch Pointer", which holds the address of the next instruction to fetch into the prefetch queue. But sometimes the processor needs to know the Instruction Pointer value, for instance, to determine the return address when calling a subroutine or to compute the destination address of a relative jump. So what happens? The processor gets the Advance Instruction Fetch Pointer address from the prefetch queue circuitry and subtracts the current length of the prefetch queue. The result is the address of the next instruction to execute, the desired Instruction Pointer value.

The Advance Instruction Fetch Pointer—the address of the next instruction to prefetch—is stored in a register at the top of the prefetch queue circuitry. As instructions are prefetched, this pointer is incremented by the prefetch circuitry. (Since instructions are fetched 32 bits at a time, this pointer is incremented in steps of four and the bottom two bits are always 0.)

But what keeps the prefetcher from prefetching too far and going outside the valid memory range? The x86 architecture infamously uses segments to define valid regions of memory. A segment has a start and end address (known as the base and limit) and memory is protected by blocking accesses outside the segment. The 386 has six active segments; the relevant one is the Code Segment that holds program instructions. Thus, the limit address of the Code Segment controls when the prefetcher must stop prefetching.3 The prefetch queue contains a circuit to stop prefetching when the fetch pointer reaches the limit of the Code Segment. In this section, I'll describe that circuit.

Comparing two values may seem trivial, but the 386 uses a few tricks to make this fast. The basic idea is to use 30 XOR gates to compare the bits of the two registers. (Why 30 bits and not 32? Since 32 bits are fetched at a time, the bottom bits of the address are 00 and can be ignored.) If the two registers match, all the XOR values will be 0, but if they don't match, an XOR value will be 1. Conceptually, connecting the XORs to a 32-input OR gate will yield the desired result: 0 if all bits match and 1 if there is a mismatch. Unfortunately, building a 32-input OR gate using standard CMOS logic is impractical for electrical reasons, as well as inconveniently large to fit into the circuit. Instead, the 386 uses dynamic logic to implement a spread-out NOR gate with one transistor in each column of the prefetcher.

The schematic below shows the implementation of one bit of the equality comparison. The mechanism is that if the two registers differ, the transistor on the right is turned on, pulling the equality bus low. This circuit is replicated 30 times, comparing all the bits: if there is any mismatch, the equality bus will be pulled low, but if all bits match, the bus remains high. The three gates on the left implement XNOR; this circuit may seem overly complicated, but it is a standard way of implementing XNOR. The NOR gate at the right blocks the comparison except during clock phase 2. (The importance of this will be explained below.)

This circuit is repeated 30 times to compare the registers.

The equality bus travels horizontally through the prefetcher, pulled low if any bits don't match. But what pulls the bus high? That's the job of the dynamic circuit below. Unlike regular static gates, dynamic logic is controlled by the processor's clock signals and depends on capacitance in the circuit to hold data. The 386 is controlled by a two-phase clock signal.4 In the first clock phase, the precharge transistor below turns on, pulling the equality bus high. In the second clock phase, the XOR circuits above are enabled, pulling the equality bus low if the two registers don't match. Meanwhile, the CMOS switch turns on in clock phase 2, passing the equality bus's value to the latch. The "keeper" circuit keeps the equality bus held high unless it is explicitly pulled low, to avoid the risk of the voltage on the equality bus slowly dissipating. The keeper uses a weak transistor to keep the bus high while inactive. But if the bus is pulled low, the keeper transistor is overpowered and turns off.

This is the output circuit for the equality comparison. This circuit is located to the right of the prefetcher.

This dynamic logic reduces power consumption and circuit size. Since the bus is charged and discharged during opposite clock phases, you avoid steady current through the transistors. (In contrast, an NMOS processor like the 8086 might use a pull-up on the bus. When the bus is pulled low, would you end up with current flowing through the pull-up and the pull-down transistors. This would increase power consumption, make the chip run hotter, and limit your clock speed.)

The incrementer

After each prefetch, the Advance Instruction Fetch Pointer must be incremented to hold the address of the next instruction to prefetch. Incrementing this pointer is the job of the incrementer. (Because each fetch is 32 bits, the pointer is incremented by 4 each time. But in the die photo, you can see a notch in the incrementer and limit check circuit where the circuitry for the bottom two bits has been omitted. Thus, the incrementer's circuitry increments its value by 1, so the pointer (with two zero bits appended) increases in steps of 4.)

Building an incrementer circuit is straightforward, for example, you can use a chain of 30 half-adders. The problem is that incrementing a 30-bit value at high speed is difficult because of the carries from one position to the next. It's similar to calculating 99999999 + 1 in decimal; you need to tediously carry the 1, carry the 1, carry the 1, and so forth, through all the digits, resulting in a slow, sequential process.

The incrementer uses a faster approach. First, it computes all the carries at high speed, almost in parallel. Then it computes each output bit in parallel from the carries—if there is a carry into a position, it toggles that bit.

Computing the carries is straightforward in concept: if there is a block of 1 bits at the end of the value, all those bits will produce carries, but carrying is stopped by the rightmost 0 bit. For instance, incrementing binary 11011 results in 11100; there are carries from the last two bits, but the zero stops the carries. A circuit to implement this was developed at the University of Manchester in England way back in 1959, and is known as the Manchester carry chain.

In the Manchester carry chain, you build a chain of switches, one for each data bit, as shown below. For a 1 bit, you close the switch, but for a 0 bit you open the switch. (The switches are implemented by transistors.) To compute the carries, you start by feeding in a carry signal at the right The signal will go through the closed switches until it hits an open switch, and then it will be blocked.5 The outputs along the chain give us the desired carry value at each position.

Concept of the Manchester carry chain, 4 bits.

Since the switches in the Manchester carry chain can all be set in parallel and the carry signal blasts through the switches at high speed, this circuit rapidly computes the carries we need. The carries then flip the associated bits (in parallel), giving us the result much faster than a straightforward adder.

There are complications, of course, in the actual implementation. The carry signal in the carry chain is inverted, so a low signal propagates through the carry chain to indicate a carry. (It is faster to pull a signal low than high.) But something needs to make the line go high when necessary. As with the equality circuitry, the solution is dynamic logic. That is, the carry line is precharged high during one clock phase and then processing happens in the second clock phase, potentially pulling the line low.

The next problem is that the carry signal weakens as it passes through multiple transistors and long lengths of wire. The solution is that each segment has a circuit to amplify the signal, using a clocked inverter and an asymmetrical inverter. Importantly, this amplifier is not in the carry chain path, so it doesn't slow down the signal through the chain.

The Manchester carry chain circuit for a typical bit in the incrementer.

The schematic above shows the implementation of the Manchester carry chain for a typical bit. The chain itself is at the bottom, with the transistor switch as before. During clock phase 1, the precharge transistor pulls this segment of the carry chain high. During clock phase 2, the signal on the chain goes through the "clocked inverter" at the right to produce the local carry signal. If there is a carry, the next bit is flipped by the XOR gate, producing the incremented output.6 The "keeper/amplifier" is an asymmetrical inverter that produces a strong low output but a weak high output. When there is no carry, its weak output keeps the carry chain pulled high. But as soon as a carry is detected, it strongly pulls the carry chain low to boost the carry signal.

But this circuit still isn't enough for the desired performance. The incrementer uses a second carry technique in parallel: carry skip. The concept is to look at blocks of bits and allow the carry to jump over the entire block. The diagram below shows a simplified implementation of the carry skip circuit. Each block consists of 3 to 6 bits. If all the bits in a block are 1's, then the AND gate turns on the associated transistor in the carry skip line. This allows the carry skip signal to propagate (from left to right), a block at a time. When it reaches a block with a 0 bit, the corresponding transistor will be off, stopping the carry as in the Manchester carry chain. The AND gates all operate in parallel, so the transistors are rapidly turned on or off in parallel. Then, the carry skip signal passes through a small number of transistors, without going through any logic. (The carry skip signal is like an express train that skips most stations, while the Manchester carry chain is the local train to all the stations.) Like the Manchester carry chain, the implementation of carry skip needs precharge circuits on the lines, a keeper/amplifier, and clocked logic, but I'll skip the details.

An abstracted and simplified carry-skip circuit. The block sizes don't match the 386's circuit.

One interesting feature is the layout of the large AND gates. A 6-input AND gate is a large device, difficult to fit into one cell of the incrementer. The solution is that the gate is spread out across multiple cells. Specifically, the gate uses a standard CMOS NAND gate circuit with NMOS transistors in series and PMOS transistors in parallel. Each cell has an NMOS transistor and a PMOS transistor, and the chains are connected at the end to form the desired NAND gate. (Inverting the output produces the desired AND function.) This spread-out layout technique is unusual, but keeps each bit's circuitry approximately the same size.

The incrementer circuitry was tricky to reverse engineer because of these techniques. In particular, most of the prefetcher consists of a single block of circuitry repeated 32 times, once for each bit. The incrementer, on the other hand, consists of four different blocks of circuitry, repeating in an irregular pattern. Specifically, one block starts a carry chain, a second block continues the carry chain, and a third block ends a carry chain. The block before the ending block is different (one large transistor to drive the last block), making four variants in total. This irregular pattern is visible in the earlier photo of the prefetcher.

The alignment network

The bottom part of the prefetcher rotates data to align it as needed. Unlike some processors, the x86 does not enforce aligned memory accesses. That is, a 32-bit value does not need to start on a 4-byte boundary in memory. As a result, a 32-bit value may be split across two 32-bit rows of the prefetch queue. Moreover, when the instruction decoder fetches one byte of an instruction, that byte may be at any position in the prefetch queue.

To deal with these problems, the prefetcher includes an alignment network that can rotate bytes to output a byte, word, or four bytes with the alignment required by the rest of the processor.

The diagram below shows part of this alignment network. Each bit exiting the prefetch queue (top) has four wires, for rotates of 24, 16, 8, or 0 bits. Each rotate wire is connected to one of the 32 horizontal bit lines. Finally, each horizontal bit line has an output tap, going to the datapath below. (The vertical lines are in the chip's lower M1 metal layer, while the horizontal lines are in the upper M2 metal layer. For this photo, I removed the M2 layer to show the underlying layer. Shadows of the original horizontal lines are still visible.)

Part of the alignment network.

The idea is that by selecting one set of vertical rotate lines, the 32-bit output from the prefetch queue will be rotated left by that amount. For instance, to rotate by 8, bits are sent down the "rotate 8" lines. Bit 0 from the prefetch queue will energize horizontal line 8, bit 1 will energize horizontal line 9, and so forth, with bit 31 wrapping around to horizontal line 7. Since horizontal bit line 8 is connected to output 8, the result is that bit 0 is output as bit 8, bit 1 is output as bit 9, and so forth.

The four possibilities for aligning a 32-bit value. The four bytes above are shifted as specified to produce the desired output below.

For the alignment process, one 32-bit output may be split across two 32-bit entries in the prefetch queue in four different ways, as shown above. These combinations are implemented by multiplexers and drivers. Two 32-bit multiplexers select the two relevant rows in the prefetch queue (blue and green above). Four 32-bit drivers are connected to the four sets of vertical lines, with one set of drivers activated to produce the desired shift. Each byte of each driver is wired to achieve the alignment shown above. For instance, the rotate-8 driver gets its top byte from the "green" multiplexer and the other three bytes from the "blue" multiplexer. The result is that the four bytes, split across two queue rows, are rotated to form an aligned 32-bit value.

Sign extension

The final circuit is sign extension. Suppose you want to add an 8-bit value to a 32-bit value. An unsigned 8-bit value can be extended to 32 bits by simply filling the upper bits with zeroes. But for a signed value, it's trickier. For instance, -1 is the eight-bit value 0xFF, but the 32-bit value is 0xFFFFFFFF. To convert an 8-bit signed value to 32 bits, the top 24 bits must be filled in with the top bit of the original value (which indicates the sign). In other words, for a positive value, the extra bits are filled with 0, but for a negative value, the extra bits are filled with 1. This process is called sign extension.9

In the 386, a circuit at the bottom of the prefetcher performs sign extension for values in instructions. This circuit supports extending an 8-bit value to 16 bits or 32 bits, as well as extending a 16-bit value to 32 bits. This circuit will extend a value with zeros or with the sign, depending on the instruction.

The schematic below shows one bit of this sign extension circuit. It consists of a latch on the left and right, with a multiplexer in the middle. The latches are constructed with a standard 386 circuit using a CMOS switch (see footnote).7 The multiplexer selects one of three values: the bit value from the swap network, 0 for sign extension, or 1 for sign extension. The multiplexer is constructed from a CMOS switch if the bit value is selected and two transistors for the 0 or 1 values. This circuit is replicated 32 times, although the bottom byte only has the latches, not the multiplexer, as sign extension does not modify the bottom byte.

The sign extend circuit associated with bits 31-8 from the prefetcher.

The second part of the sign extension circuitry determines if the bits should be filled with 0 or 1 and sends the control signals to the circuit above. The gates on the left determine if the sign extension bit should be a 0 or a 1. For a 16-bit sign extension, this bit comes from bit 15 of the data, while for an 8-bit sign extension, the bit comes from bit 7. The four gates on the right generate the signals to sign extend each bit, producing separate signals for the bit range 31-16 and the range 15-8.

This circuit determines which bits should be filled with 0 or 1.

The layout of this circuit on the die is somewhat unusual. Most of the prefetcher circuitry consists of 32 identical columns, one for each bit.8 The circuitry above is implemented once, using about 16 gates (buffers and inverters are not shown above). Despite this, the circuitry above is crammed into bit positions 17 through 7, creating irregularities in the layout. Moreover, the implementation of the circuitry in silicon is unusual compared to the rest of the 386. Most of the 386's circuitry uses the two metal layers for interconnection, minimizing the use of polysilicon wiring. However, the circuit above also uses long stretches of polysilicon to connect the gates.

Layout of the sign extension circuitry. This circuitry is at the bottom of the prefetch queue.

The diagram above shows the irregular layout of the sign extension circuitry amid the regular datapath circuitry that is 32 bits wide. The sign extension circuitry is shown in green; this is the circuitry described at the top of this section, repeated for each bit 31-8. The circuitry for bits 15-8 has been shifted upward, perhaps to make room for the sign extension control circuitry, indicated in red. Note that the layout of the control circuitry is completely irregular, since there is one copy of the circuitry and it has no internal structure. One consequence of this layout is the wasted space to the left and right of this circuitry block, the tan regions with no circuitry except vertical metal lines passing through. At the far right, a block of circuitry to control the latches has been wedged under bit 0. Intel's designers go to great effort to minimize the size of the processor die since a smaller die saves substantial money. This layout must have been the most efficient they could manage, but I find it aesthetically displeasing compared to the regularity of the rest of the datapath.

How instructions flow through the chip

Instructions follow a tortuous path through the 386 chip. First, the Bus Interface Unit in the upper right corner reads instructions from memory and sends them over a 32-bit bus (blue) to the prefetch unit. The prefetch unit stores the instructions in the 16-byte prefetch queue.

Instructions follow a twisting path to and from the prefetch queue.

How is an instruction executed from the prefetch queue? It turns out that there are two distinct paths. Suppose you're executing an instruction to add 12345678 to the EAX register. The prefetch queue will hold the five bytes 05 (the opcode), 78, 56, 34, and 12. The prefetch queue provides opcodes to the decoder one byte at a time over the 8-bit bus shown in red. The bus takes the lowest 8 bits from the prefetch queue's alignment network and sends this byte to a buffer (the small square at the head of the red arrow). From there, the opcode travels to the instruction decoder.10 The instruction decoder, in turn, uses large tables (PLAs) to convert the x86 instruction into a 111-bit internal format with 19 different fields.11

The data bytes of an instruction, on the other hand, go from the prefetch queue to the ALU (Arithmetic Logic Unit) through a 32-bit data bus (orange). Unlike the previous buses, this data bus is spread out, with one wire through each column of the datapath. This bus extends through the entire datapath so values can also be stored into registers. For instance, the MOV (move) instruction can store a value from an instruction (an "immediate" value) into a register.

Conclusions

The 386's prefetch queue contains about 7400 transistors, more than an Intel 8080 processor. (And this is just the queue itself; I'm ignoring the prefetch control logic.) This illustrates the rapid advance of processor technology: part of one functional unit in the 386 contains more transistors than an entire 8080 processor from 11 years earlier. And this unit is less than 3% of the entire 386 processor.

Every time I look at an x86 circuit, I see the complexity required to support backward compatibility, and I gain more understanding of why RISC became popular. The prefetcher is no exception. Much of the complexity is due to the 386's support for unaligned memory accesses, requiring a byte shift network to move bytes into 32-bit alignment. Moreover, at the other end of the instruction bus is the complicated instruction decoder that decodes intricate x86 instructions. Decoding RISC instructions is much easier.

In any case, I hope you've found this look at the prefetch circuitry interesting. I plan to write more about the 386, so follow me on Bluesky (@righto.com) or RSS for updates. I've written multiple articles on the 386 previously; a good place to start might be my survey of the 368 dies.

Footnotes and references

The width of the circuitry for one bit changes a few times: while the prefetch queue and segment descriptor cache use a circuit that is 66 µm wide, the datapath circuitry is a bit tighter at 60 µm. The barrel shifter is even narrower at 54.5 µm per bit. Connecting circuits with different widths wastes space, since the wiring to connect the bits requires horizontal segments to adjust the spacing. But it also wastes space to use widths that are wider than needed. Thus, changes in the spacing are rare, where the tradeoffs make it worthwhile. ↩
The Intel 8086 processor had a six-byte prefetch queue, while the Intel 8088 (used in the original IBM PC) had a prefetch queue of just four bytes. In comparison, the 16-byte queue of the 386 seems luxurious. (Some 386 processors, however, are said to only use 12 bytes due to a bug.)

The prefetch queue assumes instructions are executed in linear order, so it doesn't help with branches or loops. If the processor encounters a branch, the prefetch queue is discarded. (In contrast, a modern cache will work even if execution jumps around.) Moreover, the prefetch queue doesn't handle self-modifying code. (It used to be common for code to change itself while executing to squeeze out extra performance.) By loading code into the prefetch queue and then modifying instructions, you could determine the size of the prefetch queue: if the old instruction was executed, it must be in the prefetch queue, but if the modified instruction was executed, it must be outside the prefetch queue. Starting with the Pentium Pro, x86 processors flush the prefetch queue if a write modifies a prefetched instruction. ↩
The prefetch unit generates "linear" addresses that must be translated to physical addresses by the paging unit (ref). ↩
I don't know which phase of the clock is phase 1 and which is phase 2, so I've assigned the numbers arbitrarily. The 386 creates four clock signals internally from a clock input CLK2 that runs at twice the processor's clock speed. The 386 generates a two-phase clock with non-overlapping phases. That is, there is a small gap between when the first phase is high and when the second phase is high. The 386's circuitry is controlled by the clock, with alternate blocks controlled by alternate phases. Since the clock phases don't overlap, this ensures that logic blocks are activated in sequence, allowing the orderly flow of data. But because the 386 uses CMOS, it also needs active-low clocks for the PMOS transistors. You might think that you could simply use the phase 1 clock as the active-low phase 2 clock and vice versa. The problem is that these clock phases overlap when used as active-low; there are times when both clock signals are low. Thus, the two clock phases must be explicitly inverted to produce the two active-low clock phases. I described the 386's clock generation circuitry in detail in this article. ↩
The Manchester carry chain is typically used in an adder, which makes it more complicated than shown here. In particular, a new carry can be generated when two 1 bits are added. Since we're looking at an incrementer, this case can be ignored.

The Manchester carry chain was first described in Parallel addition in digital computers: a new fast ‘carry’ circuit. It was developed at the University of Manchester in 1959 and used in the Atlas supercomputer. ↩
For some reason, the incrementer uses a completely different XOR circuit from the comparator, built from a multiplexer instead of logic. In the circuit below, the two CMOS switches form a multiplexer: if the first input is 1, the top switch turns on, while if the first input is a 0, the bottom switch turns on. Thus, if the first input is a 1, the second input passes through and then is inverted to form the output. But if the first input is a 0, the second input is inverted before the switch and then is inverted again to form the output. Thus, the second input is inverted if the first input is 1, which is a description of XOR.

The implementation of an XOR gate in the incrementer.

I don't see any clear reason why two different XOR circuits were used in different parts of the prefetcher. Perhaps the available space for the layout made a difference. Or maybe the different circuits have different timing or output current characteristics. Or it could just be the personal preference of the designers. ↩
The latch circuit is based on a CMOS switch (or transmission gate) and a weak inverter. Normally, the inverter loop holds the bit. However, if the CMOS switch is enabled, its output overpowers the signal from the weak inverter, forcing the inverter loop into the desired state.

The CMOS switch consists of an NMOS transistor and a PMOS transistor in parallel. By setting the top control input high and the bottom control input low, both transistors turn on, allowing the signal to pass through the switch. Conversely, by setting the top input low and the bottom input high, both transistors turn off, blocking the signal. CMOS switches are used extensively in the 386, to form multiplexers, create latches, and implement XOR. ↩
Most of the 386's control circuitry is to the right of the datapath, rather than awkwardly wedged into the datapath. So why is this circuit different? My hypothesis is that since the circuit needs the values of bit 15 and bit 7, it made sense to put the circuitry next to bits 15 and 7; if this control circuitry were off to the right, long wires would need to run from bits 15 and 7 to the circuitry. ↩
In case this post is getting tedious, I'll provide a lighter footnote on sign extension. The obvious mnemonic for a sign extension instruction is SEX, but that mnemonic was too risque for Intel. The Motorola 6809 processor (1978) used this mnemonic, as did the related 68HC12 microcontroller (1996). However, Steve Morse, architect of the 8086, stated that the sign extension instructions on the 8086 were initially named SEX but were renamed before release to the more conservative CBW and CWD (Convert Byte to Word and Convert Word to Double word).

The DEC PDP-11 was a bit contradictory. It has a sign extend instruction with the mnemonic SXT; the Jargon File claims that DEC engineers almost got SEX as the assembler mnemonic, but marketing forced the change. On the other hand, SEX was the official abbreviation for Sign Extend (see PDP-11 Conventions Manual, PDP-11 Paper Tape Software Handbook) and SEX was used in the microcode for sign extend.

RCA's CDP1802 processor (1976) may have been the first with a SEX instruction, using the mnemonic SEX for the unrelated Set X instruction. See also this Retrocomputing Stack Exchange page. ↩
It seems inconvenient to send instructions all the way across the chip from the Bus Interface Unit to the prefetch queue and then back across to the chip to the instruction decoder, which is next to the Bus Interface Unit. But this was probably the best alternative for the layout, since you can't put everything close to everything. The 32-bit datapath circuitry is on the left, organized into 32 columns. It would be nice to put the Bus Interface Unit other there too, but there isn't room, so you end up with the wide 32-bit data bus going across the chip. Sending instruction bytes across the chip is less of an impact, since the instruction bus is just 8 bits wide. ↩
See "Performance Optimizations of the 80386", Slager, Oct 1986, in Proceedings of ICCD, pages 165-168. ↩

The absurdly complicated circuitry for the 386 processor's registers

The groundbreaking Intel 386 processor (1985) was the first 32-bit processor in the x86 architecture. Like most processors, the 386 contains numerous registers; registers are a key part of a processor because they provide storage that is much faster than main memory. The register set of the 386 includes general-purpose registers, index registers, and segment selectors, as well as registers with special functions for memory management and operating system implementation. In this blog post, I look at the silicon die of the 386 and explain how the processor implements its main registers.

It turns out that the circuitry that implements the 386's registers is much more complicated than one would expect. For the 30 registers that I examine, instead of using a standard circuit, the 386 uses six different circuits, each one optimized for the particular characteristics of the register. For some registers, Intel squeezes register cells together to double the storage capacity. Other registers support accesses of 8, 16, or 32 bits at a time. Much of the register file is "triple-ported", allowing two registers to be read simultaneously while a value is written to a third register. Finally, I was surprised to find that registers don't store bits in order: the lower 16 bits of each register are interleaved, while the upper 16 bits are stored linearly.

The photo below shows the 386's shiny fingernail-sized silicon die under a special metallurgical microscope. I've labeled the main functional blocks. For this post, the Data Unit in the lower left quadrant of the chip is the relevant component. It consists of the 32-bit arithmetic logic unit (ALU) along with the processor's main register bank (highlighted in red at the bottom). The circuitry, called the datapath, can be viewed as the heart of the processor.

This die photo of the 386 shows the location of the registers. Click this image (or any other) for a larger version.

The datapath is built with a regular structure: each register or ALU functional unit is a horizontal stripe of circuitry, forming the horizontal bands visible in the image. For the most part, this circuitry consists of a carefully optimized circuit copied 32 times, once for each bit of the processor. Each circuit for one bit is exactly the same width—60 µm—so the functional blocks can be stacked together like microscopic LEGO bricks. To link these circuits, metal bus lines run vertically through the datapath in groups of 32, allowing data to flow up and down through the blocks. Meanwhile, control lines run horizontally, enabling ALU operations or register reads and writes; the irregular circuitry on the right side of the Data Unit produces the signals for these control lines, activating the appropriate control lines for each instruction.

The datapath is highly structured to maximize performance while minimizing its area on the die. Below, I'll look at how the registers are implemented according to this structure.

The 386's registers

A processor's registers are one of the most visible features of the processor architecture. The 386 processor contains 16 registers for use by application programmers, a small number by modern standards, but large enough for the time. The diagram below shows the eight 32-bit general-purpose registers. At the top are four registers called EAX, EBX, ECX, and EDX. Although these registers are 32-bit registers, they can also be treated as 16 or 8-bit registers for backward compatibility with earlier processors. For instance, the lower half of EAX can be accessed as the 16-bit register AX, while the bottom byte of EAX can be accessed as the 8-bit register AL. Moreover, bits 15-8 can also be accessed as an 8-bit register called AH. In other words, there are four different ways to access the EAX register, and similarly for the other three registers. As will be seen, these features complicate the implementation of the register set.

The general purpose registers in the 386. From 80386 Programmer's Reference Manual, page 2-8.

The bottom half of the diagram shows that the 32-bit EBP, ESI, EDI, and ESP registers can also be treated as 16-bit registers BP, SI, DI, and SP. Unlike the previous registers, these ones cannot be treated as 8-bit registers. The 386 also has six segment registers that define the start of memory segments; these are 16-bit registers. The 16 application registers are rounded out by the status flags and instruction pointer (EIP); they are viewed as 32-bit registers, but their implementation is more complicated. The 386 also has numerous registers for operating system programming, but I won't discuss them here, since they are likely in other parts of the chip.1 Finally, the 386 has numerous temporary registers that are not visible to the programmer but are used by the microcode to perform complex instructions.

The 6T and 8T static RAM cells

The 386's registers are implemented with static RAM cells, a circuit that can hold one bit. These cells are arranged into a grid to provide multiple registers. Static RAM can be contrasted with the dynamic RAM that computers use for their main memory: dynamic RAM holds each bit in a tiny capacitor, while static RAM uses a faster but larger and more complicated circuit. Since main memory holds gigabytes of data, it uses dynamic RAM to provide dense and inexpensive storage. But the tradeoffs are different for registers: the storage capacity is small, but speed is of the essence. Thus, registers use the static RAM circuit that I'll explain below.

The concept behind a static RAM cell is to connect two inverters into a loop. If an inverter has a "0" as input, it will output a "1", and vice versa. Thus, the inverter loop will be stable, with one inverter on and one inverter off, and each inverter supporting the other. Depending on which inverter is on, the circuit stores a 0 or a 1, as shown below. Thus, the pair of inverters provides one bit of memory.

Two inverters in a loop can store a 0 or a 1.

To be useful, however, the inverter loop needs a way to store a bit into it, as well as a way to read out the stored bit. To write a new value into the circuit, two signals are fed in, forcing the inverters to the desired new values. One inverter receives the new bit value, while the other inverter receives the complemented bit value. This may seem like a brute-force way to update the bit, but it works. The trick is that the inverters in the cell are small and weak, while the input signals are higher current, able to overpower the inverters.2 These signals are fed in through wiring called "bitlines"; the bitlines can also be used to read the value stored in the cell.

By adding two pass transistors to the circuit, the cell can be read and written.

To control access to the register, the bitlines are connected to the inverters through pass transistors, which act as switches to control access to the inverter loop.3 When the pass transistors are on, the signals on the write lines can pass through to the inverters. But when the pass transistors are off, the inverters are isolated from the write lines. The pass transistors are turned on by a control signal, called a "wordline" since it controls access to a word of storage in the register. Since each inverter is constructed from two transistors, the circuit above consists of six transistors—thus this circuit is called a "6T" cell.

The 6T cell uses the same bitlines for reading and writing, so you can't read and write to registers simultaneously. But adding two transistors creates an "8T" circuit that lets you read from one register and write to another register at the same time. (In technical terms, the register file is two-ported.) In the 8T schematic below, the two additional transistors (G and H) are used for reading. Transistor G buffers the cell's value; it turns on if the inverter output is high, pulling the read output bitline low.4 Transistor H is a pass transistor that blocks this signal until a read is performed on this register; it is controlled by a read wordline. Note that there are two bitlines for writing (as before) along with one bitline for reading.

Schematic of a storage cell. Each transistor is labeled with a letter.

To construct registers (or memory), a grid is constructed from these cells. Each row corresponds to a register, while each column corresponds to a bit position. The horizontal lines are the wordlines, selecting which word to access, while the vertical lines are the bitlines, passing bits in or out of the registers. For a write, the vertical bitlines provide the 32 bits (along with their complements). For a read, the vertical bitlines receive the 32 bits from the register. A wordline is activated to read or write the selected register. To summarize: each row is a register, data flows vertically, and control signals flow horizontally.

Static memory cells (8T) organized into a grid.

Six register circuits in the 386

The die photo below zooms in on the register circuitry in the lower left corner of the 386 processor. You can see the arrangement of storage cells into a grid, but note that the pattern changes from row to row. This circuitry implements 30 registers: 22 of the registers hold 32 bits, while the bottom ones are 16-bit registers. By studying the die, I determined that there are six different register circuits, which I've arbitrarily labeled (a) to (f). In this section, I'll describe these six types of registers.

The 386's main register bank, at the bottom of the datapath. The numbers show how many bits of the register can be accessed.

I'll start at the bottom with the simplest circuit: eight 16-bit registers that I'm calling type (f). You can see a "notch" on the left side of the register file because these registers are half the width of the other registers (16 bits versus 32 bits). These registers are implemented with the 8T circuit described earlier, making them dual ported: one register can be read while another register is written. As described earlier, three vertical bus lines pass through each bit: one bitline for reading and two bitlines (with opposite polarity) for writing. Each register has two control lines (wordlines): one to select a register for reading and another to select a register for writing.

The photo below shows how four cells of type (f) are implemented on the chip. In this image, the chip's two metal layers have been removed along with most of the polysilicon wiring, showing the underlying silicon. The dark outlines indicate regions of doped silicon, while the stripes across the doped region correspond to transistor gates. I've labeled each transistor with a letter corresponding to the earlier schematic. Observe that the layout of the bottom half is a mirrored copy of the upper half, saving a bit of space. The left and right sides are approximately mirrored; the irregular shape allows separate read and wite wordlines to control the left and right halves without colliding.

Four memory cells of type (f), separated by dotted lines. The small irregular squares are remnants of polysilicon that weren't fully removed.

The 386's register file and datapath are designed with 60 µm of width assigned to each bit. However, the register circuit above is unusual: the image above is 60 µm wide but there are two register cells side-by-side. That is, the circuit crams two bits in 60 µm of width, rather than one. Thus, this dense layout implements two registers per row (with interleaved bits), providing twice the density of the other register circuits.

If you're curious to know how the transistors above are connected, the schematic below shows how the physical arrangement of the transistors above corresponds to two of the 8T memory cells described earlier. Since the 386 has two overlapping layers of metal, it is very hard to interpret a die photo with the metal layers. But see my earlier article if you want these photos.

Schematic of two static cells in the 386, labeled "R" and "L" for "right" and "left". The schematic approximately matches the physical layout.

Above the type (f) registers are 10 registers of type (e), occupying five rows of cells. These registers are the same 8T implementation as before, but these registers are 32 bits wide instead of 16. Thus, the register takes up the full width of the datapath, unlike the previous registers. As before, the double-density circuit implements two registers per row. The silicon layout is identical (apart from being 32 bits wide instead of 16), so I'm not including a photo.

Above those registers are four (d) registers, which are more complex. They are triple-ported registers, so one register can be written while two other registers are read. (This is useful for ALU operations, for instance, since two values can be added and the result written back at the same time.) To support reading a second register, another vertical bus line is added for each bit. Each cell has two more transistors to connect the cell to the new bitline. Another wordline controls the additional read path. Since each cell has two more transistors, there are 10 transistors in total and the circuit is called 10T.

Four cells of type (d). The striped green regions are the remnants of oxide layers that weren't completely removed, and can be ignored.

The diagram above shows four memory cells of type (d). Each of these cells takes the full 60 µm of width, unlike the previous double-density cells. The cells are mirrored horizontally and vertically; this increases the density slightly since power lines can be shared between cells. I've labeled the transistors A through H as before, as well as the two additional transistors I and J for the second read line. The circuit is the same as before, except for the two additional transistors, but the silicon layout is significantly different.

Each of the (d) registers has five control lines. Two control lines select a register for reading, connecting the register to one of the two vertical read buses. The three write lines allow parts of the register to be written independently: the top 16 bits, the next 8 bits, or the bottom 8 bits. This is required by the x86 architecture, where a 32-bit register such as EAX can also be accessed as the 16-bit AX register, the 8-bit AH register, or the 8-bit AL register. Note that reading part of a register doesn't require separate control lines: the register provides all 32 bits and the reading circuit can ignore the bits it doesn't want.

Proceeding upward, the three (c) registers have a similar 10T implementation. These registers, however, do not support partial writes so all 32 bits must be written at once. As a result, these registers only require three control lines (two for reads and one for writes). With fewer control lines, the cells can be fit into less vertical space, so the layout is slightly more compact than the previous type (d) cells. The diagram below shows four type (c) rows above two type (d) rows. Although the cells have the same ten transistors, they have been shifted around somewhat.

Four rows of type (c) above two cells of type (d).

Next are the four (b) registers, which support 16-bit writes and 32-bit writes (but not 8-bit writes). Thus, these registers have four control lines (two for reads and two for writes). The cells take slightly more vertical space than the (c) cells due to the additional control line, but the layout is almost identical.

Finally, the (a) register at the top has an unusual feature: it can receive a copy of the value in the register just below it. This value is copied directly between the registers, without using the read or write buses. This register has 3 control lines: one for read, one for write, and one for copying.

A cell of type (a), which can copy the value in the cell of type (b) below.

The diagram above shows a cell of type (a) above a cell of type (b). The cell of type (a) is based on the standard 8T circuit, but with six additional transistors to copy the value of the cell below. Specifically, two inverters buffer the output from cell (b), one inverter for each side of the cell. These inverters are implemented with transistors I1 through I4.5 Two transistors, S1 and S2, act as a pass-transistor switches between these inverters and the memory cell. When activated by the control line, the switch transistors allow the inverters to overwrite the memory cell with the contents of the cell below. Note that cell (a) takes considerably more vertical space because of the extra transistors.

Speculation on the physical layout of the registers

I haven't determined the mapping between the 386's registers and the 30 physical registers, but I can speculate. First, the 386 has four registers that can be accessed as 8, 16, or 32-bit registers: EAX, EBX, ECX, and EDX. These must map onto the (d) registers, which support these access patterns.

The four index registers (ESP, EBP, ESI, and EDI) can be used as 32-bit registers or 16-bit registers, matching the four (b) registers with the same properties. Which one of these registers can be copied to the type (a) register? Maybe the stack pointer (ESP) is copied as part of interrupt handling.

The register file has eight 16-bit registers, type (f). Since there are six 16-bit segment registers in the 386, I suspect the 16-bit registers are the segment registers and two additional registers. The LOADALL instruction gives some clues, suggesting that the two additional 16-bit registers are LDT (Local Descriptor Table register) and TR (Task Register). Moreover, LOADALL handles 10 temporary registers, matching the 10 registers of type (e) near the bottom of the register file. The three 32-bit registers of type (c) may be the CR0 control register and the DR6 and DR7 debug registers.

The six 16-bit segment registers in the 386.

In this article, I'm only looking at the main register file in the datapath. The 386 presumably has other registers scattered around the chip for various purposes. For instance, the Segment Descriptor Cache contains multiple registers similar to type (e), probably holding cache entries. The processor status flags and the instruction pointer (EIP) may not be implemented as discrete registers.6

To the right of the register file, a complicated block of circuitry uses seven-bit values to select registers. Two values select the registers (or constants) to read, while a third value selects the register to write. I'm currently analyzing this circuitry, which should provide more insight into how the physical registers are assigned.

The shuffle network

There's one additional complication in the register layout. As mentioned earlier, the bottom 16 bits of the main registers can be treated as two 8-bit registers.7 For example, the 8-bit AH and AL registers form the bottom 16 bits of the EAX register. I explained earlier how the registers use multiple write control lines to allow these different parts of the register to be updated separately. However, there is also a layout problem.

To see the problem, suppose you perform an 8-bit ALU operation on the AH register, which is bits 15-8 of the EAX register. These bits must be shifted down to positions 7-0 so they can take part in the ALU operation, and then must be shifted back to positions 15-8 when stored into AH. On the other hand, if you perform an ALU operation on AL (bits 7-0 of EAX), the bits are already in position and don't need to be shifted.

To support the shifting required for 8-bit register operations, the 386's register file physically interleaves the bits of the two lower bytes (but not the high bytes). As a result, bit 0 of AL is next to bit 0 of AH in the register file, and so forth. This allows multiplexers to easily select bits from AH or AL as needed. In other words, each bit of AH and AL is in almost the correct physical position, so an 8-bit shift is not required. (If the bits were in order, each multiplexer would need to be connected to bits that are separated by eight positions, requiring inconvenient wiring.)8

The shuffle network above the register file interleaves the bottom 16 bits.

The photo above shows the shuffle network. Each bit has three bus lines associated with it: two for reads and one for writes, and these all get shuffled. On the left, the lines for the 16 bits pass straight through. On the right, though, the two bytes are interleaved. This shuffle network is located below the ALU and above the register file, so data words are shuffled when stored in the register file and then unshuffled when read from the register file.9

In the photo, the lines on the left aren't quite straight. The reason is that the circuitry above is narrower than the circuitry below. For the most part, each functional block in the datapath is constructed with the same width (60 µm) for each bit. This makes the layout simpler since functional blocks can be stacked on top of each other and the vertical bus wiring can pass straight through. However, the circuitry above the registers (for the barrel shifter) is about 10% narrower (54.5 µm), so the wiring needs to squeeze in and then expand back out.10 There's a tradeoff of requiring more space for this wiring versus the space saved by making the barrel shifter narrower and Intel must have considered the tradeoff worthwhile. (My hypothesis is that since the shuffle network required additional wiring to shuffle the bits, it didn't take up more space to squeeze the wiring together at the same time.)

Conclusions

If you look in a book on processor design, you'll find a description of how registers can be created from static memory cells. However, the 386 illustrates that the implementation in a real processor is considerably more complicated. Instead of using one circuit, Intel used six different circuits for the registers in the 386.

The 386's register circuitry also shows the curse of backward compatibility. The x86 architecture supports 8-bit register accesses for compatibility with processors dating back to 1971. This compatibility requires additional circuitry such as the shuffle network and interleaved registers. Looking at the circuitry of x86 processors makes me appreciate some of the advantages of RISC processors, which avoid much of the ad hoc circuitry of x86 processors.

If you want more information about how the 386's memory cells were implemented, I wrote a lower-level article earlier. I plan to write more about the 386, so follow me on Bluesky (@righto.com) or RSS for updates.

Footnotes and references

The 386 has multiple registers that are only relevant to operating systems programmers (see Chapter 4 of the 386 Programmer's Reference Manual). These include the Global Descriptor Table Register (GDTR), Local Descriptor Table Register (LDTR), Interrupt Descriptor Table Register (IDTR), and Task Register (TR). There are four Control Registers CR0-CR3; CR0 controls coprocessor usage, paging, and a few other things. The six Debug Registers for hardware breakpoints are named DR0-DR3, DR6, and DR7. The two Test Registers for TLB testing are named TR6 and TR7. I expect that these registers are in the 386's Segment Unit and Paging Unit, rather than part of the processing datapath. ↩
Typically the write driver circuit generates a strong low on one of the bitlines, flipping the corresponding inverter to a high output. As soon as one inverter flips, it will force the other inverter into the right state. To support this, the pullup transistors in the inverters are weaker than normal. ↩
The pass transistor passes its signal through or blocks it. In CMOS, this is usually implemented with a transmission gate with an NMOS and a PMOS transistor in parallel. The cell uses only the NMOS transistor, which is much worse at passing a high signal than a low signal. Because there is one NMOS pass transistor on each side of the inverters, one of the transistors will be passing a low signal that will flip the state. ↩
The bitline is typically precharged to a high level for a read, and then the cell pulls the line low for a 0. This is more compact than including circuitry in each cell to pull the line high. ↩
Note that buffering is needed so the (b) cell can write to the (a) cell. If the cells were connected directly, cell (a) could overwrite cell (b) as easily as cell (b) could overwrite cell (a). With the inverters in between, cell (b) won't be affected by cell (a). ↩
In the 8086, the processor status flags are not stored as a physical register, but instead consist of flip-flops scattered throughout the chip (details). The 386 probably has a similar implementation for the flags.

In the 8086, the program counter (instruction pointer) does not exist as such. Instead, the instruction prefetch circuitry has a register holding the current prefetch address. If the program counter address is required (to push a return address or to perform a relative branch, for instance), the program counter value is derived from the prefetch address. If the 386 is similar, the program counter won't have a physical register in the register file. ↩
The x86 architecture combines two 8-bit registers to form a 16-bit register for historical reasons. The TTL-based Datapoint 2200 (1971) system had 8-bit A, B, C, D, E, H, and L registers, with the H and L registers combined to form a 16-bit indexing register for memory accesses. Intel created a microprocessor version of the Datapoint 2200's architecture, called the 8008. Intel's 8080 processor extended the register pairs so BC and DE could also be used as 16-bit registers. The 8086 kept this register design, but changed the 16-bit register names to AX, BX, CX, and DX, with the 8-bit parts called AH, AL, and so forth. Thus, the unusual physical structure of the 386's register file is due to compatibility with a programmable terminal from 1971. ↩
To support 8-bit and 16-bit operations, the 8086 processor used a similar interleaving scheme with the two 8-bit halves of a register interleaved. Since the 8086 was a 16-bit processor, though, its interleaving was simpler than the 32-bit 386. Specifically, the 8086 didn't have the upper 16 bits to deal with. ↩
The 386's constant ROM is located below the shuffle network. Thus, constants are stored with the bits interleaved in order to produce the right results. (This made the ROM contents incomprehensible until I figured out the shuffling pattern, but that's a topic for another article.) ↩
The main body of the datapath (ALU, etc.) has the same 60 µm cell width as the register file. However, the datapath is slightly wider than the register file overall. The reason? The datapath has a small amount of circuitry between bits 7 and 8 and between bits 15 and 16, in order to handle 8-bit and 16-bit operations. As a result, the logical structure of the registers is visible as stripes in the physical layout of the ALU below. (These stripes are also visible in the die photo at the beginning of this article.)

Part of the ALU circuitry, displayed underneath the structure of the EAX register.

↩