Reverse-engineering the classic MK4116 16-kilobit DRAM chip

Back in the late 1970s, the most popular memory chip was Mostek's MK4116, holding a whopping (for the time) 16 kilobits. It provided storage for computers such as the Apple II, TRS-80, ZX Spectrum, Commodore PET, IBM PC, and Xerox Alto as well as video games such as Defender and Missile Command. To see how the chip is implemented I opened one up and reverse-engineered it. I expected the circuitry to be similar to other chips of the era, using standard NMOS gates, but it was much more complex than I expected, built from low-power dynamic logic. The MK4116 also used advanced manufacturing processes to fit 16,384 high-density memory cells on the chip.12

I created the die photo below from multiple microscope images. The white lines are the metal wiring on top of the chip, while the silicon underneath appears dark red. The two large rectangular regions are the 16,384 memory cells, arranged as a 128×128 matrix split in two. In between the two memory arrays are the amplifiers and selection circuits. The control and interface circuitry is at the left and right, connected to the external pins via tiny bond wires.

Die photo of the 4116 memory chip. Click for a larger image.

Die photo of the 4116 memory chip. Click for a larger image.

In dynamic RAM, each bit is stored in a capacitor with the bit's value, 0 or 1, represented by the voltage on the capacitor.3 The advantage of dynamic RAM is that each memory cell is very small, so a lot of data can be stored on one chip.4 The downside of dynamic RAM is that the charge on a capacitor leaks away after a few milliseconds. To avoid losing data, dynamic RAM must be constantly refreshed: bits are read from the capacitors, amplified, and then written back to the capacitors. For the MK4116, all the data must be refreshed every two milliseconds.

The diagram below illustrates four of the 16,384 memory cells. Each memory cell has a capacitor, along with a transistor that connects the capacitor to the associated bit line. To read or write data, a row select line is energized, turning on the transistors in that row. The row's capacitors are connected to the bit lines, allowing the bits in that row to be accessed.

Structure of the memory cells, based on the patent.

Structure of the memory cells, based on the patent.

One of Mostek's key innovations was to multiplex the address pins.6 Earlier memory chips used a separate pin for each address bit; as memory sizes increased, so did the number of address pins. This forced Intel's 4096-bit memory chip, for instance, to use a large, more costly 22-pin package.5 Mostek cut the number of address pins in half by using each address pin twice, first for a "row" address, and then a "column" address. This approach became the industry standard, allowing memory chips to fit into inexpensive 16-pin packages.

Externally, the chip stores a single bit for 16,384 different addresses. (Typically, eight of these chips were used in parallel to store bytes.) Internally, however, the chip is implemented as a 128×128 matrix of storage cells. The row address selects a row of 128 cells7 and then the column address selects one of these 128 cells to read or write.8 Meanwhile, the entire row of 128 cells is refreshed by amplifying the signals and storing them back in the capacitors.

The 4116 die with key blocks labeled. Most of the memory cell area has been cut out.

The 4116 die with key blocks labeled. Most of the memory cell area has been cut out.

The die image above is labeled with the main functional blocks.9 The chip's 16 pins are labeled around the perimeter,10 including the seven address pins (A0-A6). The Row Address Strobe pin (RAS) is used to indicate the row address is ready, while the Column Address Strobe pin (CAS) indicates that the column address is ready. The two memory arrays are in the center; I've cut out most of the cells to keep the diagram compact. The column select circuitry and sense amplifiers are between the two memory arrays. At the right, the row decode circuitry selects a row based on the address pins, while the column address circuitry buffers the address for the column select circuitry. At the left, the clock circuits generate the chip's timing pulses, triggered by the RAS, CAS, and WRITE pins. Finally, the Data Out and Data In pins provide access to the selected data bit.

Memory cell structure

The key to the DRAM chip is the memory storage cell, designed to be as compact as possible. The highly magnified photo below shows some of the storage cells, densely packed together. It's a bit hard to visualize what's going on because the chip is constructed from multiple layers. The bottom layer is the grayish silicon die. On top of the silicon are two layers of polysilicon, a special type of deposited silicon used for transistor gates, capacitors, and wiring. The top layer of the chip is the metal wiring, which was removed for this photo. The photo shows three bit lines in the silicon, with bulb-shaped storage cells connected on either side. Vertical strips of polysilicon (poly 1) over the storage cells implement capacitors: the silicon forms the lower plate, while the polysilicon forms the upper plate. The second layer of polysilicon (poly 2) is arranged in diagonal regions to implement the selection transistors, where square notches in the poly 1 layer allow the poly 2 layer to approach the silicon.

A closeup of the memory chip under the microscope, showing individual storage cells.

A closeup of the memory chip under the microscope, showing individual storage cells.

The cross-section diagram below shows the three-dimensional, layered structure of a memory cell. At the bottom is the silicon (brown); the bit line (dark brown) is made from doped silicon. Above the silicon are the two polysilicon layers (red) and the metal layer (purple), separated by insulating silicon dioxide (gray). At the far left, the poly 1 layer and underlying silicon form a capacitor. In between the capacitor and the bit line, the poly 2 layer forms the gate of the transistor. At the left, the poly 2 layer is connected to the metal of the word line, which turns the transistor on, connecting the capacitor to the bit line.

Cross-section structure of a storage cell. Based on 16K—The new generation dynamic RAM.

Cross-section structure of a storage cell. Based on 16K—The new generation dynamic RAM.

The diagram below illustrates how bits are addressed in the storage matrix. The arrangement is somewhat confusing because columns of cells are offset and interlocked like zippers. A row select line is connected to the centers of diagonal poly 2 regions, so each region controls two transistors on neighboring bit lines. (For instance, in the upper left, the poly region connected to row select 0 forms transistors 0A and 0B.) The result is that each row select line activates 128 cells, one for each bit line in a staggered arrangement.

Arrangement of bits in the matrix. The transistors are labeled according to their corresponding row and bit line.

Arrangement of bits in the matrix. The transistors are labeled according to their corresponding row and bit line.

Low-power circuitry

A key feature of the MK4116 memory chip is that it uses almost no power when it is sitting idle. Although it consumes 462 milliwatts when active, it uses just 20 milliwatts in standby mode. Although low-power circuitry is straightforward to build with modern CMOS technology, the 4116 used earlier NMOS transistors. Most NMOS integrated circuits constructed logic gates with load transistors, a simple technique with the disadvantage of wasting power. Instead, the MK4116 memory chip uses dynamic logic, which is considerably more complex but saves power while idle.

A typical dynamic logic gate (below) operates in two phases. In the first phase, a clock signal turns on the upper transistor, precharging the output to +12 volts, the "1" state. The upper transistor then turns off, but the output remains high due to the capacitance of the wire. In the second phase, the lower transistors can pull the output low. In particular, if either input is 1, the corresponding transistor turns on and pulls the output low, so the circuit implements a NOR gate. This circuit doesn't consume any static power, just a small current to charge the wire capacitance when switching. (The inputs must be carefully timed so they don't overlap with the precharge clock.) The use of dynamic circuitry makes the 4116 much more complex than it would be otherwise since the gates are controlled by clock signals, which need to be generated.

A NOR gate using dynamic logic.

A NOR gate using dynamic logic.

The row select circuitry

The purpose of the row-select circuitry is to decode the 7 address bits and energize the corresponding row select line (out of 128) to read one row of memory. In the first step, 32 5-input NOR gates decode address bits A0 through A4. These NOR gates are implemented in the compact circuit shown below. Each NOR gate takes a different combination of non-inverted and inverted address bits and matches a particular 5-bit address. These NOR gates use dynamic logic, first pulled high and then discharged to ground, except for the selected address which remains high. Next, each NOR output is split into four, based on A5 and A6. The result is that one of 128 row select lines is activated, turning on the transistors for that row in the matrix.

The NOR gates are implemented in several compact blocks; one block of three NOR gates is shown below. Each NOR gate is a horizontal stripe of doped silicon, with ground above and below it. Each NOR gate has transistors (pink stripes) connected to ground alternating above and below it. A transistor will pull the NOR gate low if the connected address line is high. The precharge transistors at the left pull the NOR gates to +12 volts, while the output control transistors control the flow of the decoded outputs to the rest of the circuitry.

Three NOR gates in the row decoder. The vertical yellow strips indicate metal wiring, removed for this photo.

Three NOR gates in the row decoder. The vertical yellow strips indicate metal wiring, removed for this photo.

The small greenish blobs at the end of a transistor gate (pink stripe) are connections (vias) between a transistor gate and an address line. The address lines are represented as vertical yellow stripes (since the metal layer was removed). Note that each transistor gate has an address line at the right and the inverted address line at the left; thus, the NOR gates all have the same basic layout, but with the contacts changed to match a particular address. For instance, the upper NOR gate has transistors connected to A0, A2, A1, A3, and A4, so it will be active for address 00000; any other address will pull it low.

The sense amplifiers

The sense amplifiers are one of the most challenging parts of designing a memory chip. The job of the sense amplifier is to take the tiny voltage from a capacitor and amplify it into a binary 0 or 1.11 The challenge is that even though 12 volts is stored in a capacitor, the signal from the capacitor is very small, is only 100 millivolts or so. (Because the bit line is much larger than the tiny memory cell capacitor, the capacitor causes a very small voltage swing.)12 It is critically important for the sense amplifier to operate accurately, even in the presence of noise or voltage fluctuations, because any error will corrupt the data. The sense amplifier circuit must also be compact and low power since there are 128 sense amplifiers.

The 128 sense amplifiers are in the middle of the die, between the upper and lower memory arrays.

The 128 sense amplifiers are in the middle of the die, between the upper and lower memory arrays.

The chip's 128 sense amplifiers, one for each column, are located between the two memory arrays as shown above. During a read, 128 values in a row are accessed in parallel and amplified by the sense amplifiers. These 128 values are then written back to refresh the values in the capacitor. For a write operation, one of the bits is updated with the new value before they are written back.

The sense amplifier as it appears on the die, and corresponding schematic.

The sense amplifier as it appears on the die, and corresponding schematic.

Each sense amplifier (above) is a very simple circuit. It takes two inputs and compares them, pulling the lower one to 0.13 It is built from two cross-coupled transistors, each trying to pull the other one low. Whichever transistor has the higher voltage to start with will "win", forcing the other side low.14 The sense amplifier is sensitive to very small voltage differentials, allowing it to distinguish the small signals from a storage cell.

Locating the sense amplifiers between the two memory arrays isn't arbitrary, but the key to their operation: this is the "divided bit line" architecture introduced in 1972. The idea is that one input to the sense amp is the voltage from the desired memory cell, while the other input is a threshold voltage from a "dummy cell" in the opposite memory array. Dummy cells are constructed and precharged like real memory cells except the capacitor is half-sized, so they provide a voltage midway between a 0 bit and a 1 bit.3 If the voltage from the real memory cell is lower, the sense amp outputs a 0, and if higher, it outputs a 1.

Dummy cells provide the threshold voltage for deciding if a bit is 0 or 1. The dummy cells are located at the top and bottom of the memory arrays. They are on the same bit lines as real memory cells.

Dummy cells provide the threshold voltage for deciding if a bit is 0 or 1. The dummy cells are located at the top and bottom of the memory arrays. They are on the same bit lines as real memory cells.

The dummy cells are located on the edges of the memory arrays, as shown above. They consist of capacitors and transistors (similar to real memory cells), but with a separate line to charge them. The advantage of the dummy cell approach is that manufacturing differences or fluctuations during operation will (hopefully) affect the real cells and dummy cells equally, so the voltage from the dummy cell will remain at the correct level to distinguish beween a 0 and a 1. Address bit A0 controls which half of the array provides real data to the bit lines and which half connects dummy cells to the bit lines.

The column select circuitry

The purpose of the column select circuitry is to select one column out of the 128-bit row; this is the bit that is read or written. Each column select circuit is twice as wide as a memory cell, so they only decode one of 64 columns. The result is that two bits are selected at a time, and circuitry elsewhere selects one of the two bits. Like the row select circuitry, the column select circuitry is implemented by numerous NOR gates, each matching one address. For column select address bits A0 through A5 select one of 64 lines, selecting two columns at a time. These two bit lines are connected to data lines transmitting the signals to the I/O circuitry. (Since the bit lines for the upper and lower halves of the matrix are separate, there are actually four bit lines selected by the column select circuit.) As with the row select circuitry, dynamic logic is used, controlled by various timing signals. Note that each NOR gate is physically split into two parts with the sense amp in the middle.

Footnote 15

Five of the column decoders, with one highlighted.

Five of the column decoders, with one highlighted.

The schematic below shows how the column decoder works with the sense amplifier. The diagram shows two bit lines and the top half of the column decoder and sense circuitry; it is mirrored for the lower array. At the top, the sense precharge circuit pulls all the bit lines high. At the bottom, the sense amplifiers amplify and refresh the signals as explained above. The column decoder matches a particular 6-bit address, so one of the 64 decoders will activate the associated sense select circuit, connecting the chip's I/O circuitry to four bit lines (two from the upper memory array as shown here and two from the lower memory array).

Schematic of half the column decoder and sense amplifier.

Schematic of half the column decoder and sense amplifier.

At this point, four bit lines have been selected for use and their signals are passed to the input/output circuitry; the column select circuitry only decoded 1-of-64, while there are 128 columns, and each half of the array has separate bit lines. Column address bit A6 provides the final selection between the two columns. The selected bit is sent to the data-out pin for a read. For a write, the value on the data-in pin is sent back through the appropriate wire to overwrite the value in the sense amplifier. This circuitry is implemented using dynamic logic and latches, controlled by various timing signals. Much of the circuitry is duplicated, with one copy for the upper half of the memory array and one copy for the lower half. Row address bit A0 distinguishes which half of the matrix is active and which half is providing dummy data). (Note that row address bit A0 was already used to select a particular row, but the circuitry has "lost track" of which was the real row and which was the dummy row, so it must make the selection again.)

Clock generation

The chip requires many timing signals for the various steps in a memory operations. The memory chip doesn't use an external clock, unlike a CPU, but generates its own timing signals internally. The diagram below illustrates the clock generators, using buffers to create a delay between each successive clock output. The first set of timing signals is triggered by the row-access strobe (RAS), indicating that the computer has put the row address on the address pins. The next set of timing signals is triggered by the column-access strobe (CAS), indicating the column address is on the address pins. Other timing signals are triggered by the WRITE pin.

Conceptual diagram of the clock generation, from
16K—The new generation dynamic RAM.

Conceptual diagram of the clock generation, from 16K—The new generation dynamic RAM.

The real clock circuitry is much more complex than the diagram indicates, consisting of dozens of transistors in multiple chains, feeding back in complex ways to shape the pulses. (Among other things, using dynamic logic requires each buffer to have both an input that pulls it high and an input that pulls it low, forming almost a circular problem.) These gates are mostly built from large transistors, as shown below, to provide enough current to drive the circuitry, and to increase the gate delay sufficiently. The clock circuitry also uses many capacitors, probably bootstrap loads to pull signals up sharper. I'm not going to describe the clocks in detail since it's a complicated mess.

A small part of the clock circuitry.

A small part of the clock circuitry.

Input pins

The chip uses surprisingly complex circuits for the address pins and the data input pin. Mostek's earlier memory chip had problems due to noise margins on the inputs, so the MK4116 uses a complex circuit with an analog threshold, capacitor drive, and multiple controls and latches.

The diagram below shows the threshold generation circuit, which generates a 1.5-volt reference. It uses many tiny transistors in series to generate the voltage level. Conceptually, it is similar to a resistor divider between power and ground to produce an output voltage. However, resistors are both power-hungry and difficult to build in integrated circuits, so transistors are used instead. Since this circuit is always active, the designers needed to minimize its current; this was achieved by using many transistors in series.

The voltage reference used for the address pins. (×18 indicates 18 transistors in series, for instance.)

The voltage reference used for the address pins. (×18 indicates 18 transistors in series, for instance.)

The voltage on the input pin and the threshold voltage are fed into a differential amplifier/comparator, conceptually similar to the sense amplifiers. Each side tries to pull the other side low, ending up with a 1 for the "winning" side and 0 for the "losing" side. Thus, the input is converted into a binary value. The result from the comparator is stored in a latch. Multiple timing signals gate the input signal, precharge the circuitry, and control the latch.

The data-in circuitry: the pin, latch circuit, and voltage reference. This circuitry is in the lower-left corner of the die.

The data-in circuitry: the pin, latch circuit, and voltage reference. This circuitry is in the lower-left corner of the die.

The photo above shows the input circuit for the data-in pin. Next to the pin's bond wire is the threshold circuit and latch; the two capacitors are the large rectangles of metal. The voltage reference circuit is next; the data-in voltage reference is similar to the address voltage reference described above. (I left the metal layer on for this photo; the polysilicon and silicon underneath is obscured by the oxide layer.)

Conclusion

This memory chip was much more complex than I expected. I studied a simple Intel memory chip earlier so I assumed this DRAM would be larger but not much more complicated. Instead, the MK4116 has complex circuitry with over 1000 transistors controlling it, in addition to the 16,384 transistors for the memory cells and about 1500 transistors for the column selects and sense amps. A cause of the complexity is that the design needed to optimize multiple axes: density, speed, and power efficiency.16

The table below shows that each generation of DRAM chips required substantial technological changes and new developments. Memory designers don't just sit around waiting for Moore's Law to increase the memory capacity; they have to constantly develop new techniques because DRAM storage cells are fundamentally analog. Fortunately, DRAM designers have continued to solve memory scaling problems; 16-gigabit DRAMs recently went into production, an amazing factor of a million larger than the 16-kilobit MK4116 DRAM chip of 1976.

DRAM cell evolution from 4 kilobits to 16 megabits. From Impact of Processing Technology on DRAM Sense Amplifier Design (Gealow, 1990).

DRAM cell evolution from 4 kilobits to 16 megabits. From Impact of Processing Technology on DRAM Sense Amplifier Design (Gealow, 1990).

I announce my latest blog posts on Twitter, so follow me @kenshirriff or my RSS feed. Thanks to Mike Braden for suggesting the MK4332 chip to me.

Notes and references

  1. A brief history of memory innovations is here. For detailed information on DRAM circuits, see this 1990 thesis on sense amplifier design. For history, Storage array and sense/refresh circuit for single-transistor memory cells (1972) introduced the concepts of dummy cells and cross-coupled sense amplifiers. Intel's chip is discussed in A 16 384-Bit Dynamic RAM (1976) while Mostek's chip is discussed in A 16K × 1 bit dynamic RAM (1977) and 16K - The new generation dynamic RAM (1977). Inconveniently, I found most of these references after I had this blog post nearly completed. 

  2. An unusual characteristic of the chip is that it doesn't use "buried contacts". The issue is how to connect a polysilicon wire to a silicon circuit. In integrated circuits of the 1960s, polysilicon couldn't be connected to silicon directly, so a via connected the polysilicon wire to the metal layer, which had a short connection to a second via that connected down to the silicon. In 1968 at Fairchild, Federico Faggin invented the buried contact, a way to connect the polysilicon and silicon directly. This was much more convenient, so all the NMOS chips that I have examined use buried contacts.

    However, the 4116 doesn't use buried contacts. Instead, it uses the obsolete connections through the metal layer. It's a mystery why they did this. Perhaps the metal wiring density was low enough that the additional segments weren't a problem and they could eliminate one masking and processing step. (Another theory is maybe there were patent issues, but I'm not aware of any patent on the buried contact.) But this illustrates that technological progress isn't consistently linear. Even an advanced chip like the 4116 can use obsolete techniques in some areas. 

  3. In the MK4116, a 0 bit is represented by storing 12 volts on the capacitor, while a 1 bit is represented by 0 volts on the capacitor. This is backward from what you might expect, but probably saved an inverter somewhere in the circuitry. To avoid confusion, I ignore this in the text. 

  4. Early dynamic RAMs such as the Intel 1103 used three transistors per cell and used separate lines for reading and writing data. Improvements in memory technology shrunk the circuit to a single transistor and a single data line. Static RAM, in comparison, often requires 6 transistors per bit, but has the advantage of not needing to be refreshed. 

  5. For example, Intel's 2107 4096-bit DRAM required 22 pins, as did the 2101 256×4 static RAM chip. It's ironic that Intel used larger packages for these memory chips because a few years earlier, Intel had steadfastly refused to go beyond 16 pins, forcing the Intel 4004 microprocessor to use a 16-pin package. The 8008 microprocessor was barely allowed 18 pins, when 24 pins would have been more convenient. This made the 8008 slower and harder to use. 

  6. Although multiplexing the address pins might seem trivial, Mostek claims that they bet the company on this idea. The problem is how to implement multiplexing without making memory accesses wait while both parts of the address are loaded. (The time to read memory was a key factor in computer design, so every nanosecond counted.) In Mostek's solution, first the row address is put on the address pins, and the row-access strobe (RAS) is activated. While the chip is reading that row from memory, the computer puts the column address on the address pins and activates the column-access strobe (CAS). By the time the 128 bits of the storage row have been read, the column address is available and the desired bit is selected from the row of 128 bits. In other words, reading of the row is overlapped with loading of the column address, so multiplexing doesn't slow the system. However, careful timing is required to make this multiplexing work; much of the chip is devoted to clock circuitry to generate the necessary timing pulses. 

  7. The RAM chip operates on memory a row at a time, and then selects one entry from the row. This isn't the obvious way to access memory. In comparison, magnetic core memory also holds memory cells (cores) in a matrix, but accesses a single cell using X and Y select lines. A key reason for a DRAM to operate a row at a time is so the entire row can be refreshed at once, dramatically reducing the performance overhead from refresh operations. 

  8. You might wonder if it's possible to read multiple bits from a row without repeating the entire row-read operation. The chip designers thought of that and provided several techniques to boost efficiency. The page-read and page-write functions let you rapidly access multiple bits in a 128-bit page (i.e. row). A read-modify-write sequence lets you read a row, modify bits in it, and write it back without repeating the row-read. A RAS-only refresh operation lets you read and refresh a row without providing a column address. The point of this is that the chip designers implemented clever features so customers could squeeze as much performance out of their memory system as possible. See the datasheet for details. 

  9. The block diagram below shows the main functional blocks of the 4116. Many parts of this block diagram didn't make sense to me until after I had reverse-engineered the chip, such as the clock generator, dummy cells, and "1 of 2 data bus select". Many datasheets present a somewhat abstracted view of how the chip operates, but the 4116 datasheet accurately matches the implementation.

    Block diagram of the 4116 memory chip, from the databook.

    Block diagram of the 4116 memory chip, from the databook.

     

  10. One inconvenient feature of the memory chip is it requires three different voltages: +12 volts, +5 volts, and -5 volts. Almost all the circuitry runs on 12 volts. The 5-volt supply is used only to provide a standard TTL voltage level for the data out pin. The -5 volts is a substrate bias, connected to the underlying silicon die to improve the characteristics of the transistors. Later chips implemented a charge pump circuit to generate the bias voltage, eliminating the need for an external bias voltage. Later memory chips also eliminated the need for +12 volts. This simplified use of the chips, since only a single-voltage power supply was required. A less-obvious benefit is that this made two of the chip's 16 pins available for other uses. Specifically, these pins were used as additional address bits in the next two generations of memory chips, the 64-kilobit and 256-kilobit chips. As a side effect, the address pins are in a somewhat scrambled order, due to the location of the available pins. 

  11. It's not a coincidence that the input to the sense amp is very small, just enough to be reliably amplified. This is a consequence of economics: if the DRAM produced a large voltage difference, the designers would shrink the cells to save money. But if the voltage difference was too small for reliability, the designers would need to increase the cells. The result is a design where the voltage difference is just barely large enough to be reliably amplified by advanced circuitry. (We noticed the same thing when using a vintage 1960s IBM core memory (video); we were just barely able to read the core values. The cause is the same: if the cores had produced nice clean pulses, they were larger than they needed to be.) 

  12. When the capacitor is connected to the bit line, the resulting voltage will depend on the relative capacitances of the capacitor and the bit line. The bit line capacitance is said to be 800 fF, while the storage cell has 40 fF capacitance, for a 20:1 ratio. Thus, the resulting voltage will be very close to the +12V precharge voltage on the bit line, but perturbed a few hundred millivolts. 

  13. The sense amplifier can only pull a signal low, not raise it, so you might wonder where the amplification happens. Both sides are precharged to +12 volts and the memory cell capacitance only pulls the sides down by 100 millivolts or so. The "winning" side will remain very close to 12 volts, while the other side is pulled to 0 by the sense amp. Thus a 1 bit is pulled higher by the precharge, while a 0 bit is pulled lower by the sense amp. 

  14. The diagram below shows the sense amplifier voltages during operation of a prototype DRAM sense amp. First, the two sides of the sense amp are precharged to the same voltage. Next, a DRAM storage node is selected on one side and a dummy node on the other. Note that the voltage difference between the two sides is very small, maybe 200 millivolts. Finally, the difference is amplified, forcing the higher side up and the lower side down. In this case, the storage node held a 1 so it started slightly higher. If it held a 0, it would start slightly lower and the two lines would diverge in opposite directions. The point is that the sense amp takes a very small voltage differential and amplifies it into a large binary signal.

    Voltage diagram for a prototype sense amp (not the 4116). Based on  Storage Array and Sense/Refresh Circuit for Single-Transistor Memory Cells, 1972.

    Voltage diagram for a prototype sense amp (not the 4116). Based on Storage Array and Sense/Refresh Circuit for Single-Transistor Memory Cells, 1972.

    One difference between this sense amp and the MK4116 is that this circuit is precharged to a midpoint voltage, while the MK4116's is precharged to +12 volts. In this sense amp, one signal must be pulled high, while in the MK4116 both signals start near +12V and one is forced low.  

  15. Robert Proebsting, co-founder of Mostek and developer of address multiplexing, has an oral history that provide some information on the 4116. He discusses why the column decoder selects one of 64 columns and the selection between the pair happens earlier. The reason is they wanted the noise from the address lines to be equal on both sides of the sense amp, so they have three address line pairs on each side.  

  16. Intel produced 16,384-bit DRAM chips before Mostek, the 2116 and others, but Mostek's chips beat Intel in the marketplace. Interestingly, the internal structure was completely different from the MK4116. The 2116 contained four memory arrays internally and was structured as two independent 8-kilobit memories. This saved on power since the unused half could be left unpowered during a memory access. Moreover, if a 2116 chip had a manufacturing flaw in one half, Intel repackaged it as an 8-kilobit 2108 chip with either the upper or lower half operational. The user had to set address bit A6 appropriately to get the working half. 

Reverse-engineering the carry-lookahead circuit in the Intel 8008 processor

The 8008 was Intel's first 8-bit microprocessor, introduced in 1972. While primitive by today's standards, the 8008 is historically important because it essentially started the microprocessor revolution and is the ancestor of the modern x86 processor family. I've been studying the 8008's silicon die under the microscope and reverse-engineering its circuitry.

The die photo below shows the main functional blocks1 including the registers, instruction decoder, and on-chip stack storage. The 8-bit arithmetic logic unit (ALU) is on the left. Above the ALU is the carry-lookahead generator, which improves performance by computing the carries for addition, before the addition takes place. It's a bit surprising to see carry lookahead implemented in such an early microprocessor. This blog post explains how the carry circuit is implemented.

The Intel 8008 die with key functional blocks labeled. Click for a larger version.

The Intel 8008 die with key functional blocks labeled. Click for a larger version.

Most of what you see in the die photo is the greenish-white wiring of the metal layer on top. Underneath the metal is polysilicon wiring, providing more connections as well as implementing transistors. The chip contains about 3500 tiny transistors, which appear as brighter yellow. The underlying silicon substrate is mostly obscured; it is purplish-gray. Around the edges of the die are 18 rectangular pads; these are connected by tiny bond wires to the external pins of the integrated circuit package (below).

An 8008 integrated circuit in an 18-pin DIP (dual inline package). The package is very scratched, but I didn't see the point in paying for mint condition for a chip I was immediately going to decap.

An 8008 integrated circuit in an 18-pin DIP (dual inline package). The package is very scratched, but I didn't see the point in paying for mint condition for a chip I was immediately going to decap.

The 8008 was sold as a small 18-pin DIP (dual inline package) integrated circuit. 18 pins is an inconveniently small number of pins for a microprocessor, but Intel was committed to small packages at the time.2 In comparison, other early microprocessors typically used 40 pins, making it much easier to connect the data bus, address bus, control signals, and power to the processor.

Addition

The heart of a processor is the arithmetic-logic unit (ALU), the functional block that performs arithmetic operations (such as addition or subtraction) and logical operations (such as AND, OR, and XOR). Addition was the most challenging operation to implement efficiently because of the need for carries.3

Consider how you add two decimal numbers such as 8888 and 1114, with long addition. Starting at the right, you add each pair of digits (8 and 4), write down the sum (2), and pass any carry (1) along to the left. In the next column, you add the pair of digits (8 and 1) along with the carry (1), writing down the sum (0) and passing the carry (1) to the next column. You repeat the process right-to-left, ending up with the result 10002. Note that you have to add each position before you can compute the next position.

Binary numbers can be added in a similar way with a circuit called a ripple-carry adder that was used in many early microprocessors. Each bit is computed by a full adder, which takes two input bits and a carry and produces the sum bit and a carry-out. For instance, adding binary 1 + 1 with no carry-in yields 10, for a sum bit of 0 and a carry-out of 1. Each carry-out is added to the bit position to the left, just like decimal long addition.

The problem with ripple carry is if you add, say, 11111111 + 1, you need to wait as the carry "ripples" through the sum from right to left. This makes addition a slow serial operation instead of a parallel operation. Even though the 8008 only performs addition on 8-bit numbers, this delay would slow the processor too much. The solution was a carry lookahead circuit that rapidly computes the carries for all eight bit positions. Then the sum can be calculated in parallel without waiting for carries to ripple through the bits. According to 8008 designer Hal Feeney, "We built the carry look-ahead logic because we needed the speed as far as the processor is concerned. So carry look ahead seemed like something we could integrate and have fairly low real estate overhead and, as you see, the whole carry look ahead is just a very small portion of the chip."

Implementing carry lookahead

The idea of carry lookahead is that if you can compute all the carry values in advance, then you can rapidly add all the bit positions in parallel. But how can you compute the carries without performing the addition? The solution in the 8008 was to build a separate circuit for each bit position to compute the carry based on the inputs.

The diagram below zooms in on the carry lookahead circuitry and the arithmetic-logic unit (ALU). The two 8-bit arguments and a carry-in arrive at the top. These values flow vertically through the carry lookahead circuit, generating carry values for each bit along the way. Each ALU block receives two input bits and a carry bit and produces one output bit. The carry lookahead has a triangular layout because successive carry bits require more circuitry, as will be explained. The 8-bit ALU has an unusual layout in order to make the most of the triangular space. Almost all microprocessors arrange the ALU in a rectangular block; an 8-bit ALU would have 8 similar slices. But in the 8008, the slices of the ALU are scattered irregularly; some slices are even rotated sideways. I've written about the 8008's ALU before if you want more details.

Closeup of the 8008 die showing the carry lookahead circuitry and the ALU.

Closeup of the 8008 die showing the carry lookahead circuitry and the ALU.

To understand how carry lookahead works, consider three addition cases. First, adding 0+0 cannot generate a carry, even if there is a carry in; the sum is 0 (if there is no carry in) or 1 (with carry in). The second case is 0+1 or 1+0. In this case, there will be a carry out only if there is a carry in. (With no carry-in the result is 1, while with carry-in the result is 10.) This is the "propagate" case, since the carry-in is propagated to carry-out. The final case is 1+1. In this case, there will be a carry-out, regardless of the carry-in. This is the "generate" case, since a new carry is generated.

The circuit below computes the carry-out when adding two bits (X and Y) along with a carry-in. This circuit is built from an OR gate on the left, two AND gates in the middle, and an OR gate on the right. (Although this circuit looks complex, it can be implemented efficiently in hardware.) To see how it operates, consider the three cases. If X and Y are both 0, the carry output will be 0. Otherwise, the first OR gate will output 1. If carry-in is 1, the upper AND gate will output 1 and the carry-out will be 1. (This is the propagate case.) Finally, if both X and Y are 1, the lower AND gate will output 1, and the carry-out will be 1. (This is the generate case.)

This circuit computes the carry-out given the carry-in and two input bits X and Y.

This circuit computes the carry-out given the carry-in and two input bits X and Y.

To compute the carry into a higher-order position, multiple instances of this circuit can be chained together. For instance, the circuit below computes the carry into bit position 2 (C2). The gate block on the left computes C1, the carry into bit position 1, from the carry-in (C0) and low-order bits X0 and Y0, as explained above. The gates on the right apply the same process to the next bits, generating the carry into position 2. For other bit positions, the same principle is used but with additional blocks of gates. For instance, the carry into position 7 is computed by a chain of seven blocks of gates. Since the circuit for each successive bit is one unit longer, the carry structure has the triangular structure seen on the die.

Computing the carry into position 2 requires two stages of carry prediction.

Computing the carry into position 2 requires two stages of carry prediction.

The diagram below shows how the carry circuit for bit 2 is implemented on the die; the circuit for other bits is similar, but with more repeated blocks. In the photograph, the metal wiring on top of the die is silverish. Underneath this, the polysilicon wiring is yellow. At the bottom, the silicon is grayish. The transistors are brighter yellow; several are indicated. The schematic underneath shows the wiring of the transistors; the layout of the schematic is close to the physical layout.

Implementation of the carry lookahead circuit for bit 2.

Implementation of the carry lookahead circuit for bit 2.

I'll give a brief outline of how the circuit works. The 8008 is implemented with a type of transistor called a PMOS transistor. You can think of a PMOS transistor as turning on if the input is 0, and off if the input is 1.4 Instead of standard logic gates, this circuit uses a technique called dynamic logic, which takes advantage of capacitance. In the first step, the precharge signal connects -9 volts to the circuitry, precharging it. In the second step, the input signals (top) are applied, turning on various transistors. If there is a path through the transistors from the +5 supply to the output, the output will be pulled high. Otherwise, the output remains at the precharge level; the capacitance of the wires holds the -9 volts. I won't trace out the entire circuit, but the upper X/Y transistor pairs implement an OR gate since if either one is on, the carry can get through. The lower X/Y transistors implement an AND gate; if both are on, the +5 signal will get through, generating a 1.

You might wonder why this carry lookahead circuit is any faster than a plain ripple-carry adder, since the carry signal has to go through up to seven large gates to generate the last carry bit. The trick is that the entire circuit is electrically a single large gate due to the dynamic design. All the transistors are activated in parallel, and then the 5-volt signal can pass through them all rapidly.5 Although there is still a delay as this signal travels through the circuit, the circuit is faster than the standard ripple carry adder which activates transistors in sequence.

A brief history of carry circuits

The efficient handling of carries was an issue back to the earliest days of mechanical calculation. The mathematician Blaise Pascal created a mechanical calculator in 1645. This calculator used a mechanical ripple carry mechanism powered by gravity that rapidly propagated the carry from one digit to the next (video). Almost two centuries later, Charles Babbage designed the famous difference engine (1819-1842). It used a slow ripple carry; after the addition cycle, spiral levers on a rotating shaft activated each digit's carry in sequence. Babbage spent years designing a better carry mechanism for his ambitious Analytical Engine (1837), developing an "anticipating carriage" to perform all carries in parallel. With the anticipating carriage, each digit wheel had a sliding shaft that moved into position when a digit was 9. When a digit triggered a carry by moving from 9 to 0, it raised the stack of shafts, incrementing all the appropriate digits in parallel (see video).

Detail of Babbage's diagram of the "anticipating carriage" that computes carries in the Analytical Engine.
I'm not sure how this mechanism works. From The Babbage Papers at the Science Museum, London, CC BY-NC-SA 4.0.

Detail of Babbage's diagram of the "anticipating carriage" that computes carries in the Analytical Engine. I'm not sure how this mechanism works. From The Babbage Papers at the Science Museum, London, CC BY-NC-SA 4.0.

The first digital computers used ripple carry. The designer of the Bell Labs relay computer (1939) states that "the carry circuit was complicated" due to the use of binary-coded decimal (BCD). The groundbreaking ENIAC (1946) used decimal counters with ripple carry. Early binary electronic computers such as EDSAC (1949) and SEAC (1950) were serial, operating on one bit at a time, so they computed carries one bit at a time too. Early computers with parallel addition such as the 1950 SWAC (the fastest computer at the time) and the commercial IBM 701 (1952) used ripple carry.

As computers became faster in the 1950s, ripple carry limited performance so alternatives were developed. In 1956, the National Bureau of Standards patented a 53-bit adder using vacuum tubes. This design introduced the important carry-lookahead concept, as well as the idea of using a hierarchy of carry lookahead (two levels in this case). The diagram below illustrates the complexity of this adder.

Diagram of a 53-bit adder from A 1-microsecond adder using one-megacycle circuitry, 1956.

Diagram of a 53-bit adder from A 1-microsecond adder using one-megacycle circuitry, 1956.

The development of supercomputers led to new carry techniques. The transistorized Atlas was built by the University of Manchester, Ferranti and Plessey in 1962. It used the influential Manchester carry chain technique, described in 1959. The Atlas vied with the IBM Stretch (1961) for the title of the world's fastest computer. The Stretch introduced high-speed techniques including the carry-select adder and the patented carry save adder for multiplication.

As with mainframes, microprocessors started with simple adders but required improved carry techniques as performance demands increased. Most early microprocessors used ripple carry, such as the 6502, Z-80, and ARM1. Carry-skip was often used for the program counter (as in the 6502 and Z-80); ripple carry was fast enough for 8-bit words but too slow for the 16-bit program counter. The ALU of the Intel 8086 (1978) used a Manchester carry chain as well as carry skip. The large transistor counts of VLSI chips permitted more complex adders, fed by research in parallel-prefix adders. The DEC Alpha 21064 (1992) combined multiple techniques: Manchester carry chain, carry lookahead, conditional sum, and carry select (details). The Hewlett-Packard PA_8000 (1995) contained over 20 adders for various purposes, including a Ling adder, a type developed at IBM in 1966 (details). The Pentium II (1997) used a 72-bit Kogge-Stone adder while the Pentium 4 (2000) used a Han-Carlson adder.6

This history shows that carry propagation was an important performance problem in the 1950s and remains an issue today with continuing research and improvements. Many different solutions have been developed, first in mainframes and later in microprocessors, growing more complex as technology advances. These approaches have tradeoffs of die area, cost, and speed, so different processors choose different implementations.

Die photo of the Intel 8008 processor. Click for a larger version.

Die photo of the Intel 8008 processor. Click for a larger version.

If you're interested in the 8008, I have other articles about it describing its architecture, its ALU, its on-chip stack, bootstrap loads, and its unusual history. I announce my latest blog posts on Twitter, so follow me at @kenshirriff. I also have an RSS feed.

Notes and references

  1. The functional blocks of the 8008 processor are documented in the datasheet (below). The layout of this diagram closely matches the physical layout on the die. I've highlighted the carry lookahead block.

    Functional blocks of the 8008 processor. From the 8008 datasheet.

    Functional blocks of the 8008 processor. From the 8008 datasheet.

     

  2. According to Federico Faggin's oral history, the 8008 team was lucky to be allowed to even use an 18-pin package for the 8008. "It was a religion in Intel" to use 16-pin packages, even though other manufacturers commonly used 40- or 48-pin packages. When Intel was forced to move to 18-pin packages for the 1103 RAM chip, it "was like the sky had dropped from heaven. I never seen so [many] long faces at Intel". The move to 18 pins was beneficial for the 8008 team, which had been forced to use 16 pins for the earlier 4004. However, even 18 pins was impractically small considering the chip used 14-bit addresses. The result was address and data signals were multiplexed over 8 data pins. This both slowed the processor and made use of the chip more complicated. Intel soon gave up on small packages, using a standard 40-pin package for the 8080 processor in 1974. 

  3. I'm ignoring subtraction in this discussion because it was implemented by addition, adding a two's complement value. Multiplication and division were not implemented by early microprocessors. Interestingly, even the earliest mainframe computers implemented multiplication and division in hardware. 

  4. Most of the "classic" microprocessors were implemented with NMOS transistors. If you're familiar with NMOS gates, everything is backward with PMOS. Although PMOS has worse performance than NMOS, it was easier to manufacture at first, so the Intel 4004 and 8008 used PMOS. PMOS required fairly large negative voltages, which is why the diagram shows -9 volts and +5 volts. 

  5. I'm hand-waving over the timing of the carry lookahead circuit. An accurate analysis of the timing would require considering the capacitance of each stage, which might add an O(n2) term.

    Also note that this carry lookahead circuit is a bit unusual. A typical carry lookahead circuit (as in the 74181 ALU chip) expands out the gates, yielding much larger but flatter circuits to minimize propagation delays. On the other hand, the 8008's circuit has a lot in common with a Manchester carry chain, which uses a similar technique of passing the incoming carry through a chain of pass transistors, or potentially generating a carry at each stage. A Manchester carry chain, however, uses a single N-stage chain rather than the 8008's triangle of separate chains for each bit. A Manchester carry chain can tap each bit's carry from each stage of the chain, so only one chain is required. The 8008's carry circuit, however, lacks the transistors that block a carry from propagating backwards, so its intermediate values may not be valid.

    In any case, the 8008's carry lookahead circuit was sufficiently fast for Intel's needs. 

  6. For more information on various adders, see this presentation, another presentation and Advanced Arithmetic Techniques

Inside the stacked RAM modules used in the Apple III

In 1978, a memory chip stored just 16 kilobits of data. To make a 32-kilobit memory chip, Mostek came up with the idea of putting two 16K chips onto a carrier the size of a standard integrated circuit, creating the first memory module, the MK4332 "RAM-pak". This module allowed computer manufacturers to double the density of their memory systems and by 1982, Mostek had sold over 3 million modules. The Apple III is the best-known system that used these memory modules.

The MK4332 memory module combined two 16-kilobit memory chips on a ceramic substrate.

The MK4332 memory module combined two 16-kilobit memory chips on a ceramic substrate.

This module was built from two 16-kilobit memory chips, constructed from the standard MK4116 dynamic RAM (DRAM) chip packaged in a leadless ceramic chip carrier; these are the golden rectangles on top of the carrier.

You might wonder why customers didn't simply use these surface-mount packages directly, but at the time soldering surface-mount components was still a challenge for many customers. However, mounting two leadless chips on a dual inline-package (DIP) carrier allowed customers to double their memory density while still using their standard through-hole soldering techniques.

The purple carrier holding the chips was a ceramic substrate designed for thermal compatibility with the chips.1 There is no circuitry inside the ceramic carrier except wiring between the chips and the eighteen DIP pins. The two memory chips were wired in parallel except for their two select lines, which were kept separate. This allowed the desired memory chip to be selected. As a result, the MK4332 module has 18 pins, compared to 16 pins for the chips on top. Mostek used the same module design with the next generation of RAM chips, creating a 128-kilobit RAM module (MK4528) from two 64-kilobit RAM chips (MK4564).

Inside the 4116 memory chip

Although you might expect a complex mounting technique, the two 4116 chips are simply soldered onto the substrate with standard reflow techniques. For the photo below, I removed the metal lid from the left chip with a chisel and unsoldered the right chip with a hot air gun. On the left, you can see the rectangular silicon die inside the leadless carrier package. On the right are the 16 solder pads on the ceramic substrate. The wiring between the solder pads and the DIP pins is inside the ceramic substrate.

The MK4332 with the left package opened and the right package unsoldered.

The MK4332 with the left package opened and the right package unsoldered.

I created the die photo below from multiple microscope images. The white lines are the metal wiring on top of the chip, while the silicon underneath appears dark red. The two large rectangular regions are the 16,384 memory cells, arranged as a 128×128 matrix, split in two. The circuitry in between these regions consists of 128 sense amplifiers to amplify the bits read from memory, and selection circuitry to select one bit out of the 128. (Externally, the chip is accessed as 16,384×1, outputting a single bit. Typically, eight of these chips were used to store bytes.) The control and interface circuitry is at the left and right, connected to the external pads via tiny bond wires.

Die photo of the 4116 memory chip. Click for a larger image.

Die photo of the 4116 memory chip. Click for a larger image.

In dynamic RAM, a bit is stored in a capacitor, with a transistor providing access to the capacitor. The value of the bit is represented by the presence or absence of charge on the capacitor. The advantage of dynamic RAM is that each memory cell is very small, constructed from just two components,2 allowing a high memory density. (In comparison, static RAM may require six transistors per cell.) The downside of dynamic RAM is that the charge on a capacitor leaks away after a few milliseconds. To avoid losing data, dynamic RAM must be constantly refreshed: bits are read from the capacitors, amplified, and then written back to the capacitors. For this particular chip, all the data must be refreshed every two milliseconds.

The diagram below illustrates the wiring of the memory cells, showing two of the 128 rows and columns. To read or write data, a row select line is energized. The transistors in that row turn on, connecting that row's capacitors to the data in/out lines. The data from that row is read out of the capacitors and amplified. At that point, the data can either be written back to refresh the row, or a new bit can be written. Note that although the chip accesses 128 bits in parallel internally, the chip provides access to one bit at a time externally, selecting one of the 128 bits to read or write.

Structure of the memory cells, based on the patent.

Structure of the memory cells, based on the patent.

The magnified photo below shows some of the storage cells, densely packed together. It's a bit hard to visualize what's going on because the chip is constructed from multiple layers. The bottom layer is the grayish silicon die. On top of the silicon are two layers of polysilicon. Above this is the metal wiring, which was removed for this photo. The photo shows three sense lines (data in/out) in the silicon, with bulb-shaped storage cells connected on either side. Vertical strips of polysilicon (poly 1) over the storage cells implement capacitors: the silicon forms the lower plate, while the polysilicon forms the upper plate. The second layer of polysilicon (poly 2) is arranged in diagonal regions to implement the selection transistors. Square notches in the poly 1 layer allow the poly 2 layer to approach the silicon to form transistors. Horizontal metal wiring (not visible) is connected to the poly 2 regions to select a row by driving the transistors. Note that the rows are staggered and interlocking (kind of like a zipper) due to the highly-optimized layout. At the time, fitting this much memory on a chip was a challenge that pushed the limits of integrated circuit technology.

A closeup of the memory chip under the microscope, showing individual storage cells.

A closeup of the memory chip under the microscope, showing individual storage cells.

Memory chips in the Apple III

Apple was a major customer of these memory modules, using them in the Apple III computer (1980). The Apple III was marketed as a business computer to follow the popular Apple II. Unfortunately, the Apple III was a business failure due to reliability issues and competition from the IBM PC introduced a year later.

Apple III Plus computer. Photo by Bilby, CC BY 3.0.

Apple III Plus computer. Photo by Bilby, CC BY 3.0.

As was usual for the time, the Apple III's memory board3 was stuffed with memory chips to achieve more capacity. An unusual part of the design is it used three rows of memory chips (instead of a power of two), mixing 16-kilobit and 32-kilobit memory chips to achieve 128 kilobytes of storage. (The Apple III's case was designed before the boards, so the boards had to be designed to fit the available space.) In the photo below, the top row holds MK4332 memory modules, while the bottom two rows hold 16-kilobit MK4116 chips.4

Apple III main memory card. Photo courtesy of DigiBarn, CC BY-NC 3.0

Apple III main memory card. Photo courtesy of DigiBarn, CC BY-NC 3.0

A brief history of memory

Memory is an under-appreciated part of computing. The CPU usually gets the attention, but memory was often the limiting factor. The problem with memory is that storing a single bit is easy, but most approaches are impractical when you try to scale up to thousands or millions of bits.

The early ENIAC computer (1946) used vacuum tubes for storage, but these were bulky and expensive, limiting ENIAC to just 20 words (of 10 digits) stored in its accumulators. Early computers such as EDSAC (1949) used mercury delay lines for memory, sending pulse trains of sound waves through tubes of mercury. Although EDSAC could store 512 words, you had to wait for bits to circulate serially through the mercury. An improvement was the random-access Williams tube which stored data as spots on a cathode-ray tube screen. Although they were temperamental, Williams tubes were used in the Manchester Mark 1 (1949) and the commercial IBM 701 (1952).

The introduction of core memory revolutionized computing, providing fast, cheap, and reliable storage, storing each bit in a tiny magnetized ferrite ring. Core memory was introduced in the Whirlwind computer (1953) and used in most computers of the late 1950s and 1960s. However, since each bit required a separate physical ferrite core, memory sizes were limited to a few megabytes for even the largest customers. For example, memory cabinets for the IBM System/360 (1969) held 256 kilobytes but weighed over a ton each (below).

Magnetic core memory was relatively bulky. This photo shows an IBM System/360 Model 85 installation. The cabinets in the front are IBM 2365 Processor Storage, each holding 256 kilobytes. The double-H cabinet in the center is the CPU. Photo from IBM.

Magnetic core memory was relatively bulky. This photo shows an IBM System/360 Model 85 installation. The cabinets in the front are IBM 2365 Processor Storage, each holding 256 kilobytes. The double-H cabinet in the center is the CPU. Photo from IBM.

Semiconductor memory led to another dramatic shift. At first, semiconductor memory was costly and had very small capacity; Intel's first product was a memory chip holding just 64 bits and costing $99.50. In 1968, Dennard at IBM invented cost-effective dynamic RAM and semiconductor DRAM technology advanced quickly at various companies. Intel introduced the first commercially available DRAM chip in 1970, the i1103 holding 1K bits. This chip was nicknamed the "core killer" because of its impact on the magnetic core memory industry.

Computer storage rapidly moved from core memory to DRAM as the capacity of DRAM increased and the price fell.5 Mostek introduced the 4-kilobit MK4096 chip in 1973, followed by the 16-kilobit MK4116 in 1976. In 1978, Fujitsu introduced the first commercial 64-kilobit DRAM chip and Japan took the lead in DRAM manufacturing.6 Intel left the DRAM industry in 1985 due to decreasing market share and profits, followed by the remaining US DRAM manufacturers.

Fifty years after the introduction of DRAM, it is still the dominant technology for main storage, a remarkably long lifetime. Compared to the 16-kilobit chip I described, Samsung's recent 16-gigabit DRAMs are a factor of a million larger, showing the incredible increase in density. It remains to be seen if anything will challenge the long storage leadership of DRAM.

I announce my latest blog posts on Twitter, so follow me at kenshirriff. I also have an RSS feed. Thanks to Mike Braden for suggesting the MK4332 chip to me.

Notes and references

  1. For details on the construction of the memory modules, see Rectangular chip-carriers double memory-board density, Electronics, 1982. 

  2. Early dynamic RAMs such as the Intel 1103 used three transistors per cell and used separate lines for reading and writing data. Improvements in memory technology shrunk the circuit to a single transistor and a single data line. 

  3. The Apple III memory board pictured is the "12 volt memory board", given that name because the memory chips required 12 volts (as well as +5 and -5). It was upgraded by the "5 volt memory board", which used only a 5 volt supply. The 5 volt memory board used more modern 64-kilobit memory chips (4864) giving it a larger capacity of 128 or 256 kilobytes. Inconveniently, the power supply required a 12-volt load to operate, so the 5-volt memory board has a power resistor to draw 0.4 amps from the otherwise-unused 12-volt supply. Details are in the Apple III reference manual

  4. The Apple III memory board was also available in a lower-cost 96-kilobyte module. In that configuration, the 4332 memory modules were replaced with the 16-kilobit (MK4116) chips used on the rest of the board. One clever feature of the 4332 module is the two "extra" select pins are on the end of the package. The result is that a memory board (such as the Apple III's) can be designed to accept either the 16-pin 16-kilobyte chips or the 18-pin 32-kilobyte modules, depending on how much memory is desired. With the smaller chips, the two extra pins are unused. It's strange, however, that the Apple III memory board only accepted the larger modules in one of the three rows of chips. 

  5. The industry switch from magnetic core memory to semiconductor memory wasn't as straightforward as superior semiconductor memory overthrowing inferior core memory. Instead, there was a time period where they co-existed, due to tradeoffs. For instance, in 1972, a customer could select core memory, semiconductor memory, or a mixture for the D-112 minicomputer (a PDP-8 clone); semiconductor memory was 5 times faster, but core memory supplied four times the capacity per board. By 1973, industry publications were reporting that "Semiconductor memories are taking over data-storage applications". As late as 1980, core memory manufacturers were advertising the benefits of core memory, battling the "myths" that semiconductor was better.

    Was the overthrow of magnetic core by semiconductor memory inevitable? My view is that "technological determinism" acts in some ways; the development of DRAM memory was almost unavoidable following the development of MOS transistors. However, "economic determinism" was more responsible for the success of semiconductor memory: if magnetic core had remained the lower-cost option, it probably would have remained dominant. As a counterexample, CCD (charge-coupled device) memory and bubble memory were hyped as storage technologies of the future, but couldn't achieve the price-performance to dislodge either semiconductor memory or hard disks. 

  6. Note that the capacity of memory chips increased by a factor of 4 each generation (1-, 4-, 16-, 64-kilobit) rather than a factor of 2. The reason is that each address pin was multiplexed to provide two address bits, so each additional address pin resulted in a factor of four increase. By reusing each address pin for both a row address and a column address, the number of address pins was kept low so compact 16-pin packages could be used even as memory sizes expanded to 256-kilobit. Conveniently, as technology improved, memory chips required fewer voltages, freeing up pins formerly used for power. One consequence, though, was the ordering of address pins on the chip was essentially random as new address pins were assigned based on which pins were available, rather than sequentially. The multiplexed address system was introduced in the Mostek MK4096 chip and meant that the 256-kilobit 41256 chip used fewer pins than the original 1-kilobit Intel 1103 (16 pins vs 18).