History of the Computer - Memory Error Correction Codes Part 1 of 2

We have mentioned before, in the history of the computer series, that various forms of error correction are used, in cases where the medium is unreliable. This applies mainly to magnetic tape and disks. The magnetic coating on the recording surfaces is subject to wear, various codes such as CRC (Cyclic Redundancy Check) have been developed. Data transmission now also uses error correction, previously error detection would cause a re-transmission.

The need for error correction in memories became more pressing when semiconductor, or chip memories were introduced in the 1970s. Although they promised much larger capacity in much less space, for a better cost, the early chips were susceptible to failures.

The early introduction of these memory types in mainframes saw the re-introduction of the Hamming code. Richard Hamming, a mathematician who had worked on the Manhattan Project in WWII, worked on early computers, and devised the code in 1950.

The code was used in chip memories to improve the performance of the computers so that they could be used without too many failures! It was able to correct a single bit error (SBE). Thus, if one of the bits in a word read out of memory was a 1 instead of a 0, it could be changed back to a 0, on the fly. This operation was transparent to the user. It could also detect, but not correct Multiple Bit Errors (MBE), also known as MUE (Multiple Uncorrectable Errors).

Multiple bit errors caused a recovery process to be initiated, causing lost time, a situation frowned upon in computer circles! It was therefore important for the engineers to keep a watchful eye on the error logs.

A repeat occurrence of a particular bit in error indicated a potential failure of multiple bits, as another bit failure at the same address, at the same time would cause problems. For this reason a chip showing a single bit error would be replaced at the next maintenance session.

How does the Hamming code work? It can be seen as an extension of a simple parity code, which we have mentioned before. Odd parity counts the number of 1 bits in a character, or word, and sets to 1 or 0 to make the total count odd. For example 1011010 has an even number of bits, so a parity bit of 1 would be added to the data written to memory - 11011010. Now we can check the data read out of memory to see if the total number of bits is odd or even. If it is even there is an error.

P101 1010 = even # of bits
1101 1010 = odd # of bits with a parity bit.

We now go to the next step, and devise a code which will identify the location of a failing bit. The way we do this is to consider a series of sets of bits so that the checks overlap. We choose these sets in accordance with the binary bit values, or powers, 1,2,4,8 etc. taking as many bits as we need to cover the word length. These check bits are inserted in the word written to memory in the appropriate bit positions.

D7-D6-D5 C8-D4-D3-D2 C4-D1-C2-C1

D1 to D7 are the original data bits in sequence
C1 to C4 are the check bits in the decimal value positions.

In Part 2 we will use an example of a bit failure to illustrate the operation.

Tony is an experienced computer engineer. He is currently webmaster and contributor to http://www.what-why-wisdom.com A set of diagrams accompanying these articles may be seen at http://www.what-why-wisdom.com/history-of-the-computer-0.html RSS feed also available - use http://www.what-why-wisdom.com/Educational.xml