History of the Computer - Redundancy Part 2 of 2

Can you see what's coming next? What happens if the controller fails? Ha! - we're ready for that! We put in another controller, and connect the second string of cables to this one instead! Redundant controllers.

Other configurations are also possible, using both interfaces on both controllers, however these are mainly concerned with system throughput, or the ability to switch drives between two or more systems. The extra redundancy provided in these cases is more of a bonus than a necessity.

Moving further up the data chain, we need a path between the controller and the I/O (input/output) section of the mainframe. By now you will see that we will use two paths to provide redundancy. You can also work out that, if the I/O unit fails, we have problems, not only in talking to our disk drive but to tapes, printers, datacomms etc.

This possibility of any one component in the system being able to affect the whole system, or a significant section of it, is known as 'Single Point Sensitivity'. A Single Point of Failure is one which has the potential to affect operations to some extent.

The way to avoid this, of course, is to duplicate everything, CPUs, I/Os, Controls etc. The most difficult component to duplicate is memory, as this is the basic part of the system, where everything is controlled.

Multi processors have been in operation since the 1960s and dual paths have been used to access subsystems. In the case of larger systems, for example a 6x4 (6 CPUs, 4I/Os) for paths were provided to disk and tape controllers to increase throughput and allow 2 or more operating systems to run on selected components. Units can be combined or removed 'on the fly' and systems carry on working with more or less resources.

We have talked about redundancy in hardware, or the physical components of the system, by providing alternative hardware. There is also the provision of redundant capabilities in the software, usually the operating system.

The software is set up to take note of errors detected by the hardware, and automatically 'mark down' a specific component, or access path, depending on a preset 'threshold' of error tolerance. At the same time, the system operations and maintenance personnel are alerted, so that the perceived problem can be rectified.

The provision of redundancy in a system gives the maintenance engineer the ability to work on the failing component or path while the system continues operation. We can also test multiple paths to a component to eliminate possible causes of a failure which, we have seen, could lie anywhere in the chain from the drive back to the I/O unit.

A further redundancy commonly built in to large systems is the power supply. The system power is provided by a generator, driven by a motor, which is driven from mains power. In the event of a power supply failure, a diesel engine takes over driving the generator to ensure continuous power. The change-over period is covered by batteries, which are kept charged during normal operation. In another version, the batteries drive the generator and isolate the system from power surges.

The complete system can be duplicated in this way, with more resources being placed in susceptible areas, such as disks and tapes. The cost of additional components is weighed against the requirement for continuous operation, and what effect a catastrophic failure might have. The ultimate redundancy uses a completely separate duplicate system located at a remote area, possibly hundreds of miles distant (of course, with duplicated links!)

Tony is an experienced computer engineer. He is currently webmaster and contributor to http://www.what-why-wisdom.com . A set of diagrams accompanying these articles may be seen at http://www.what-why-wisdom.com/history-of-the-computer-0.html . RSS feed also available - use http://www.what-why-wisdom.com/Educational.xml