IEEE 1596
Standard for Scalable Coherent Interface (SCI) - IEEE Computer Society Document
contributor author | IEEE - The Institute of Electrical and Electronics Engineers, Inc. | |
date accessioned | 2017-09-04T16:35:15Z | |
date available | 2017-09-04T16:35:15Z | |
date copyright | 03/19/1992 | |
date issued | 1992 | |
identifier other | UZMENAAAAAAAAAAA.pdf | |
identifier uri | https://yse.yabesh.ir/std/handle/yse/98737 | |
description abstract | Foreword The demand for more processing power continues to increase, and apparently has no limit. One can usefully saturate the resources of any computer so easily by merely specifying a finer mesh or higher resolution for the solution of some physical problem (hydrodynamics, for example), that engineers and scientists are desperate for enormously larger computers. To get this kind of computing power, it seems necessary to use a large number of processors cooperatively. Because of the propagation delays introduced when signals cross chip boundaries, the fastest uniprocessor may be on one chip before long. Pipelining and similar large-mainframe tricks are already used extensively on single-chip processors. Vector processors help, but are hard to use efficiently in many applications. Multiprocessors communicating by message passing work well for some applications, but not for all. The shared-memory multiprocessor looks like the best strategy for the future, but a great deal of work will be needed to develop software to use it efficiently. It is important to support both the shared-memory and the message-passing models efficiently (and at the same time) in order to support optimal software for a wide range of problems, especially for a system that dynamically allocates processors and perhaps changes its configuration depending on the nature of its load. SCI started from an attempt to increase the bandwidth of a backplane bus past the limits set by backplane physics in order to meet the needs of new generations of processor chips, some of which can single-handedly saturate the fastest buses. We soon learned that we had to abandon the bus structure to achieve our goals. Backplane performance is limited by physics (distributed capacitances and the speed of light) and by a bus's one-at-atime nature, an inherent bottleneck. To gain performance far beyond what buses and backplanes can do, one needs better signaling techniques and the concurrent use of many signaling paths. Rather than using bused backplane wires, SCI is based on point-to-point interconnect technology. This design approach eliminates many of the physics problems and results in much higher speeds. SCI in effect simulates a bus, providing the bus services one expects (and more) without using buses. SCI has turned out to be surprisingly simple, much simpler than many of the alternative designs we explored and much simpler than bus-based systems would be if they tried to approach a comparable size and performance. This simplicity may not be obvious to the first-time reader of this rather thick document, but much of this bulk is due to the large amount of tutorial material necessary to introduce such a new way of doing things (a paradigm shift), and even more is due to the comprehensive executable description of cache behavior under all possible conditions. The switch from a shared backplane bus to a point-to-point interconnect has created many new problems and research topics, which have been resolved in record time by this SCI project. Much research remains to be done on determining optimal ways to use the mechanisms SCI provides. SCI has also required the development of novel allocation and cache-coherence protocols, which has made the project a challenging one indeed, particularly in view of our schedule objectives. Historical Perspective and Acknowledgments Most of the developers of SCI come from high-speed-bus backgrounds, such as Fastbus (IEEE Std 960-1989) or Futurebus (IEEE Std 896.1-1987). Paul Sweazey, who was the coordinator of the Futurebus cache coherence task group, initiated a SuperBus Study Group under the IEEE Computer Society's Microprocessor Standards Committee in November 1987 to consider whether something could be done for the next bus generation to avoid the multitude of competing incompatible standards we saw in the 32-bit generation. Futurebus tried to solve that problem, starting in the late 1970s, but could not converge to a single best solution in time to head off the development of many alternatives. The SuperBus Study Group met for less than a year before deciding that there was indeed a way to do better and to achieve the throughput rates that are required for supporting multiple 100-MFLOPS-class processor chips, namely about 1 Gbyte/s per processor. We were particularly urged on by Paul L. Borrill, Futurebus chairman, and John Moussouris (one of the founders of MIPS), who frightened us all by his predictions of immensely powerful processors in the near future—which already are coming true! Our July 1988 Project Authorization Request was approved by the IEEE Standards Board in October. David B. Gustavson was appointed Chairman and David V. James became the logical-task-group coordinator and Vice Chairman. Gustavson also served as physical-task-group coordinator, handled the records and mailings, and shared minutes-taking and editing duties with David James. A Control and Status Register and I/O Architecture effort was started within SCI, based on some significant contributions by David James. When it was recognized as important for other standard buses as well, it was split off as an independent activity shared by Futurebus+, Serial Bus (P1394), and others. In April 1989 this also became an official project, P1212, with David James as chairman. The goal of a uniform CSR architecture has been attempted many times before (e.g., by the Fastbus Software Working Group, chaired by Gustavson), and has proven elusive. The reason P1212 has had a more comprehensive success is that David James brought considerable architectural experience to bear, generating sufficient rationale for the various choices so that decisions no longer seem entirely arbitrary. Much of this rationale is a consequence of multiprocessor architectural considerations; without the constraint of efficient multiprocessor interoperability, many CSR design issues would be too arbitrary to be able to achieve timely standardization. The CSR Architecture has become a unifying force for the latest generation of buses, encouraging VME and MULTIBUS® II users to use the CSR architecture as they interface to Futurebus+, thus facilitating a future interface to SCI as system requirements grow. In this way, there is a relatively smooth and well-defined growth path from presentgeneration single-processor systems through Futurebus+'s several-processor systems with cache coherence, to SCI's many-processor systems. Because of the importance of such a migration path to the future acceptance of SCI, we place high priority on interfacing SCI with other buses. For that reason we include protocol hooks that would not otherwise be needed. In exchange, SCI users will be able to take advantage of the large number of existing I/O interfaces. In March 1989, a Fiber Optic Task Group (SCI-FI) was started, led by Hans Wiggers, and an SCI/Futurebus+ Bridge Task Group was started, led by Mark Williams (a joint appointment with Futurebus+). Throughout the development of SCI, Knut Alnes and Ernst Kristiansen were working on an early implementation, providing input for the details of the specification. They also initiated work at the University of Oslo, by Stein Gjessing and others, on formal verification of the cache-coherence mechanisms. This real implementation effort was extremely valuable to SCI, and greatly accelerated convergence to a practical specification. David James generated documents at an incredible rate. As the result of his single-handed effort the bulk of the text of this specification first appeared in June 1989. At the same time he was producing two volumes of similar size for the CSR working group! He is convinced that having something on paper produces more productive discussions, and our experience supports that view. In September 1990, the working group requested the initiation of a project to standardize an SCI/VME bridge architecture, P1596.1, chaired by Ernst Kristiansen. The first meeting was held in November. September 1990 also saw the P1212 draft completion and the beginning of its ballot phase. In November 1990 the Fiber Optic part of SCI was given a big boost by Hewlett Packard's decision to release its Gbit/s serial G-link specification for use by SCI. This link is able to transfer the 17th bit that makes possible a transparent synchronous interface with the parallel 16-bit-plus-flag SCI link. The other serial links considered needed occasional extra symbols in place of the flag bit, which made such an interface much more difficult because the serial and parallel clock frequencies could not have a constant ratio. (Subsequently, ways to solve this problem were discovered that are compatible with other encodings, such as 8b/10b, so future link standards could use these if that proves desirable.) In January 1991, the working group voted unanimously to submit the draft specification to the Microprocessor Standards Committee (MSC) for forwarding to the balloting body. This was the only vote taken by the working group, which worked entirely by consensus from start to finish. Our philosophy was that, given choices, we would always take the technically superior way. If superiority was not apparent, an arbitrary choice would be used until it ran into problems. This method worked very well, resulting in rapid progress and a nearly ego-free working environment. It helped that this project was at the leading edge of technology, and thus attracted contributors of sufficiently high stature that their egos were under control. It also helped that SCI was not considered a threat to existing commercial interests, but rather a path to new markets. In order to avoid the chaos of the 32-bit-bus world, SCI would have to finish in record time. (Normal development time for a new bus standard that involves new design without major historical constraints has run from eight to twelve years.) To this end, the group worked at a feverish pace, with multiday meetings every month and much work between. Many workers put in nearly full-time (in some cases much more than full-time) effort. One benefit, as the pace increases, is that the progress improves more than proportionally because there is no time between meetings for forgetting. The result is that the work goes faster and has higher quality and coherence, as we hope the reader will agree upon examining this standard. The P1596 Working Group is grateful to all who have participated directly or indirectly in the development of the SCI standard. In the initial design phases, novel concepts were often mistakenly discarded before being resurrected and included in the SCI standard. The working group extends its gratitude to those who had the perseverance to withstand this learning process, and its apologies to those whose contributions were not appreciated properly. Some of the multiprocessor architectural issues in SCI are very esoteric, and we recognize that has been frustrating to newcomers, as it takes a long time to get up to speed. The working group is also grateful for the patience that the experts in various areas have shown while time was spent on other areas. Scope Purpose: To define an interface standard for very high performance multiprocessor systems that supports a coherent shared-memory model scalable to systems with up to 64K nodes. This standard is to facilitate assembly of processor, memory, I/O, and bus adaptor cards from multiple vendors into massively parallel systems with throughputs ranging up to more than 1012 operations per second. Scope: This standard will encompass two levels of interface, defining operation over distances less than 10 m. The physical layer will specify electrical, mechanical, and thermal characteristics of connectors and cards. The logical level will describe the address space, data transfer protocols, cache coherence mechanisms, synchronization primitives, control and status registers, and initialization and error recovery facilities. The preceding statements were those submitted to and approved by the IEEE Standards Board as the definition of the SCI project. These goals have been met and exceeded: support for message-passing was added, and the operating distance is not limited to 10 m. (The intent of that limitation was to make clear that this is not yet-another Local Area Network.) The real distinction between SCI and a network has more to do with the memory-access-based model SCI uses and the distributed cache-coherence model. The practical operating distance depends more on the throughput and performance needed than on any absolute limit built into the specification—very long links would yield unacceptable performance for many users (but perhaps not all). In particular, the fiber-optic physical layer can extend the SCI paradigm over distances long enough to link a computer to its I/O devices, or to link several nearby processors. No arbitrary length limit would be appropriate, but practical considerations including the throughput requirements and the cost of transmitters and receivers will set the lengths that people consider useful. A very-high-priority goal was that SCI be cost-effective for small systems as well as for the massively parallel ones mentioned in the purpose statement above. SCI's low pin count and simple ring implementation make mediumperformance, few-processor systems easier to build with SCI than with bused backplane systems; a two-layer backplane should be sufficient, and three layers should be enough to support the optional geographical addressing mechanism. The SCI interface, complete with transceivers, fits into a single IC package that includes much of the logic needed to support the cache-coherence protocols. This economy for small systems leads to the expectation that SCI processor boards will be built in high volume, making them inexpensive enough to be assembled in large numbers for building supercomputers at low cost. SCI also simplifies the construction of reliable systems. SCI Type 1 modules are well protected against electrostatic discharge and electromagnetic interference, and can be safely inserted while the remainder of the system remains powered. SCI supports live insertion and withdrawal by using a single supply voltage (with on-board conversion as needed) and staggered pin lengths in the connector to guarantee safe sequencing. Note, however, that system software plays an important role in live insertion or removal of a module because the resources provided by that module have to be allocated and deallocated appropriately. In systems where several modules share a ringlet, the removal of one module interrupts all communication via that ringlet, so the resources on those modules also have to be deallocated. A similar situation arises in any system that may have multiple processors resident on one field-replaceable board: all have to be deallocated when any one is replaced. The system software for handling the deallocation and reallocation of these resources is outside SCI's scope. Although SCI does not provide fault tolerance directly in its low-level protocols, it does provide the support needed for implementing fault-tolerant operation in software. With this recovery software, the SCI coherence protocols are robust and can recover from an arbitrary number of detected transmission failures (packets that are lost or corrupted). The SCI paradigm removes the limits that bus structures place on throughput, but its latency is of course limited by the speed of signal propagation (less than the speed of light). Ever-increasing throughput can be expected as technology improves, but the organization of hardware and software will have to take into account the relatively constant latency (delay between request and response), which is proportional to the physical size of the system. The last generation of buses approached the ultimate limits of performance, leading to the concept of an “ultimate” standard. However, the initially defined SCI physical layers are likely just the first of a series of implementations having higher or lower performance levels. The 1 Gbyte/s link speed specified for the initial ECL/copper-backplane implementation was chosen based on a combination of marketing and engineering considerations. From a marketing point of view, it was necessary to define a territory that did not disturb the markets for present 32-bit standards or present networks, and from an engineering point of view this link speed was near the edge of what available signaling technology and integrated circuit technology could support. New technologies, such as better cables, connectors, transceivers; IC packages with more pins or higher powerdissipation capabilities; or faster ICs, could make it practical or desirable to implement SCI on new physical-layer standards. Such standards, with different link widths or bit rates, will be developed from time to time. However, packet formats and higher level coherence protocols will be the same across all these physical implementations. That should make the problem of interfacing one SCI system to another relatively simple—SCI already includes the necessary mechanisms to cope easily with speed differences. | |
language | English | |
title | IEEE 1596 | num |
title | Standard for Scalable Coherent Interface (SCI) - IEEE Computer Society Document | en |
type | standard | |
page | 255 | |
status | Active | |
tree | IEEE - The Institute of Electrical and Electronics Engineers, Inc.:;1992 | |
contenttype | fulltext |