|
马上注册,结交更多好友,享用更多功能,让你轻松玩转社区。
您需要 登录 才可以下载或查看,没有账号?注册
x
http://www.inf.fu-berlin.de/lehre/WS94/RA/RISC-9.html
The RISC Concept - A Survey of Implementations
Authors: Margarita Esponda and Ra'ul Rojas
Institut fuer Informatik
Fachbereich Mathematik
Freie Universitat Berlin
Takustr. 9, 14193 Berlin
Email: esponda@inf.fu-berlin.de, rojas@inf.fu-berlin.de
Technical report B-91-12
September 1991
includes fourteen pictures [todo: add links]
Abstract Reduced Instruction Set Computers (RISC) have received much attentionin the last few years. The RISC design philosophy has led to a profoundre-evaluation of long held beliefs in the computer architecturecommunity. Yet the precise definition of what "RISC design" really means,is something which has been obscured by the unfounded claims of somemicroprocessor manufacturers and by the reductionist definitions foundin the popular computer literature. In this paper we define RISC in ahierarchical manner focusing the analysis on the essential features of thisnew architectural paradigm. Several RISC architectures are discussed and therelevant data is summarized with the help of Kiviat graphs. The closingsection discusses future possible developments in the field of computerarchitecture.Contents
1. Introduction
2. The confusion around the RISC concep
3. The RISC concept: a logical reconstruction
4. Comparing RISC with CISC
5. Taxonomy of RISC processors
6. Survey of features of commercial RISC processors
6.1 The MIPS series
6.2 The SPARC family
6.3 The IBM RS/6000
6.4 The Motorola 88000 family
6.5 Intel 860
6.6 Hewlett Packard's Precision Architecture
6.7 The Transputer - A RISC processor?
7. The success of RISC processors
8. Conclusions and the future of RISC
9. Literature
1. IntroductionThere seems to be now an overwhelming case in favor of Reduced InstructionSet Computers (RISC) as high performance computing engines. RISC processors,first developed in the eighties, seem predestined to dominate the computerindustry in the nineties and to relegate old microprocessor architecturesinto oblivion. Practically all important computer manufacturers areoffering now some kind of RISC system. Computer giants like IBM orHewlett Packard went to great lengths in order to develop their own RISCprocessors. Others, like DEC or Siemens, preferred to license one of thealready existing designs in order to keep up with the new performance raceof the nineties. Yet the current widespread support for the RISC conceptwas still being put in doubt as recently as 1986, when it was still notcompletely clear that RISC could outperform Complex Instruction Set Computer(CISC) systems in the general purpose marketplace [Moad 1986]. Just fiveyears later it looks as if the discussion has been closed.
But what does RISC mean? What are the essential features of this newapproach to computer architecture? Asking these questions could seemsuperfluous, but it is not so. As a matter of fact, there is a widespreadmisunderstanding of what RISC really means and of the way in which thenew processors are capable of reaching performance levels reserved beforefor much larger systems. The acronym of the new technology is alreadyreductionist: "RISC" is generally interpreted as meaning that a processorshould implement only a small instruction set capable of running fasterthan in traditional designs. Processors with less than 100 instructions arequalified in some popular computer journals as being RISC just because ofthis fact. Microprocessor manufacturers have contributed also to the generalconfusion by calling old CISC processors RISC designs and by asserting thatthey are now building them with "RISC concepts" or with a "RISC kernel"[Crawford 1990]. But as we will see in this survey, some of the reputed RISCdesigns do not correspond to the general characteristics that should beassociated with a RISC processor.
In this paper we try to elucidate first of all what is meant when we speakof RISC systems. This is not a purely semantic exercise. Understanding thebasic tenets of the RISC design philosophy makes it possible to find outwhere the performance advantage of the new processors comes from and, moreimportant, what type of new features could be expected in the future. Weproceed then to consider some of the more publicized RISC or "RISCy" designsand we summarize their characteristics with the help of Kiviat graphs, agraphical tool developed for performance measurement studies of computersystems [Ferrari/Serazzi/Zeigner 1983]. In the last part of this surveywe look at the present market penetration of RISC processors and we alsoconsider some of the possible future development paths.
2. The confusion around the RISC conceptThe motivation for the design of RISC processors arose from technologicaldevelopments which changed gradually the architectural parameterstraditionally used in the computer industry. Patterson [1985] has alreadygiven a detailed account of the prehistory of RISC.
At the abstract architectural level the general trend until the middle ofthe seventies was the design of ever richer instruction sets which couldtake some of the burden of interpreting high level computer languages fromthe compiler to the hardware. The philosophy of the time was to buildmachines which could diminish the semantic gap between high level languagesand the machine language. Many special instructions were included in theinstruction set in order to improve the performance of some operationsand several machine instructions looked almost like their high-levelcounterparts. If anything was to be avoided it was, first of all, compilercomplexity.
At the implementation level, microcoding provided a general method ofimplementing increasingly complex instruction sets using a fair amount ofhardware. Microcoding also made possible to develop families of compatiblecomputers which differed only in the underlying technology and performancelevel, like in the case of the IBM/360 system.
The metrics used to assess the quality of a design corresponded directlyto these two architectural levels: the first metric was code density,i.e., the length of compiled programs; the second metric was compilercomplexity. Code density should be maximized, compiler complexity shouldbe minimized. Not very long ago Wirth [1986] was still analyzing somemicroprocessor architectures based exactly on these criteria and denouncingthem for being "halfheartedly high-level language oriented."
There were good reasons for microcoded designs in the past. Memorywas slow and expensive - therefore compact code was required. Therewas a need for instructions of high encoded semantic content whichcould maintain the processor running at full speed with a minimum ofinstruction fetches. Microcode had also an additional advantage: it couldbe changed in different models of the same computer family, allowing forincreased parallel execution of individual instructions in the high endof the family. The transition from the use of core memory (with typicalcycle times 10 times slower than semiconductor memory) to the now useddynamic and static memory chips eliminated one of the advantages ofmicroprogramming. Microprograms and real programs could be stored in thesame kind of devices with comparable access times. The introduction of cachememories in the early seventies altered the equation again in favor ofexternal programming against microprogramming [Bell 1986].
One of the fundamental elements in the performance equation was still theinstruction set used. IBM, DEC and other companies had installed thousandsof machines by the seventies and compatibility was the really importantissue of every new processor release. The users of IBM products werelocked-in with this company due to their high software investment, butIBM was also locked-in with their old abstract computer architecture andinstruction set, which still survives today after 26 years of having beenintroduced!
It is surprising that the winds of innovation first blew inside IBM. Theproject which is now recognized as the first pioneering RISC architecturewas started 1975 at the IBM Research Center in Yorktown Heights, N.Y. Asmall computer system, which was intended originally to control a telephoneexchange system, evolved into a minicomputer design which challenged thetraditional computer architecture wisdom [Hopkins 1987]. John Cocke, an IBMfellow, had noticed that only a small subset of the IBM/360 instructionset was used most of the time and it was this subset which had the biggestimpact on execution time. Cocke and his colleagues set themselves the goalof simplifying the instruction set in order to achieve one cycle executiontime as an average. This objective could only be achieved if the instructionset was pipelined, masking in this way the cycles used for fetching anddecoding of the instructions.
Two projects which started some years later brought RISC concepts finallyinto the mainstream of computer architecture. The first one was led byDavid Patterson at the University of Berkeley and culminated in thedefinition of the RISC-I and RISC-II processors at the beginning of theeighties. Patterson also coined the RISC acronym. John Hennessy ledsimultaneously the MIPS project at Stanford which evolved into a commercialventure some years later. Figure 1 shows a chronology of the RISC processorsthat will be discussed in this survey.
According to Patterson [1985] RISC processors inaugurated a new set ofarchitectural design principles. Because of this, RISC has been called morea philosophy than a particular architectural recipe. The relevant points ofthis design philosophy mentioned by Patterson are:
- The instruction set must be kept simple
- Instructions must run at the fastest possible rate (without intermediate interpreting levels like microcode)
- Pipelining is more important than program size
- Compiler technology is a critical ingredient of RISC designs: optimizing compilers must transport so much complexity from the hardware into the compiling phase as possible.
(Figure 1)In this informal account by Patterson there is no clear hierarchy amongthese four different objectives. Every one of them seems to be equallyimportant for a definition of RISC. We will see in the next section, thatassuming a clear hierarchy which puts pipelining at the center of the designwork leads effortlessly to a listing of all relevant RISC traits.
When RISC is understood as just the name of a bundle of architecturalfeatures for processors, the most frequently mentioned are:
- 1) small instruction set
- 2) load/store architecture
- 3) fixed length coding and hardware decoding
- 4) large register set
- 5) delayed branching
- 6) processor throughput of one instruction per cycle in average
The difference between RISC as design philosophy and RISC as a bundleof features is something which remains obscure in the popular computerliterature. There is no clear view of the interdependence of the diversefeatures. Processor throughput, for example, is a dependent variable ofdecoding time, but not the other way around. We already mentioned that inmost cases RISC is understood as meaning just a "small" instruction set. Inthis spirit some authors have claimed that the first RISC machine was thePDP-8 with only eight basic instructions, and there is also the talk of an"ultimate RISC" machine with an instruction set of only one instruction.
There is obviously a widespread misconception of what RISC means and ofthe reasons for the greater performance of RISC processors. RISC does notmean going "back to the future" (as Gordon Bell [1986] once ironicallyasked) if that means going back to the old designs. The essence of RISC isconstructing parallel machines with a sequential instruction stream. RISCdesigns exploit instruction level parallelism and the distinguishing featureis an instruction set optimized for a highly regular pipeline flow. Thispoint has not been perceived clearly outside the computer architecturecommunity and this survey tries to elucidate this as its first task. Whenthe essence of RISC has been understood, the absurdity of the claim thatthe PDP-8 was the first RISC machine becomes obvious. It is also possibleto evaluate the claims of microprocessor manufacturers who nowadays speakof their own CISC processors as of camouflaged RISC engines. Although theessence of RISC is parallelism, RISC surveys have systematically avoidedgiving empirical data on the effective level of pipelining achieved with theold and the new architectures [Gimarc/Milutinovic 1987, Horster et al 1986].
3. The RISC concept: a logical reconstruction
Parallel computers seem to be the promise of the future, yet there are fewwho pause to realize that they are the computer systems that we are usingnow. The sequential processor belongs to the past of computer technologyand today it is used only in small systems or special controllers. The mainparallelising method used by modern processors is pipelining.
Uniprocessor systems get their instructions from the main memory in asequential fashion, but they overlap several phases of the execution pathof the received instructions. The execution path of an instruction isthe sequence of operations which each instruction must go through in theprocessor. The phases in the execution path are typically: instructionfetch, decode, operand fetch, ALU execution, memory access and write backof the operation results. In some processors the chain of phases in theexecution path can be subdivided still more finely. Others use a coarsersubdivision in only three stages (fetch, decode, execute). The numberof stages in the execution path is an architectural feature which canbe changed according to the intended exploitation of instruction levelparallelism.
Pipelining is just the overlapped execution of the different phases of theexecution path. Figure 2 shows how a pipeline of depth three is started. Itbegins by fetching instruction i in the first cycle. In the second cycleinstruction i is decoded and instruction i+1 is fetched. In the third cycleinstruction i+2 is fetched, instruction i+1 is decoded and instruction i isexecuted. The pipeline is then full and if it remains so, turning out oneinstruction execution per cycle, the processor works as a parallel processorcapable of speeding up execution by the factor three. We have now in fact aparallel processor disguised as a sequential one.
In real systems there are many reasons for the regular pipeline flow to beinterrupted systematically. The penalty for these disruptions is paid in theform of lost or stall pipeline cycles. The effective parallelism exploitedby traditional CISC microprocessors (like the 68030 or Intel 80286) israrely larger than the factor 2, and more likely to be near the factor1.5. This means that old CISC microprocessors offer a very limited form ofinstruction level parallelism.(Figure 2)The main difference between RISC and CISC, is that the instruction set ofthe first kind of processors was explicitly designed to allow the sustainedexecution of instructions in one cycle as average. CISC processors (inmainframes) can also approach this objective, but only at the expense ofmuch more hardware logic capable of reproducing what RISC processors achievethrough a streamlined design. Some RISC processors, like the SPARC, achievea sustained speedup of 2.8 running real applications. This means that theSPARC is a parallel engine capable of working on about three instructionssimultaneously. Other RISC processors offer similar performance.
The "official" definition of RISC processors should thus be: processors withan instruction set whose individual instructions can be executed in onecycle exploiting pipelining. Pipelined supercomputers and large mainframeshave used pipelining intensively for years, but in a radically different wayas RISC processors [Hwang/Briggs 1985]. In IBM mainframes, for example, theinstruction set was given by "tradition" and pipelining was implemented inspite of an instruction set which was not designed for it. Of course thereare ways to accommodate pipelining, but at a much higher cost. This is thereason why other pipelined mainframes, like the CDC/6600, are seen as theprecursors of RISC machines rather than the IBM/360 behemoths.
In summary: taking pipelining as the starting point, it is easy to deductall other features of RISC processors. The fundamental question is: what isneeded in order to maintain a regular pipeline flow in the processor? Thefollowing RISC features constitute the answer:
a) Regular pipeline phases and deep pipelines
First of all the logical levels of the processing pipeline must be definedand each one must be balanced against each other [Hennessy/Patterson1990]. Going through each pipeline stage must take the same time and all thework done in the execution path should be distributed in the most uniformway. Each pipeline stage takes a complete clock cycle. Typical processorsuse a clock cycle time at least so large as the time it takes to performone typical ALU operation. In a processor with 20 MHz clock rate eachcycle lasts 50 nanoseconds. Using standard CMOS technology in the logiccomponents, this is equivalent to about 10 logic levels (each logic levelhas a delay of 5 ns). It is clear that this restriction imposes a heavyburden on the designer of microprocessors. In each stage of the pipeline amaximum of 10 logic levels can be traversed. The computer architect musttry to parallelise each one of the phases internally in order to use aminimum of logic levels. This is easier if the pipeline phases are correctlybalanced and if they are as independent from each other as possible, so asnot to have to handle signals running from one stage to the other. TypicalRISC processors go beyond the classical three level pipeline and usepipelines with four, five or six levels. A deeper pipeline means morepotential parallelism but also more coordination problems. We return to thisproblem later.
b) Fixed instruction length
In CISC processors, like the VAX, instructions are of variable lengthand several words have to be fetched until the whole instruction can becompletely decoded. This introduces a variable element in the durationof the fetch stage which can stall the pipeline if the decoding stageis waiting for an instruction. Large processors avoid this problem witha prefetch buffer which can store many instructions of the sequentialstream. CISC microprocessors use also small prefetch buffers or severalwords of instruction cache like is the case with the Motorola 68020.
The simplest technique for avoiding a variable fetch time is to encodeeach instruction using a fixed one word format. The fetch stage has inthis way a fixed duration and one instruction can be issued each cycle tothe decoding stage under normal pipeline flow (the branching problem isconsidered below). The decoding stage does not need to request additionalinstruction bytes according to the encoding of the instruction and thereis no need for any additional control lines between the fetch and decodestages.
c) Hardwired decoding
A fixed instruction format also makes the decoding of instructionseasier. Typical RISC processors reserve 6 bits out of 32 for the opcode ofthe instruction (which makes it possible to encode 64 instructions). Theoperands and the result are typically held in registers. Each argument isencoded, using for example 5 bits. Thirty-two registers can be referencedin this way. Decoding of the opcode and access to the register operands canbe done simultaneously, which is a very important feature if the operandsare to be ready for execution in the next cycle. Figure 3 shows the encodingformat of the MIPS processor, a typical RISC engine.(Figure 3)Note that in case one of the operands is a constant (that must be storedor added to in a register) it is encoded using an overlapped format. Thisposes no problem for the decoder, because this constant can be decodedsimultaneously with the access to the argument registers. One register toomuch will be read, but this intermediate read can be discarded withoutlosing any cycles. As can be seen, decoding of a fixed instruction formatcan be done in parallel in a clock cycle.
d) Register to register operations
The execution phase of an instruction should also take one clock cycleas a maximum whenever possible. Arithmetical instructions which accessoperands in memory do not fulfill this condition because the long latencyof memory accesses keeps the ALU waiting several cycles. Register toregister operations avoid this inconvenience. This kind of instructioncan be executed almost always in one cycle using the 10 levels of logicavailable in a pipeline stage of a 20 MHz processor. Instructions likeinteger multiply or divide can be directly implemented in the ALU, but theytake several cycles to complete and they inevitably stall the pipeline. SomeRISC processors, like the SPARC, do not directly implement multiply anddivide. The corresponding routines have to be implemented in software. CISCprocessors, like the VAX or the 68020 admit registers to memory operationswith a long latency and which introduce large pipeline "bubbles."
e) Load/store architecture
If all operands for arithmetic and logical operations are located inregisters, it is obvious that these registers have to be loaded firstwith the necessary data. This is done in RISC processors using a "load"instruction, which can access bytes, halfwords or complete words. A "store"instruction transfers the contents of registers to memory.
Without special measures the processor must wait after each load instructionfor the memory to deliver the wished data - the pipeline stalls. RISCprocessors avoid this problem using a "delayed" load. The load instructionis executed in one cycle but the result of the load is made available onlyone or more cycles later. This means that the instruction following theload must avoid using the register being loaded as one of its arguments. Inmost cases this condition can be enforced by the compiler, which tries toreschedule the instructions so that the load does not have to stop thepipeline. When this rescheduling is not possible, the load stalls thepipeline for as many cycles as the main memory or cache takes to respond.
f) Delayed branching
The most complex hazard menacing the uninterrupted pipeline flow isbranching. Instructions are fetched sequentially but a taken branch canalter the sequential flow of instructions. After a taken branch a newinstruction located at the branch target has to be fetched and the pipelinehas to be flushed of now irrelevant instructions. Statistics of realprograms have shown that 15% of all instructions for some processors canbe branches [Hennessy/Patterson 1990]. Around half of the forward goingbranches and 90% of the backward going branches are taken. This amountsto many lost pipeline cycles in typical CISC processors, which flush thepipeline after each taken branch.
RISC processors use other strategies. First of all, the branching decisionis made very early in the execution path - possibly already in the decodestage. This can be done only if the branching condition tests are verysimple, like for example a register compare with zero or a condition flagtest. At the end of the decode phase the processor can start fetchinginstructions from the new target. But in this decode cycle the nextinstruction after the branch has already been fetched. In order to avoidstall cycles this instruction can be executed. In this case the branch is adelayed branch. From the programmers point of view the branch is postponeduntil after the next instruction is executed. The compiler tries to schedulea useful instruction in the location after the branch, which is called the"delay slot." Some RISC processors with very deep pipelines schedule upto two delay slots [McFarling/Hennessy 1986]. More delay slots make thescheduling of useful instructions increasingly complicated and in many casesthe compiler ends writing NOPs in them.
It must be said in justice that delayed branching is not strictly a RISCinnovation. This kind of branching was used before in microprograms butcertainly not in macroinstruction sets.
Another technique borrowed from mainframes is the so called "zero cycle"branching. After each prefetch of a branch special hardware tries topredict if the branch will be taken or not. The next instruction is thenprefetched from the predicted target address. In this case no delay slotsare needed. If a special branching processor is included (like in the IBMRS/6000 RISC system) branches can be preprocessed and filtered out so thatthe arithmetical processor receives only a sequential instruction stream[Oehler/Groves 1990]. A good prediction strategy can maintain the pipelineflowing almost without disruption.g) Software scheduling and optimizing compilersThe interaction between delayed loads and delayed branching can be verycomplex. The whole benefit of a RISC architecture can be reaped only ifthe compiler is sophisticated enough to rearrange instructions in theoptimal order. RISC architectures try to maximize the synergy betweenhardware and software. Optimizing compilers are thus not an optionalfeature of RISC systems but one of their essential components. C compilersespecially, have become sophisticated enough to outperform hand coding inassembly language. Our own programming experiments using a SPARC workstationbrought a run time improvement of at most 3% with hand corrections to theassembly code of C programs. This is very different than the situation withtraditional high level compilers for CISC machines, where hand coding canimprove compiled code dramatically. Using the same benchmarks as with theSPARC workstation, we were able to speed up compiled code in a MicroVax byalmost 100% using hand coding!h) High Memory BandwidthIf instructions are to be fetched, decoded and executed in one cycle steps,a huge memory bandwidth is required. Using a 20 MHz processor and dynamicRAM chips with 100 ns cycle time some form of intermediate cache is needed,capable of delivering at least one word per cycle. RISC processors depend ona complex memory hierarchy in order to work at full speed. In most of them,separate data and instruction caches try to avoid contention for the systembus when a fetch is overlapped with a register load or store. For thisreason most RISC processors include memory management components. A RISCprocessor without management of a memory hierarchy could hardly outperform aCISC processor because the latter encode much more semantic information ineach instruction [Flynn et al 1987].
From the above discussion it should be clear that all of the discussedRISC features are part of a common strategy to guarantee an uninterruptedpipeline flow, and in this way, a high level of parallel execution ofsequentially coded programs. Fixed word encoding, hardwired decoding,delayed loads, delayed branches, etc., are just ways to achieve a regularpipeline flow. Some of these features could disappear in future RISC designs(for example in processors with zero cycle branching no delayed loads arenecessary) or not be used in others (the floating point units of RISCprocessors are sometimes microcoded). The essential point will remain beingthe exploitation of instruction level parallelism.
How much instruction level parallelism do typical programs contain? It isnot possible to give a definite answer to this question, because it dependson the instruction set used. Instruction sets can be designed with thepipeline flow or with other objectives in mind. Reduced instruction setshave one clear objective: minimizing pipeline stalls, and for this reasonthey can exploit instruction level parallelism more intensively than CISCprocessors. There is widespread disagreement in the literature about theinstruction level parallelism available in real programs. Some authorscalculated in the seventies that a maximum speedup by a factor of 2 couldbe achieved using this form of parallelism. More recent results suggestthat the available average parallelism could be as large as a factor of 5[Wall 1991]. Other groups have reported experiments in which the availableparallelism for processors with multiple execution units fluctuated between2 and 5.8 instructions per cycle [Butler et al 1991]. With an unboundedmachine size it was possible to achieve parallelising rates of 17 to1165 instructions per cycle! More conservative estimates reckoned thatnormal pipelined processors were already using almost all of the availableparallelism [Jouppi/Wall 1989]. Excessive pipelining can also reduce theoverall performance in some cases [Smith/Johnson/Horowitz 1989]. Moreresearch is needed about this important problem before an upper limit forthe available instruction level parallelism can be agreed upon.
4. Comparing RISC with CISC
There has been much discussion about the relative merits of CISC andRISC architectures. Some argue that many of the techniques used in RISCprocessors can be translated also to CISC designs. It is possible forexample to rewire the processor in order to execute most of the simpleinstructions in one cycle. Or it is possible to use a pipelined microengine,like in the Vax, in order to speed up execution. The microengine could bethought of as a RISC kernel giving all the advantages of this paradigmwithout its disadvantages.
But the main problem remains unsolved: RISC features can be introducedin CISC processors only at the expense of much more hardware. It ispossible, for example, to program the pipeline of a CISC processor to usethe dead time between the load and store of one instruction argument inmemory. The microengine works in this case following a load/store model,and it dynamically reschedules the operations needed by the macrocode. Thisdynamical rescheduling is too expensive compared to the software schedulingused in RISC processors. Software scheduling must be done only once and thenit runs without complex hardware. Dynamical scheduling needs increasingamounts of logic.
CISC processors can still be made competitive to RISC processors ifthe cycle time is reduced. There are already prototypes of Intel 80386microprocessors running at clock frequencies as high as 50 MHz. Suchprocessors can outperform RISC designs running at a slower clock rate.
But RISC processors are better positioned to achieve greater reductions inthe clock cycle time in the long run. The cycle time is determined by thefollowing factors: pipelining depth, amount of logic in each stage and theVLSI technology used. If the first and third factors are fixed, it is theamount of logic, i.e, the number of logic levels in each pipeline stage,the factor which determines the clock cycle time. It is much more difficultto reduce the number of logic levels in a complex design as in a simpleone. RISC processors can achieve larger reductions in the clock cycle timewith a lower investment in design time. Reducing the clock cycle time ofCISC processors is not impossible, but much more difficult.
It is also easier for RISC processors to employ fastertechnologies. Emitter-Coupled Logic (ECL) gates, for example, have a lowerdelay as CMOS (2 ns instead of 5 ns). The problem is that they are muchmore power hungry. ECL circuits dissipate around 25 mW per gate, whereasCMOS circuits dissipate only 1 mW running at 20 MHz [Hamacher/Vranesic/Zaky1990]. It is very difficult to build CISC processors in ECL technology dueto the large number of transistors used. ECL chips are not able to dissipateall the power consumed by a CISC design. RISC processors, on the otherside, employ just a fraction of the transistors used by CISC designs. It ispossible to build them in ECL technology with less technical problems andwith a better turnaround time. This has been done already for the MIPS andSPARC series by some chip manufacturers [Brown 1990].
It is also very difficult to increase the pipelining depth in CISCprocessors. Using RISC technology, it is possible to think aboutsuperpipelined processors capable of working with a pipeline of eight ornine stages. This is something being investigated by the designers of theMIPS series.
In summary: the controversy surrounding CISC versus RISC designs can notbe settled just by looking at the present performance differences of thetwo technologies. If this were the case, then it should be admitted thatCISC microprocessors have come nearer to the performance of RISC designsin the last two years [Hennessy 1990]. But the question is which designphilosophy will be capable of climbing the performance ladder faster in thenext few years. Here RISC designs appear as potentially much faster thanCISC processors, which have already come close to their "physiological"limits whereas RISC is still in its infancy.
5. Taxonomy of RISC processors
A compact but precise discussion of the features of commercial RISCprocessors presupposes some kind of classification method. A taxonomy of themost important aspects of the architecture is needed. In what follows wedevelop such a taxonomy considering the most relevant characteristics thatshould be taken into account when discussing RISC designs.
The simplest method to achieve this is to use a top-down approach, in whichsuccessive features are examined by focusing the attention in ever finersubsets of the computer architecture. Following this approach we come to thearchitectural characteristics discussed below.Word widthThe first important feature of the processor and memory ensemble is theword width used by the processor. Most current RISC processors use a 32 bitinternal and external word width. This means that the integer registers,the address and data paths are restricted to this number of bits. Thereare nevertheless a few RISC processors which already use a partial 64 bitarchitecture. The Intel 860 processor, for example, has a bus control unitcapable of reading or writing 64 bits simultaneously to memory. The IBMRS/6000 processor uses thirty-two 64 bit floating point registers. Probablythe first full fledged 64 bit processor will be the MIPS R4000 processor,which could be announced in 1992.Split or common cacheRISC processors need a cache between them and main memory. But this cachecan be a common one, in which instructions and data are mixed, or it can bea split unit, in which two separate caches hold respectively instructions ordata. The efficiency of both caching methods is very similar, but the splitapproach is used in many RISC designs.
On-chip or off-chip cache
Some RISC processors use an on-chip cache because it is faster to access,although it increases the chip complexity and therefore the chip area. Otherprocessors were designed with an off-chip cache in mind (like the SPARCchip), in order to simplify the design of the integer unit. CISC processors,like the Intel 80486, use an on-chip cache in order to cut the performanceadvantage of RISC processors.
Harvard or Princeton architecture
In systems with a split cache it is possible to use separate data andaddress buses for each cache separately. In this case an instruction fetchcan be handled in parallel with a data access. This is called a Harvardarchitecture. A Princeton architecture uses a common bus to access dataand instruction cache. The Motorola 88000 employs a Harvard architecture,whereas the MIPS R3000 chip uses a Princeton architecture. The MIPS chipmultiplexes the use of the common cache bus between the fetch unit and thedata unit. It should be noticed that a Harvard architecture does not meanseparate buses from the cache to main memory. From the processor to the twocache units two buses are used, but the cache units share a single bus tomain memory.
Prefetch buffer
The instruction stream to the processor can be handled with an additionallevel in the memory hierarchy. Fast prefetch buffers can access theinstruction cache sequentially in advance in order to hold severalinstructions ready to be consumed by the processor. This structure is calleda prefetch buffer. Only few RISC processors use prefetch buffers. The IBMRS/6000 is one of them. It works with a prefetch buffer capable of storing 4instructions. This kind of buffer is very important for processors which tryto achieve the maximal instruction issue rate.
Write buffer
The equivalent to prefetch buffers on the data stream side are writebuffers. The processor does not have to wait until some data has beenwritten on the cache. It just gives a write request to the write buffer andspecial hardware handles the request autonomously.
Coprocessor or multiple units architecture
This is one of the decisive classification criteria for RISC processors. Acoprocessor architecture means that the instruction stream is analyzedconcurrently by two or more processors (for example an integer processorand a floating point processor). Each processor takes the instructions thatit can handle, the others interpret it as a NOP. In this way integer andfloating point operations can be executed concurrently in two differentprocessors. The processors can communicate through memory or through specialcontrol lines.
A multiple unit architecture means that there is a central decoding facilitywhich starts execution units according to the instruction which has beendecoded. The decoding unit, for example, can start an integer additionin the integer unit - one cycle later it can start the floating pointmultiplication unit, and so on.
The Motorola 88000 and the IBM RS/6000 use a multiple unit architecture,whereas the SPARC and MIPS chip sets use a coprocessor architecture.
Common register file or private registers
In a coprocessor architecture each processor handles its own registersand register interchange is managed thorough memory. In a multiple unitarchitecture there are two possibilities: a common register file can beaccessed by all execution units or the execution units themselves canwork with private registers. A combination of these two extremes is alsopossible. The Motorola 88000 is a processor with a common register file. TheIBM RS/6000 uses private registers in its execution units.
Width and number of internal data paths
The performance of execution units can be enhanced by using more andwider datapaths in the internal architecture of a processor. It makes aperformance difference if 64 bits have to be transferred from the registersin one or two 32 bit steps. Two write-back paths to the register file arebetter than one mainly in processors with multiple units.
Condition codes
Control of execution flow has been achieved traditionally through the useof condition bits which are set as a side effect of some arithmetical orlogical operations. Several RISC processors set condition bits explicitlyin one of the general purpose registers. This register can then be testedby the branching instruction. This strategy avoids the problems associatedwith a long pipeline in which it is not completely clear which instructionchanged the condition codes the last time. IBM solved this problem bymultiplying the number of condition bits: up to ten sets of condition codesare available in the IBM RS/6000.
Register renaming and scoreboarding
In RISC processors the management of the register file is an essentialfeature. There are three different ways to solve the scheduling problemfor the usage of registers: the first solution is to schedule registersin software and to avoid collisions through a sophisticated compile timeanalysis. The second solution relies on the help of a special hardware"scoreboard" that tracks the usage and availability of registers. Whenevera register which is not yet free is requested, the scoreboard locks therequest until the register is available. The third solution comes fromthe mainframe world and was implemented by IBM in the RS/6000 processor:registers are dynamically renamed by the hardware. If two instructions needregister R2 to generate a temporary result, one of the two gets access tothis register and the other to a "copy" of R2. The results are calculatedand the real R2 is updated according to the sequential order of the callinginstructions. A full explanation of this technique can be found in the bookof Hennessy and Patterson [1990].
Pipelining depth of multiple units
In chips with multiple units an important parameter is the pipeline depthof each unit. Floating point units are implemented with a deeper pipeline,taking into account the longer latency of floating point operations. Animportant question is how the pipelines of different depth are coordinatedso as to avoid collisions at the exit of the pipelines, when more than oneunit could try to access the register file.
Chaining
Another important question is if the output of execution units is to bedirectly connected to the input of other execution units. If this is thecase something similar to the so called "chaining" of vector processors isavailable. The multiplier, for example, can be directly connected to anadder and in this way the inner product of two vectors can be calculatedextremely fast.
Multiple purpose architecture
The last architectural feature of interest is if the processor beingconsidered exhibits a general purpose architecture or not. A general purposechip needs to implement interrupts, protection levels and uses a memorymanagement unit. Almost all RISC processors provide these features. The onesthat do not provide them have been designed for embedded applications or forsimple multiprocessing nodes (like the Transputer).
After this summary of architectural features the structure of real computerscan be discussed.
6. Survey of features of commercial RISC processors In this section we review some of the most important and popular RISCprocessors. We limit ourselves to summarizing the relevant features ofeach design. We have also drawn for each processor the correspondingKiviat graph. This type of graphical representation has been used in otherarchitectural studies [Siewiorek/Bell/Newell 1985] and in many fields inwhich the representation of several dimensions of data must be handled injust two dimensions. In doing this we tried to make the design of the Kiviatgraph as expressive as possible in order to facilitate the comparison ofdifferent kinds of processors. It is well known that a graphical approachcan be superior to complicated tables when several data dimensions areinvolved [Tufte 1990].
The variables considered in the comparison of processors are thefollowing: number of pipeline stages, number of addressing modes, numberof instructions, method of branch handling, average CPI according to someauthors, number of registers, instruction length (fixed or variable) andlevels of decoding (one level for hardware decoding, two for microcode,and three for micro plus nanocode). The circle meets the points in thedifferent data axis that could be considered as "typical" RISC values. Apipelining depth of four stages, for example, could be considered as anormal feature of RISC technology. More pipelining makes the processorpotentially faster if the other associated features have the adequatevalues. One single addressing mode is normally associated with a load/storearchitecture. Several RISC processors use just 6 bits for the encoding ofinstructions: this means that only 64 instructions can be encoded. Onedelayed branch slot could be considered normal in most RISC designs, butthere are other alternatives. The IBM RS/6000 for example uses a powerfulbranch handling method superior in average to delayed branching, but whichis also more hardware intensive. Thirty-two registers are typical for mostRISC designs.
With this information in mind we can look now at several commercial RISCprocessors.
6.1 The MIPS series
The commercial MIPS processor (R2000 or R3000 which differ in the clock rateand implementation but not in the main architectural features) is a spin-offfrom the experimental designs made at Stanford University in the earlyeighties. The acronym "MIPS" reveals clearly the design philosophy whichwas applied: MIPS stands for Microprocessor without Interlocking PipelineStages. The objective of the MIPS designers was to produce a RISC processorwith deep pipelining and pipeline interlocking controlled by software. Ifone instruction requires two cycles to complete, it is the duty of thecompiler to schedule one NOP instruction following it. In this way the onlypipeline bubbles which arise during execution are the NOPs scheduled by thesoftware, and the hardware does not have to stop the pipeline every now andthen. This reduces the amount of hardware needed in the processor [Thurner1990].
Some other interesting concepts were explored at Stanford with MIPS-X, aderivative of the MIPS architecture with additional features [Chow/Horowitz1987]. Many of them were later adopted in the commercial MIPS processor.
The MIPS R2000 is a 32 bit processor with an off-chip split cachefor instructions and data. A write buffer handles all data writes tomemory. The R2000 uses a common bus to the external caches - it is a nonHarvard architecture. The MIPS chip set follows a radical coprocessorarchitecture. The integer CPU is separated from the so called System ControlCoprocessor, which is an on-chip cache control. The CPU and floating pointunit communicate through memory. There are 32 general purpose integerregisters and 16 separate 64 bit floating point registers. The floatingpoint coprocessor contains an add, a divide, and a multiply unit. There areno condition code bits and no scoreboard. Register scheduling is managed bythe software [Kane 1987].
Figure 4 shows that the MIPS series approaches the typical RISC circle veryclosely. The integer pipeline has a depth of five stages and the floatingpoint pipeline a maximal depth of six stages. The Cycles per Instruction(CPI) reported by some studies is 1.7 [Bhandarkar/Clark 1991]. For the ECLversion, the R6000, the reported CPI is 1.2 [Haas 1990].
The MIPS processors have only one addressing mode. The compiler optimizesthe allocation of registers in order to fully exploit the registerfile. This is not so efficient as register windows, but the MIPS compilerdoes a good job at eliminating unnecessary register loads and stores[Cmelik/Kong/Ditzel/Kelly 1991].
The total number of instructions is bounded by the six bits available forthe opcode (64 instructions). The processor uses delayed branch with onedelay slot.
The processor is fully hardwired, including the floating point unit. Thelow gate count of the MIPS design made it also a good target for fasterchip technology and one ECL processor is already being offered. It was alsotargeted for a GaAs implementation.
From the data shown it follows that the MIPS series is one of the cleanestRISC designs being offered at the time of this writing [Gross et al 1988].(Figure 4)6.2 The SPARC family
The SPARC (Scalable Processor Architecture) can claim to descendfrom an illustrious lineage. SPARC was derived from the RISC-I andRISC-II processors developed at the University of Berkeley in the earlyeighties. The architecture was defined by Sun Microsystems but it is not aproprietary design. Any interested semiconductor company can get a licenseto build a SPARC processor in any desired technology. In what follows thedesign parameters of the Cypress SPARC chips are discussed [Cypress 1990].
The SPARC is a 32 bit processor with an off-chip common cache. Three chipsprovide the functionality needed: one for the integer unit, one for thefloating point unit, and another works as a cache controller and memorymanagement unit. The SPARC design follows the coprocessor architecturalparadigm. Floating point unit and integer unit exchange information throughmemory and through some control lines. There is no prefetch buffer. A commoninteger register file with two read and one write port is used. The floatingpoint unit provides 32 registers 32 bits wide. Instructions are decoded inparallel by the integer and floating point unit. Floating point instructionsare then started when the integer unit sets a control line. Conditioncodes are used and no scoreboard is available to control the scheduling ofregisters.
Figure 5 shows that SPARC is also a typical RISC oriented design. There arejust two peculiarities that set it apart from other RISC processors. Firstof all: the SPARC uses the concept of "register windows" in order toeliminate the load and stores to a stack associated with procedurecalls. Instead of pushing arguments in a stack in memory, the callingprocedure copies registers from one register window to the next. Registerwindows are a hardware oriented method to optimize register allocation. Somecritics of register windows point out that the same benefits can be obtainedby scheduling registers at compile time. The Berkeley team used registerwindows because they lacked the compiler expertise needed to implementinterprocedural register allocation, as they later pointed out themselves.
Another peculiarity of the SPARC are its "tagged" instructions. Declarativelanguages like Lisp or Prolog make extensive use of tagged data types. TheSPARC provides instructions which make easier to handle a two bit tag ineach word of memory [Cypress 1990]. This feature can speed up Lisp by somepercentage points.
The CPI of the SPARC is 1.6, as confirmed by our own measurements. This isnot significantly different from the CPI of the MIPS series. In all otherarchitectural respects, the SPARC is very similar to the MIPS machine. Justthe number of addressing modes is higher: two in the SPARC for just one inthe MIPS processor.(Figure 5)6.3 The IBM RS/6000
The IBM RS/6000 or POWER architecture (Performance Optimization withEnhanced RISC) contains so many innovations compared to the MIPS and SPARCdesigns, that it is difficult to say that it is still just another RISCprocessor. The IBM RS/6000 shares with older RISC designs the streamlinedapproach to pipelined execution. But the instruction set of the IBMprocessor is large and many special instructions have been provided in orderto speed up execution. The POWER chip set is indeed an impressive computingengine.
The RS/6000 is a 32 bit processor. Split external caches are used. Theprocessor follows a Harvard architecture with separate buses forinstructions and data. The first surprise is the width of the instructionbuffer: 128 bits are read in parallel and stored in a 4 word prefetchbuffer. The data bus is 64 bits wide in order to read and store 64 bitfloating point data in a single cycle.
The RS/6000 architecture is one of multiple units and consists of threemain blocks: one for control and branching, one for integer operations andanother for floating point. The branching unit tries to detect branches veryearly by parsing the prefetch buffer and trying to determine if the branchwill be taken or not. The branching unit runs ahead of the other processingunits and in many cases it can "absorb" the branch instruction, saving onepipeline slot. Because of this feature IBM names this technique "zero cyclebranching" [Oehler/Groves1990].
The floating point unit provides 32 registers 64 bits wide. The registerscan be locked in order to control its utilization by concurrentfloating point operations. One addition and one multiplication can bestarted concurrently. The processor is also capable of performing onemultiply-and-add operation in four cycles. This capability is importantfor the calculation of the scalar product of vectors and other commonmathematical functions. All floating point operations comply with the IEEEstandard.
The Kiviat graph should be explained more carefully. There are in the IBMRS/6000 two different pipelines: one for the integer (called fixed point)and one for the floating point unit. The first two pipeline stages occur inthe branching unit. The fixed point unit works with four additional stagesand the floating point unit with six [Grohoski 1990]. Integer operationsthen go through six pipeline stages and floating point operations througheight. This is a level of pipelining uncommon in workstations. Other RISCprocessors do not employ so deeply pipelined floating point units.
The RS/6000 has one addressing mode and an additional autoincrementmode. The autoincrement mode is more typical of CISC processors, but it wasincluded in the RS/6000 to gain some speed trying to avoid compromising thepipeline flow [Hall/O'Brien 1991]. The additional addressing mode makes thehardware more complex.
The IBM RS/6000 has no delayed slots because it does not need them. Itsbranching lookahead technique makes them irrelevant. The branching unit alsoowns special registers and one for iteration counting. With the help of thisregister the execution unit does not have to count the number of iterationsin a FOR loop, and only serial code is passed over from the branching to theexecution units.
The instruction length of the RS/6000 is fixed but some operations arehandled in microcode (specially FP operations). There are ten sets ofcondition codes.(Figure 6)One important feature of the RS/6000 is the use of register renaming in thefloating point unit. Through it the processor is able to do loop unrollingon the fly and achieves execution rates similar to the ones of vectorprocessors.
The IBM RS/6000 is a superscalar machine because the execution of floatingpoint and integer operations can be highly overlapped. In some benchmarksthe IBM RS/6000 approaches a CPI of almost 1.1 and the geometric average ofthe CPI measured in 9 of the SPEC benchmarks is 1.6 [Stephens et al 1991].
The complexity of the IBM RS/6000 shows itself in the large number oftransistors needed to implement the architecture: more than 2 million justfor the logic! The extra memory required in the different units contributesother 4.8 million transistors, but most of them are the ones needed in thecaches. This complexity makes it questionable if the architecture can bescaled up to other technologies (like ECL) which dissipate more energy pergate.
6.4 The Motorola 88000 family
The 88100 processor, the first in the 88000 family, was launched in 1988 asthe answer of Motorola to the burgeoning RISC designs [Hennessy/Patterson1990]. The 88000 family sacrificed compatibility with the older 68000 familyfor performance. The Kiviat graph below shows the main features of theM88100.
The 88100 is a RISC processor with a 32 bit external and internalarchitecture. Split caches are handled off-chip by two separate 88200 cachemanagement units. There are separate buses for instruction and data, i.e.,the processor follows a Harvard architectural model. There is no prefetchbuffer and the processor follows the multiple units approach. There isone integer unit and two floating point units (adder and multiplier). Theregister file is common to all units and contains 32 registers of 32bits. Register 0 is hardwired to 0. Registers can contain integer orfloating point data. Special function units could be implemented in laterincarnations of the architecture. There are no condition codes: statusinformation is handled in registers [Alsup 1990].
The M88100 uses three different addressing modes: register plus offset,register plus register, and register plus scaled register. The last twoaddressing modes provide easy access to arrays in memory.
The number of instructions is 51 and 12 of them are floating pointinstructions [Hamacher/Vranesic/Zaky 1990].
The processor uses delayed branches with one branch slot. Normal branchescan be used also. Delayed load is also used: the instruction following aload to a register must wait one cycle to use this register. Two generalpurpose registers are concatenated when 64 bits floating point data isneeded.
The 88100 does not dispose of a full fledged scoreboard to control the usageof registers. Each register has instead an "in use" bit, which is set everytime the register is waiting to be updated by an instruction which has beenstarted. The processor checks this bit before starting other instructionswhich update the same register.(Figure 7)The processor works with fixed length instructions and hardwaredecoding. There are only four instruction formats, very similar to theformats of the MIPS R3000 processor. The number of pipeline stages is 4for integer operations, a more or less typical value for RISC designs. Thepipeline depth of the floating point adder is 4, which together withinstruction fetch and decode give a total pipeline depth of 6.
The Motorola architecture does not offer any other surprises: there are noregister windows nor deviations from a pure RISC approach. The designersdefined a linking convention which allows subroutines to pass parametersthrough registers, but this is not equivalent to register windows.
The next member of the family, the M88110, will adopt what Motorola calls asymmetric superscalar design and will handle branches with a special unit.
6.5 Intel 860
Intel developed the 80860 processor with embedded applications in mind. Itwas the first RISC chip of the semiconductor manufacturer and silicon areawas not spared - more than one million transistors were used in the finaldesign. The chip has not been a great market success.(Figure 8)The I860 is a 32 bit processor built with a Harvard architecture. The bus tothe instruction cache is 32 bits wide, and the bus to the data cache is 128bits wide, making possible to access four words in parallel. The caches arelocated on-chip [Bodenkamp 1990].
The chip follows the multiple units paradigm and provides one floating pointadder, one floating point multiplier and one special graphics unit. The"RISC core" contains thirty-two 32 bit registers and one ALU. A scoreboardcontrols the allocation of general purpose registers.
The floating point register file contains 30 registers 32 bits wide, whichcan be used as 15 64 bit registers. The adding and multiplying units canbe chained to speed-up the multiply and add combination needed in linearalgebra and graphics.
The processor uses a fixed instruction format very similar to theMIPS format, decoding is hardwired, and only two addressing modes areprovided. The number of instructions is bounded by the six bits providedfor the operation code. Intel reports a CPI of 1.1, but it is more probablethat the CPI lies around 1.6 the "typical" RISC CPI. The pipelines are notvery deep: floating point and integer pipelines have at most three stages,depending on the unit.
The graphics unit provides some common operations needed to handle singlepixels in computer graphics.
6.6 Hewlett Packard's Precision Architecture
When Hewlett-Packard charged their computer architects with designing anew processor architecture for the nineties, the goal was set to provide asingle type of machine for commercial and scientific applications across alarge performance range. The new architecture unified the different productlines of HP and was much more powerful than the older machines.
The Precision Architecture (PA) is a RISC design, which neverthelessexhibits many characteristics only normally found in larger systems. In thisrespect the PA is similar to the Power Architecture of IBM.
The Kiviat graph for the PA systems shows its more relevant features. The PAis a load/store architecture with fixed instruction length [Lee 1989]. Thenumber of different instructions formats is larger than in other RISCmachines: twelve different combinations of opcode and register or constantfields are possible in a single word (the SPARC and MIPS processors use onlyfour different combinations).(Figure 9)The number of different addressing modes is basically two with twoadditional modes supporting post- and premodification of an indexregister. This gives a total of four different addressing modes.
The number of different addressing modes is basically two with twoadditional modes supporting post- and premodification of an indexregister. This gives a total of four different addressing modes.
The opcode of the PA consists of six bits. This reduces the number ofpossible instructions to less than 64 (although several instructions areoffered in several variants using special bits in the instruction format).
Delayed branches with one slot are used in the PA. The delay slotinstruction can be cancelled according to the result of the branch decision.
The number of general purpose registers in the PA is 32. Thirty-twoadditional special purpose registers are also used to manage interrupts,protection levels, etc.
Some of the above data show that the PA is not a typical RISC design. Themost atypical feature, however, is the low level of pipelining of thefirst processors offered. Just three pipeline stages are used [Lee 1989],although newer designs can employ a deeper pipeline. The pipeline implementsinterlocks in hardware. The optimal pipeline flow requires softwarescheduling.
The PA achieves a low CPI through simultaneous execution of scalar andfloating point operations. The number of floating point units can varyfrom one PA machine to another. The PA tries to achieve a low CPI usingsuperscalar techniques.
HP's Precision Architecture employs much more hardware than pure RISCdesigns trying to achieve a low CPI. The PA philosophy is nearer to thephilosophy of the IBM RS/6000 than to the pure RISC concepts.
6.7 The Transputer - A RISC processor?
There has been much discussion about the correct classification of theTransputer chip from Inmos. The designers of the Transputer claim it to be aRISC design. They adduce as proof that many instructions can be executed inone cycle and that the basic instructions are just 16.
The Kiviat graph tells another story. The Transputer is architecturallynearer to CISC processors like the 68020 than to the RISC designs. There arecertainly some special RISCy features in the Transputer, but they can notobscure the main facts.
The Transputer is a 32 bit processor with a non Harvard architecture. TheInmos implementation does not provide a cache, although there is an internalon-chip memory which must be explicitly addressed. The Transputer followsthe coprocessor paradigm. The floating point coprocessor operates with itsprivate registers and the integer unit implements a stack model. A uniquefeature of the Transputer are its 4 serial links, which make it possible toconnect arrays of Transputers with little additional hardware.(Figure 10)The number of different instructions in the Transputer is greater than128. There are the 16 basic instructions, but there are a lot more forfloating point operations and management of concurrent processes. Theinstructions do not use a fixed coding, but a variable instruction length,similar in spirit to the one designed by Wirth for the Lilith machine. Partsof the processor are microcoded, especially the floating point unit.
The Transputer does not use general purpose registers but a stack withonly three elements. The addition of two numbers in the stack can beexecuted in one cycle, but this is not equivalent to the addition oftwo general purpose registers, which do not destroy their operands. Astack architecture requires many more instructions than a general purposearchitecture for the handling of the standard arithmetical assignmentsin high level languages. The shallow stack of the Transputer forces thischip to access more frequently main memory. The solution adopted in theTransputer was to provide 4Kb or 8 Kb of fast on-chip memory. Yet, theseadditional memory cells are no registers, and most of them are consumedby process management when the chip is being used to handle concurrentprocesses. This on-chip memory is not equivalent to a large register file.(Figure 11)The Transputer does not use delay slots and there are something like 5different addressing modes. The architecture does not define a memoryhierarchy, as other RISC designs do.
The most non RISC feature of the chip is that no care was taken to ensurea high degree of pipelining. The Transputer documentation does notmention pipelining at all and this seems to have been no issue for thedesigners. Earlier surveys of RISC processors would leave in this point justa question mark [Gimarc/Milutinovic 1987]. Some recent articles talk ofdeep pipelining in the new Transputer chip unveiled in the second quarter of1991, but the relevant information is not yet available. All that is knownis that the new Transputer will be a superscalar design with a pipelinedepth of five stages.
We have included the data for the 68020 processor [Motorola 1984] becausethe Kiviat graph shape in some way resembles that of the Transputer. Bothchips were released at about the same time and both reflect similararchitectural decisions, although both chips are also very different. TheCPI of the M68020 is an enormous 6.7 [Serlin 1990].
7. The success of RISC processors
There are a number of factors which have transformed RISC technology intoa success in the marketplace. One of the most important is simplicity ofdesign. Original RISC processors contained less than 300,000 logic gates andeven today, when more complex designs have appeared, RISC processors aretypically much more compact and leaner than CISC processors.
The only exceptions are the new superscalar designs being produced byHewlett-Packard, IBM and recently also by Intel. The chips from Intel havealready surpassed the one million transistors mark and they are beingmanufactured using a three level metal process (the older generation ofprocessors used only two levels of metal). The IBM RS/6000 uses also amassive amount of hardware to provide optimal performance. The table belowgives a panoramic view of the complexity of some processors.
The second factor which has made possible the performance gains associatedwith RISC processors is better compilers. Compiler technology has changed inthe last few years and many techniques which just yesterday were consideredvery sophisticated are now fairly common. Optimizing compilers have drivenassembly language programmers to extinction. The synergy between compilerand architecture is a factor which from now on, will not be disregarded innew processor designs.
Table 1: Transistor count of some processors
Processor Number of transistors Technology
SPARC 75000 0.8 micron CMOS
MIPS R3000 110000 2 micron CMOS
MIPS R6000 360000 ECL
M88000 175000 CMOS
(M88200) 750000 CMOS
Intel 860 1000000 1 micron CMOS
Transputer T800 238000 2 micron CMOS
Hewlett Packard PA 115000 1.5 micron NMOS
IBM RS/6000 2040000 0.9 - 1 micron CMOS
Intel 80486 1200000 CMOS
Motorola 68040 1200000 CMOS
_______________________________________
Sources: Bode 1990, Gimarc/Milutinovic 1987, Bakoglu 1990,
There is only a problem with RISC processors: there are too many ofthem! The Figure below shows that in the RISC market there are four majorplayers: SUN, Hewlett-Packard, IBM and MIPS. These four companies accountfor most of the RISC chips sold for Workstations. The four are mutuallyincompatible.
The situation in the Workstation market (the RISC playfield for now)is very different than the situation in the mainframe or microcomputermarket. In the mainframe world the de facto standard is the old IBM/370CISC architecture. More than 90% of the mainframes sold conform to thisarchitecture. Similarly, in the microcomputer world more than 90% of thesystems are based in the Intel 8088/286/386 architecture. Software iscompatible and can be transferred from one machine to another.(Figure 12)The problem of the incompatibility of RISC processors is solved by using astandard operating system, i.e., UNIX. But UNIX alone is not enough. If twodifferent designers use the same processor chip, this does not guaranteethe compatibility of compiled binary code. Many other factors need tobe standardized (for example which register holds the frame address inrelocatable code, etc.). To solve these problems every one of the companiesoffering RISC processors has tried to define an "application binaryinterface" (ABI), which could make object code portable from one machine tothe other. For the SPARC, the M88000, the PA and the Intel 80860 such binaryinterfaces have been proposed. A new consortium built around the MIPS chiphas gone a step further: they are trying to define a common architecturalplatform for microcomputers and workstations. The ACE (Advanced ComputingEnvironment) initiative bundles more than 20 companies so diverse as Compaq,DEC and Siemens.
The nineties are the decade of the strategic alliances. At the levelof general purpose computers there are two major groups and two majorindependent players. The last two are Hewlett-Packard and IBM. Theyare big enough to claim by themselves a big portion of the RISCmarket. Hewlett-Packard has announced its willingness to license thePrecision Architecture, but no "clones" of the PA are yet known. IBM willlicense the Power Architecture to Apple and Motorola.
The other players are MIPS and SPARC International. The MIPS group includesthe companies mentioned above and others, like Silicon Graphics andToshiba. The SPARC group consists of companies like Sun, Fujitsu, Philips,Tatung, and Ahmdal. It is evident that the MIPS group concentrates mostlycompanies offering technically demanding products. The SPARC group consistsmostly of companies trying to clone the Sun workstations. The MIPS groupappears as technically more sophisticated than the SPARC group, but thiscould also change in the future as more companies join the field.
The M88000 and the Intel 80860 were non starters in the general purposemarket. These chips are being used only in embedded processing or as chipsfor multiprocessing arrays. The Transputer chip has been more successful,but also in a restricted class of architectures.
8. Conclusions and the future of RISC
It was argued in this survey that RISC processors can be distinguished fromCISC designs mainly on one count: their efficient utilization of instructionpipelining. RISC processors have been defined explicitly with the aim toexhaust the available instruction level parallelism available in typicalprograms. All other RISC features can be logically derived from this initialpurpose.
Modern RISC processors have come close to achieving average CPIs of 1.5. Theonly way to go further down is by designing more agressive and ambitiousprocessors capable of executing most of the instruction set in less than onecycle. Two alternative paths could be taken: the superpipelined approachor the superscalar one. The designers of the MIPS series of processors areexperimenting with the first technique. When the new processor, the MIPSR4000, with a 64 bit architecture becomes available it could be similarin design to some high level pipelined machines or vector processors. TheR4000 will also include the floating point coprocessor and the cacheon-chip. Other groups are going through the superscalar path. That is thecase with IBM and Intel. The designers of the SPARC are already working on asuperscalar chip.
We can expect to see new superpipelined and superscalar chips in the nextfive years. Before 1995 the average CPI of real programs could fall belowthe mark of one cycle per instruction. We will see more and more mainframetechnology being used in microprocessors. This has been a constant ofthe past: Many of the features of today«s microprocessors were once theexclusive realm of mainframes (caching, pipelining, multiple functionalunits, branching prediction, etc.). In the future still more technology willmigrate from the mainframe world to microprocessors.
RISC has also been called a "scalable architecture" because it is possibleto go from one technology to another with practically the same design (fromCMOS to ECL, for example). The first mainframes with a Reduced InstructionSet should appear in the next years. Amdahl has just announced its plans tobuild a fast SPARC server, and companies like Fujitsu are working on new ECLSPARC chips. Gallium Arsenide looks still more promising than ECL, and RISCchips are prime targets for GaAs implementations. Until 1995 some processorswill switch to a 64 bit architecture capable of handling the huge addressingspace which will be typical at the end of this decade [Hennessy 1990].
Although the future belongs to the superpipelined/superscalar processors,CISC designs will not disappear so fast, just because of the enormousinstalled base of such computers. RISC and CISC will peacefully coexistuntil CISC adopts so many features of RISC that it will be hard to tell thedifference.
Literature:- [1] Mitch Alsup, "Motorola's 88000 Family Architecture", IEEE Micro,February 1990, pp. 48-66.
- [2] H.B. Bakoglu, G.F. Grohoski, R.K. Montoye, "The IBM RISCSystem/6000 processor: hardware overview", IBM Journal of Research andDevelopment, Vol. 34, No. 1, January 1990, pp. 12-22.
- [3] Gordon Bell, "RISC: Back to the Future?", Datamation, Vol. 32,No. 11, June 1 1986, pp. 96-108.
- [4] Dileep Bhandarkar and Douglas W. Clark, "Performance fromArchitecture: Comparing RISC and CISC with Similar Hardware Organisation",Proceedings of the Fourth International Conference on Architectural Supportfor Programming Languages and Operating Systems, Santa Clara, California,April 8-11, 1991, pp. 310-319.
- [5] J. Bodenkamp, "I860 Mikroprocessor", in Arndt Bode (ed),RISC-Architekturen, Reihe Informatik, Band 60, Wissenschaftsverlag,Mannheim, 1990, pp. 431-447.
- [6] Emil W. Brown, "Implementing Sparc in ECL", IEEE Micro, February1990, pp. 10-21.
- [7] Michael Butler, Tse-Yu Yeh, Yale Patt, Mitch Alsup, Hunter Scales,Michael Shebanow, "Single Instruction Stream Parallelism is Greater thanTwo", Proceedings of the 18th Annual International Symposium on ComputerArchitecture, ACM, New York, 1991, pp. 276-286.
- [8] Paul Chow and Mark Horowitz, "Architectural Tradeoffs in the Designof MIPS-X", Proceedings of the 14th Annual International Symposium onComputer Architecture, Pittsburgh, Pennsylvania, 1987, pp. 300-308.
- [9] Robert F. Cmelik, Shing I. Kong, David R. Ditzel and EdmundJ. Kelly, "An Analysis of SPARC and MIPS Instruction Set Utilization on theSPEC Benchmarks", Proceedings of the Fourth International Conference onArchitectural Support for Programming Languages and Operating Systems, SantaClara, California, April 8-11, 1991, pp. 290-302.
- [10] John H. Crawford, "The i486 CPU: Executing Instructions in OneClock Cycle", IEEE Micro, February 1990, pp. 27-36.
- [11] Cypress Semiconductor, SPARC RISC User's Guide, February 1990.
- [12] Robin W. Edenfield et al, "The 68040 Processor: Part 1, Design andImplementation", IEEE Micro, February 1990, pp. 66-78.
- [13] Joel S. Emer, Douglas W. Clark, "A Characterization of ProcessorPerformance in the VAX-11/780", Proceedings of the 11th Annual InternationalSymposium on Computer Architecture, Ann Arbor, Michigan, 1984, pp. 301-309.
- [14] Domenico Ferrari, Giuseppe Serazzi and Alessandro Zeigner,Measurement and Tuning of Computer Systems, Prentice Hall, London, 1983.
- [15] Michael J. Flynn, Chad L. Mitchell and Johannes M. Mulder, "AndNow a Case for More Complex Instruction Sets", Computer, September 1987,pp. 71-83.
- [16] Charles E. Gimarc and Veljko M. Milutinovic, "A Survey of RISCProcessors and Computers of the Mid-1980s", Computer, September 1987,pp. 59-69.
- [17] G.F. Grohoski, "Machine organisation of the IBM RISC System/6000processor", IBM Journal of Research and Development, Vol. 34, No. 1, January1990, pp. 37-58.
- [18] Thomas R. Gross, John L. Hennessy, Steven A. Przybylski andChristopher Rowen, "Measurement and Evaluation of the MIPS Architecture andProcessor", ACM Transactions on Computer Systems, Vol. 6, No. 3, August1988, pp. 229-257.
- [19] Hass, W., "MIPS RISC-Architektur in ECL-Technik", in Arndt Bode(ed), RISC-Architekturen, Wissenschaftsverlag, Mannheim, 1990.
- [20] Brian Hall, Kevin O'Brien, "Performance Characteristics ofArchitectural Features of the IBM RISC System/6000", Proceedings of theFourth International Conference on Architectural Support for ProgrammingLanguages and Operating Systems, Santa Clara, California, April 8-11,1991, pp. 303-309.
- [21] Carl Hamacher, Zvonko Vranesic, Safwat Zaky, ComputerOrganisation, McGraw-Hill, New York, 1990.
- [22] John L. Hennessy, David A. Patterson, Computer Architecture AQuantitative Approach, Morgan Kaufmann Publishers, San Mateo, 1990.
- [23] John L. Hennessy, "What will the single architecture of tomorrowlook like?", Electronic World News, November 5, 1990, pp. c4-c5.65
- [24] M.E. Hopkins, "A Perspective on the 801/Reduced Instruction SetComputer", IBM Systems Journal, Vol. 26, No. 1, 1987, pp. 107-121.
- [25] Patrick Horster, Dietrich Manstetten and Heidrun Pelzer, DasRISC-Konzept, Bericht 118, Rheinisch-WestfŠlische Technische HochschuleAAchen, Juni 1986, 94 p.
- [26] Kai Hwang and Faye A. Briggs, Computer Architecture and ParallelProcessing, McGraw Hill, New York, 1985.
- [27] Norman P. Jouppi, David W. Wall, "Available Instruction-LevelParallelism for Superscalar and Superpipelined Machines", Proceedings ofthe Third International Conference on Architectural Support for ProgrammingLanguages and Operating Systems, Boston, Mass, 1989, pp. 272-282.
- [28] Gerry Kane, MIPS R2000 RISC Architecture, Prentice Hall, EnglewoodCliffs, 1987.
- [29] Ruby B. Lee, "Precision Architecture", Computer, Vol. 22, No. 1,January 1989, pp. 78-91.
- [30] Scott McFarling and John Hennessy, "Reducing the Cost ofBranches", Proceedings of the 13th Annual International Symposium onComputer Architecture, Tokyo, Japan, 1986, pp. 396-403.
- [31] Jeff Moad, "Gambling on RISC", Computer, Vol. 32, No. 11, June 11986, pp. 86-92.
- [32] Motorola, MC68020 32-Bit Microprocessor User's Manual,Prentice-Hall, London, 1984.
- [33] R.R. Oehler and R.D. Groves, "IBM RISC System/6000 processorarchitecture", IBM Journal of Research and Development, Vol. 34, No. 1,January 1990, pp. 23-36.
- [34] David Patterson, "Reduced Instruction Set Computers",Communications of the ACM, Vol. 28, No.1, January 1985, pp. 9-21.
- [35] Omri Serlin, "MIPS, Drystones and Other Tales", in WilliamStallings (ed), Reduced Instruction Set Computers, IEEE Computer SocietyPress, Washington 1990, pp. 282-296.
- [36] Daniel P. Siewiorek, Gordon Bell and Allen Newell, ComputerStructures: Principles and Examples, McGraw-Hill, Auckland, 1985.
- [37] Michael D. Smith, Mike Johnson and Mark A. Horowitz, "Limitson Multiple Instruction Issue", Proceedings of the Third InternationalConference on Architectural Support for Programming Languages and OperatingSystems, Boston, Mass, 1989, pp. 290-302.
- [38] Chriss Stephens, Bryce Cogswell, John Heinlein, Gregory Palmer,John P. Shen, "Instruction Level Profiling and Evaluation of the IBMRS/6000", Proceedings of the 18th Annual International Symposium on ComputerArchitecture, ACM, New York, 1991, pp. 180-189.
- [39] E. Thurner, "Die MIPS Prozessor Familie", in Arndt Bode (ed),RISC-Architekturen, Reihe Informatik, Band 60, Wissenschaftsverlag,Mannheim, 1990, pp. 379-401.
- [40] Edward R. Tufte, The Visual Display of Quantitative Information,Graphics Press, Cheshire, 1990.
- [41] David M. Wall, "Limits of Instruction-Level Parallelism",Proceedings of the Fourth International Conference on Architectural Supportfor Programming Languages and Operating Systems, Santa Clara, California,April 8-11, 1991, pp. 176-189.
- [42] Niklaus Wirth, "Microprocessor Architectures: A Comparison Basedon Code Generation by Compiler", Communications of the ACM, Vol. 29, No. 10,1986, pp. 978-990.
|
|