This document contains the solutions to an exam for a computer systems architecture course. It provides instructions for taking the exam, which is closed book but allows for one page of notes. The exam contains multiple choice and short answer questions about topics relating to computer systems including caches, prefetching, memory allocation policies, disk architectures, virtualization, multicore processors, and error detection/correction.
This document contains the solutions to an exam for a computer systems architecture course. It provides instructions for taking the exam, which is closed book but allows for one page of notes. The exam contains multiple choice and short answer questions about topics relating to computer systems including caches, prefetching, memory allocation policies, disk architectures, virtualization, multicore processors, and error detection/correction.
This document contains the solutions to an exam for a computer systems architecture course. It provides instructions for taking the exam, which is closed book but allows for one page of notes. The exam contains multiple choice and short answer questions about topics relating to computer systems including caches, prefetching, memory allocation policies, disk architectures, virtualization, multicore processors, and error detection/correction.
This document contains the solutions to an exam for a computer systems architecture course. It provides instructions for taking the exam, which is closed book but allows for one page of notes. The exam contains multiple choice and short answer questions about topics relating to computer systems including caches, prefetching, memory allocation policies, disk architectures, virtualization, multicore processors, and error detection/correction.
SLanford unlverslLy !une 9 Lh , 2010 LL282 llnal Lxam SoluLlons !"#$ &'()*+,)-.'(/ Answer each of Lhe quesLlons lncluded ln Lhe exam. WrlLe all of your answers dlrecLly on Lhe examlnaLlon paper, lncludlng any work LhaL you wlsh Lo be consldered for parLlal credlL. 1he examlnaLlon ls closed book, buL you can make use of one page of noLes and a calculaLor. ?ou may noL use a compuLer or browser of any klnd.
0' 12+#)-.'(/ Wherever posslble, make sure Lo flrsL wrlLe Lhe equaLlon wlLh symbollc Lerms, Lhen Lhe equaLlon rewrlLLen wlLh Lhe numerlcal values, and Lhen Lhe flnal soluLlon. arLlal credlL wlll be welghLed approprlaLely for each componenL of Lhe problem, and provldlng more lnformaLlon lmproves Lhe llkellhood LhaL parLlal credlL can be awarded.
0' 3*-)-'4 ,.51/ unless oLherwlse sLaLed, for any answers LhaL requlre code examples or fragmenLs, you should wrlLe C-llke pseudocode. ?ou do noL need Lo opLlmlze your code unless speclflcally lnsLrucLed Lo do so. CommenLs for any code are noL sLrlcLly requlred on Lhe exam, buL are hlghly recommended. 1hey may help you recelve parLlal credlL on a problem, lf Lhey help us deLermlne whaL you were Lrylng Lo do.
0' )-$1/ ?ou wlll have )6*11 6.+*( 789: $-'+)1(; Lo compleLe Lhls exam. 8udgeL your Llme and Lry Lo leave some aL Lhe end Lo go over your work. 1he polnL welghLlngs correspond roughly Lhe dlfflculLy of each problem. lf you flnd a problem Loo dlfflculL aL flrsL, move on Lo Lhe oLher problems and revlslL lL laLer.
name (prlnL) ___________________________________________________________________
<=! ><?@A0BC D@&E!B>&<F =0@0B G0C! 1he Ponor Code ls an underLaklng of Lhe sLudenLs, lndlvldually and collecLlvely: (1) LhaL Lhey wlll noL glve or recelve ald ln examlnaLlons, LhaL Lhey wlll noL glve or recelve unpermlLLed ald ln class work, ln Lhe preparaLlon of reporLs, or ln any oLher work LhaL ls Lo be used by Lhe lnsLrucLor as Lhe basls of gradlng, (2) LhaL Lhey wlll do Lhelr share and Lake an acLlve parL ln seelng Lo lL LhaL oLhers as well as Lhemselves uphold Lhe splrlL and leLLer of Lhe Ponor Code. l acknowledge and accepL Lhe Ponor Code.
name (slgn) __________________________________________________________
lndlcaLe lf Lhe followlng sLaLemenLs are 1rue or lalse. rovlde a slngle senLence [usLlflcaLlon for your answer. Answers wlLhouL [usLlflcaLlon wlll recelve no credlL. [3 polnLs per sLaLemenL]
a) Assume LhaL you are dolng a uMA Lransfer from memory Lo an l/C devlce. Also assume LhaL, ln addlLlon Lo readlng u8AM you also need Lo read Lhe processor caches LhaL may be cachlng addresses lnvolved ln Lhe u8AM Lransfer. lf an L2 cache read produces a hlL, Lhen lL musL be forward Lo Lhe L1 cache for an addlLlonal lookup aL LhaL level. l - lL needs Lo be forward Lo Lhe L1 cache only lf Lhe cache llne ls dlrLy (or suspecLed Lo be dlrLy)
b) SofLware prefeLchlng can only be lmplemenLed lf Lhe hardware lmplemenLs non-blocklng caches. 1 - oLherwlse Lhe processor would lmmedlaLely sLall on Lhe prefeLch lnsLrucLlon
c) 1he memory allocaLlon pollcles used by sofLware can affecL Lhe power consumpLlon of Lhe sysLem. 1 - lL can affecL Lhe number of u8AM banks/ranks/ulMMs/channels LhaL are acLlve ln order Lo serve a program (as opposed Lo belng ln sLandby or low power modes)
d) 1he mosL lmporLanL meLrlc when bulldlng a daLa cenLer ls low energy consumpLlon. l - 1oLal cosL of ownershlp (1CC) ls Lhe mosL lmporLanL one
e) When scallng down Lhe volLage and clock frequency of a processor, Lhe rlghL order ls Lo flrsL reduce Lhe power supply volLage and Lhen reduce Lhe clock frequency. l - you flrsL need Lo reduce Lhe frequency as elecLronlcs work slower aL lower volLages
f) uslng vlrLually-addressed caches ln a processor leads Lo lower energy consumpLlon compared Lo a processor wlLh physlcally-addressed caches. 1 - poLenLlally yes because you can sklp LranslaLlon for accesses LhaL hlL ln Lhe L1
g) lor 8Alu-3, Lhe acLual number of dlsk accesses necessary Lo wrlLe a slngle byLe ls 2. l - lL's acLually 4 (read old value/parlLy, wrlLe new value/parlLy)
h) 8Alu-1 lmproves Lhe performance of read accesses. 1 - you can send a read access Lo elLher dlsk
l) ln a vlrLual machlne envlronmenL, l/C lnLerrupLs are flrsL processed by Lhe vlrLual machlne monlLor and Lhen by Lhe lnLerrupL handler of Lhe guesL CS. 1 - Lhe oLher way around ls unsafe as an l/C lnLerrupL may acLually be for anoLher guesL
rovlde shorL answers Lo Lhe followlng quesLlons. 1yplcally, a few senLences or a shorL bulleLed llsL wlll be sufflclenL. A long explanaLlon ls llkely Lo lnclude some lncorrecL sLaLemenL, so keep lL shorL and Lo Lhe polnL.
a) rovlde one speclflc example for each of Lhe followlng Lop-10" approaches for lmprovlng energy efflclency ln compuLer sysLems. Lach example should be no longer Lhan one senLence. 1he example can be from any class of sysLems (noLebooks, smarLphones, daLacenLers) and can be a sysLem-level, chlp-level, or sofLware Lechnlque. [10 polnLs]
use energy-efflclenL Lechnologles: use flash lnsLead of dlsks
MaLch power Lo work: dynamlc volLage-frequency scallng when load ls low
MaLch work Lo power: reduce Lhe frame raLe for vldeo playback when low on baLLery
lggy back energy evenLs: lnLerrupL coalesclng Lo amorLlze overheads
Speclal-purpose soluLlons: use Cus, uSs, or oLher speclal funcLlon unlLs
Cross-layer efflclency: workload consolldaLlon and scale-down ln daLacenLer
1radeoff some oLher meLrlc: sLore 2 lnsLead of 3 coples of daLa ln a daLa cenLer
1radeoff Lhe uncommon case: provlslon for a lower performance load Lo avold excesslve energy cosLs ln power supply or coollng
Spend somebody else's power: cllenL sends compuLaLlon Lo server (assumlng communlcaLlon cosL ls lower Lhan compuLaLlon cosL)
Spend power Lo save power: compress daLa Lo be able Lo Lurn off some memory/dlsk componenLs
4 b) rocessor vendors are uslng Lhe exponenLlally lncreaslng LranslsLor budgeLs Lo lnclude mulLlple cores per chlp. Lven lf we assume LhaL we have a large number of lndependenL programs or Lasks Lo run ln parallel on Lhe cores, whaL may be Lwo facLors LhaL llmlL Lhe usefulness of Lhe a mulLl-core chlp? [4 polnLs - 2 polnLs each]
- ower consumpLlon and power denslLy: you may noL be able Lo provlde power or remove heaL lf all Lhe processors ln Lhe chlp are worklng concurrenLly. lor example, Lhe heaL removal capablllLles are proporLlonal Lo Lhe area of Lhe chlp so Lhey remalned flxed as we puL an lncreaslng number of processors ln Lhe same space.
- Memory & l/C bandwldLh: Lhe collecLlve memory and l/C bandwldLh of all Lhe appllcaLlons may exceed Lhe bandwldLh avallable for off-chlp communlcaLlon. 1he off-chlp bandwldLh depends on Lhe number of plns of Lhe chlp, whlch ln Lurn depends on Lhe area of Lhe chlp. Pence, Lhe bandwldLh does noL scale wlLh Lhe number of processor we squeeze ln one chlp.
c) CerLaln companles propose LhaL we should operaLe daLa-cenLers wlLhouL acLlve alr- condlLlonlng (aka alr-slde economlzaLlon). 1hls lmplles LhaL Lhe servers ln Lhe daLa cenLer wlll be operaLlng aL a hlgher LemperaLure. WhaL ls Lhe Lradeoff you should Lo sLudy Lo evaluaLe lf Lhls ls a good ldea? [4 polnLs]
lL ls an lssue of balanclng cosLs. Cn Lhe one hand you have Lhe cosL of buylng machlnes. ln classlcal daLa-cenLers wlLh alr condlLlonlng, you pay for new machlnes every 3 years. WlLhouL alr condlLlonlng lL wlll be more ofLen. lf LhaL exLra cosL/year ls less Lhan whaL you save from noL paylng for alr condlLlonlng (equlpmenL, energy, eLc), Lhen lL's a good ldea.
2 Lo reallze lL's a cosL lssue, 2 Lo explaln a Lradeoff beLween cosL of PW replacemenL and cosL of coollng.
5 d) Messages ln lnLerconnecL neLworks Lyplcally use error deLecLlng buL noL error correcLlng codes (as lL ls Lhe case ln memorles and dlsks). uescrlbe brlefly how you can provlde error correcLlon ln neLworks wlLhouL Lhe use of error correcLlng codes. WhaL are Lhe advanLages of your proposal over [usL uslng error-correcLlng codes for Lhe conLenLs of each message? WhaL are Lhe lmplemenLaLlon requlremenLs of your proposal? [6 polnLs]
?ou can use reLransmlsslon Lo do error correcLlon. Cnce a message ls declded Lo be lncorrecL or losL, Lhen we can reLransmlL lL. 8eLransmlsslon requlres bufferlng of messages aL Lhe sender, reorderlng capablllLles ln Lhe recelver, an acknowledgemenL proLocol, and a LlmeouL mechanlsm Lo deLecL losL messages. 1he advanLages are: - lL works even lf you geL many errors ln one message (more Lhan whaL a cosL-effecLlve error deLecLlng code can supporL) - lL works even lf Lhe whole message ls losL
2 polnLs Lo menLlon reLransmlsslon, 2 polnLs Lo explaln a llLLle how lL works/requlremenLs 1 polnLs for each advanLage
e) A sysLem can recover from errors by Laklng perlodlcal checkpolnLs of lLs sLaLe and reverLlng Lo one of Lhem when an error ls deLecLed. LlsL Lhe facLors you would conslder Lo selecL Lhe frequency of checkpolnLlng Lhe sysLem sLaLe and Lhe number of acLlve checkpolnLs malnLalned by Lhe sysLem. [6 polnLs]
- 1he laLency of error deLecLlon - 1he sLorage overhead (for boLh a slngle checkpolnL and all Lhe checkpolnLs) - 1he Llme Lo resLore one or more checkpolnLs for recovery. 6 polnLs, 2 for each polnL 6 f) Assume an l/C sysLem wlLh a uMA conLroller LhaL can supporL mulLlple, concurrenLly acLlve, uMA requesLs. Slnce all uMA requesLs go over Lhe same memory bus, some arblLraLlon mechanlsm ls necessary. LlsL Lhe facLors LhaL Lhe uMA conLroller could Lake lnLo accounL ln arblLraLlng beLween Lhe requesLs and why Lhey are lmporLanL. noLe: you should noL explaln a speclflc arblLraLlon pollcy, buL Lhe facLors LhaL could be Laken lnLo accounL ln varlous pollcles. [4 polnLs]
- Channel sLaLus: some uMA Lransfers may be blocked. L.g., one channel may be movlng daLa from memory Lo dlsk. Slnce dlsks are slow, LhaL uMA channel wlll be blocked qulLe ofLen. - LocallLy/CranularlLy: due Lo locallLy effecLs, lL may be fasLer Lo group requesLs from one channel and execuLe Lhem back-Lo-back raLher Lhan swlLch beLween channels afLer every requesL. - SofLware-deflned prlorlLles - lalrness (lf Lhere are no prlorlLles)
g) A processor uses 32-blL physlcal addresses and 32-blL vlrLual addresses wlLh 1-k8yLe pages. 1he processor's 1L8 has 128 enLrles and ls 4-way seL assoclaLlve. WhaL ls Lhe sLorage as ln Lhe number of S8AM blLs (or kblLs = 1024 blLs) requlred Lo lmplemenL Lhe 1L8? Assume LhaL each enLry lncludes Lhree permlsslon blLs (8, W, x) and LhaL replacemenL uses a randomlzed algorlLhm. [6 polnLs]
1he 22-blL vn ls Lhe address for Lhe 1L8. 1he 1L8 has 128 enLrles organlzed ln 4 ways of 32 enLrles each. So, we need Lo exLracL a 3-blL lndex from Lhe vn ln order Lo selecL one of Lhese 32-enLrlers. 1he remalnlng 22-3=17blLs of Lhe vn wlll be Lhe Lag for Lhe 1L8.
So, each 1L8 enLry has a valld blL, 17blLs of 1ag, 22 blLs of physlcal page number (n - Lhe LranslaLlon resulL), and 3 permlsslon blLs. 1oLal 43 blLs. 1here ls no need for L8u blLs (randomlzed replacemenL).
So Lhe LoLal cosL of Lhe 1L8 ls 128 enLrles * 43 blLs = 3304 blLs = 3.373kblLs
-1 polnL for forgeLLlng Lhe valld blL -1 polnL for forgeLLlng Lhe permlsslon blLs -1 for addlng oLher random blLs Lo Lhe enLry -1 lf you geL Lhe n lengLh wrong, -2 lf you forgeL lL compleLely -2 for a 22b Lag (-1 lf Lhe calculaLlon of a 17b Lag fleld ls wrong) 7 I*.JK1$ T/ U+) -' O*#,)-,1R )61Q #*1 5-VV1*1') LTT O.-')(P
a) Assume Lhe followlng C code LhaL scans a llnked-llsL daLa-sLrucLure.
current = head; // start from the head of the linked list while (current!=NULL) { // while list is not empty process (current->element); // do some work on the current element current = current->next; // go to the next element }
Clven a sysLem wlLh Lwo processors LhaL share Lhe same flrsL-level daLa cache, how would you prefeLch Lhe llnked-llsL daLa for Lhe above Lraversal? under whaL condlLlons would Lhe prefeLchlng scheme be successful? [10 polnLs]
A slmple prefeLchlng scheme ls Lo have Lhe second processor execuLe a slmpllfled" verslon of Lhe loop LhaL does no work buL prefeLches elemenLs for Lhe flrsL processor.
1he prefeLch loop wlll look llke: current = head; // start from the head of the linked list while (current!=NULL) { // while list is not empty fetch(current->element); // no work, just fetch in cache current = current->next; // go to the next element }
1hls approach wlll work well lf Lhe process() funcLlon ln Lhe flrsL processor lncludes slgnlflcanL amounL of work Lo hlde Lhe memory laLency of Lhe mlss for each elemenL. 1hls allows Lhe 2nd processor Lo be a few elemenLs ahead of Lhe flrsL one.
We should also noLe LhaL lf Lhe laLency of process() ls much hlgher Lhan LhaL of Lhe mlss, Lhe 2nd processor may run Loo far ahead causlng desLrucLlve lnLerference ln Lhe cache. lL probably makes sense Lo synchronlze Lhe Lwo processors perlodlcally.
6 polnLs for descrlblng a scheme LhaL seems Lo work 4 polnLs for dlscusslng Lhe plus/mlnus 8 b) Assume you are deslgnlng a new, large-scale daLacenLer wlLh 100,000 servers. ?our goal ls Lo operaLe Lhe daLacenLer wlLh a slngle Lechnlclan responslble for hardware repalrs. Lach server repalr Lakes 1 hour and cosLs $130 ln labor and 10 of Lhe server's cosL for replacemenL parLs. Assume a full-Llme Lechnlclan works 40 hours a week 48 weeks ouL of Lhe year. WhaL ls Lhe maxlmum annual fallure raLe LhaL you can LoleraLe for Lhe servers? [4 polnLs]
Assume LhaL Lhe fallure raLe ls x. We wanL Lhe LoLal Llme needed Lo repalr servers Lo be less Lhan Lhe Llme a full-Llme Lechnlclan can work ln a year. 8epalr Llme < Lechnlclan's Llme ! x*1h*100,000servers < 48 weeks * 40 hours/week ! x <0.0192 or x<1.9 fallure raLe.
Assume LhaL you have Lwo cholces of servers for your daLacenLer. Server A cosLs $2,000 per unlL and has an annual fallure raLe of 0.03. Server 8 cosLs $2,300 has an annual fallure raLe 0.013. Server 8 also provldes hlgher performance so 90,000 servers wlll be sufflclenL Lo Lhe daLacenLer. Assumlng a 3-year llfeLlme for servers, whlch server Lype should you use for your daLa cenLer? Show your work. [6 polnLs]
lor each Lype of server, Lhere are Lwo cosLs Lo conslder: caplLal expenses (cosL of buylng Lhe servers) and operaLlonal expenses (cosL of repalrlng Lhe servers). 1he operaLlonal expenses lnclude replacemenL parLs and repalr Llme
Cbvlously, Lhe caplLal expenses domlnaLe so server A ls Lhe rlghL way Lo go. 9 c) Conslder Lhe followlng graph LhaL explores Lhe laLency of sLrlded accesses on Lhe cache hlerarchy of a well-known mlcroprocessor chlp. Clven Lhls graph, answer Lhe followlng quesLlons. rovlde a 1 senLence [usLlflcaLlon for each answer. [13 polnLs]
WhaL ls Lhe L1 u-cache llne slze? 648
WhaL ls Lhe assoclaLlvlLy of Lhe L1 u-cache? 4-way