The most exciting development in parallel computer architecture is the convergence of traditionally disparate approaches on a common machine structure. This book explains the forces behind this convergence of shared-memory, message-passing, data parallel, and data-driven computing architectures. It then examines the design issues that are critical to all parallel architecture across the full range of modern design, covering data access, communication performance, coordination of cooperative work, and correct implementation of useful semantics. It not only describes the hardware and software techniques for addressing each of these issues but also explores how these techniques interact in the same system. Examining architecture from an application-driven perspective, it provides comprehensive discussions of parallel programming for high performance and of workload-driven evaluation, based on understanding hardware-software interactions. * synthesizes a decade of research and development for practicing engineers, graduate students, and researchers in parallel computer architecture, system software, and applications development * presents in-depth application case studies from computer graphics, computational science and engineering, and data mining to demonstrate sound quantitative evaluation of design trade-offs * describes the process of programming for performance, including both the architecture-independent and architecture-dependent aspects, with examples and case-studies * illustrates bus-based and network-based parallel systems with case studies of more than a dozen important commercial designs Table of Contents 1 Introduction 2 Parallel Programs 3 Programming for Performance 4 Workload-Driven Evaluation 5 Shared Memory Multiprocessors 6 Snoop-based Multiprocessor Design 7 Scalable Multiprocessors 8 Directory-based Cache Coherence 9 Hardware-Software Tradeoffs 10 Interconnection Network Design 11 Latency Tolerance 12 Future Directions APPENDIX A Parallel Benchmark Suites
- Abali, B., and C. Aykanat. 1994. Routing Algorithms for IBM SP1. Lecture Notes in Computer Science, Vol. 853. New York: Springer-Verlag, 161-175. Google Scholar
- Abdel-Shafi, H. A., J. Hall, S. V. Adve, and V. S. Adve. 1997. An Evaluation of Fine-Grain Producer-Initiated Communication in Cache-Coherent Multiprocessors. Proc. Third Symposium on High Performance Computer Architecture (February). Google Scholar
- Adve, S. V 1993. Designing Memory Consistency Models for Shared-Memory Multiprocessors. Ph.D. diss., University of Wisconsin-Madison. Available as Tech. Report #1198, University of Wisconsin-Madison, Computer Science (December). Google Scholar
- Adve, S. Y. and K. Gharachorloo. 1996. Shared Memory Consistency Models: A Tutorial. IEEE Computer 29(12):66-76. Google ScholarDigital Library
- Adve, S. V., K. Gharachorloo, A. Gupta, J. L. Hennessy, and M. Hill. 1993. Sufficient Systems Requirements for Supporting the PLpc Memory Model . Tech. Report #1200, University of Wisconsin-Madison. Computer Science (December). Also available as Tech. Report #CSL-TR-93-595, Stanford University.Google Scholar
- Adve, S. V., and M. Hill. 1990a. Weak Ordering: A New Definition. 1990. Proc. 17th Int'l Symposium on Computer Architecture (May):2-14. Google Scholar
- Adve, S. V., and M. Hill. 1990b. Implementing Sequential Consistency in Cache-Based Systems. Proc. 1990 Int'l Conference on Parallel Processing (August):47-50.Google Scholar
- Adve, S. V., and M. Hill, 1993. A Unified Formalization of Four Shared-Memory Models. IEEE Transactions on Parallel and Distributed Systems 4(6):613-624. Google ScholarDigital Library
- Agarwal, A. 1991. Limit on Interconnection Performance. IEEE Transactions on Parallel and Distributed Systems 2(4):398-412. Google ScholarDigital Library
- Agarwal, A., R. Bianchini, D. Chaiken, K. L. Johnson, D. Kranz, J. Kubiatowicz, B.-H. Lim, K. Mackenzie, and D. Yeung. 1995. The MIT Alewife Machine: Architecture and Performance. Proc. 22nd Int'l Symposium on Computer Architecture (May/June):2-13. Google Scholar
- Agarwal, A., and A. Gupta. 1988. Memory-Reference Characteristics of Multiprocessor Applications Under MACH. Proc. ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems (May):215-225. Google Scholar
- Agarwal, A., B.-H. Lim, D. Kranz, and J. Kubiatowicz. 1990. (April): A Processor Architecture for Multiprocessing. Proc. 17th Annual Int'l Symposium on Computer Architecture (June):104-114. Google Scholar
- Agarwal, A., B.-H Lim, D. Kranz, and J. Kubiatowicz. 1991. LimitLESS Directories: A Scalable Cache Coherence Scheme. Proc. Fourth Int'l Conference on Architectural Support for Programming Languages and Operating Systems (April):224-234. Google Scholar
- Agarwal, A., R. Simoni, J. Hennessy, and M. Horowitz. 1988. An Evaluation of Directory Schemes for Cache Coherence. Proc. 15th Int'l Symposium on Computer Architecture (June):280-289. Google Scholar
- Aiken, A., and A. Nicolau. 1988. Optimal Loop Parallelization. Proc. SIGPLAN Conference on Programming Language Design and Implementation (June):308-317. Also published in SIGPLAN Notices 23(7). Google Scholar
- Aimoto, Y., T. Kimura, Y. Yabe, H. Heiuchi, et al. 1996. A 7.68GIPS 3.84GB/S 1W Parallel Image-Processing RAM Integrating a 16Mb DRAM and 128 Processors, International Solid-State Circuits Conference , San Francisco (February):372-373.Google Scholar
- Alexander, T. B., K. G. Robertson, D. T. Lindsay, D. L. Rogers, J. R. Obermeyer, J. R. Keller, K. Y. Oka and M. M. Jones II. 1994. Corporate Business Servers: An Alternative to Mainframes for Business Computing (HP K-Class). Hewlett-Packard Journal (June):8-33.Google Scholar
- Almasi, G. S., and A. Gottlieb. 1989. Highly Parallel Computing. Redwood City. CA: Benjamin/Cummings. Google Scholar
- Alverson, R., D. Callahan, D. Cummings, B. Koblenz, A. Porterfield and B. Smith. 1990. The Tera Computer System. Proc. 1990 Int'l Conference on Supercomputing (June): 1-6. Google Scholar
- Amdahl, G. M. 1967. Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities. AFIPS 1967 Spring Joint Computer Conference 40:483-485. Google Scholar
- Anderson, J. P., S. A. Hoffman, J. Shifman and R. Williams. 1962. D825-A Multiple-Computer Sysiem for Command and Control. AFIP Proc. FJCC 22:86-96. Google Scholar
- Anderson, J., and M. Lam. 1993. Global Optimizations for Parallelism and Locality on Scalable Parallel Machines. Proc. SIGPLAN'93 Conference on Programming Language Design and Implementation (June). Google Scholar
- Anderson, T. E., D. E. Culler, D. Patterson. 1995. A Case for NOW (Networks of Workstations). IEEE Micro 15(1):54-6. Google ScholarDigital Library
- Anderson, T. E., S. S. Owicki, J. P. Saxe and C. P. Thacker. 1992. High Speed Switch Scheduling for Local Area Networks. Proc. ASPLOS V (October):98-110. Google Scholar
- Archibald, J., and J.-L. Baer. 1986. Cache Coherence Protocols: Evaluation Using a Multiprocessor Simulation Model. ACM Transactions on Computer Systems 4(4):273-298. Google ScholarDigital Library
- Arnould, E. A., F. J. Bitz, E. C. Cooper, H. T. Kung, R. D. Sansom, and P. A. Steenkiste. 1989. The Design of Nectar: A Network Backplane for Heterogeneous Multicomputers. Proc. ASPLO III (April):205-216. Google Scholar
- Arpaci, R. H., D. E. Culler, A. Krishnamurthy, S. G. Steinberg, and K. Yelick. 1995. Empirical Evaluation of the Cray-T3D: A Compiler Perspective. Proc. 22nd Int'l Symposium on Computer Architecture (June):320-331. Google Scholar
- Arvind, and D. E. Culler. 1986. Dataflow Architectures. Annual Reviews in Computer Science 1:225-253. Palo Alto, CA: Annual Reviews. Reprinted in Dataflow and Reduction Architectures. Edited by S. S. Thakkar. Los Alamitos, CA: IEEE Computer Society Press, 1987.Google ScholarCross Ref
- Athas, W. C., and C. L. Seitz. 1988. Multicomputers: Message-Passing Concurrent Computers. IEEE Computer 21(8):9-24. Google ScholarDigital Library
- August, M. C., G. M. Brost, C. C. Hsiung and A. J. Schiffleger. 1989. Cray X-MP: The Birth of a Supercomputer. Computer 22(1):45-52. Google ScholarDigital Library
- Baer, J.-L., and T.-F Chen. 1991. An Efficient On-Chip Preloading Scheme to Reduce Data Access Penalty. Proc. Supercomputing '91 (November):176-186. Google Scholar
- Baer, J.-L., and W.-H. Wang. 1988. On the Inclusion Properties for Multi-Level Cache Hierarchies. Proc. 15th Annual Int'l Symposium on Computer Architecture (May):73-80. Google Scholar
- Bailey, D. H. 1990. FFTs in External or Hierarchical Memory Journal of Supercomputing 4(1):23-35. Also published in Proc. Supercomputing '89 (November):234-242. Google Scholar
- Bailey, D. H. 1991. Twelve Ways 10 Fool the Masses When Giving Performance Results on Parallel Computers. Supercomputing Review (August):54-55.Google Scholar
- Bailey, D. H. 1993. Misleading Performance Reporting in the Supercomputing Field. Scientific Programming 1(2):141-151. Also published in Proc Supercomputing '93 . Google ScholarDigital Library
- Bailey, D. H., E. Barszcz, J. Barton, D. Browning, R. Carter, L. Dagum, R. Fatoohi, S. Fineberg, P. Frederickson, T. Lasinski, R. Schreiber, H. Simon, V. Venkatakrishnan, and S. Weeratunga. 1991. The NAS Parallel Benchmarks. Intl. Journal of Supercomputer Applications 5(3);66-73. Also published as Tech. Report RNR-94-007, Numerical Aerodynamic Simulation Facility, NASA Ames Research Center (March 1994).Google Scholar
- Bailey, D. H., E. Barszcz, L. Dagum and H. D. Simon. 1994. NAS Parallel Benchmark Results 3-94. Proc. Scalable High-Performance Computing Conference , Knoxville, TN (May): 111-120.Google Scholar
- Bailey, D., T. Harris, W. Saphir, R. van der Wijngaart, A Woo and M. Yarrow. 1995. The NAS Parallel Benchmarks 2. 0. Report NAS-95-020, Numencal Aerodynamic Simulation Facility. NASA Ames Research Center (December).Google Scholar
- Baker, W. E., R. W. Horst, D. P Sonnier, and W. J. Watson 1995. A Flexible ServerNet-Based Fault-Tolerant Architecture. Proc. 25th Int'l Symposium on Fault-Tolerant Compuling (June). Los Alamitos, CA: IEEE Computer Society Press, 2-11. Google Scholar
- Bakoglu, H. B. 1990. Circuits, Interconnection, and Packaging for VLSI. Reading, MA: Addison-Wesley.Google Scholar
- Ball, J. R., R. C. Bollinger, T. A. Jeeves, R. C. McReynolds, D. H. Shaffer. 1962. On the Use of the Solomon Parallel-Processing Computer. Proc AFIPS Fall Joint Computer Conference 22:137-146. Google Scholar
- Banks, D. and M. Prudence 1993. A High Performance Network Architecture for a PARISC Workstation. IEEE Journal on Selected Areas in Communication 11(2): 191-202. Google ScholarDigital Library
- Barnes, J. E., and P. Hut 1989. Error Analysis of a Tree Code. Astrophysics Journal Supplement 70(June):389-417.Google ScholarCross Ref
- Barosso, L., and M. Dubois. 1993. The Performance of Cache-Coherent Ring-Based Multiprocessors. Proc. 20th Annual Int'l Symposium on Computer Architectures (ISCA) (May):268-277 Google Scholar
- Barosso, L., and M. Dubois. 1995. Performance Evaluation of the Slotted Ring Multiprocessors. IEEE Transactions on Computers 44(7):878-890. Google ScholarDigital Library
- Barroso, L. A., S. Iman, J. Jeong, K. Oner, K. Ramamurthy and M. Dubois. 1995. RPM: A Rapid Prototyping Engine for Multiprocessor Systems. IEEE Computer 28(2):26-34. Google ScholarDigital Library
- Barszcz, E., Fatoohi, R., Venkatakrishnan, V., and Weeratunga, S. 1993. Solution of Regular, Sparse Triangular Linear Systems on Vector and Distributed-Memory Multiprocessors . Tech. Report NAS RNR-93-007. NASA Ames Research Center. Moffett Field, CA (April).Google Scholar
- Barton, E., J. Crownie, and M. McLaren. 1994. Message Passing on the Meiko CS-2. Parallel Computing 20(4):497-507. Google ScholarDigital Library
- Baskett, F. T. Jermoluk, and D. Solomon. 1988. The 4D-MP Graphics Superworkstation: Computing + Graphics = 40 MIPS + 40 MFLOPS and 100,000 Lighted Polygons per Second. Proc. 33rd IEEE Computer Society Int'l Conference--COMPCON '88 (February):468-471.Google Scholar
- Batcher, K. E. 1974. Staran Parallel Processor System Hardware. Proc. AFIPS National Computer Conference , 405-410. Google Scholar
- Batcher, K. E. 1980. Design of a Massively Parallel Processor. IEEE Transactions on Computers C- 29(9):836-840. Google ScholarDigital Library
- Bell, C. G. 1985. Multis: A New Class of Multiprocessor Computers. Science 228:462-467.Google Scholar
- Benes, V. 1965. Mathematical Theory of Connecting Networks and Telephone Traffic. San Diego, CA: Academic Press.Google Scholar
- Bennett, J. E., and M.J. Flynn. 1996a. Latency Tolerance for Dynamic Processors. Tech. Report #CSL-TR-96-687, Computer Systems Laboratory, Stanford University. Google Scholar
- Bennett, J. E., and M. J. Flynn. 1996b. Reducing Cache Miss Rates Using Prediction Caches. Tech. Report #CSL-TR-96- 707. Computer Systems Laboratory, Stanford University. Google Scholar
- Berry, M., D. Chen, P. Koss, et al. 1989. The PERFECT Club Benchmarks: Effective Performance Evaluation of Computers. Int'l Journal of Supercomputer Applications 3(3):5-40. Google ScholarDigital Library
- Bershad, B. N., M. J. Zekauskas, and W. A. Sawdon. 1993. The Midway Distributed Shared Memory System. Proc. COMPCON '93 (February).Google Scholar
- Bhatt, S. M. and C. E. Leiserson. 1983. How to Assemble Tree Machines. ACM Symposium on Theory of Computing (STOC '82) . New York: ACM Press. Google Scholar
- Biagioni, E., E. Cooper, and R. Sansom. 1993. Designing a Practical ATM LAN. IEEE Network (March). Google Scholar
- Bilas, A., L. Iftode, and J. P. Singh. 1998. Evaluation of Hardware Support for Next-Generation Shared Virtual Memory Clusters. Proc. Int'l Conference on Supercomputing (July). Google Scholar
- Bisiani, R., and M. Ravishankar. 1990. PLUS: A Distributed Shared-Memory System. Proc. 17th Int'l Symposium on Computer Architecture (May): 115-124. Google Scholar
- Black, D., R. Rashid, D. Golub, C. Hill, R. Baron. 1989. Translation Lookaside Buffer Consistency: A Software Approach. Proc. Third Int'l Conference on Architectural Support for Programming Languages and Operating Systems. Boston (April): 113-122. Google Scholar
- Blackford, L. S., J. Choi, A. Cleary, E. D'Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. 1997. ScaLAPACK Users' Guide. Philadelphia, PA: Society for Industrial and Applied Mathematics (SIAM). Google Scholar
- Blelloch, G. 1993. Prefix Sums and Their Applications. In Synthesis of Parallel Algorithms . Edited by J. Reif. San Francisco: Morgan Kaufmann, 35-60.Google Scholar
- Blelloch, G. E., C. E. Leiserson, B. M. Mages, C. G. Plaxton, S.J. Smith, and M. A. Zagha. 1991. Comparison of Sorting Algorithms for the Connection Machine CM-2. Proc. Symposium on Parallel Algorithms ana Architectures (July):3-16. Google Scholar
- Blumrich, M. A., C. Dubnicki, E. W. Felten, K. Li, M.R. Mesarina. 1994. Two Virtual Memory Mapped Network Interface Designs. Proc. Hot Interconnects II Symposium (August).Google Scholar
- Blumrich, M., K. Li, R. Alpert, C. Dubnicki, E. Felten, and J. Sandberg. 1994. A Virtual Memory Mapped Network Interface for the Shrimp Multicomputer. Proc. 21st Int'l Symposium on Computer Architecture (April): 142-153. Google Scholar
- Boden, N., D. Cohen, R. Felderman, A. Kulawik, C. Seitz, J. Seizovic, and W. Su. 1995. Myrinet: A Gigabit-per-Second Local Area Network. IEEE Micro 15(1):29-38. Google ScholarDigital Library
- Bodin, F., P. Beckman, D. Gannon, S. Yang, S. Kesavan, A. Malony, and B. Mohr. 1993. Implementing a Parallel C++ Runtime System for Scalable Parallel Sysiems. Proc. Supercomputing '93 (November):588-597. Also in Scientific Programming 2(3). Google Scholar
- Bolt Beranek and Newman Advanced Computers. 1989. TC2000 Technical Product Summary. Cambridge, MA: Bolt Beranek and Newman.Google Scholar
- Bomans, L., and D. Roose. 1989. Benchmarking the iPSC/2 Hypereube Multiprocessor. Concurrency: Practice and Experience , 1(1):3-18.Google ScholarCross Ref
- Borkar, S., R. Cohn, G. Cox, T. Gross, H. T. Kung, M. Lam, M. Levine, B. Moore, W. Moore, C. Peterson, J. Susman, J. Sutton, J. Urbanski, and J. Webb. 1990. Supporting Systolic and Memory Communication in iWarp. Proc. 17th Annual Int'l Symposium on Computer Architecture , Seattle, WA (May):70-81. Revised version appears as Tech. Report #CMU-CS-90-197. Carnegie Mellon University. Google Scholar
- Bouknight, W. J., S. A. Denenberg, D. E. McIntyre, J. M. Randall, A. H. Sameh, and D. L. Slotnick. 1972. The Illiac IV System. Proc. IEEE 60(4):369-388.Google Scholar
- Boyle, J., R. Butler, T. Disz, B. Glickfield, E. Lusk, W. R. Overbeek, J. Patterson, and R. Stevens. 1987. Portable Programs for Parallel Processors. New York: Holt, Rinehart and Winston. Google Scholar
- Brewer, E. A., F. T. Chong, F. T. Leighton. 1994. Scalable Expanders: Exploiting Hierarchical Random Wiring. Proc. 1994 Symposium on the Theory of Computing , Montreal, Canada (May):144-152. Google Scholar
- Brewer, E. A., F. T. Chong, L. T. Liu, J. Kubiatowicz, S. D. Sharma. 1995. Remote Queues: Exposing Network Queues for Atomicity and Optimization. Proc. Seventh Annual Symposium on Parallel Algorithms and Architectures (July):42-53. Google Scholar
- Brewer, E. A., and B. C. Kuszmaul. 1994. How to Get Good Performance from the CM-5 Data Network. Proc. 1994 Int'l Parallel Processing Symposium , Cancun, Mexico (April):858-867. Google Scholar
- Bruno, J., P. R. Cappello. 1988. Implementing the Beam and Warming Method on the Hypercube. Proc. Third Conference on Hypercube Concurrent Computers and Applications , Pasadena, CA, Jan 19-20. Google Scholar
- Burger, D. 1997. System-Level Implications of Processor-Memory integration. Workshop on Mixing Logic and DRAM: Chips that Compute and Remember. Presented at the Int'l Symposium on Computer Architecture (ISCA) '97 (June).Google Scholar
- Burger, D., J. Goodman, and A. Kagi. 1996. Memory Bandwidth Limitations in Future Microprocessors. Proc. 23rd Annual Symposium on Computer Architecture (May):78-89. Google Scholar
- Burkhardt, H., et al. 1992. Overview of the KSR-1 Computer System. Tech. Report KSR-TR-9202001. Kendall Square Research. Boston (February).Google Scholar
- Butler, M., T-Y. Yeh, Y. Patt, M. Alsup, H. Scales, and M. Shebanow. 1991. Single Instruction Stream Parallelism Is Greater Than Two. Proc. Annual Int'l Symposium on Computer Architecture (ISCA), 276-86. Google Scholar
- Callahan, T. and S. C. Goldstein. 1995. NIFDY: A Low Overhead. High Throughput Network Interface. Proc. 22nd Annual Symposium on Computer Architecture (June):230-241. Google Scholar
- Cardoza, W., F. Glover, and W. Snaman Jr. 1996. Design of a TruCluster Multicomputer System for the Digital UNIX Environment. Digital Technical Journal 8(1):5-17. Google ScholarDigital Library
- Carter, J. B., J. K. Bennett, and W. Zwaenepoel. 1991. Implementation and Performance of Munin. Proc. 13th Symposium on Operating Systems Principles (October):152-164. Google Scholar
- Carter, J. B., J. K. Bennett, and W. Zwaenepoel. 1995. Techniques for Reducing Consistency-Related Communication in Distributed Shared-Memory Systems. ACM Transactions of Computer Systems 13(3):205-244. Google ScholarDigital Library
- Catanzaro, B. 1997. Multiprocessor System Architectures: A Technical Survery of Multiprocessor/ Multithreaded Systems Using SPARC, Multi-level Bus Architectures and Solaris (SunOS). Mountain View, CA: Sun Microsystems.Google Scholar
- Cekleov, M., D. Yen, P. Sindhu, J.-M. Frailong, et al. 1993. SPARCcenter 2000: Multiprocessing for the 90s, Digest of Papers. Proc. COMPCON Spring '93. Los Alamitos, CA: IEEE Computer Society Press, 345-353.Google Scholar
- Censier, L., and P. Feautrier. 1978. A New Solution to Cache Coherence Problems in Multiprocessor Systems. IEEE Transaction on Computer Systems C-27(12):1112-1118. Google Scholar
- Chan, K., et al. 1993. Multiprocessor Features of the HP Corporate Business Servers. Proc. COMPCON (Spring):330-337.Google Scholar
- Chandy, K. M., and J. Misra. 1988. Parallel Program Design: A Foundation. Reading. MA: Addison Wesley. Google Scholar
- Chang, P. P. S. A. Mahlke, W. Y. Chen, N. J. Warter, and W. W. Hwu. 1991. IMPACT: An Architectural Framework for Multiple-Instruction Issue Processors. Proc. 18th Int'l Symposium on Computer Architecture (ISCA) 19(3):266-275. Google Scholar
- Chen, T.-F., and J.-L. Baer. 1992. Reducing Memory Latency via Non-Blocking and Prefetching Caches. Proc. Fifth Int'l Conference on Architectural Support for Programming Languages and Operating Systems (October):51-61. Google Scholar
- Chen, T.-F., and J.-L. Baer. 1994. A Performance Study of Software and Hardware Data Prefetching Schemes. Proc. 21st Annual Symposium on Computer Architecture (April):223-232. Google Scholar
- Cheong, H., and A. Viedenbaum. 1990. Compiler-directed Cache Management in Multiprocessors. IEEE Computer 23(6):39-47. Google ScholarDigital Library
- Chien, A. A., and J. H. Kim. 1992. Planar-Adaptive Routing: Low-Cost Adaptive Networks for Multiprocessors. Proc. 19th Annual International Symposium on Computer Architecture (ISCA), Gold Coast, Australia (May):268-277. Google Scholar
- Choi, J.J.J. Dongarra, R. Pozo, and D. W. Walker. 1992. ScaLAPACK: A Scalable Linear Algebra Library for Distributed Memory Concurrent Computers. Proc. Fourth Symposium on the Frontiers of Massively Parallel Computation, McLean, VA. Los Alamitos, CA: IEEE Computer Society Press, 120-127.Google Scholar
- Chun, B. N., A. M. Mainwaring, and D. E. Culler. 1998. Virtual Network Transport Protocols for Myrinet. IEEE Micro (January):53-63. Google ScholarDigital Library
- Clark, R., and K. Alnes. 1996. An SCI Chipset and Adapter. Symposium Record, Hot Interconnects IV (August):221-235.Google Scholar
- Cohen, D., G. Finn, R. Felderman, and A. DeSchon. 1993. ATOMIC: A Low-Cost, Very High-Speed, Local Communication Architecture. Proc. 1993 Int. Conference on Parallel Processing . Google Scholar
- Convex Computer Corporation. 1993. Exemplar Architecture. Richardson, TX: Convex Computer Corp.Google Scholar
- Corella, F., J. Stone, C. Barton. 1993. A Formal Specification of the PowerPC Shared Memory Architecture. Tech. Report Computer Science RC 18638 (81566), IBM Research Division. T.J. Watson Research Center (January).Google Scholar
- Cornell, J. A. 1972. Parallel Processing of Ballistic Missile Defense Radar Data with PEPE. COMPCON 72, 69-72.Google Scholar
- Cox, A., and R. Fowler. 1993. Adaptive Cache Coherency for Detecting Migratory Shared Data. Proc. 20th Int'l Symposium on Computer Architecture (May):98-108. Google Scholar
- Crowther, W., J. Goodhue, R. Gurwitz, R. Rettberg, and R. Thomas. 1985. The Butterfly Parallel Processor. IEEE Computer Architecture Technical Newsletter , 18-46.Google Scholar
- Culler, D. E. 1994. Multithreading: Fundamental Limits, Potential Gains, and Alternatives. In Multithreaded Computer Architecture: A Summary of the State of the Art. Edited by R. Iannucci. Dordrecht, Germany; Norwell, MA: Kluwer Academic Publishers, 97-138.Google Scholar
- Culler, D. E., A. C. Dusseau, S. C. Goldstein, A. Krishnamurthy, S. Lumetta, T. von Eicken, and K. Yelick. 1993. Parallel Programming in Split-C. Proc. Supercomputing '93 (November): 262-273. Google Scholar
- Culler, D. E., A. C. Dusseau, R. P. Martin, and K. E. Schauser. 1993. Fast Parallel Sorting under LogP: From Theory to Practice. In Portability and Performance for Parallel Processing . Chapter 4. New York: John Wiley & Sons, 71-98.Google Scholar
- Culler, D. E., R. M. Karp, D. A., Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. von Eicken. 1993. LogP: Toward A Realistic Model of Parallel Computation. ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (May): 1-12. Google Scholar
- Culler, D. E., R. M. Karp, D. A., Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. von Eicken. 1996. LogP: A Practical Model of Parallel Computation. CACM 39(11):78-85. Google ScholarDigital Library
- Culler, D. E., A. Sah, K. E. Schauser, T. von Eicken, and J. Wawrzynek. 1991. Fine-Grain Parallelism with Minimal Hardware Support. Proc. Fourth Int'l Symposium on Arch. Support for Programming Languages and Systems (ASPLOS) (April):164-175. Google Scholar
- Culler, D. E., K. E. Schauser, and T. von Eicken. 1993. Two Fundamental Limits on Dataflow Multithreading. Proc. IFIP WG 10.3 working Conference on Architectures and Compilation Techniques for Fine and Medium Grain Parallelism . Orlando, FL. Google Scholar
- Dahlgren, F. 1995. Boosting the Performance of Hybrid Snooping Cache Protocols. Proc 22nd Int'l Symposium on Computer Architecture (June):60-69. Google Scholar
- Dahlgren, F., M. Dubois, and P. Stenstrom. 1994. Combined Performance Gains of Simple Cache Protocol Extensions. Proc. 21st Int'l Symposium on Computer Architecture (April): 187-197. Google Scholar
- Dahlgren, F., M. Dubois, and P. Stenstrom. 1995. Sequential Hardware Prefetching in Shared-Memory Multiprocessors. IEEE Transactions on Parallel and Distributed Systems 6(7). Google ScholarDigital Library
- Dally, W. J. 1990a. Virtual-Channel Flow Control. Proc. 17th Annual Int'l Symposium on Computer Architecture (ISCA) , Seattle, WA, (May):60-68. Google Scholar
- Dally, W. J. 1990b. Performance Analysis of k -ary n -cube Interconnection Networks. IEEE-TOC 39(6):775-85. Google ScholarDigital Library
- Dally, W. J., A. Chien, S. Fiske, W. Horwat, J. Keen, J. Larivee, R. Lethin, P Nuth, S. Willis. 1989. The J-Machine: A Fine-Grained Concurrent Computer. Proc IFIP 11th World Computer Congress, Information Processing'89 , 1147-1153.Google Scholar
- Dally, W. J., J. A. S. Fiske, J. S. Keen, R. A. Lethin, M. D. Noakes and P. R. Nuth. 1992. The Message Driven Processor: A Multicomputer Processing Node with Efficient Mechanisms. IEEE Micro (April):23-39. Google ScholarDigital Library
- Dally, W. J., J. S. Keen, M. D. Noakes. 1993. The J-Machine Architecture and Evaluation. Digest of Papers. COMPCON Spring '93. San Francisco, CA (February):183-188.Google Scholar
- Dally, W. J., and C. Seitz. 1987. Deadlock-Free Message Routing in Multiprocessor Interconnections Networks. IEEE-TOC C-36(5):547-553. Google Scholar
- Denning, P. J. 1968. The Working Set Model for Program Behavior. Communications of the ACM 11(5):323-333. Google ScholarDigital Library
- Dennis, J. B. 1980. Dataflow Supercomputers. IEEE Computer 13(11):93-100. Google ScholarDigital Library
- Digital Equipment Corporation. 1992. Alpha Architecture Handbook . Maynard, MA: Digital Equipment Corp.Google Scholar
- Dijkstra, E. W. 1965. Solution of a Problem in Concurrent Programming Control. Communications of the ACM 8(9):569. Google ScholarDigital Library
- Dijkstra, E. W., and C. S. Sholten. 1968. Termination Detection for Diffusing Computations. Information Processing Letters 1:1-4.Google Scholar
- Dongarra, J. J. 1990. Performance of Various Computers Using Standard Linear Equations Software in a Fortran Environment . Tech. Report CS-89-85. University of Tennessee, Computer Science Dept. (March). Google Scholar
- Dongarra, J. J. 1994. Performance of Various Computers Using Standard Linear Equation Software . Tech. Repon CS-89-85. University of Tennessee, Computer Science Dept. (November); current report available from [email protected] Google Scholar
- Dongarra, J. J., J. Martin, and J. Worlton. 1987. Computer Benchmarking: Paths and Pitfalls. IEEE Spectrum (July):38. Google Scholar
- Dongarra, J. J., and D. W. Walker. 1995. Software Libraries for Linear Algebra Computations on High performance Computers. SIAM Review 37:151-180. Google ScholarDigital Library
- Dongarra, J. J., and W. Genlzsch, eds. 1993. Computer Benchmarks. Amsterdam: Elsevier Science B. V., North-Holland. Google Scholar
- Dubnicki, C. L. Iftode, E. W. Felten, K. Li. 1996. Software Support for Virtual Memory-Mapped Communication. Tenth Int'l Parallel Processing Symposium (April). Google Scholar
- Dubnicki, C., and T. LeBlanc. 1992. Adjustable Block Size Coherent Caches. Proc. 19th Annual Int'l Symposium on Computer Architecture (May):170-180. Google Scholar
- Dubois, M., and C. Scheurich, 1990. Memory Access Dependencies in Shared-Memory Multi-processors. IEEE Transactions on Software Engtneering 16(6):660-673. Google ScholarDigital Library
- Dubois, M., C. Scheurich, and F. Briggs. 1986. Memory Access Buffering in Multiprocessors. Proc. 13th Int'l Symposium on Computer Architecture (June):434-442. Google Scholar
- Dubois, M., J. Skeppstedt, L. Ricciulli, K. Ramamurthy, and P. Slenstrom. 1993. The Detection and Elimination of Useless Misses in Multiprocessors. Proc. 20th Int'l Symposium on Computer Architecture (May):88-97. Google Scholar
- Dubois, M., J.-C. Wang, L. A. Barroso, K. Chen and Y.-S. Chen. 1991. Delayed Consistency and Its Effects on the Miss Rate of Parallel Programs. Proc. Supercomputing '91 (November): 197-206. Google Scholar
- Dunigan, T. H. 1988. Performance of a Second Generation Hypereube Tech. Report ORNL/TM-10881, Oak Ridge National Lab. (November).Google Scholar
- Dunning, D., G. Regnier, G. McAlpine, D. Camaron, B. Shubert, F. Berry, A. M. Merriti, E. Gronke and C. Dodd. 1998. The Virtual Interface Architecture. IEEE Micro 18(2). Google Scholar
- Dusseau, A. C., D. E. Culler, K. E. Schauser, and R. P. Martin. 1996. Fast Parallel Sorting under LogP: Experience with the CM-5. IEEE Transactions on Parallel and Distributed Systems 7(8): 791-805. Google ScholarDigital Library
- Dwarkadas, S., P. Keleher, A. L. Cox, and W. Zwaenepoel. 1993. Evaluation of Release Consistent Software Distributed Shared Memory on Emerging Network Technology. Proc. 20th Int'l Symposium on Computer Architecture (May):144-155. Google Scholar
- Eggers, S., and R. Katz. 1988. A Characterization of Sharing in Parallel Programs and Its Application to Coherency Protocol Evaluation. Proc. 15th Annual Int'l Symposium on Computer Architecture (May):373-382. Google Scholar
- Eggers, S., and R. Katz. 1989a. The Effect of Sharing on the Cache and Bus Performance of Parallel Programs. Proc. Third Int'l Conference on Architectural Support for Programming Languages and Operating Systems (May):257-270. Google Scholar
- Eggers, S., and R. Katz. 1989b. Evaluating the Performance of Four Snooping Cache Coherency Protocols. Proc. 16th Annual Int'l Symposium on Computer Architecture (May):2-15. Google Scholar
- Eigenmann, R., and S. Hassanzadeh. 1996. Benchmarking with Real Industrial Applications: The SPEC High Performance Group. IEEE Computational Science and Engineering (spring). Google Scholar
- Elliott, D. G., W. M. Snelgrove, and M. Stumm. 1992. Computational RAM: A Memory-SIMD Hybrid and Its Application to DSP. Custom Integrated Circuits Conference , Boston, MA (May):30.6.1-30.6.4.Google Scholar
- Elliott, D. G., M. Stumm, and W. M. Snelgrove. 1997. Computational RAM: The Case for SIMD Computing in Memory . Workshop on Mixing Logic and DRAM: Chips that Compute and Remember. Presented at Annual International Symposium on Computer Architecture (ISCA) '97 (June).Google Scholar
- Erlichson, A., B. Nayfeh, J. P. Singh and Oyekunle Olukotun. 1995. The Benefits of Clustering in Cache-Coherent Multiprocessors: An Application-Driven Investigation. Proc. Supercomputing 95 (November). Google Scholar
- Erlichson, A., N. Nuckolls, G. Chesson, and J. L. Hennessy. 1996. SoftFLASH: Analyzing the Performance of Clustered Distributed Virtual Shared Memory. Proc. Seventh Int'l Conference on Architectural Support for Programming Languages and Operating Systems (October):210-220. Google Scholar
- Falsafi, B., A. R. Lebeck, S. K. Reinhardt, I. Schoinas, M. D. Hill, J. R. Larus, A. Rogers, and D. A. Wood. 1994. Application-Specific Protocols for User-Level Shared Memory. Proc. Supercomputing '94 (November):380-389. Google Scholar
- Falsafi, B. and D. A. Wood. 1997. Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA. Proc. 24th Int'l Symposium on Computer Architecture (June):229-240. Google Scholar
- Farkas, K., Z. Vranesic, and M. Stumm. 1992. Cache Consistency in Hierarchical Ring-Based Multiprocessors. Poc. Supercomputing '92 (November). Google Scholar
- Feigel, C. P. 1994. TI Introduces Four-Processor DSP Chip. Microprocessor Report (March):28.Google Scholar
- Felderman, R., et al. 1994. Atomic: A High Speed Local Communicalion Architecture. Journal of High Speed Networks 3(1):1-29. Google ScholarDigital Library
- Fenwick, D. M., D. J. Foley, W. B. Gist, S. R. VanDoren, and D. Wissell. 1995. The AlphaServer 8000 Series: High-End Server Platform Development. Digital Technical Journal 7(1):43-65. Google ScholarDigital Library
- Flanagan, J. L. 1994. Technologies for Multimedia Communications. IEEE Proceedings 82(4):590-603.Google ScholarCross Ref
- Flynn, M. J. 1972. Some Computer Organizations and Their Effectiveness. IEEE Transactions on Computing C-21(Seplember):948-960. Google ScholarDigital Library
- Fortune, S., and J. Wyllie. 1978. Parallelism in Random Access Machines. Proc. 10th ACM Symposium on Theory of Computing (May). Google Scholar
- Fox. G., M. Johnson, G. Lyzenga, S. Otto, J. Salmon, and D. Walker. 1988. Solving Problems on Concurrent Processors , vol. 1. Englewood Cliffs. NJ: Prentice Hall. Google Scholar
- Frailong, J.-L. et al. 1993. The Next Generation SPARC Multiprocessing System Architecture. Proc. COMPCON (spring):475-480.Google Scholar
- Frank, S., H. Burkhardt III, and J. Rothnie. 1993. The KSR1: Bridging the Gap between Shared Memory and MPPs. Proc. COMPCON, Digest of Papers (spring);285-294.Google Scholar
- Fu, J. W. C., and J. H. Patel. 1991. Data Prefetching in Multiprocessor Vector Cache Memories. Proc. 18th Annual Symposium on Computer Architecture (May):54-63. Google Scholar
- Fu, J. W. C., J. H. Patel, and B. L. Janssens. 1992. Stride Directed Prefetching in Scalar Processors. Proc. 25th Annual Int'l Symposium on Microarchitecture (December): 102-110. Google Scholar
- Fuchs, H., G. Abram, and E. Grant. 1983. Near Real-Time Shaded Display of Rigid Objects. Proc. SIGGRAPH . Google Scholar
- Galles, M., and E. Williams. 1993. Performance Optimizations, Implementation, and Verification of the SGI Challenge Multiprocessor. Proc. 27th Hawaii Int'l Conference on System Sciences Vol. I: Architecture (January). Also in SGI Challenge . Edited by T. N. Mudge and B. D. Shriver. Los Alamitos, CA: IEEE Computer Society Press, 1994, 134-143.Google Scholar
- Geist, A., A. Beguelin, and J. Dongarra, W. Jiang, R. Manchek and V. Sunderam 1994. PVM 3.0 Users'Guide and Reference Manual . Tech Report ORNL/TM-12187. Oak Ridge, TN: Oak Ridge National Laboratory (February), http://wwweece.ksu.edu/pvm3/ug.ps.Google Scholar
- Geist, A., A. Beguelin, J. Dongarra, R. Manchek, W. Jiang, and V. Sunderam. 1994. PVM: A Users' Guide and Tutorial/or Networked Parallel Computing . Cambridge, MA: MIT Press. Google ScholarCross Ref
- Geist, G. A., and V. S. Sunderam. 1992. Network Based Concurrent Computing on the PVM System, Journal of Concurrency: Practice and Experience 4(4):293-311. Google ScholarDigital Library
- Gharachorloo, K. 1995. Memory Consistency Models for Shared-Memory Multiprocessors , Ph.D. diss., Computer Systems Laboratory. Stanford University (December). Also published as Tech. Report #CSL-TR-95-685. Google Scholar
- Gharachorloo, K., S. Adve, A. Gupta, M. Hill, and J. L. Hennessy. 1992. Programming for Different Memory Consistency Models. Journal of Parallel and Distributed Computing 15(4):399-407.Google ScholarCross Ref
- Gharachorloo, K., A. Gupta, and J. L. Hennessy. 1991a. Performance Evaluation of Memory Consistency Models for Shared-Memory Multiprocessors. Proc. 4th Int'l Conference on Architectural Support for Programming Languages and Operating Systems (April):245-257. Google Scholar
- Gharachorloo, K., A. Gupta, and J. L. Hennessy. 1991b. Two Techniques to Enhance the Performance of Memory Consistency Models. Proc. Int'l Conference on Parallel Processing (August): 1355-1364.Google Scholar
- Gharachorloo, K., A. Gupta, and J. L. Hennessy. 1992. Hiding Memory Latency Using Dynamic Scheduling in Shared Memory Multiprocessors. Proc. 19th Int'l Symposium on Computer Architecture (May):22-33. Google Scholar
- Gharachorloo, K., D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J. L. Hennessy. 1990. Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors. Proc. 17th Int'l Symposium on Computer Architecture (May):15-26. Google Scholar
- Gillett, R. 1996. Memory Channel Neiwork for PCI. IEEE Micro 16(1):12-18. Google ScholarDigital Library
- Gillett, R., M. Collins, and D. Pimm. 1996. Overview of Network Memory Channel for PCI. Proc. IEEE Spring COMPCON '96 (February). Google Scholar
- Gillett, R., and R. Kaufmann. 1997. Using Memory Channel Network. IEEE Micro 17(1):19-25. Google ScholarDigital Library
- Glass, C. J., and L. M. Ni. 1992. The Turn Model for Adaptive Routing. Proc. Annual International Symposium on Computer Architecture (ISCA) (May): 278-287. Google Scholar
- Godiwala, N. D., and B. A. Maskas. 1995. The Second-Generation Processor Module for AlphaServer 2100 Systems. Digital Technical Journal 7(1). Google Scholar
- Gokhale, M., B. Holmes, and K. Iobst. 1995. Processing in Memory: The Terasys Massively Parallel PIM Array. IEEE Computer 28(3):23-31. Google ScholarDigital Library
- Goldschmidt, S. R. 1993. Simulation of Multiprocessors: Speed and Accuracy . Ph.D. diss., Stanford University (June). Google Scholar
- Golub, G., and C. Van Loan. 1997. Matrix Computations 3e . Baltimore, MD: Johns Hopkins University Press.Google Scholar
- Goodman, J. R. 1983. Using Cache Memory to Reduce Processor-Memory Traffic. Proc. 10th Annual Int'l Symposium on Computer Architecture (June): 124-131. Google Scholar
- Goodman, J. R. 1987. Coherency for Multiprocessor Virtual Address Caches. Proc. Second Int'l Conference on Architectural Support for Programming Languages and Operating Systems . Palo Alto. CA(October):72-81. Google Scholar
- Goodman, J. R. 1989. Cache Consistency and Sequential Consistency . Tech. Report #1006, University of Wisconsin-Madison. Computer Science Dept. (February).Google Scholar
- Goodman, J. R., M. K. Vernon, P.J. Woest. 1989. Set of Efficient Synchronization Primitives for a Large-Scale Shared-Memory Multiprocessor. Proc. Third Int'l Conference on Architectural Support for Programming Languages and Operating Systems (April):64-75. Google Scholar
- Gottlieb, A., R. Grishman, C. P. Kruskal, K. P. McAuliffe, L. Rudolph, and M. Snir. 1983. The NYU Ultracomputer--Designing an MIMD Shared Memory Parallel Computer. IEEE Transactions on Computers C-32(2):175-189. Google ScholarDigital Library
- Gottlieb, A., and C. P. Kruskal. 1984. Complexity Results for Permuting Data and Other Computations on Parallel Processors. Journal of the ACM 31(April):193-209. Google ScholarDigital Library
- Gottlieb, A., B. Lubachevsky, and L. Rudolph. 1983. Basic Techniques for the Efficient Coordination of Large Numbers of Cooperating Sequenlal Processes. ACM Transactions on Programming Languages and Systems 5(2). Google Scholar
- Grafe, V. G., and J. E. Hoch. 1990. The Epsilon-2 Hybrid Dataflow Architecture. Proc. COMPCON Spring '90 , San Francisco, CA (March):88-93.Google Scholar
- Grahn, H., and P. Stenstrom. 1996. Evaluation of a Competitive-Update Cache Coherence Protocol with Migratory Data Detection, Journal of Parallel and Distributed Computing 39(2):168-180. Google ScholarDigital Library
- Grahn, H., P. Stenstrom and M. Dubois. 1995. Implementation and Evaluation of Update-Based Protocols under Relaxed Memory Consistency Models. Future Generation Computer Systems 11(3):247-271. Google ScholarDigital Library
- Granuke, G., and S. Thakkar. 1990. Synchronization Algorithms for Shared Memory Multiprocessors. IEEE Computer 23(6):60-69. Google ScholarDigital Library
- Gray, J. 1991. The Benchmark Handbook for Database and Transaction Processing Systems . San Francisco: Morgan Kaufmann. Google Scholar
- Green, S. A., and D. J. Paddon. 1990. A Highly Flexible Multiprocessor Solution for Ray Tracing. The Visual Computer 6:62-73.Google ScholarCross Ref
- Greenberg, R. I. and C. E. Leiserson. 1989. Randomized Routing on Fat-Trees. Advances in Computing Research 5:345-374.Google Scholar
- Greenwald, M., and D. R. Cheriton. 1996. The Synergy between Non-Blocking Synchronization and Operating System Structure. Proc. Second Symposium on Operating System Design and Implementation, USENIX , Seattle (October): 123-136. Google Scholar
- Gropp, W. E. Lusk and A. Skjellum 1994. Using MPI: Portable Parallel Programming with the Message-Passing Interface . Cambridge, MA: MIT Press. Google Scholar
- Groscup, W. 1992. The Intel Paragon XP/S Supercomputer. Proc. Fifth ECMWF Workshop on the Use of Parallel Processors in Meteorology (November):262-273.Google Scholar
- Gunther, K. D. 1981. Prevention of Deadlocks in Packet-Switched Data Transport Systems. IEEE Transactions on Communication C-29(4):512-24.Google ScholarCross Ref
- Gupta, A., J. L. Hennessy, K. Gharachorloo, T. Mowry and W.-D. Weber. 1991. Comparative Evaluation of Latency Reducing and Tolerating Techniques. Proc. 18th Int'l Symposium on Computer Architecture (May):254-263. Google Scholar
- Gupta, A., and W.-D. Weber. 1992. Cache Invalidation Patterns in Shared-Memory Multiprocessors. IEEE Transactions on Computers 41(7):794-810. Google ScholarDigital Library
- Gupta, A., W.-D. Weber, and T. Mowry. 1990. Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache-Coherence Schemes. Proc. Int'l Conference on Parallel Processing I (August):312-321.Google Scholar
- Gurd, J. R., C. C. Kerkham and I. Watson. 1985. The Manchester Prototype Dataflow Computer. Communications of the ACM 28(1):34-52. Google ScholarDigital Library
- Gustafson, J. L. 1988. Reevaluating Amdahl's Law. Communications of the ACM 31(5):532-533. Google ScholarDigital Library
- Gustafson, J. L., and Q. O. Snell. 1994. HINT: A New Way to Measure Computer Performance . Tech. Report. Ames Laboratory, U.S. Dept. of Energy. Ames, IA.Google Scholar
- Gustavson, D. 1992. The Scalable Coherence Interface and Related Standards Projects. IEEE Micro 12(1):10-22. Google ScholarDigital Library
- Gwennap, L. 1994a. Microprocessors Head Toward MP on a Chip. Microprocessor Report (May).Google Scholar
- Gwennap, L. 1994b. PA-7200 Enables Inexpensive MP Systems. Microprocessor Report (March).Google Scholar
- Hagersten, E. 1992. Toward Scalable Cache Only Memory Architectures . Ph.D. diss., Swedish Institute of Computer Science (October).Google Scholar
- Hagersten, E., A. Landin. and S. Haridi. 1992. DDM--A Cache Only Memory Architecture. IEEE Computer 25(9):44-54. Google ScholarDigital Library
- Hanrahan, P., D. Salzman and L. A. Aupperle. 1991. A Rapid Hierarchical Radiosity Algorithm. Proc. SIGGRAPH (July). Google Scholar
- Hayashi, K., T. Doi, T. Horie, Y. Koyanagi, O. Shiraki, N. Imamura, T. Shimizu, H. Ishihata and T. Shindo. 1994. AP1000+: Archiieciural Support of PUT/GET Interface for Parallelizing Compiler. ACM SIGPLAN Notices 29(11):196. Google ScholarDigital Library
- Heinlein, J., R. P. Bosch, Jr., K. Gharachorloo, M. Rosenblum, and A. Gupta. 1997. Coherent Block Data Transfer in the FLASH Multiprocessor. Proc. 11th Int'l Parallel Processing Symposium (April). Google Scholar
- Heinlein, J., K. Gharachorloo, S. Dresser, and A. Gupta. 1994. Integration of Message Passing and Shared Memory in the Stanford FLASH Multiprocessor. Proc. 6th Int'l Conference on Architectural Support for Programming Languages and Operating Systems (October):38-50. Google Scholar
- Heinrich, M., J. Kuskin, D. Ofelt, J. Heinlein, J. Baxter, J. P. Singh, R. Simoni, K. Gharachorloo, D. Nakahira, M. Horowitz, A. Gupta, M. Rosenblum, and J. L. Hennessy. 1994. The Performance Impact of Flexibility on the Stanford FLASH Multiprocessor. Proc. 6th Int'l Conference on Architectural Support for Programming Languages and Operating Systems (October):274-285. Google Scholar
- Hennessy, J. L., and N. Jouppi. 1991. Computer Technology and Architecture: An Evolving Interaction. IEEE Computer 24(9): 18-29. Google ScholarDigital Library
- Hennessy, J. L., and D. A. Patterson. 1996. Computer Architecture: A Quantitative Approach . 2nd ed. San Francisco: Morgan Kaufmann. Google ScholarDigital Library
- Herlihy, M. P. 1988. Impossibility and Universality Results for Wait-Free Synchronizalion. Seventh ACM SIGACTS-SICOPS Symposium on Principles of Distributed Computing (August):276-290. Google Scholar
- Herlihy, M. P. 1991. Wait-Free Synchronization. ACM Transactions on Programming Languages and Systems 13(1):124-149. Google ScholarDigital Library
- Herlihy, M. P. 1993. A Methodology for Implementing Highly Concurrent Data Objects. ACM Transactions on Programming Languages and Systems 15(5):745-770. Google ScholarDigital Library
- Herlihy, M. P., and J. E. B. Moss. 1993. Transactional Memory: Architectural Support for Lock-Free Data Structures. Proc. 20th Annual Symposium on Computer Architecture , San Diego, CA (May):289-301. Google Scholar
- Herlihy, M. P., and J. Wing. 1987. Axioms for Concurrent Objects. Proc. 14th ACM Symposium on Principles of Programming Languages (January): 13-26. Google Scholar
- Hernquist, L. 1987. Performance Characteristics of Tree Codes. Astrophysics Journal Supplement 64(August):715-734.Google ScholarCross Ref
- Hey, A. J. G. 1991. The Genesis Distributed Memory Benchmarks. Parallel Computing 17:1111-1130. Google ScholarDigital Library
- High Performance Fortran Forum. 1993. High Performance Fortran Language Specification. Scientific Programming 2(1): 1-270.Google Scholar
- Hill, M. D., S. J. Eggers, J. R. Larus, G. S. Taylor, G. Adams, B. K. Bose, G. A. Gibson, P. M. Hansen, J. Keller, S. I. Kong, C. G. Lee, D. Lee, J. M. Pendleton, S. A. Ritchie, D. A. Wood, B. G. Zorn, P. N. Hilfinger, D. A. Hodges, R. H. Katz, J. Ousterhut, and D. A. Patterson. 1986. Design Decisions in SPUR. IEEE Computer 19(10):8-22. Also in Computers for Artificial Intelligence Processing . Edited by B. W. Wah and C. V Ramamoorthy. New York: John Wiley and Sons, 273-299. Google ScholarDigital Library
- Hill, M. D., and A. J. Smith. 1989. Evaluating Associativity in CPU Caches. IEEE Transactions on Computers C-38(12):1612-1630. Google Scholar
- Hillis, W. D. 1985. The Connection Machine . Cambridge, MA: MIT Press. Google Scholar
- Hillis, W. D., and G. L. Steele. 1986. Data Parallel Algorithms. Communications of the ACM 29(12):1170-1183. Google ScholarDigital Library
- Hillis, W. D., and L. W. Tucker. 1993. The CM-5 Connection Machine: A Scalable Supercomputer. Communications of the ACM 36(11):31-40. Google ScholarDigital Library
- Hirata, H., K. Kimura, S. Nagamine, Y. Mochizuki, A. Nishimura, Y. Nakase and T. Nishizawa. 1992. An Elementary Processor Architecture with Simultaneous Instruction Issuing from Multiple Threads. Proc. 19th Int'l Symposium on Computer Architecture (May):136-145. Google Scholar
- Hoare, C. A. R. 1978. Communicating Sequential Processes. Communications of the ACM 21(8):666-667. Google ScholarDigital Library
- Hockney, R. W. and C. R. Jesshope. 1988. Parallel Computers 2. London: Adam Hilger.Google Scholar
- Holt, C. M. Heinrich, J. P. Singh, E. Rothberg and J. L. Hennessy. 1995. The Effects of Latency, Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors . Tech. Report #CSL-TR-95-660, Computer Systems Laboratory. Stanford University (January). Google Scholar
- Homewood M., and M. McLaren. 1993. Meiko CS-2 Interconnect Elan--Elite Design. Hot Interconnects (August).Google Scholar
- Horiw, T. K. Hayashi, T. Shimizu, and H. Ishihata. 1993. Improving the AP1000 Parallel Computer Performance with Message Passing. Proc. 20th Annual Int'l Symposium on Computer Architecture (May):314-325. Google Scholar
- Horowitz, M. 1997. Limits of Electrical Signalling. Hot Interconnects Keynote (August).Google Scholar
- Horst, R. 1995. TNet: A Reliable System Area Network. IEEE Micro 15(1):37-45. Google ScholarDigital Library
- Horst, R. W. and T. C. K. Chou 1985. An Architecture for High Volume Transaction Processing. Proc. 12th Annual Int'l Symposium on Computer Architecture (June):240-245. Boston MA. (Tandem NonStop II). Google Scholar
- Horst, R. W., R. L. Harris, and R. L. Jardine. 1990. Multiple Instruction Issue in the NonStop Cyclone Processor. Proc. Annual International Symposium on Computer Architecture (ISCA) , 216-226. Google Scholar
- Hristea, C. D. Lenoski and J. Keen. 1997. Measuring Memory Hierarchy Performance of Cache Coherent Multiprocessors Using Micro Benchmarks. Proc. SC97 (November; all-Web conference proceeding). Google Scholar
- Hunt, D. 1996. Advanced Features of the 64-Bit PA-8000 . Palo Alto. CA: Hewlett Packard Corp.Google Scholar
- IEEE Computer Society. 1993. IEEE Standard for Scalable Coherent Interface (SCI) . IEEE Standard 1596-1992. Washington, DC: IEEE Computer Society.Google Scholar
- IEEE Computer Society. 1995. IEEE Standard for Cache Optimization for Large Numbers of Processors Using the Scalable Coherent Interface (SCI) Draft 0.35 (September). Washington, DC: IEEE Computer Society.Google Scholar
- Iftode, L., C. Dubnicki, E. W. Felten and K. Li. 1996. Improving Release-Consistent Shared Virtual Memory Using Automatic Update. Proc. Second Symposium on High Performance Computer Architecture (February): 14-25. Google Scholar
- Iftode, L., J. P. Singh, and K. Li. 1996a. Understanding Application Performance on Shared Virtual Memory Systems. Proc. 23rd Int'l Symposium on Computer Architecture (April): 122-133. Google Scholar
- Iftode, L., J. P. Singh, and K. Li. 1996b. Scope Consistency: A Bridge between Release Consistency and Entry Consistency. Proc. Symposium on Parallel Algorithms and Architectures (June). Google Scholar
- Intel Corporation. 1994. 1750, 1860, 1960 Processors and Related Products . Santa Clara, CA: Intel Corp.Google Scholar
- Intel Corporation. 1996. Pentium® Pro Family Developers Manual . Sanla Clara, CA: Intel Corp.Google Scholar
- Jeremiassen, T. E., and S. J. Eggers. Eliminating False Sharing. Proc. 1991 Int'l Conference on Parallel Processing (August):377-381.Google Scholar
- Jiang, D., H. Shan, and J. P Singh. 1997. Application Restructuring and Performance Portability on Shared Virtual Memory and Hardware-Coherent Multiprocessors. Proc. Sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (June):217-229. Google Scholar
- Jiang, D., and J. P. Singh. 1998. A Methodology and an Evaluation of the SGI Origin2000. Proc. SIGMETRICS Conference on Measurement and Modeling of Computer Systems (June). Google Scholar
- Joe, T. 1995. COMA-F: A Non-Hierarchical Cache Only Memory Architecture . Ph.D. diss., Computer Systems Laboratory, Stanford University (March). Google Scholar
- Joe, T., and J. L. Hennessy. 1994. Evaluating the Memory Overhead Required for COMA Architectures. Proc. 21st Int'l Symposium on Computer Architecture (April):82-93. Google Scholar
- Joerg, C. F. 1994. Design and Implementation of a Packet Switched Routing Chip . Tech. Report MIT/LCS/TR-482, MIT Laboratory for Computer Science (August). Google Scholar
- Joerg, C. F., and A. Boughton. 1991. The Monsoon Interconnection Network. Proc. ICCD (October). Google Scholar
- Johnson, M. 1991. Superscalar Microprocessor Design . Englewood Cliffs, NJ: Prentice Hall.Google Scholar
- Jordan, H. F. 1985. HEP Architecture, Programming, and Performance. In Parallel MIMD Compulation: The HEP Supercomputer and Its Applications . Edited by J. S. Kowalik. Cambridge, MA: MIT Press, 8. Google Scholar
- Jouppi, N. P. 1990. Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers. Proc. 17th Annual Symposium on Computer Architecture (June):364-373. Google Scholar
- Jouppi, N. P., and P. Ranganathan. 1997. The Relative Importance of Memory Latency, Bandwidth, and Branch Limits to Performance . Workshop on Mixing Logic and DRAM: Chips that Compute and Remember. Presented at the Annual Int'l Symposium on Computer Architecture (ISCA) '97 (June).Google Scholar
- Jouppi, N. P. and D. Wall. 1989. Available Instruction-Level Parallelism for Superscalar and Superpipelined Machines. ASPLOS III , 272-282. Google Scholar
- Kagi, A., D. Burger, and J. R. Goodman. 1997. Efficient Synchronization: Let Them Eat QOLB. Proc. 24th Int'l Symposium on Computer Architecture (ISCA) (June): 170-180. Google Scholar
- Karlin, A. R., M. S. Manasse, L. Rudolph and D. D. Sleator. 1986. Competitive Snoopy Caching. Proc. 27th Annual IEEE Symposium on Foundations of Computer Science . Google Scholar
- Karol, M., M. Hluchyj and S. Morgan. 1987. Input versus Output Queueing on a Space Division Packet Switch. IEEE Transactions on Communications 35(12):1347-1356.Google ScholarCross Ref
- Karp, R., U. Vazirani and V. Vazirani. 1990. An Optimal Algorithm for On-Line Bipartite Matching. Proc. 22nd ACM Symposium on the Theory of Computing (May):352-358. Google Scholar
- Kaxiras, S. 1996. Kiloprocessor Extensions to SCI. Proc. 10th Int'l Parallel Processing Symposium . Google Scholar
- Kaxiras, S., and J. Goodman. The GLOW Cache Coherence Protocol Extensions for Widely Shared Data. Proc. Int'l Conference on Supercomputing (May):35-43. Google Scholar
- Kecton, K. K., T. E. Anderson, and D. A. Patterson. 1995. LogP Quantified: The Case for Low-Overhead Local Area Networks. Hot Interconnects III: Symposium on High Performance Interconnects (August).Google Scholar
- Keleher, P., A. L. Cox, S. Dwarkadas and W. Zwaenepoel. 1994. TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems. Proc. Winter USENIX Conference (January):15-132. Google Scholar
- Keleher, P., A. L. Cox, and W. Zwaenepoel. 1992. Lazy Consistency for Software Distributed Shared Memory. Proc. 19th Int'l Symposium on Computer Architecture (May): 13-21. Google Scholar
- Kermani, P. and L. Kleinrock. 1979. Virtual Cut-Through: A New Computer Communication Switching Technique. Computer Networks 3 (September):267-286.Google Scholar
- Kessler, R. E., and J. L. Schwarzmeier. 1993. Cray T3D: A New Dimension for Cray Research. Proc. Papers, COMPCON Spring'93 , San Francisco (February):176-182.Google Scholar
- Knuth, D. E. 1966. Additional Comments on a Problem in Concurrent Programming Control. Communications of the ACM 9(5):321-322. Google ScholarDigital Library
- Koebel, C. D. Loveman, R. Schreiber, G. Steele, and M. Zosel. 1994. The High Performance Fortran Handbook . Cambridge, MA: MIT Press. Google Scholar
- Koeninger, R. K., M. Furtney, and M. Walker. 1994. A Shared Memory MPP from Cray Research. Digital Technical Journal 6(2):8-21.Google Scholar
- Kogge, P. M. 1994. EXECUBE--A New Architecture for Scalable MPPs. 1994 Int'l Conference on Parallel Processing (August):177-184. Google Scholar
- Kontothanassis, L. I., G. Hunt, R. Stets, N. Hardavellas, M. Cierniak, S. Parthasarathy, W. Meira, S. Dwarkadas, and M. Scott. 1997. VM-Based Shared Memory on Low-Latency, Remote-Memory-Access Networks. Proc. 24th Int'l Symposium on Computer Architecture (June). Google Scholar
- Kontothanassis, L. I., and M. L. Scott. 1996. Using Memory-Mapped Network Interfaces to Improve the Performance of Distributed Shared Memory. Proc. Second Symposium on High Performance Computer Architecture (February): 166-177. Google Scholar
- Kostantantindou, S., and L. Snyder. 1991. Chaos Router: Architecture and Performance. Proc. 18th Annual Symposium on Computer Architecture (May):212-221. Google Scholar
- Krishnamurthy, A., K. E. Schauser, C. J. Scheiman, R. Y. Wang, D. E. Culler, and K. Yelick. 1996. Evaluation of Architectural Support for Global Address-Based Communication in Large-Scale Parallel Machines. ACM SIGPLAN Notices 31(9):37-48. Google ScholarDigital Library
- Krishnamurthy, A., and K. A. Yelick. 1994. Optimizing Parallel SPMD Programs. Seventh Annual Workshop on Languages and Compilers for Parallel Computing . Ithaca, NY (August). Google Scholar
- Krishnamurthy, A., and K. A. Yelick. 1995. Optimizing Parallel Programs with Explicit Sychronization. Programming Language Design and Implementation , 196-204. Google Scholar
- Krishnamurthy, A., and K. A. Yelick. 1996. Analyses and Optimizations for Shared Address Space Programs. JPDC 38(2):130-144. Google ScholarDigital Library
- Kroft, D. 1981. Lockup-Free Instruction Fetch/Prefetch Cache Organization. Proc. Eighth Int'l Symposium on Computer Architecture (May):81-87. Google Scholar
- Kronenberg, N. R. H. Levy, and W. D. Strecker. 1986. Vax Clusters: A Closely-Coupled Distributed System. ACM Transactions on Computer Systems 4(2): 130-146. Google ScholarDigital Library
- Kruskal, C. P., and M. Snir. 1983. The Performance of Multistage Interconnection Networks for Multiprocessors. IEEE Transactions on Computers C-32(12):1091-1098. Google Scholar
- Kubiatowicz, J., and A. Agarwal. 1993. The Anatomy of a Message in the Alewife Multiprocessor. Proc. Int'l Conference on Supercomputing (July): 195-206. Google Scholar
- Kuehn, J. T., and B. J. Smith. 1988. The Horizon Supercomputing System: Architecture and Software. Proc. Supercomputing '88 (November):28-34. Google Scholar
- Kumar, M. 1992. Unique Design Concepts in GFII and Their Impact on Performance. IBM Journal of Research and Development 36(6):990-1000. Google ScholarDigital Library
- Kumar, V., A. Grama, A. Gupta, and G. Karypis. 1994. Introduction to Parallel Computing: Design and Analysis of Algorithms . Redwood City. CA: Benjamin/Cummings Publishing Company. Google Scholar
- Kumar, V., and A. Gupta. 1991. Analysis of Scalability of Parallel Algorithms and Architectures: A Survey. Proc. Int'l Conference on Supercomputing (June):396-405. Google Scholar
- Kung, H. T. R. Sansom, S. Schlick, P. A. Steenkiste, M. Arnould, F. J. Bitz, F. Christianson, E. C. Cooper, O. Menzilcioglu, D Ombres, and B. Zill. 1989. Network-Based Multicomputers: An Emerging Parallel Architecture. Proc. Supercomputing '91 Conference (November):664-673. Google Scholar
- Kurihara, K., D. Chaiken and A. Agarwal. 1991. Latency Tolerance through Multithreading in Large-Scale Multiprocessors. Proc. Int'l Symposium on Shared Memory Multiprocessing (April):91-101.Google Scholar
- Kuskin, J., D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, and J. Hennessy. 1994. The Stanford FLASH Multiprocessor. Proc. 21st Int'l Symposium on Computer Architecture (April): 302-313. Google Scholar
- Lam, M. S., and R. P. Wilson. 1992. Limits on Control Flow on Parallelism. Proc 19th Annual Int'l Symposium on Computer Architecture (May):46-57. Google Scholar
- Lamport, L. 1979. How to Make a Multiprocessor Computer that Correctly Executes Multiprocess Programs. IEEE Transactions on Computers C-28(9):690-691. Google ScholarDigital Library
- Larus, J. R., B. Richards, and G. Viswanathan. 1996. Parallel Programming in C**: A Large-Grain Data-Parallel Programming Language. In Parallel Programming Using C++ . Edited by G. V. Wilson and P. Lu. Cambridge, MA: MIT Press.Google Scholar
- Laudon, J., A. 1994. Architectural and Implementation Tradeoffs in Multiple-Context Processors . Ph.D. diss., Stanford University, Stanford, California. Also published as Tech. Report #CSL-TR-94-634. Computer Systems Laboratory, Stanford University (May). Google Scholar
- Laudon, J., A. Gupta, and M. Horowitz. 1994. Architectural and Implementation Tradeoffs in the Design of Multiple-Context Processors. In Multithreaded Computer Architecture: A Summary of the State of the Art . Edited by R. A. Iannucci. Dordrecht, Germany; Norwell, MA; Kluwer Academic Publishers, 167-200.Google Scholar
- Laudon, J. P. and D. Lenoski. 1997. The SGI Origin: A ccNUMA Highly Scalable Server. Proc. 24th Int'l Symposium on Computer Architecture . Google Scholar
- Lawton, J. V., J. J. Brosnan, M. P. Doyle, S.D. O'Rlodain and T. G. Reddin. 1996. Building a High-Performance Message-Passing System for MEMORY CHANNEL Clusters. Digital Technical Journal 8(2):96-116. Google ScholarDigital Library
- Lee, C. G. 1989. Multi-Step Gradual Rounding. IEEE Transactions on Computers 38(4):595-600. Google ScholarDigital Library
- Lee, R. L., A. Y. Kwok and F. A. Briggs. 1991. The Floating point Performance of a Superscalar SPARC Processor. Proc. 4th Symposium on Architectural Support for Programming Languages and Operating Systems (April):28-37. Google Scholar
- Leighton, F. T. 1992. Introduction to Parallel Algorithms and Architectures . San Francisco: Morgan Kaufmann. Google Scholar
- Leiserson, C. E. 1985. Fat-Trees: Universal Networks for Hardware-Efficient Supercomputing. IEEE Transactions on Computers C-34(10):892-901. Google ScholarDigital Library
- Leiserson, C. E., Z. S. Abuhamdeh, D. C. Douglas, C. R. Feynman, M. N. Ganmukhi, J. V. Hill, W. D. Hillis, B. C. Kuszmaul, M. A. St. Pierre, D. S. Wells, M. C. Wong, S. Yang, and R. Zak. 1996. The Network Architecture of the Connection Machine CM-5. Journal of Parallel and Distributed Computing 33(2): 145-158. Also in Proc. Fourth Symposium on Parallel Algorithms and Architectures '92 (June):272-285. Google ScholarDigital Library
- Lenoski, D. 1992. The Stanford DASH Multiprocessor . Ph.D. diss., Computer Systems Laboratory, Stanford University.Google Scholar
- Lenoski, D., J. Laudon, K. Gharachorloo, A. Gupta, and J. L. Hennessy. 1990. The Directory-Based Cache Coherence Protocol for the DASH Multiprocessor. Proc. 17th Int'l Symposium on Computer Architecture (May):148-159. Google Scholar
- Lenoski, D., J. Laudon, T. Joe, D. Nakahira, L. Stevens, A. Gupta, and J. L. Hennessy. 1992. The DASH Prototype: Implementation and Performance. Proc. 19th Int'l Symposium on Computer Architecture , Gold Coast, Australia (May):92-103. Google Scholar
- Lenoski, D., J. Laudon, T. Joe, D. Nakahira, L. Stevens, A. Gupta, and J. L. Hennessy. 1993. The DASH Prototype: Logic Overhead and Performance. IEEE Transactions on Parallel and Distributed Systems 4(1):41-61. Google ScholarDigital Library
- Li. K., and P. Hudak. 1989. Memory Coherence in Shared Virtual Memory Systems. ACM Transactions on Computer Systems 7(4):321-359. Google ScholarDigital Library
- Li, S.-Y. 1988. Theory of Periodic Contention and Its Application to Packet Switching. Proc. INFOCOM '88 (March):320-325.Google Scholar
- Lim, B.-H., and A. Agarwal. 1994. Reactive Syncronization Algorithms for Multiprocessors. Proc. Sixth Int'l Conference on Architectural Support for Programming Languages and Operating Systems , 25-35. Google Scholar
- Linder, D., and J. Harden. 1991. An Adaptive Fault Tolerant Wormhole Strategy for k-ary n-cubes. IEEE Transactions on Computer C-40(1):2-12. Google ScholarDigital Library
- Lipton, R., and J. Sandberg. 1988. PRAM: A Scalable Shared Memory . Tech. Report #CS-TR-180-88, Computer Science Dept., Princeton University (September).Google Scholar
- Litzkow, M., M. Livny, and M. W. Mutka. 1988. Condor--A Hunter of Idle Workstations. Proc. Eighth Int'l Conference of Distributed Computing Systems (June): 104-111.Google Scholar
- Lo, J. L., S. J. Eggers, J.S. Emer, H. M. Levy, R. L. Stamm and D. M. Tullsen. 1997. Converting Thread-Level Parallelism into Instruction-Level Parallelism via Simultaneous Multithreading. ACM Transactions on Computer Systems (August). Google Scholar
- Lonergan, W., and P. King. 1961. Design of the B 5000 System. Datamation 7(5):28-32.Google Scholar
- Lovett, T. and R. Clapp. 1996. STiNG: A CC-NUMA Computer System for the Commercial Marketplace. Proc. 23rd Int'l Symposium on Computer Architecture (May);308-317. Google Scholar
- Luk, C.-K., and T. C. Mowry. 1996. Compiler-Based Prefetching for Recursive Data Structures. Proc. Seventh Symposium on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VII) (October):222-233. Google Scholar
- Lukowsky, J., and S. Polit. 1997 (date accessed). IP Packet Switching on the GlGAswitch/FDDI System. http://www.networks.digital.com:80/dr/techart/gsfip-mn.hlml.Google Scholar
- Mainwaring, A., B. Chun, S. Schleimer, and D. Wilkerson. 1997. System Area Network Mapping. Proc. Ninth Annual ACM Symposium on Parallel Algorithms and Architecture , Newport, RI (June):116-126. Google Scholar
- Mainwaring, A., and D. E. Culler. 1996. Active Message Applications Programming Interface and Communication Subsystem Organization . Tech. Report CSD-96-918. University of California at Berkeley. Google Scholar
- Martin, R. 1994. HPAM: An Active Message Layer of a Network of Workstations. Presented at Hot Interconnects II (August).Google Scholar
- Massalin, H., and C. Pu. 1991. A Lock-Free Multiprocessor OS Kernel . Tech. Report CUCS-005-01, Columbia University, Computer Science Dept. (October).Google Scholar
- Matelan, N. 1985. The FLEX/32 Multicomputer. Proc. 12th Annual Int'l Symposium on Computer Architecture , Boston, MA. (Flex) (June):209-213. Google Scholar
- May, C., E. Silha, R. Simpson, and H. Warren, eds. 1994. The PowerPC Architecture: A Specification for a New Family of RISC Processors . San Francisco: Morgan Kaufmann. Google Scholar
- McCreight, E. 1984. The Dragon Computer System: An Early Overview . Tech. Report, Xerox Corp. (September).Google Scholar
- Mellor-Crummey, J. and M. Scott. 1991. Algorithms for Scalable Synchronization on Shared Memory Mutiprocessors. ACM Transactions on Computer Systems 9(1):21-65. Google ScholarDigital Library
- Melvin, S., and Y. Patt. 1991. Exploiting Fine-Grained Parallelism through a Combination of Hardware and Software Techniques. Proc. Annual Int'l Symposium on Computer Architecture (ISCA) , 287-296. Google Scholar
- Michael, M., and M. Scott. 1996. Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms. Proc. 15th Annual ACM Symposium on Principles of Distributed Computing , Philadelphia, PA (May): 267-276. Google Scholar
- Minnich, R., D. Burns, and F. Hady. 1995. The Memory-Integrated Network Interface. IEEE Micro 15(1): 11-20. Google ScholarDigital Library
- MIPS Technologies. 1991. MIPS R4000 Users Manual . Mountain View, CA: MIPS Technologies.Google Scholar
- MIPS Technologies. 1996. RI0000 Microprocessor User's Manual, Version 1.1 (January). Mountain View, CA: MIPS Technologies.Google Scholar
- Miyoshi, H.; M. Fukuda, T. Iwamiya, T. Nakamura, M. Tuchiya, M. Yoshida, K. Yamamoto, Y. Yamamoto, S. Ogawa, Y. Matsuo, T. Yamane, M. Takamura, M. Ikeda, S. Okada, Y. Sakamoto, T. Kitamura, H. Hatama, M. Kishimoto, M. Arnould, F. J. Bitz, E. C. Cooper, H. T. Kung, R. Sansom, S. Schlick, P. A. Steenkiste, and B. Zill. 1994. Development and Achievement of NAL Numerical Wind Tunnel (NWT) for CFD Computations. Proc. Supercomputing '94 , Washington, DC (November):685-692. Google Scholar
- Mowry, T. C. 1994. Tolerating Latency through Software-Controlled Data Prefetching . Ph.D. diss., Computer Systems Laboratory, Stanforcf University. Also published as Tech. Report #CSL-TR-94-628. Computer Systems laboratory, Stanford University (June). Google Scholar
- MPI Forum. 1993. Document for a Standard Message-Passing Interface . Tech. Report CS-93-214. University of Tennessee, Knoxville, Computer Science Dept. (November). Google Scholar
- MPI Forum. 1994. MPI: A Message Passing Interface. Int'l Journal of Supercomputing Applications 8(3/4). Special Issue on MPI. (updated 5/95). Also published in Proc. Supercomputing '93 Conference (May). Los Alamitos, CA: IEEE Computer Society Press. 878-883. Updated spec at http://www.mcs.anl.gov/mpi/. Google Scholar
- Mukherjee, S., and M. Hill. 1997. A Case for Making Network Interfaces Less Peripheral. Hot Interconnects (August).Google Scholar
- NAS Parallel Benchmarks. 1998 (date accessed). http://science.nas.nasa.gov/Software/NPB/.Google Scholar
- Nayfeh, B. A., L. Hammond, K. Olukoton. 1996. Evaluation of Design Alternatives for a Multiprocessor Microprocessor. Proc. 23rd Annual Int'l Symposium on Computer Architecture (May). New York: ACM Press, 67-77. Google Scholar
- Nestle, E., and A. Inselberg. 1985. The Synapse N+1 System: Architectural Characieristics and Performance Data of a Tightly-Coupled Multiprocessor System. Proc. 12th Annual Int'l Symposium on Computer Architecture , Boston, MA (Synapse) (June):233-239. Google Scholar
- Ngai, J., and C. Seitz. 1989. A Framework for Adaptive Routing in Multicomputer Networks. Proc. 1989 Symposium on Parallel Algorithms and Architectures (June):2-10. Google Scholar
- Nickolls, J. R. 1990. The Design of the MasPar MP-1: A Cost Effective Massively Parallel Computer. COMPCON Spring '90, Digest of Papers , San Francisco, CA (February/March):25-28.Google Scholar
- Nikhil, R. S., and Arvind. 1989. Can Dataflow Subsume von Neumann Computing? Proc. 16th Annual Int'l Symposium on Computer Architecture (May):262-72. Google Scholar
- Nikhil, R., G. Papadopoulos, and Arvind. 1993. *T: A Multithreaded Massively Parallel Architecture. Proc. Annual Int'l Symposium on Computer Architecture (ISCA) '93 (May):156-167. Google Scholar
- Noakes, M. D., D. A. Wallach and W.J. Dally. 1993. The J-Machine Multicomputer: An Architectural Evaluation. Proc. 20th Int'l Symposium on Computer Architecture (May):224-235. Google Scholar
- Nuth, P., and W.J. Dally. 1992. The J-Machine Network. Proc. Int'l Conference on Computer Design: VLSI in Computers and Processors (October). Google Scholar
- Nuth, P., and W.J. Dally. 1995. The Named-State Register File: Implementation and Performance. Proc. First Int'l Symposium on High-Performance Computer Architecture (January):4-13. Google Scholar
- O'Krafka, B., and A. Newton. 1990. An Empirical Evaluation of Two Memory-Efficient Directory Methods. Proc. 17th Int'l Symposium on Computer Architecture (May):138-147. Google Scholar
- Office of Science and Technology Policy. 1993. Grand Challenges 1993: High Performance Computing and Communications, A Report by the Committee on Physical, Mathematical, and Engineering Sciences . Washington, DC: Office of Science and Technology Policy.Google Scholar
- Ohara, M. 1996. Producer-Oriented versus Consumer-Oriented Prefetching: A Comparison and Analysis of Parallel Application Programs . Ph.D. diss., Computer Systems Laboratory. Stanford University. Available as Tech. Report #CSL-TR-96-695, Stanford University (June). Google Scholar
- Olukotun, K., B. A. Nayfeh, L. Hammond, K. Wilson and K. Chang. 1996. The Case for a Single-Chip Multiprocessor. Proc. ASPLOS (October):2-11. Google Scholar
- Omondi, A. R. 1994. Ideas for the Design of Multithreaded Pipelines. In Multithreaded Computer Architecture: A Summary of the State of the Art . Edited by R. Iannucci. Dordrecht, Germany; Norwell, MA: Kluwer Academic Publishers, 1994. See also A. R. Omondi, Design of a High Performance Instruction Pipeline. Computer Systems Science and Engineering 6(1):13-29 (1991).Google Scholar
- Pacheco, P. 1996. Parallel Programming with MPI . San Francisco: Morgan Kaufmann. Google Scholar
- Padegs, A. 1981. System/360 and Beyond. IBM Journal of Research and Development 25(5):377-390. Google ScholarDigital Library
- Pai, V. S., P. Ranganathan, S. V. Adve, and T. Harton. 1996. An Evaluation of Memory Consistency Models for Shared-Memory Systems with ILP Processors. Proc. Seventh Symposium on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VII) (October):12-23. Google Scholar
- Papadimitriou, C. H. 1979. The Serializability of Concurrent Database Updates. Journal of the ACM 26(4):631-653. Google ScholarDigital Library
- Papadopoulos, G. M., and D. E. Culler. 1990. Monsoon: An Explicit Token-Store Architecture. Proc. 17th Annual Int'l Symposium on Computer Architecture , Seattle, WA (May):82-91. Google Scholar
- Papamarcos, M., and J. Patel. 1984. A Low Overhead Coherence Solution for Multiprocessors with Private Cache Memories. Proc. 11th Annual Int'l Symposium on Computer Architecture (June):348-354. Google Scholar
- PARKBENCH Committee. 1994. Public International Benchmarks for Parallel Computers. Scientific Programming 3(2). Also published as Tech. Report CS93-213, University of Tennessee, Knoxville, Dept. of Computer Science (November). Google Scholar
- Patterson, D. A. 1995. Microprocessors in 2020. Scientific American (September).Google Scholar
- Patterson, D. A., T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, and K. Yelick. 1997. A Case for Intelligent RAM. IEEE Micro 17(2):34-44. Google ScholarDigital Library
- Peterson, L., and B. Davie. 1996. Computer Networks . San Francisco: Morgan Kaufmann. Google Scholar
- Pfeiffer, W., S. Hotovy, N. Nystrom, D. Rudy, T. Sterling, M. Straka. 1995 (date accessed). JNNIE: The Joint NSF-NASA Initiative on Evaluation, http://www.tc.cornell.edu/JNNIE/finrep/jnnie.html.Google Scholar
- Pfister, G. F. 1995. In Search of Clusters--The Coming Battle for Lowly Parallel Computing . Englewood Cliffs. NJ: Prentice Hall. Google Scholar
- Pfister, G. F., W. C. Brantley, D. A. George, S. L. Harvey, W. J. Kleinfelder, K. P. McAuliff, E. A. Melton, V. A. Norton, and J. Weiss. 1985. The IBM Research Parallel Processor Prototype (RP3): Introduction and Architecture. Proc. Int'l Conference on Parallel Processing (August):264-771.Google Scholar
- Pfister, G. F., and V. A. Norton. 1985. Hot Spot Contention and Combining Multistage Interconnection Networks. IEEE Transactions on Computers C-34(10).Google ScholarCross Ref
- Pierce, P. 1988. The NX/2 Operating System. Proc. Third Conference on Hypereube Concurrent Computers and Applications (January):384-390. Google Scholar
- Pierce, P., and G. Reenter. 1994. The Paragon Implementation of the NX Message Passing Interface. Proc. Scalable High-Performance Computing Conference (May):184-90.Google Scholar
- Porter, R. E. 1960. Datamation 6(1):8-14.Google Scholar
- Przybylski, S., M. Horowitz, J. L. Hennessy. 1988. Performance Tradeoffs in Cache Design. Proc. 15th Annual Symposium on Computer Architecture (May):290-298. Google Scholar
- Ranganathan, P. V. S. Pai, H. Abdel-Shafi, and S. V. Adve. 1997. The Interaction of Software Prefetching with ILP Processors in Shared-Memory Systems. Proc. 24th Int'l Symposium on Computer Architecture (June). Google Scholar
- Ratner, J. 1985. Concurrent Processing: A New Direction in Scientific Computing. Proc. 1985 National Computing Conference , 835.Google Scholar
- Reddaway, S. F. 1973. DAP--A Distributed Array Processor. First Annual Int'l Symposium on Computer Architecture (Dccember):61-65. Google Scholar
- Reinhardt, S. K., M. D. Hill, J. R. Larus, A. R. Lebeck, J. C. Lewis, and D. A. Wood. 1993. The Wisconsin Wind Tunnel: Virtual Prototyping of Parallel Computers. Proc. ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems (May):48-60. Google Scholar
- Reinhardt, S. K., J. R. Larus, and D. A. Wood. 1994. Tempest and Typhoon: User-Level Shared Memory. Proc. 21st Int'l Symposium on Computer Architecture (April):325-337. Google Scholar
- Reinhardt, S. K., R. W. Pfile, and D. A. Wood. 1996. Decoupled Hardware Support for Distributed Shared Memory. Proc. 23rd Int'l Symposium on Computer Architecture (May):34-43. Google Scholar
- Rettberg, R., W. Crowther, P. Carvey, and R. Tomlinson. 1990. The Monarch Parallel Processor Hardware Design. IEEE Computer (April):18-30. Google ScholarDigital Library
- Rettberg, R., and R. Thomas. 1986. Contention is No Obstacle to Shared-Memory Multiprocessing. Communications of the ACM 29(12):1202-1212. Google ScholarDigital Library
- Rinard, M. C., D. J. Scales, and M. S. Lam. 1993. Jade: A High-Level. Machine-Independent Language for Parallel Programming. IEEE Computer 26(6). Google Scholar
- Rodgers, D. 1985. Improvements on Multiprocessor System Design. Proc. 12th Annual Int'l Symposium on Computer Architecture , Boston, MA (Sequent B8000) (June):225-231. Google Scholar
- Rosenblum, M., S. A. Herrod, E. Witchel, and A. Gupta. 1995. Complete Computer Simulation: The SimOS Approach. IEEE Parallel and Distributed Technology 3(4). Google Scholar
- Rosenburg, B. 1989. Low-Synchronization Translation Lookaside Buffer Consistency in Large-Scale Shared-Memory Multiprocessors. Proc. Symposium on Operating Sysrems Principles (December). Google Scholar
- Rothberg, E., J. P. Singh, and A. Gupta. 1993. Working Sets, Cache Sizes, and Node Granularity Issues for Large-Scale Multiprocessors. Proc. 20th Int'l Symposium on Computer Architecture (May):14-25. Google Scholar
- Russel, R. M. 1978. The CRAY-1 Computer System. Communications of the ACM 21(1):63-72. Google ScholarDigital Library
- Saavedra-Barrera, R. H., D. E. Culler, T. von Eicken. 1990. Analysis of Multithreaded Architectures for Parallel Computing. Second Annual ACM Symposium on Parallel Algorithms and Architectures (July): 169-178. Google Scholar
- Saavedra, R. H., R. S. Gaines, and M.J. Carlton. 1993. Micro Benchmark Analysis of the KSR1. Proc. Supercomputing '93 , Portland, OR (November):202-213. Google Scholar
- Saavedra, R. H., and A. J. Smith. 1996. Analysis of Benchmark Characteristics and Benchmark Performance Prediction. ACM Transactions on Computer Systems 14(4);344-384. Google ScholarDigital Library
- Sakai, S., Y. Kodama and Y. Yamaguchi 1991. Prototype Implementation of a Highly Parallel Dataflow Machine EM4. Proc. Fifth Int'l Parallel Processing Symposium . Anaheim, CA (April/May):278-286. Google Scholar
- Salmon, J. 1990. Parallel Hierarchical N-body Methods . Ph.D. diss., California Institute of Technology. Google Scholar
- Salmon, J. K., M. S. Warren, and G. S. Winckelmans. 1994. Fast Parallel Tree codes for Gravitational and Fluid Dynamical N-body Problems. Intl. Journal of Supercomputer Applications 8:129-142. Google ScholarDigital Library
- Samanta, R., A. Bilas, L. Iftode, and J. R Singh. 1998. Home-Based SVM Protocols for SMP Clusters: Design, Simulations, Implementation, and Performance. Proc. 23rd Annual Int'l Symposium on Computer Architecture (February).Google Scholar
- Saulsbury, A., F. Pong, and A. Nowatzyk 1996. Missing the Memory Wall: The Case for Processor/Memory Integration. Proc. 23rd Annual Int'l Symposium on Computer Architecture (May):90-101. Google Scholar
- Saulsbury, A., T. Wilkinson, J. Carter, and A. Landin. 1995. An Argument For Simple COMA Proc. First IEEE Sympostum on High Performance Computer Architecture (January):276-285. Google Scholar
- Savage, J. 1985. Parallel Processing as a Language Design Problem. Proc. 12th Annual Int'l Symposium on Computer Architecture , Boston, MA (Myrias 4000) (June):221-224. Google Scholar
- Scales, D. J., K. Gharachorloo and C. A. Thekkath. 1996. Shasta: A Low Overhead. Software-Only Approach for Supporting Fine-Grain Shared Memory. Proc. Seventh Int'l Conference on Architectural Support for Programming, Languages and Operating Systems (October):174-185. Google Scholar
- Scales, D. J., and M. S. Lam. 1994. The Design and Evaluation of a Shared Object System for Distributed Memory Machines. Proc. First Symposium on Operating System Design and Implementation (November):101-114. Google Scholar
- Schanin, D.J. 1986. The Design and Development of a Very High Speed System Bus--The Encore Multimax Nanobus. In Proc. Fall Joint Computer Conference (Encore) , Dallas, TX (November). Edited by H. S. Stone. Los Alamitos: IEEE Computer Society Press, 410-418. Google Scholar
- Schauser, K. E., and C. J. Scheiman. 1995. Experience with Active Messages on the Meiko CS-2. Proc. Ninth Int'l Symposium on Parallel Processing (IPPS'95) (April):140-149. Google Scholar
- Scheurich, C. and M. Dubois. 1987. Correct Memory Operation of Cache-Based Multiprocessors. Proc. 14th Int'l Symposium on Computer Architecture (June):234-243. Google Scholar
- Schoinas, I., B. Falsafi, A. R. Lebeck, S. K. Reinhardt, J. R. Larus, and D. A. Wood. 1994. Fine-Grain Access Control for Distributed Shared Memory. Proc. 6th Int'l Conference on Architectural Support for Programming Languages and Operating Systems (October):297-306. Google Scholar
- Schroeder, M. D., A. D. Birrell, M. Burrows, H. Murray, R. M. Needham, T. L. Rodeheffer, E. H. Satterthwaite and C. P. Thacker. 1991. Autonet: A High-Speed. Self-Configuring Local Area Network Using Point-to-Point Links. IEEE Journal on Selected Areas in Communications 9(8):1318-1335. Google Scholar
- Schwiebert, L., and D. N. Jayasimha. 1995. A Universal Proof Technique for Deadlock-Free Routing in Interconnection Networks. Symposium on Parallel Algorithms and Architecture (July):175-184. Google Scholar
- Scott, S. 1991. A Cache-Coherence Mechanism for Scalable Shared-Memory Multiprocessors. Proc. Int'l Symposium on Shared Memory Multiprocessing (April):49-59.Google Scholar
- Scott, S. 1996. Synchronization and Communication in the T3E Multiprocessor. Proc. Seventh Int'l Conference on Architectural Support for Programming Languages and Operating Systems (October):26-36, Cambridge. MA. Google Scholar
- Scott, S., and J. R. Goodman. 1993. Performance of Pruning Cache Directories for Large-Scale Multiprocessors. IEEE Transactions on Parallel and Distributed Systems 4(5):520-534. Google ScholarDigital Library
- Scott, S. 1994. The Impact of Pipelined Channels on k-ary n-Cube Networks. IEEE Transactions on Parallel and Distributed Systems 5(1):2-16. Google ScholarDigital Library
- Scott, S., M. Vernon, and J. R. Goodman. 1992. Performance of the SCI Ring. Proc. 19th Int'l Symposium on Computer Architecture (May):403-414. Google Scholar
- Seitz, C. L. 1984. Concurrent VLSI Architectures. IEEE Transactions on Computers 33(12):1247-1265. Google ScholarDigital Library
- Seitz, C. L. 1985. The Cosmic Cube. Communications of the ACM 28(1):22-33. Google ScholarDigital Library
- Seitz, C. L., and W.-K. Su. 1993. A Family of Routing and Communication Chips Based on Mosaic. Proc. of Univ. of Washington Symposium on Integrated Systems . Cambridge, MA: MIT Press, 320-337. Google Scholar
- Shah, G., J. Nieplocha, J. Mirza, C. Kim, R. Harrison, R. K. Govindaraju, K. Gildea, P. DiNicola, and C. Bender. 1998. Performance and Experience with LAPI--A New High-Performance Communicalion Library for the IBM RS/6000 SP. Twelfth Int'l Parallel Processing Symposium (March):260-266. Google Scholar
- Shasha, D., and M. Snir. 1988. Efficient and Correct Execution of Parallel Programs that Share Memory. ACM Transactions on Programming Languages and Operating Systems 10(2):282-312. Google ScholarDigital Library
- Shimada, T., K. Hiraki and K. Nishida. 1984. An Architecture of a Data Flow Machine and Its Evaluation. Proc. COMPCON '84 , 486-90.Google Scholar
- Simoni, R., and M. Horowitz. 1991. Dynamic Pointer Allocation for Scalable Cache Coherence Directories. Proc. Int'l Symposium on Shared Memory Multiprocessing (April):72-81.Google Scholar
- Sindhu, R.J.-M. Frailong, and M. Cekleov. 1991. Formal Specification of Memory Models . Tech. Report (PARC) CSL-91-11. Xerox Corp., Palo Alto Research Center, Palo Alto, CA.Google Scholar
- Sindhu, P., et al. 1993. XDBus: A High-Performance, Consistent, Packet Switched VLSI Bus. Proc. COMPCON (Spring):338-344.Google Scholar
- Singh, J. P. 1993. Parallel Hierarchical N-body Methods and Their Implications for Multiprocessors . Ph.D. diss., Tech. Report #CSL-TR-93-565. Stanford University (March). Google Scholar
- Singh, J. P. 1998. Some Aspects of Controlling Scheduling in Handware Control Prefetching . To be published as Tech. Report, Princeton University, Computer Science Dept.Google Scholar
- Singh, J. P., A. Gupta, and M. Levoy. 1994. Parallel Visualization Algorithms: Performance and Architectural Implications. IEEE Computer 27(6). Google Scholar
- Singh, J. P., J. L. Hennessy and A. Gupta. 1993. Scaling Parallel Programs for Multiprocessors: Methodology and Examples. IEEE Computer 26(7):42-50. Google ScholarDigital Library
- Singh, J. P., J. L. Hennessy and A. Gupta. 1995. Implications of Parallel Hierarchical N-body Applications for Multiprocessors. ACM Transactions on Computer Systems (May). Google Scholar
- Singh, J. P., C. Holt, T. Totsuka, A. Gupta, and J. L. Hennessy. 1995. Load Balancing and Data Locality in Hierarchial N-body Methods: Barnes-Hut, Fast Multipole and Radiosity. Journal of Parallel and Distributed Computing (June). Google ScholarDigital Library
- Singh, J. P., T. Joe, A. Gupta, and J. L. Hennessy. 1993. An Empirical Comparison of the KSR-1 and DASH Multiprocessors. Proc. Supercomputing '93 (November). Google Scholar
- Singh, J. P., E. Rothberg, and A. Gupta. 1994. Modeling Communication in Parallel Algorithms: A Fruitful Interaction between Theory and Systems? Proc. 10th Annual ACM Symposium on Parallel Algorithms and Architectures . Google Scholar
- Singh, J. P. W-D. Weber, and A. Gupta. 1992. SPLASH: The Stanford Parallel. Applications for SHared Memory. Computer Architecture News 20(1):5-44. Google ScholarDigital Library
- Sites, R. L. ed. 1992 Alpha Architecture Reference Manual . Hudson. MA: Digital Press, Digital Equipment Corp. Google Scholar
- Slater, M. 1994. Intel Unveils Multiprocessor System Specification. Microprocessor Report (May):12-14.Google Scholar
- Slotnick, D. L. 1967. Unconventional Systems. Proc. AFIPS Spring Joint Computer Conference 30:477-481. Google Scholar
- Slotnick, D. L., W. C. Borck, and R. C. McReynolds. 1962. The Solomon Computer. Proc. AFIPS Fall Joint Computer Conference 22:97-107. Google Scholar
- Smith, A. J. 1982. Cache Memories. ACM Computing Surveys 14(3):473-530. Google ScholarDigital Library
- Smith, B. J. 1981. Architecture and Applications of the HEP Multiprocessor Computer System. Proc. SPIE: Real-Time Signal Processing IV 298(August):241-248.Google Scholar
- Smithm B. J. 1985. The Architecture of HEP. In Parallel MIMD Computation. The HEP Supercomputer and Its Applications . Edited by J.S. Kowalik. Cambridge, MA: MIT Press. 41-55. Google Scholar
- Smith, M. D., M. Johnson, and M. A. Horowitz. 1989. Limits on Multiple Instruction Issue. Proc. Third Int'l Conference on Architectural Support for Programming Languages and Operating Systems , 290-302, Apr. Google Scholar
- Snir, M., S. Otto, S. H. Lederman, D. Walker, and J. Dongarra. 1995. MPI: The Complete Reference . Cambridge, MA: MIT Press. Google ScholarDigital Library
- Sohi, G., S. Breach, and T. N. Vijaykumar. 1995 Multiscalar Processors. Proc 22nd Annual Int'l Symposium on Computer Architecture (June):414-425. Google Scholar
- SPEC (Standard Performance Evaluation Corporation). 1995 (date accessed). http://www.specbench.org/. (SPEC Benchmark Suite Release 1.0., 1989).Google Scholar
- Spertus, E., S. C. Goldstein, K. E. Schauser, T. von Eicken, D. E. Culler, W. J. Dally. 1993. Evaluation of Mechanisms for Fine-Grained Parallel Programs in the J-Machine and the CM-5. Proc. 20th Annual Symposium on Computer Architecture (May):302-313. Google Scholar
- Stenstrom, P. T. Joe and A. Gupia. 1992. Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures. Proc. 19th Int'l Symposium on Computer Architecture (May):80-91. Google Scholar
- Stets, R., S. Dwarkadas, N. Hardavellas, G. Hunt, L. Koniothanassis, S. Parthasarathy and M. Scott 1997. Cashmere-2L: Software Coherent Shared Memory on a Clustered Remote-Write Network. Proc. 16th ACM Symposium on Operating Systems Principles (October). Google Scholar
- Stone, H. S. 1970. A Logic-in-Memory Computer. IEEE Transactions on Computers C-19(1):73-78. Google ScholarDigital Library
- Stunkel, C. B., D. G. Shea, D. G. Grice, P. H. Hochschild and M. Tsao. 1994. The SP-1 High Performance Swiich. Proc. Scalable High Performance Computing Conference (May): 150-157 Knoxville, TN.Google Scholar
- Stunkel, C. B., et al. 1998 (date accessed). The SP2 Communication Subsystem . http://ibm.tc.cornell.edu/ibm/pps/doc/css/css.ps.Google Scholar
- SUN Microsystems. 1991. The SPARC Architecture Manual . #800-199-12. Version 8 (January). Mountain View, CA: SUN Microsystems.Google Scholar
- Sunderam, V. S. 1990. PVM: A Framework for Parallel Distributed Computing. Concurrency: Practice and Experience 2(4):315-339. Google ScholarDigital Library
- Sunderam, V. S., J. Dongarra, A. Geist, and R. Manchek. 1994. The PVM Concurrent Computing System: Evolution, Experiences, and Trends. Parallel Computing 20(4):531-547. Google ScholarDigital Library
- Swan, R.J., A. Bechtolsheim, K.-W. Lai, and J. K. Ousterhout. 1977. The Implementation of the CM* Multi-Microprocessor. Proc. AFIPS Conference/National Computer Conference (46):645-655. Google Scholar
- Swan, R. J., S. H. Fuller, and D. R Siewiorek. 1977. CM*--A Modular, Multi-Microprocessor. Proc. AFIPS Conference/National Computer Conference (46):637-44. Google Scholar
- Sweazey, P., and A.J. Smith. 1986. A Class of Compatible Cache Consistency Protocols and Their Support by the IEEE Futurebus. Proc. 13th Int'l Symposium on Computer Architecture (May):414-423. Google Scholar
- Tamir, Y., and G. L. Frazier. 1988. High-Performance Multi-Queue Buffers for VLSI Communication Switches. Proc. 15th Annual Int'l Symposium on Computer Architecture , 343-354. Google Scholar
- Tanenbaum, A. S., and A. S. Woodhull. 1997. Operating System Design and Implementation 2nd ed. Englewood Cliffs, NJ: Prentice Hall. Google Scholar
- Tang, C. 1976. Cache Design in a Tightly Coupled Multiprocessor System. Proc. AFIPS Conference (June):749-753. Google Scholar
- Teller, P. 1990. Translation-Lookaside Buffer Consistency. IEEE Computer 23(6):26-36. Google ScholarDigital Library
- Thacker, C. L. Stewart, and E. Satterthwaite, Jr. 1988. Firefly: A Multiprocessor Workstation. IEEE Transactions on Computers 37(8):909-20. Google ScholarDigital Library
- Thapar, M., and B. Delagi. 1990. Stanford Distributed-Directory Protocol. IEEE Computer 23(6):78-80. Google ScholarDigital Library
- Thekkath, R., A. P. Singh, J. P. Singh, J. Hennessy and S. John. 1997. An Application-Driven Evaluation of the Convex Exemplar SP-1200. Proc. Int'l Parallel Processing Symposium (June).Google Scholar
- Thompson, M., J. Barton, T. Jermoluk, and J. Wagner. 1988. Translation Lookaside Buffer Synchronization in a Multiprocessor System. Proc. USENIX Technical Conference (February).Google Scholar
- Thornton, J. E. 1964. Parallel Operation in the Control Data 6600. AFIPS Proc. Fall Joint Computer Conference , Part 2 26:33-40. Reprinted in Siework, Bell, and Newell. 1982. Computer Structures: Principles and Examples . New York; McGraw-Hill. Google Scholar
- Tomasulo, R. M. 1967. An Efficient Algorithm for Exploiting Multiple Arithmetic Units. IBM Journal of Research and Development 11(1):25-33. Google ScholarDigital Library
- Torrellas, J., M. S. Lam, and J. L. Hennessy. 1994. False Sharing and Spatial Locality in Multiprocessor Caches. IEEE Transactions on Computers 43(6):651-663. Google ScholarDigital Library
- Transaction Processing Council. 1998. http://www.tpc.orgGoogle Scholar
- Traw, C., and J. Smith. 1991. A High-Performance Host Interface for ATM Networks. Proc. ACM SIGCOMM Conference (September):317-325. Google Scholar
- Traylor, R., and D. Dunning. 1992. Routing Chip Set for Intel Paragon Parallel Supercomputer. Proc. Hot Chips '92 Symposium (August).Google Scholar
- Tucker, L. W., and A. Mainwaring. 1994. CMMD: Active Messages on the CM-5. Parallel Computing 20(4):481-496. Google ScholarDigital Library
- Tucker, L. W., and G. G. Robertson. 1988. Architecture and Applications of the Connection Machine. IEEE Computer 21(8):26-38. Google ScholarDigital Library
- Tucker, S. 1986. The IBM 3090 System: An Overview. IBM Systems Journal 25(1):4-19. Google ScholarDigital Library
- Tullsen, D. M., and S.J. Eggers. 1993. Limitations of Cache Prefetching on a Bus-Based Multiprocessor. Proc. 20th Annual Symposium on Computer Architecture (May):278-288. Google Scholar
- Tullsen, D. M., S.J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, and R. L. Stamm. 1996. Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor. Proc. 23rd Int'l Symposium on Computer Architecture (May): 191-202. Google Scholar
- Tullsen, D. M., S.J. Eggers. and H. M. Levy 1995. Simultaneous Multithreading: Maximizing On-Chip Parallelism. Proc. 20th Annual Symposium on Computer Architecture (June):278-288. Google Scholar
- Turner, J. S. 1988. Design of a Broadcast Packet Switching Network. IEEE Transactions on Communication 36(6):734-743.Google ScholarCross Ref
- Valiant, L. G. 1990. A Bridging Model for Parallel Computation. Communications of the ACM 33(8):103-111. Google ScholarDigital Library
- Valois, J. 1995. Lock-Free Linked Lists Using Compare-and-Swap. Proc. 14th Annual ACM Symposium on Principles of Distributed Computing , Ottawa, Canada (August):214-222. Google Scholar
- Vick, C. R., and J. A. Cornell 1978. PEPE Architecture--Present and Future. Proc. AFIPS Conference 47:981-1002.Google Scholar
- von Eicken, T. A. Basu and V. Buch. 1995. Low-Latency Communication Over ATM Using Active Messages. IEEE Micro 15(1):46-53. Google ScholarDigital Library
- von Eicken, T., D. E. Culler, S. C. Goldstein, and K. E. Schauser. 1992. Active Messages: A Mechanism for Integrated Communication and Computation. Proc. 19th Annual Int'l Symposium on Computer Architecture , Gold Coast, Australia (May) 256-266. Google Scholar
- Vranesic, Z., M. Stumm, D. Lewis, and R. White 1991. Hector: A Hierarchically Structured Shared Memory Multiprocessor. IEEE Computer 24(1):72-78. Google ScholarDigital Library
- Wall, D. W. 1991. Limits of Instruction-Level Parallelism. ASPLOS IV (April):176-188. Google Scholar
- Wallach, D. A. 1992. PHD: A Hierarchical Cache Coherence Protocol . S.M. thesis. Massachusetts Institute of Technology. Also available as Tech. Report #1389. Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Boston, MA (August).Google Scholar
- Wang, W-H., J.-L. Baer and H. M. Levy. 1989. Organization and Performance of a Two-Level Virtual-Real Cache Hierarchy. Proc. 16th Annual Int'l Symposium on Computer Architecture (June):140-148. Google Scholar
- Warren, M. S., and J. K. Salmon. 1993. A Parallel Hashed Oct-Tree N-body Algorithm. Proc. Supercomputing '93 . Washington, DC: IEEE Computer Society, 12-21. Google Scholar
- Weaver, D., and T. Germond, eds. 1994. The SPARC Architecture Manual . SPARC International, Version 9. Englewood Cliffs, NJ: Prentice Hall. Google Scholar
- Weber, W.-D. 1993. Scalable Directories for Cache-Coherent Shared-Memory Multiprocessors Ph.D. diss., Computer Systems Laboratory, Stanford University (January). Also available as Tech. Report #CSL-TR-93-557. Stanford University.Google Scholar
- Weber, W.-D., S. Gold, P. Helland, T. Shimizu, T. Wicki and W. Wilcke. 1997. The Mercury Interconnect Architecture: A Cost-Effective Infrastructure for High-Performance Servers. Proc. 24th Int'l Symposium on Computer Architecture (June):98-107. Google Scholar
- Weiss, S. and J. Smith. 1994. Power and PowerPC . San Francisco: Morgan Kaufmann. Google Scholar
- Widdoes, L., Jr., and S. Correll. 1980. The S-1 Project: Developing High Performance Computers. Proc. COMPCON (Spring):282-291.Google Scholar
- Wilson, A., Jr. 1987. Hierarchical Cache/Bus Architecture for Shared Memory Multiprocessors. Proc. 14th Int'l Symposium on Computer Architecture (June):244-252. Google Scholar
- Wolf, M. E., and M. S. Lam. 1991. A Data Locality Optimizing Algorithm. Proc. ACM SIGPLAN'91 Conference on Programming Language Design and Implementation (June):30-44. Google Scholar
- Wolfe, M. 1989. Optimizing Supercompilers for Supercomputers . Cambridge, MA, MIT Press. Google Scholar
- Woo, S. C., J. P. Singh, and J. L. Hennessy. 1994. The Performance Advantages of Integrating Block Data Transfer in Cache-Coherent Multiprocessors. Proc. 6th Int'l Conference on Architectural Support for Programming Languages and Operating Systems (October):219-229, San Jose, CA. Google Scholar
- Woo, S. C. M. Ohara, E. J. Torrie, J. P. Singh, and A. Gupta. 1995. The SPLASH-2 Programs; Characterization and Methodological Considerations. Proc. 22nd Annual Int'l Symposium on Computer Architecture (June):24-36. Google Scholar
- Wood, D. A., S. Chandra, B. Falsafi, M. D. Hill, J. R. Larus, A. R. Lebeck, J. C. Lewis, S. S. Mukherjee, S. Palacharla, and S. K. Reinhardt. 1993. Mechanisms for Cooperative Shared Memory. Proc. 20th Annual Symposium on Computer Architecture (May):156-167. Google Scholar
- Wood, D. A., S. J. Eggers, G. Gibson, M. D. Hill, J. M. Pendleton, S. A. Ritchie, G. S. Taylor, R. H. Katz, and D. A. Patterson. 1986. An In-Cache Address Translation Mechanism. Proc. 13th Annual Symposium on Computer Architecture (June):358-365. Google Scholar
- Wood, D. A., and M. D. Hill. 1995. Cost-Effective Parallel Computing. IEEE Computer 28(2):69-72. Google ScholarDigital Library
- Woodbury, P., A. Wilson, B. Shein, I. Gertner, P.Y. Chen, J. Bartlett, and Z. Aral. 1989. Shared Memory Multiprocessors: The Right Approach to Parallel Processing. Proc. COMPCON (Spring): 72-80.Google Scholar
- Wulf, W., R. Levin, and C. Person. 1975. Overview of the Hydra Operating System Development. Proc. 5th Symposium on Operating Systems Principles (November):122-131. Google Scholar
- Yamashita, N., T. Kimura, Y. Fujita, Y. Aimoto, T. Manaba, S. Okazaki, K. Nakamura, and M. Yamashina. 1994. A 3.84GIPS Integrated Memory Array Processor LSI with 64 Processing Elements and 2Mb SRAM. Int'l Solid-State Circuits Conference , San Francisco (February):260-261.Google Scholar
- Zekauskas, M. J., W. A. Sawdon, and B. N. Bershad. 1994. Software Write Detection for a Distributed Shared Memory. Proc. Operating Systems Design and Implementation Symposium (November):87-100. Google Scholar
- Zhang, Z., and J. Torrellas. 1995. Speeding Up Irregular Applications in Shared-Memory Multiprocessors: Memory Binding and Group Prefetching. Proc. 22nd Annual Symposium on Computer Architecture (May):188-199. Google Scholar
- Zhou, Y., L. Iftode, and K. Li. 1996. Performance Evaluation of Two Home-Based Lazy Release Consistency Protocols for Shared Virtual Memory Systems. Proc. Operating Systems Design and Implementation Symposium (October). Google Scholar
Cited By
- Wang F and Hsu C Rainbow Connection Number in Pyramid Networks Proceedings of the 11th International Conference on Computer Modeling and Simulation, (147-150)
- León-Sandoval E and Barbosa-Santillán L Data intensive parallel tree algorithm patterns based on GPUs Proceedings of the 2018 International Conference on Data Science and Information Technology, (69-73)
- Cairo M and Rizzi R The complexity of simulation and matrix multiplication Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, (2203-2214)
- Qin H, Liu Z, Liu Y and Zhong H (2017). An object-oriented MATLAB toolbox for automotive body conceptual design using distributed parallel optimization, Advances in Engineering Software, 106:C, (19-32), Online publication date: 1-Apr-2017.
- Kantert J, Tomforde S, Scharrer R, Weber S, Edenhofer S and Mller-Schloer C (2017). Identification and classification of agent behaviour at runtime in open, trust-based organic computing systems, Journal of Systems Architecture: the EUROMICRO Journal, 75:C, (68-78), Online publication date: 1-Apr-2017.
- Huang C, Kumar R, Elver M, Grot B and Nagarajan V C3D The 49th Annual IEEE/ACM International Symposium on Microarchitecture, (1-12)
- Ros A and Kaxiras S Racer The 49th Annual IEEE/ACM International Symposium on Microarchitecture, (1-13)
- Fernández-Pascual R, Ros A and Acacio M Optimization of a Linked Cache Coherence Protocol for Scalable Manycore Coherence Proceedings of the 29th International Conference on Architecture of Computing Systems -- ARCS 2016 - Volume 9637, (100-112)
- Dimakopoulou M, Eranian S, Koziris N and Bambos N Reliable and efficient performance monitoring in linux Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, (1-13)
- Kuiper G, Geuns S and Bekooij M Utilization Improvement by Enforcing Mutual Exclusive Task Execution in Modal Stream Processing Applications Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems, (28-37)
- Zhang G, Horn W and Sanchez D Exploiting commutativity to reduce the cost of updates to shared data in cache-coherent systems Proceedings of the 48th International Symposium on Microarchitecture, (13-25)
- Daya B, Chen C, Subramanian S, Kwon W, Park S, Krishna T, Holt J, Chandrakasan A and Peh L SCORPIO Proceeding of the 41st annual international symposium on Computer architecuture, (25-36)
- Daya B, Chen C, Subramanian S, Kwon W, Park S, Krishna T, Holt J, Chandrakasan A and Peh L (2014). SCORPIO, ACM SIGARCH Computer Architecture News, 42:3, (25-36), Online publication date: 16-Oct-2014.
- Liu C and Yang C Exploiting heterogeneity in MPSoCs to prevent potential trojan propagation across malicious IPs Proceedings of the 24th edition of the great lakes symposium on VLSI, (335-340)
- Carpenter A, Hu J, Kocabas O, Huang M and Wu H Enhancing effective throughput for transmission line-based bus Proceedings of the 39th Annual International Symposium on Computer Architecture, (165-176)
- Carpenter A, Hu J, Kocabas O, Huang M and Wu H (2012). Enhancing effective throughput for transmission line-based bus, ACM SIGARCH Computer Architecture News, 40:3, (165-176), Online publication date: 5-Sep-2012.
- Schor L, Bacivarov I, Rai D, Yang H, Kang S and Thiele L Scenario-based design flow for mapping streaming applications onto on-chip many-core systems Proceedings of the 2012 international conference on Compilers, architectures and synthesis for embedded systems, (71-80)
- Nychis G, Fallin C, Moscibroda T, Mutlu O and Seshan S (2012). On-chip networks from a networking perspective, ACM SIGCOMM Computer Communication Review, 42:4, (407-418), Online publication date: 24-Sep-2012.
- Nychis G, Fallin C, Moscibroda T, Mutlu O and Seshan S On-chip networks from a networking perspective Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication, (407-418)
- Xue J, Garg A, Ciftcioglu B, Hu J, Wang S, Savidis I, Jain M, Berman R, Liu P, Huang M, Wu H, Friedman E, Wicks G and Moore D (2010). An intra-chip free-space optical interconnect, ACM SIGARCH Computer Architecture News, 38:3, (94-105), Online publication date: 19-Jun-2010.
- Kim H, Ahn J and Kim J Replication-aware leakage management in chip multiprocessors with private L2 cache Proceedings of the 16th ACM/IEEE international symposium on Low power electronics and design, (135-140)
- Xue J, Garg A, Ciftcioglu B, Hu J, Wang S, Savidis I, Jain M, Berman R, Liu P, Huang M, Wu H, Friedman E, Wicks G and Moore D An intra-chip free-space optical interconnect Proceedings of the 37th annual international symposium on Computer architecture, (94-105)
- Li X and Hammami O (2009). An automatic design flow for data parallel and pipelined signal processing applications on embedded multiprocessor with NoC, International Journal of Reconfigurable Computing, 2009, (2-2), Online publication date: 1-Jan-2009.
- Ophelders F, Bekooij M and Corporaal H A tuneable software cache coherence protocol for heterogeneous MPSoCs Proceedings of the 7th IEEE/ACM international conference on Hardware/software codesign and system synthesis, (383-392)
- Zeng H, Yourst M, Ghose K and Ponomarev D MPTLsim Proceedings of the 46th Annual Design Automation Conference, (226-231)
- Moonen A, Bekooij M, van den Berg R and van Meerbergen J Cache aware mapping of streaming applications on a multiprocessor system-on-chip Proceedings of the conference on Design, automation and test in Europe, (300-305)
- Bijlsma T, Bekooij M, Jansen P and Smit G Communication between nested loop programs via circular buffers in an embedded multiprocessor system Proceedings of the 11th international workshop on Software & compilers for embedded systems, (33-42)
- Poletti F, Poggiali A, Bertozzi D, Benini L, Marchal P, Loghi M and Poncino M (2007). Energy-Efficient Multiprocessor Systems-on-Chip for Embedded Computing, IEEE Transactions on Computers, 56:5, (606-621), Online publication date: 1-May-2007.
- Moreira O, Valente F and Bekooij M Scheduling multiple independent hard-real-time jobs on a heterogeneous multiprocessor Proceedings of the 7th ACM & IEEE international conference on Embedded software, (57-66)
- Cameron K, Ge R and Sun X (2007). $\log_{\rm n}{\rm P}$ and $\log_{3}{\rm P}$, IEEE Transactions on Computers, 56:3, (314-327), Online publication date: 1-Mar-2007.
- Tumeo A, Monchiero M, Palermo G, Ferrandi F and Sciuto D A design kit for a fully working shared memory multiprocessor on FPGA Proceedings of the 17th ACM Great Lakes symposium on VLSI, (219-222)
- Stuijk S, Basten T, Geilen M and Corporaal H Multiprocessor resource allocation for throughput-constrained synchronous dataflow graphs Proceedings of the 44th annual Design Automation Conference, (777-782)
- Acacio M, Gonzalez J, Garcia J and Duato J (2005). A Two-Level Directory Architecture for Highly Scalable cc-NUMA Multiprocessors, IEEE Transactions on Parallel and Distributed Systems, 16:1, (67-79), Online publication date: 1-Jan-2005.
- Bekooij M, Parmar S and van Meerbergen J Performance guarantees by simulation of process Proceedings of the 2005 workshop on Software and compilers for embedded systems, (10-19)
- Bhunia S, Datta A, Banerjee N and Roy K (2005). GAARP, IEEE Transactions on Computers, 54:6, (752-766), Online publication date: 1-Jun-2005.
- Bilardi G, Pietracaprina A, Pucci G, Schifano F and Tripiccione R The potential of on-chip multiprocessing for QCD machines Proceedings of the 12th international conference on High Performance Computing, (386-397)
- Brown J and Wen Z Toward an application support layer Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics, (912-919)
- Chaudhuri M and Heinrich M (2004). SMTp, ACM SIGARCH Computer Architecture News, 32:2, (124), Online publication date: 2-Mar-2004.
- Chaudhuri M and Heinrich M SMTp Proceedings of the 31st annual international symposium on Computer architecture
- Teo Y and Onggo B Formalization and strictness of simulation event orderings Proceedings of the eighteenth workshop on Parallel and distributed simulation, (89-96)
- Dongarra J, Foster I, Fox G, Gropp W, Kennedy K, Torczon L and White A References Sourcebook of parallel computing, (729-789)
- Goel A, Roychoudhury A and Mitra T (2003). Compactly representing parallel program executions, ACM SIGPLAN Notices, 38:10, (191-202), Online publication date: 1-Oct-2003.
- Goel A, Roychoudhury A and Mitra T Compactly representing parallel program executions Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming, (191-202)
- Lepak K and Lipasti M (2002). Temporally silent stores, ACM SIGARCH Computer Architecture News, 30:5, (30-41), Online publication date: 1-Dec-2002.
- Mauer C, Hill M and Wood D Full-system timing-first simulation Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, (108-116)
- Mauer C, Hill M and Wood D (2019). Full-system timing-first simulation, ACM SIGMETRICS Performance Evaluation Review, 30:1, (108-116), Online publication date: 1-Jun-2002.
- Lepak K and Lipasti M (2002). Temporally silent stores, ACM SIGPLAN Notices, 37:10, (30-41), Online publication date: 1-Oct-2002.
- Lepak K and Lipasti M Temporally silent stores Proceedings of the 10th international conference on Architectural support for programming languages and operating systems, (30-41)
- Lepak K and Lipasti M (2002). Temporally silent stores, ACM SIGOPS Operating Systems Review, 36:5, (30-41), Online publication date: 1-Dec-2002.
- Beaumont O, Boudet V and Robert Y A Realistic Model and an Efficient Heuristic for Scheduling with Heterogeneous Processors Proceedings of the 16th International Parallel and Distributed Processing Symposium
- Sorin D, Plakal M, Condon A, Hill M, Martin M and Wood D (2002). Specifying and Verifying a Broadcast and a Multicast Snooping Cache Coherence Protocol, IEEE Transactions on Parallel and Distributed Systems, 13:6, (556-578), Online publication date: 1-Jun-2002.
- Martin M, Sorin D, Cain H, Hill M and Lipasti M Correctly implementing value prediction in microprocessors that support multithreading or multiprocessing Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture, (328-337)
- Li T and John L (2001). ADir_pNB, IEEE Transactions on Computers, 50:9, (921-934), Online publication date: 1-Sep-2001.
- Kwak H, Lee B, Hurson A, Yoon S and Hahn W (1999). Effects of Multithreading on Cache Performance, IEEE Transactions on Computers, 48:2, (176-184), Online publication date: 1-Feb-1999.
Recommendations
Architecture of the VPP500 parallel supercomputer
Supercomputing '94: Proceedings of the 1994 ACM/IEEE conference on SupercomputingThe VPP500 vector parallel processor is a highly parallel, distributed memory supercomputer that has a performance range of 6.4 to 355 gigaFLOPS and a main memory capacity from 1 to 222 gigabytes. The system scalably supports between 4 and 222 ...
The NYU Ultracomputer Designing an MIMD Shared Memory Parallel Computer
We present the design for the NYU Ultracomputer, a shared-memory MIMD parallel machine composed of thousands of autonomous processing elements. This machine uses an enhanced message switching network with the geometry of an Omega-network to approximate ...
A universal parallel computer architecture
AbstractAdvances in interconnection network performance and interprocessor interaction mechanisms enable the construction of fine-grain parallel computers in which the nodes are physically small and have a small amount of memory. This class of machines ...