Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
Skip header Section
Parallel Computer Architecture: A Hardware/Software ApproachSeptember 1998
Publisher:
  • Morgan Kaufmann Publishers Inc.
  • 340 Pine Street, Sixth Floor
  • San Francisco
  • CA
  • United States
ISBN:978-0-08-057307-6
Published:29 September 1998
Pages:
1056
Skip Bibliometrics Section
Reflects downloads up to 27 Jan 2025Bibliometrics
Skip Abstract Section
Abstract

The most exciting development in parallel computer architecture is the convergence of traditionally disparate approaches on a common machine structure. This book explains the forces behind this convergence of shared-memory, message-passing, data parallel, and data-driven computing architectures. It then examines the design issues that are critical to all parallel architecture across the full range of modern design, covering data access, communication performance, coordination of cooperative work, and correct implementation of useful semantics. It not only describes the hardware and software techniques for addressing each of these issues but also explores how these techniques interact in the same system. Examining architecture from an application-driven perspective, it provides comprehensive discussions of parallel programming for high performance and of workload-driven evaluation, based on understanding hardware-software interactions. * synthesizes a decade of research and development for practicing engineers, graduate students, and researchers in parallel computer architecture, system software, and applications development * presents in-depth application case studies from computer graphics, computational science and engineering, and data mining to demonstrate sound quantitative evaluation of design trade-offs * describes the process of programming for performance, including both the architecture-independent and architecture-dependent aspects, with examples and case-studies * illustrates bus-based and network-based parallel systems with case studies of more than a dozen important commercial designs Table of Contents 1 Introduction 2 Parallel Programs 3 Programming for Performance 4 Workload-Driven Evaluation 5 Shared Memory Multiprocessors 6 Snoop-based Multiprocessor Design 7 Scalable Multiprocessors 8 Directory-based Cache Coherence 9 Hardware-Software Tradeoffs 10 Interconnection Network Design 11 Latency Tolerance 12 Future Directions APPENDIX A Parallel Benchmark Suites

References

  1. Abali, B., and C. Aykanat. 1994. Routing Algorithms for IBM SP1. Lecture Notes in Computer Science, Vol. 853. New York: Springer-Verlag, 161-175. Google ScholarGoogle Scholar
  2. Abdel-Shafi, H. A., J. Hall, S. V. Adve, and V. S. Adve. 1997. An Evaluation of Fine-Grain Producer-Initiated Communication in Cache-Coherent Multiprocessors. Proc. Third Symposium on High Performance Computer Architecture (February). Google ScholarGoogle Scholar
  3. Adve, S. V 1993. Designing Memory Consistency Models for Shared-Memory Multiprocessors. Ph.D. diss., University of Wisconsin-Madison. Available as Tech. Report #1198, University of Wisconsin-Madison, Computer Science (December). Google ScholarGoogle Scholar
  4. Adve, S. Y. and K. Gharachorloo. 1996. Shared Memory Consistency Models: A Tutorial. IEEE Computer 29(12):66-76. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Adve, S. V., K. Gharachorloo, A. Gupta, J. L. Hennessy, and M. Hill. 1993. Sufficient Systems Requirements for Supporting the PLpc Memory Model . Tech. Report #1200, University of Wisconsin-Madison. Computer Science (December). Also available as Tech. Report #CSL-TR-93-595, Stanford University.Google ScholarGoogle Scholar
  6. Adve, S. V., and M. Hill. 1990a. Weak Ordering: A New Definition. 1990. Proc. 17th Int'l Symposium on Computer Architecture (May):2-14. Google ScholarGoogle Scholar
  7. Adve, S. V., and M. Hill. 1990b. Implementing Sequential Consistency in Cache-Based Systems. Proc. 1990 Int'l Conference on Parallel Processing (August):47-50.Google ScholarGoogle Scholar
  8. Adve, S. V., and M. Hill, 1993. A Unified Formalization of Four Shared-Memory Models. IEEE Transactions on Parallel and Distributed Systems 4(6):613-624. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Agarwal, A. 1991. Limit on Interconnection Performance. IEEE Transactions on Parallel and Distributed Systems 2(4):398-412. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Agarwal, A., R. Bianchini, D. Chaiken, K. L. Johnson, D. Kranz, J. Kubiatowicz, B.-H. Lim, K. Mackenzie, and D. Yeung. 1995. The MIT Alewife Machine: Architecture and Performance. Proc. 22nd Int'l Symposium on Computer Architecture (May/June):2-13. Google ScholarGoogle Scholar
  11. Agarwal, A., and A. Gupta. 1988. Memory-Reference Characteristics of Multiprocessor Applications Under MACH. Proc. ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems (May):215-225. Google ScholarGoogle Scholar
  12. Agarwal, A., B.-H. Lim, D. Kranz, and J. Kubiatowicz. 1990. (April): A Processor Architecture for Multiprocessing. Proc. 17th Annual Int'l Symposium on Computer Architecture (June):104-114. Google ScholarGoogle Scholar
  13. Agarwal, A., B.-H Lim, D. Kranz, and J. Kubiatowicz. 1991. LimitLESS Directories: A Scalable Cache Coherence Scheme. Proc. Fourth Int'l Conference on Architectural Support for Programming Languages and Operating Systems (April):224-234. Google ScholarGoogle Scholar
  14. Agarwal, A., R. Simoni, J. Hennessy, and M. Horowitz. 1988. An Evaluation of Directory Schemes for Cache Coherence. Proc. 15th Int'l Symposium on Computer Architecture (June):280-289. Google ScholarGoogle Scholar
  15. Aiken, A., and A. Nicolau. 1988. Optimal Loop Parallelization. Proc. SIGPLAN Conference on Programming Language Design and Implementation (June):308-317. Also published in SIGPLAN Notices 23(7). Google ScholarGoogle Scholar
  16. Aimoto, Y., T. Kimura, Y. Yabe, H. Heiuchi, et al. 1996. A 7.68GIPS 3.84GB/S 1W Parallel Image-Processing RAM Integrating a 16Mb DRAM and 128 Processors, International Solid-State Circuits Conference , San Francisco (February):372-373.Google ScholarGoogle Scholar
  17. Alexander, T. B., K. G. Robertson, D. T. Lindsay, D. L. Rogers, J. R. Obermeyer, J. R. Keller, K. Y. Oka and M. M. Jones II. 1994. Corporate Business Servers: An Alternative to Mainframes for Business Computing (HP K-Class). Hewlett-Packard Journal (June):8-33.Google ScholarGoogle Scholar
  18. Almasi, G. S., and A. Gottlieb. 1989. Highly Parallel Computing. Redwood City. CA: Benjamin/Cummings. Google ScholarGoogle Scholar
  19. Alverson, R., D. Callahan, D. Cummings, B. Koblenz, A. Porterfield and B. Smith. 1990. The Tera Computer System. Proc. 1990 Int'l Conference on Supercomputing (June): 1-6. Google ScholarGoogle Scholar
  20. Amdahl, G. M. 1967. Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities. AFIPS 1967 Spring Joint Computer Conference 40:483-485. Google ScholarGoogle Scholar
  21. Anderson, J. P., S. A. Hoffman, J. Shifman and R. Williams. 1962. D825-A Multiple-Computer Sysiem for Command and Control. AFIP Proc. FJCC 22:86-96. Google ScholarGoogle Scholar
  22. Anderson, J., and M. Lam. 1993. Global Optimizations for Parallelism and Locality on Scalable Parallel Machines. Proc. SIGPLAN'93 Conference on Programming Language Design and Implementation (June). Google ScholarGoogle Scholar
  23. Anderson, T. E., D. E. Culler, D. Patterson. 1995. A Case for NOW (Networks of Workstations). IEEE Micro 15(1):54-6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Anderson, T. E., S. S. Owicki, J. P. Saxe and C. P. Thacker. 1992. High Speed Switch Scheduling for Local Area Networks. Proc. ASPLOS V (October):98-110. Google ScholarGoogle Scholar
  25. Archibald, J., and J.-L. Baer. 1986. Cache Coherence Protocols: Evaluation Using a Multiprocessor Simulation Model. ACM Transactions on Computer Systems 4(4):273-298. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Arnould, E. A., F. J. Bitz, E. C. Cooper, H. T. Kung, R. D. Sansom, and P. A. Steenkiste. 1989. The Design of Nectar: A Network Backplane for Heterogeneous Multicomputers. Proc. ASPLO III (April):205-216. Google ScholarGoogle Scholar
  27. Arpaci, R. H., D. E. Culler, A. Krishnamurthy, S. G. Steinberg, and K. Yelick. 1995. Empirical Evaluation of the Cray-T3D: A Compiler Perspective. Proc. 22nd Int'l Symposium on Computer Architecture (June):320-331. Google ScholarGoogle Scholar
  28. Arvind, and D. E. Culler. 1986. Dataflow Architectures. Annual Reviews in Computer Science 1:225-253. Palo Alto, CA: Annual Reviews. Reprinted in Dataflow and Reduction Architectures. Edited by S. S. Thakkar. Los Alamitos, CA: IEEE Computer Society Press, 1987.Google ScholarGoogle ScholarCross RefCross Ref
  29. Athas, W. C., and C. L. Seitz. 1988. Multicomputers: Message-Passing Concurrent Computers. IEEE Computer 21(8):9-24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. August, M. C., G. M. Brost, C. C. Hsiung and A. J. Schiffleger. 1989. Cray X-MP: The Birth of a Supercomputer. Computer 22(1):45-52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Baer, J.-L., and T.-F Chen. 1991. An Efficient On-Chip Preloading Scheme to Reduce Data Access Penalty. Proc. Supercomputing '91 (November):176-186. Google ScholarGoogle Scholar
  32. Baer, J.-L., and W.-H. Wang. 1988. On the Inclusion Properties for Multi-Level Cache Hierarchies. Proc. 15th Annual Int'l Symposium on Computer Architecture (May):73-80. Google ScholarGoogle Scholar
  33. Bailey, D. H. 1990. FFTs in External or Hierarchical Memory Journal of Supercomputing 4(1):23-35. Also published in Proc. Supercomputing '89 (November):234-242. Google ScholarGoogle Scholar
  34. Bailey, D. H. 1991. Twelve Ways 10 Fool the Masses When Giving Performance Results on Parallel Computers. Supercomputing Review (August):54-55.Google ScholarGoogle Scholar
  35. Bailey, D. H. 1993. Misleading Performance Reporting in the Supercomputing Field. Scientific Programming 1(2):141-151. Also published in Proc Supercomputing '93 . Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Bailey, D. H., E. Barszcz, J. Barton, D. Browning, R. Carter, L. Dagum, R. Fatoohi, S. Fineberg, P. Frederickson, T. Lasinski, R. Schreiber, H. Simon, V. Venkatakrishnan, and S. Weeratunga. 1991. The NAS Parallel Benchmarks. Intl. Journal of Supercomputer Applications 5(3);66-73. Also published as Tech. Report RNR-94-007, Numerical Aerodynamic Simulation Facility, NASA Ames Research Center (March 1994).Google ScholarGoogle Scholar
  37. Bailey, D. H., E. Barszcz, L. Dagum and H. D. Simon. 1994. NAS Parallel Benchmark Results 3-94. Proc. Scalable High-Performance Computing Conference , Knoxville, TN (May): 111-120.Google ScholarGoogle Scholar
  38. Bailey, D., T. Harris, W. Saphir, R. van der Wijngaart, A Woo and M. Yarrow. 1995. The NAS Parallel Benchmarks 2. 0. Report NAS-95-020, Numencal Aerodynamic Simulation Facility. NASA Ames Research Center (December).Google ScholarGoogle Scholar
  39. Baker, W. E., R. W. Horst, D. P Sonnier, and W. J. Watson 1995. A Flexible ServerNet-Based Fault-Tolerant Architecture. Proc. 25th Int'l Symposium on Fault-Tolerant Compuling (June). Los Alamitos, CA: IEEE Computer Society Press, 2-11. Google ScholarGoogle Scholar
  40. Bakoglu, H. B. 1990. Circuits, Interconnection, and Packaging for VLSI. Reading, MA: Addison-Wesley.Google ScholarGoogle Scholar
  41. Ball, J. R., R. C. Bollinger, T. A. Jeeves, R. C. McReynolds, D. H. Shaffer. 1962. On the Use of the Solomon Parallel-Processing Computer. Proc AFIPS Fall Joint Computer Conference 22:137-146. Google ScholarGoogle Scholar
  42. Banks, D. and M. Prudence 1993. A High Performance Network Architecture for a PARISC Workstation. IEEE Journal on Selected Areas in Communication 11(2): 191-202. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Barnes, J. E., and P. Hut 1989. Error Analysis of a Tree Code. Astrophysics Journal Supplement 70(June):389-417.Google ScholarGoogle ScholarCross RefCross Ref
  44. Barosso, L., and M. Dubois. 1993. The Performance of Cache-Coherent Ring-Based Multiprocessors. Proc. 20th Annual Int'l Symposium on Computer Architectures (ISCA) (May):268-277 Google ScholarGoogle Scholar
  45. Barosso, L., and M. Dubois. 1995. Performance Evaluation of the Slotted Ring Multiprocessors. IEEE Transactions on Computers 44(7):878-890. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Barroso, L. A., S. Iman, J. Jeong, K. Oner, K. Ramamurthy and M. Dubois. 1995. RPM: A Rapid Prototyping Engine for Multiprocessor Systems. IEEE Computer 28(2):26-34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Barszcz, E., Fatoohi, R., Venkatakrishnan, V., and Weeratunga, S. 1993. Solution of Regular, Sparse Triangular Linear Systems on Vector and Distributed-Memory Multiprocessors . Tech. Report NAS RNR-93-007. NASA Ames Research Center. Moffett Field, CA (April).Google ScholarGoogle Scholar
  48. Barton, E., J. Crownie, and M. McLaren. 1994. Message Passing on the Meiko CS-2. Parallel Computing 20(4):497-507. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Baskett, F. T. Jermoluk, and D. Solomon. 1988. The 4D-MP Graphics Superworkstation: Computing + Graphics = 40 MIPS + 40 MFLOPS and 100,000 Lighted Polygons per Second. Proc. 33rd IEEE Computer Society Int'l Conference--COMPCON '88 (February):468-471.Google ScholarGoogle Scholar
  50. Batcher, K. E. 1974. Staran Parallel Processor System Hardware. Proc. AFIPS National Computer Conference , 405-410. Google ScholarGoogle Scholar
  51. Batcher, K. E. 1980. Design of a Massively Parallel Processor. IEEE Transactions on Computers C- 29(9):836-840. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Bell, C. G. 1985. Multis: A New Class of Multiprocessor Computers. Science 228:462-467.Google ScholarGoogle Scholar
  53. Benes, V. 1965. Mathematical Theory of Connecting Networks and Telephone Traffic. San Diego, CA: Academic Press.Google ScholarGoogle Scholar
  54. Bennett, J. E., and M.J. Flynn. 1996a. Latency Tolerance for Dynamic Processors. Tech. Report #CSL-TR-96-687, Computer Systems Laboratory, Stanford University. Google ScholarGoogle Scholar
  55. Bennett, J. E., and M. J. Flynn. 1996b. Reducing Cache Miss Rates Using Prediction Caches. Tech. Report #CSL-TR-96- 707. Computer Systems Laboratory, Stanford University. Google ScholarGoogle Scholar
  56. Berry, M., D. Chen, P. Koss, et al. 1989. The PERFECT Club Benchmarks: Effective Performance Evaluation of Computers. Int'l Journal of Supercomputer Applications 3(3):5-40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Bershad, B. N., M. J. Zekauskas, and W. A. Sawdon. 1993. The Midway Distributed Shared Memory System. Proc. COMPCON '93 (February).Google ScholarGoogle Scholar
  58. Bhatt, S. M. and C. E. Leiserson. 1983. How to Assemble Tree Machines. ACM Symposium on Theory of Computing (STOC '82) . New York: ACM Press. Google ScholarGoogle Scholar
  59. Biagioni, E., E. Cooper, and R. Sansom. 1993. Designing a Practical ATM LAN. IEEE Network (March). Google ScholarGoogle Scholar
  60. Bilas, A., L. Iftode, and J. P. Singh. 1998. Evaluation of Hardware Support for Next-Generation Shared Virtual Memory Clusters. Proc. Int'l Conference on Supercomputing (July). Google ScholarGoogle Scholar
  61. Bisiani, R., and M. Ravishankar. 1990. PLUS: A Distributed Shared-Memory System. Proc. 17th Int'l Symposium on Computer Architecture (May): 115-124. Google ScholarGoogle Scholar
  62. Black, D., R. Rashid, D. Golub, C. Hill, R. Baron. 1989. Translation Lookaside Buffer Consistency: A Software Approach. Proc. Third Int'l Conference on Architectural Support for Programming Languages and Operating Systems. Boston (April): 113-122. Google ScholarGoogle Scholar
  63. Blackford, L. S., J. Choi, A. Cleary, E. D'Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. 1997. ScaLAPACK Users' Guide. Philadelphia, PA: Society for Industrial and Applied Mathematics (SIAM). Google ScholarGoogle Scholar
  64. Blelloch, G. 1993. Prefix Sums and Their Applications. In Synthesis of Parallel Algorithms . Edited by J. Reif. San Francisco: Morgan Kaufmann, 35-60.Google ScholarGoogle Scholar
  65. Blelloch, G. E., C. E. Leiserson, B. M. Mages, C. G. Plaxton, S.J. Smith, and M. A. Zagha. 1991. Comparison of Sorting Algorithms for the Connection Machine CM-2. Proc. Symposium on Parallel Algorithms ana Architectures (July):3-16. Google ScholarGoogle Scholar
  66. Blumrich, M. A., C. Dubnicki, E. W. Felten, K. Li, M.R. Mesarina. 1994. Two Virtual Memory Mapped Network Interface Designs. Proc. Hot Interconnects II Symposium (August).Google ScholarGoogle Scholar
  67. Blumrich, M., K. Li, R. Alpert, C. Dubnicki, E. Felten, and J. Sandberg. 1994. A Virtual Memory Mapped Network Interface for the Shrimp Multicomputer. Proc. 21st Int'l Symposium on Computer Architecture (April): 142-153. Google ScholarGoogle Scholar
  68. Boden, N., D. Cohen, R. Felderman, A. Kulawik, C. Seitz, J. Seizovic, and W. Su. 1995. Myrinet: A Gigabit-per-Second Local Area Network. IEEE Micro 15(1):29-38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Bodin, F., P. Beckman, D. Gannon, S. Yang, S. Kesavan, A. Malony, and B. Mohr. 1993. Implementing a Parallel C++ Runtime System for Scalable Parallel Sysiems. Proc. Supercomputing '93 (November):588-597. Also in Scientific Programming 2(3). Google ScholarGoogle Scholar
  70. Bolt Beranek and Newman Advanced Computers. 1989. TC2000 Technical Product Summary. Cambridge, MA: Bolt Beranek and Newman.Google ScholarGoogle Scholar
  71. Bomans, L., and D. Roose. 1989. Benchmarking the iPSC/2 Hypereube Multiprocessor. Concurrency: Practice and Experience , 1(1):3-18.Google ScholarGoogle ScholarCross RefCross Ref
  72. Borkar, S., R. Cohn, G. Cox, T. Gross, H. T. Kung, M. Lam, M. Levine, B. Moore, W. Moore, C. Peterson, J. Susman, J. Sutton, J. Urbanski, and J. Webb. 1990. Supporting Systolic and Memory Communication in iWarp. Proc. 17th Annual Int'l Symposium on Computer Architecture , Seattle, WA (May):70-81. Revised version appears as Tech. Report #CMU-CS-90-197. Carnegie Mellon University. Google ScholarGoogle Scholar
  73. Bouknight, W. J., S. A. Denenberg, D. E. McIntyre, J. M. Randall, A. H. Sameh, and D. L. Slotnick. 1972. The Illiac IV System. Proc. IEEE 60(4):369-388.Google ScholarGoogle Scholar
  74. Boyle, J., R. Butler, T. Disz, B. Glickfield, E. Lusk, W. R. Overbeek, J. Patterson, and R. Stevens. 1987. Portable Programs for Parallel Processors. New York: Holt, Rinehart and Winston. Google ScholarGoogle Scholar
  75. Brewer, E. A., F. T. Chong, F. T. Leighton. 1994. Scalable Expanders: Exploiting Hierarchical Random Wiring. Proc. 1994 Symposium on the Theory of Computing , Montreal, Canada (May):144-152. Google ScholarGoogle Scholar
  76. Brewer, E. A., F. T. Chong, L. T. Liu, J. Kubiatowicz, S. D. Sharma. 1995. Remote Queues: Exposing Network Queues for Atomicity and Optimization. Proc. Seventh Annual Symposium on Parallel Algorithms and Architectures (July):42-53. Google ScholarGoogle Scholar
  77. Brewer, E. A., and B. C. Kuszmaul. 1994. How to Get Good Performance from the CM-5 Data Network. Proc. 1994 Int'l Parallel Processing Symposium , Cancun, Mexico (April):858-867. Google ScholarGoogle Scholar
  78. Bruno, J., P. R. Cappello. 1988. Implementing the Beam and Warming Method on the Hypercube. Proc. Third Conference on Hypercube Concurrent Computers and Applications , Pasadena, CA, Jan 19-20. Google ScholarGoogle Scholar
  79. Burger, D. 1997. System-Level Implications of Processor-Memory integration. Workshop on Mixing Logic and DRAM: Chips that Compute and Remember. Presented at the Int'l Symposium on Computer Architecture (ISCA) '97 (June).Google ScholarGoogle Scholar
  80. Burger, D., J. Goodman, and A. Kagi. 1996. Memory Bandwidth Limitations in Future Microprocessors. Proc. 23rd Annual Symposium on Computer Architecture (May):78-89. Google ScholarGoogle Scholar
  81. Burkhardt, H., et al. 1992. Overview of the KSR-1 Computer System. Tech. Report KSR-TR-9202001. Kendall Square Research. Boston (February).Google ScholarGoogle Scholar
  82. Butler, M., T-Y. Yeh, Y. Patt, M. Alsup, H. Scales, and M. Shebanow. 1991. Single Instruction Stream Parallelism Is Greater Than Two. Proc. Annual Int'l Symposium on Computer Architecture (ISCA), 276-86. Google ScholarGoogle Scholar
  83. Callahan, T. and S. C. Goldstein. 1995. NIFDY: A Low Overhead. High Throughput Network Interface. Proc. 22nd Annual Symposium on Computer Architecture (June):230-241. Google ScholarGoogle Scholar
  84. Cardoza, W., F. Glover, and W. Snaman Jr. 1996. Design of a TruCluster Multicomputer System for the Digital UNIX Environment. Digital Technical Journal 8(1):5-17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  85. Carter, J. B., J. K. Bennett, and W. Zwaenepoel. 1991. Implementation and Performance of Munin. Proc. 13th Symposium on Operating Systems Principles (October):152-164. Google ScholarGoogle Scholar
  86. Carter, J. B., J. K. Bennett, and W. Zwaenepoel. 1995. Techniques for Reducing Consistency-Related Communication in Distributed Shared-Memory Systems. ACM Transactions of Computer Systems 13(3):205-244. Google ScholarGoogle ScholarDigital LibraryDigital Library
  87. Catanzaro, B. 1997. Multiprocessor System Architectures: A Technical Survery of Multiprocessor/ Multithreaded Systems Using SPARC, Multi-level Bus Architectures and Solaris (SunOS). Mountain View, CA: Sun Microsystems.Google ScholarGoogle Scholar
  88. Cekleov, M., D. Yen, P. Sindhu, J.-M. Frailong, et al. 1993. SPARCcenter 2000: Multiprocessing for the 90s, Digest of Papers. Proc. COMPCON Spring '93. Los Alamitos, CA: IEEE Computer Society Press, 345-353.Google ScholarGoogle Scholar
  89. Censier, L., and P. Feautrier. 1978. A New Solution to Cache Coherence Problems in Multiprocessor Systems. IEEE Transaction on Computer Systems C-27(12):1112-1118. Google ScholarGoogle Scholar
  90. Chan, K., et al. 1993. Multiprocessor Features of the HP Corporate Business Servers. Proc. COMPCON (Spring):330-337.Google ScholarGoogle Scholar
  91. Chandy, K. M., and J. Misra. 1988. Parallel Program Design: A Foundation. Reading. MA: Addison Wesley. Google ScholarGoogle Scholar
  92. Chang, P. P. S. A. Mahlke, W. Y. Chen, N. J. Warter, and W. W. Hwu. 1991. IMPACT: An Architectural Framework for Multiple-Instruction Issue Processors. Proc. 18th Int'l Symposium on Computer Architecture (ISCA) 19(3):266-275. Google ScholarGoogle Scholar
  93. Chen, T.-F., and J.-L. Baer. 1992. Reducing Memory Latency via Non-Blocking and Prefetching Caches. Proc. Fifth Int'l Conference on Architectural Support for Programming Languages and Operating Systems (October):51-61. Google ScholarGoogle Scholar
  94. Chen, T.-F., and J.-L. Baer. 1994. A Performance Study of Software and Hardware Data Prefetching Schemes. Proc. 21st Annual Symposium on Computer Architecture (April):223-232. Google ScholarGoogle Scholar
  95. Cheong, H., and A. Viedenbaum. 1990. Compiler-directed Cache Management in Multiprocessors. IEEE Computer 23(6):39-47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  96. Chien, A. A., and J. H. Kim. 1992. Planar-Adaptive Routing: Low-Cost Adaptive Networks for Multiprocessors. Proc. 19th Annual International Symposium on Computer Architecture (ISCA), Gold Coast, Australia (May):268-277. Google ScholarGoogle Scholar
  97. Choi, J.J.J. Dongarra, R. Pozo, and D. W. Walker. 1992. ScaLAPACK: A Scalable Linear Algebra Library for Distributed Memory Concurrent Computers. Proc. Fourth Symposium on the Frontiers of Massively Parallel Computation, McLean, VA. Los Alamitos, CA: IEEE Computer Society Press, 120-127.Google ScholarGoogle Scholar
  98. Chun, B. N., A. M. Mainwaring, and D. E. Culler. 1998. Virtual Network Transport Protocols for Myrinet. IEEE Micro (January):53-63. Google ScholarGoogle ScholarDigital LibraryDigital Library
  99. Clark, R., and K. Alnes. 1996. An SCI Chipset and Adapter. Symposium Record, Hot Interconnects IV (August):221-235.Google ScholarGoogle Scholar
  100. Cohen, D., G. Finn, R. Felderman, and A. DeSchon. 1993. ATOMIC: A Low-Cost, Very High-Speed, Local Communication Architecture. Proc. 1993 Int. Conference on Parallel Processing . Google ScholarGoogle Scholar
  101. Convex Computer Corporation. 1993. Exemplar Architecture. Richardson, TX: Convex Computer Corp.Google ScholarGoogle Scholar
  102. Corella, F., J. Stone, C. Barton. 1993. A Formal Specification of the PowerPC Shared Memory Architecture. Tech. Report Computer Science RC 18638 (81566), IBM Research Division. T.J. Watson Research Center (January).Google ScholarGoogle Scholar
  103. Cornell, J. A. 1972. Parallel Processing of Ballistic Missile Defense Radar Data with PEPE. COMPCON 72, 69-72.Google ScholarGoogle Scholar
  104. Cox, A., and R. Fowler. 1993. Adaptive Cache Coherency for Detecting Migratory Shared Data. Proc. 20th Int'l Symposium on Computer Architecture (May):98-108. Google ScholarGoogle Scholar
  105. Crowther, W., J. Goodhue, R. Gurwitz, R. Rettberg, and R. Thomas. 1985. The Butterfly Parallel Processor. IEEE Computer Architecture Technical Newsletter , 18-46.Google ScholarGoogle Scholar
  106. Culler, D. E. 1994. Multithreading: Fundamental Limits, Potential Gains, and Alternatives. In Multithreaded Computer Architecture: A Summary of the State of the Art. Edited by R. Iannucci. Dordrecht, Germany; Norwell, MA: Kluwer Academic Publishers, 97-138.Google ScholarGoogle Scholar
  107. Culler, D. E., A. C. Dusseau, S. C. Goldstein, A. Krishnamurthy, S. Lumetta, T. von Eicken, and K. Yelick. 1993. Parallel Programming in Split-C. Proc. Supercomputing '93 (November): 262-273. Google ScholarGoogle Scholar
  108. Culler, D. E., A. C. Dusseau, R. P. Martin, and K. E. Schauser. 1993. Fast Parallel Sorting under LogP: From Theory to Practice. In Portability and Performance for Parallel Processing . Chapter 4. New York: John Wiley & Sons, 71-98.Google ScholarGoogle Scholar
  109. Culler, D. E., R. M. Karp, D. A., Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. von Eicken. 1993. LogP: Toward A Realistic Model of Parallel Computation. ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (May): 1-12. Google ScholarGoogle Scholar
  110. Culler, D. E., R. M. Karp, D. A., Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. von Eicken. 1996. LogP: A Practical Model of Parallel Computation. CACM 39(11):78-85. Google ScholarGoogle ScholarDigital LibraryDigital Library
  111. Culler, D. E., A. Sah, K. E. Schauser, T. von Eicken, and J. Wawrzynek. 1991. Fine-Grain Parallelism with Minimal Hardware Support. Proc. Fourth Int'l Symposium on Arch. Support for Programming Languages and Systems (ASPLOS) (April):164-175. Google ScholarGoogle Scholar
  112. Culler, D. E., K. E. Schauser, and T. von Eicken. 1993. Two Fundamental Limits on Dataflow Multithreading. Proc. IFIP WG 10.3 working Conference on Architectures and Compilation Techniques for Fine and Medium Grain Parallelism . Orlando, FL. Google ScholarGoogle Scholar
  113. Dahlgren, F. 1995. Boosting the Performance of Hybrid Snooping Cache Protocols. Proc 22nd Int'l Symposium on Computer Architecture (June):60-69. Google ScholarGoogle Scholar
  114. Dahlgren, F., M. Dubois, and P. Stenstrom. 1994. Combined Performance Gains of Simple Cache Protocol Extensions. Proc. 21st Int'l Symposium on Computer Architecture (April): 187-197. Google ScholarGoogle Scholar
  115. Dahlgren, F., M. Dubois, and P. Stenstrom. 1995. Sequential Hardware Prefetching in Shared-Memory Multiprocessors. IEEE Transactions on Parallel and Distributed Systems 6(7). Google ScholarGoogle ScholarDigital LibraryDigital Library
  116. Dally, W. J. 1990a. Virtual-Channel Flow Control. Proc. 17th Annual Int'l Symposium on Computer Architecture (ISCA) , Seattle, WA, (May):60-68. Google ScholarGoogle Scholar
  117. Dally, W. J. 1990b. Performance Analysis of k -ary n -cube Interconnection Networks. IEEE-TOC 39(6):775-85. Google ScholarGoogle ScholarDigital LibraryDigital Library
  118. Dally, W. J., A. Chien, S. Fiske, W. Horwat, J. Keen, J. Larivee, R. Lethin, P Nuth, S. Willis. 1989. The J-Machine: A Fine-Grained Concurrent Computer. Proc IFIP 11th World Computer Congress, Information Processing'89 , 1147-1153.Google ScholarGoogle Scholar
  119. Dally, W. J., J. A. S. Fiske, J. S. Keen, R. A. Lethin, M. D. Noakes and P. R. Nuth. 1992. The Message Driven Processor: A Multicomputer Processing Node with Efficient Mechanisms. IEEE Micro (April):23-39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  120. Dally, W. J., J. S. Keen, M. D. Noakes. 1993. The J-Machine Architecture and Evaluation. Digest of Papers. COMPCON Spring '93. San Francisco, CA (February):183-188.Google ScholarGoogle Scholar
  121. Dally, W. J., and C. Seitz. 1987. Deadlock-Free Message Routing in Multiprocessor Interconnections Networks. IEEE-TOC C-36(5):547-553. Google ScholarGoogle Scholar
  122. Denning, P. J. 1968. The Working Set Model for Program Behavior. Communications of the ACM 11(5):323-333. Google ScholarGoogle ScholarDigital LibraryDigital Library
  123. Dennis, J. B. 1980. Dataflow Supercomputers. IEEE Computer 13(11):93-100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  124. Digital Equipment Corporation. 1992. Alpha Architecture Handbook . Maynard, MA: Digital Equipment Corp.Google ScholarGoogle Scholar
  125. Dijkstra, E. W. 1965. Solution of a Problem in Concurrent Programming Control. Communications of the ACM 8(9):569. Google ScholarGoogle ScholarDigital LibraryDigital Library
  126. Dijkstra, E. W., and C. S. Sholten. 1968. Termination Detection for Diffusing Computations. Information Processing Letters 1:1-4.Google ScholarGoogle Scholar
  127. Dongarra, J. J. 1990. Performance of Various Computers Using Standard Linear Equations Software in a Fortran Environment . Tech. Report CS-89-85. University of Tennessee, Computer Science Dept. (March). Google ScholarGoogle Scholar
  128. Dongarra, J. J. 1994. Performance of Various Computers Using Standard Linear Equation Software . Tech. Repon CS-89-85. University of Tennessee, Computer Science Dept. (November); current report available from [email protected] Google ScholarGoogle Scholar
  129. Dongarra, J. J., J. Martin, and J. Worlton. 1987. Computer Benchmarking: Paths and Pitfalls. IEEE Spectrum (July):38. Google ScholarGoogle Scholar
  130. Dongarra, J. J., and D. W. Walker. 1995. Software Libraries for Linear Algebra Computations on High performance Computers. SIAM Review 37:151-180. Google ScholarGoogle ScholarDigital LibraryDigital Library
  131. Dongarra, J. J., and W. Genlzsch, eds. 1993. Computer Benchmarks. Amsterdam: Elsevier Science B. V., North-Holland. Google ScholarGoogle Scholar
  132. Dubnicki, C. L. Iftode, E. W. Felten, K. Li. 1996. Software Support for Virtual Memory-Mapped Communication. Tenth Int'l Parallel Processing Symposium (April). Google ScholarGoogle Scholar
  133. Dubnicki, C., and T. LeBlanc. 1992. Adjustable Block Size Coherent Caches. Proc. 19th Annual Int'l Symposium on Computer Architecture (May):170-180. Google ScholarGoogle Scholar
  134. Dubois, M., and C. Scheurich, 1990. Memory Access Dependencies in Shared-Memory Multi-processors. IEEE Transactions on Software Engtneering 16(6):660-673. Google ScholarGoogle ScholarDigital LibraryDigital Library
  135. Dubois, M., C. Scheurich, and F. Briggs. 1986. Memory Access Buffering in Multiprocessors. Proc. 13th Int'l Symposium on Computer Architecture (June):434-442. Google ScholarGoogle Scholar
  136. Dubois, M., J. Skeppstedt, L. Ricciulli, K. Ramamurthy, and P. Slenstrom. 1993. The Detection and Elimination of Useless Misses in Multiprocessors. Proc. 20th Int'l Symposium on Computer Architecture (May):88-97. Google ScholarGoogle Scholar
  137. Dubois, M., J.-C. Wang, L. A. Barroso, K. Chen and Y.-S. Chen. 1991. Delayed Consistency and Its Effects on the Miss Rate of Parallel Programs. Proc. Supercomputing '91 (November): 197-206. Google ScholarGoogle Scholar
  138. Dunigan, T. H. 1988. Performance of a Second Generation Hypereube Tech. Report ORNL/TM-10881, Oak Ridge National Lab. (November).Google ScholarGoogle Scholar
  139. Dunning, D., G. Regnier, G. McAlpine, D. Camaron, B. Shubert, F. Berry, A. M. Merriti, E. Gronke and C. Dodd. 1998. The Virtual Interface Architecture. IEEE Micro 18(2). Google ScholarGoogle Scholar
  140. Dusseau, A. C., D. E. Culler, K. E. Schauser, and R. P. Martin. 1996. Fast Parallel Sorting under LogP: Experience with the CM-5. IEEE Transactions on Parallel and Distributed Systems 7(8): 791-805. Google ScholarGoogle ScholarDigital LibraryDigital Library
  141. Dwarkadas, S., P. Keleher, A. L. Cox, and W. Zwaenepoel. 1993. Evaluation of Release Consistent Software Distributed Shared Memory on Emerging Network Technology. Proc. 20th Int'l Symposium on Computer Architecture (May):144-155. Google ScholarGoogle Scholar
  142. Eggers, S., and R. Katz. 1988. A Characterization of Sharing in Parallel Programs and Its Application to Coherency Protocol Evaluation. Proc. 15th Annual Int'l Symposium on Computer Architecture (May):373-382. Google ScholarGoogle Scholar
  143. Eggers, S., and R. Katz. 1989a. The Effect of Sharing on the Cache and Bus Performance of Parallel Programs. Proc. Third Int'l Conference on Architectural Support for Programming Languages and Operating Systems (May):257-270. Google ScholarGoogle Scholar
  144. Eggers, S., and R. Katz. 1989b. Evaluating the Performance of Four Snooping Cache Coherency Protocols. Proc. 16th Annual Int'l Symposium on Computer Architecture (May):2-15. Google ScholarGoogle Scholar
  145. Eigenmann, R., and S. Hassanzadeh. 1996. Benchmarking with Real Industrial Applications: The SPEC High Performance Group. IEEE Computational Science and Engineering (spring). Google ScholarGoogle Scholar
  146. Elliott, D. G., W. M. Snelgrove, and M. Stumm. 1992. Computational RAM: A Memory-SIMD Hybrid and Its Application to DSP. Custom Integrated Circuits Conference , Boston, MA (May):30.6.1-30.6.4.Google ScholarGoogle Scholar
  147. Elliott, D. G., M. Stumm, and W. M. Snelgrove. 1997. Computational RAM: The Case for SIMD Computing in Memory . Workshop on Mixing Logic and DRAM: Chips that Compute and Remember. Presented at Annual International Symposium on Computer Architecture (ISCA) '97 (June).Google ScholarGoogle Scholar
  148. Erlichson, A., B. Nayfeh, J. P. Singh and Oyekunle Olukotun. 1995. The Benefits of Clustering in Cache-Coherent Multiprocessors: An Application-Driven Investigation. Proc. Supercomputing 95 (November). Google ScholarGoogle Scholar
  149. Erlichson, A., N. Nuckolls, G. Chesson, and J. L. Hennessy. 1996. SoftFLASH: Analyzing the Performance of Clustered Distributed Virtual Shared Memory. Proc. Seventh Int'l Conference on Architectural Support for Programming Languages and Operating Systems (October):210-220. Google ScholarGoogle Scholar
  150. Falsafi, B., A. R. Lebeck, S. K. Reinhardt, I. Schoinas, M. D. Hill, J. R. Larus, A. Rogers, and D. A. Wood. 1994. Application-Specific Protocols for User-Level Shared Memory. Proc. Supercomputing '94 (November):380-389. Google ScholarGoogle Scholar
  151. Falsafi, B. and D. A. Wood. 1997. Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA. Proc. 24th Int'l Symposium on Computer Architecture (June):229-240. Google ScholarGoogle Scholar
  152. Farkas, K., Z. Vranesic, and M. Stumm. 1992. Cache Consistency in Hierarchical Ring-Based Multiprocessors. Poc. Supercomputing '92 (November). Google ScholarGoogle Scholar
  153. Feigel, C. P. 1994. TI Introduces Four-Processor DSP Chip. Microprocessor Report (March):28.Google ScholarGoogle Scholar
  154. Felderman, R., et al. 1994. Atomic: A High Speed Local Communicalion Architecture. Journal of High Speed Networks 3(1):1-29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  155. Fenwick, D. M., D. J. Foley, W. B. Gist, S. R. VanDoren, and D. Wissell. 1995. The AlphaServer 8000 Series: High-End Server Platform Development. Digital Technical Journal 7(1):43-65. Google ScholarGoogle ScholarDigital LibraryDigital Library
  156. Flanagan, J. L. 1994. Technologies for Multimedia Communications. IEEE Proceedings 82(4):590-603.Google ScholarGoogle ScholarCross RefCross Ref
  157. Flynn, M. J. 1972. Some Computer Organizations and Their Effectiveness. IEEE Transactions on Computing C-21(Seplember):948-960. Google ScholarGoogle ScholarDigital LibraryDigital Library
  158. Fortune, S., and J. Wyllie. 1978. Parallelism in Random Access Machines. Proc. 10th ACM Symposium on Theory of Computing (May). Google ScholarGoogle Scholar
  159. Fox. G., M. Johnson, G. Lyzenga, S. Otto, J. Salmon, and D. Walker. 1988. Solving Problems on Concurrent Processors , vol. 1. Englewood Cliffs. NJ: Prentice Hall. Google ScholarGoogle Scholar
  160. Frailong, J.-L. et al. 1993. The Next Generation SPARC Multiprocessing System Architecture. Proc. COMPCON (spring):475-480.Google ScholarGoogle Scholar
  161. Frank, S., H. Burkhardt III, and J. Rothnie. 1993. The KSR1: Bridging the Gap between Shared Memory and MPPs. Proc. COMPCON, Digest of Papers (spring);285-294.Google ScholarGoogle Scholar
  162. Fu, J. W. C., and J. H. Patel. 1991. Data Prefetching in Multiprocessor Vector Cache Memories. Proc. 18th Annual Symposium on Computer Architecture (May):54-63. Google ScholarGoogle Scholar
  163. Fu, J. W. C., J. H. Patel, and B. L. Janssens. 1992. Stride Directed Prefetching in Scalar Processors. Proc. 25th Annual Int'l Symposium on Microarchitecture (December): 102-110. Google ScholarGoogle Scholar
  164. Fuchs, H., G. Abram, and E. Grant. 1983. Near Real-Time Shaded Display of Rigid Objects. Proc. SIGGRAPH . Google ScholarGoogle Scholar
  165. Galles, M., and E. Williams. 1993. Performance Optimizations, Implementation, and Verification of the SGI Challenge Multiprocessor. Proc. 27th Hawaii Int'l Conference on System Sciences Vol. I: Architecture (January). Also in SGI Challenge . Edited by T. N. Mudge and B. D. Shriver. Los Alamitos, CA: IEEE Computer Society Press, 1994, 134-143.Google ScholarGoogle Scholar
  166. Geist, A., A. Beguelin, and J. Dongarra, W. Jiang, R. Manchek and V. Sunderam 1994. PVM 3.0 Users'Guide and Reference Manual . Tech Report ORNL/TM-12187. Oak Ridge, TN: Oak Ridge National Laboratory (February), http://wwweece.ksu.edu/pvm3/ug.ps.Google ScholarGoogle Scholar
  167. Geist, A., A. Beguelin, J. Dongarra, R. Manchek, W. Jiang, and V. Sunderam. 1994. PVM: A Users' Guide and Tutorial/or Networked Parallel Computing . Cambridge, MA: MIT Press. Google ScholarGoogle ScholarCross RefCross Ref
  168. Geist, G. A., and V. S. Sunderam. 1992. Network Based Concurrent Computing on the PVM System, Journal of Concurrency: Practice and Experience 4(4):293-311. Google ScholarGoogle ScholarDigital LibraryDigital Library
  169. Gharachorloo, K. 1995. Memory Consistency Models for Shared-Memory Multiprocessors , Ph.D. diss., Computer Systems Laboratory. Stanford University (December). Also published as Tech. Report #CSL-TR-95-685. Google ScholarGoogle Scholar
  170. Gharachorloo, K., S. Adve, A. Gupta, M. Hill, and J. L. Hennessy. 1992. Programming for Different Memory Consistency Models. Journal of Parallel and Distributed Computing 15(4):399-407.Google ScholarGoogle ScholarCross RefCross Ref
  171. Gharachorloo, K., A. Gupta, and J. L. Hennessy. 1991a. Performance Evaluation of Memory Consistency Models for Shared-Memory Multiprocessors. Proc. 4th Int'l Conference on Architectural Support for Programming Languages and Operating Systems (April):245-257. Google ScholarGoogle Scholar
  172. Gharachorloo, K., A. Gupta, and J. L. Hennessy. 1991b. Two Techniques to Enhance the Performance of Memory Consistency Models. Proc. Int'l Conference on Parallel Processing (August): 1355-1364.Google ScholarGoogle Scholar
  173. Gharachorloo, K., A. Gupta, and J. L. Hennessy. 1992. Hiding Memory Latency Using Dynamic Scheduling in Shared Memory Multiprocessors. Proc. 19th Int'l Symposium on Computer Architecture (May):22-33. Google ScholarGoogle Scholar
  174. Gharachorloo, K., D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J. L. Hennessy. 1990. Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors. Proc. 17th Int'l Symposium on Computer Architecture (May):15-26. Google ScholarGoogle Scholar
  175. Gillett, R. 1996. Memory Channel Neiwork for PCI. IEEE Micro 16(1):12-18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  176. Gillett, R., M. Collins, and D. Pimm. 1996. Overview of Network Memory Channel for PCI. Proc. IEEE Spring COMPCON '96 (February). Google ScholarGoogle Scholar
  177. Gillett, R., and R. Kaufmann. 1997. Using Memory Channel Network. IEEE Micro 17(1):19-25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  178. Glass, C. J., and L. M. Ni. 1992. The Turn Model for Adaptive Routing. Proc. Annual International Symposium on Computer Architecture (ISCA) (May): 278-287. Google ScholarGoogle Scholar
  179. Godiwala, N. D., and B. A. Maskas. 1995. The Second-Generation Processor Module for AlphaServer 2100 Systems. Digital Technical Journal 7(1). Google ScholarGoogle Scholar
  180. Gokhale, M., B. Holmes, and K. Iobst. 1995. Processing in Memory: The Terasys Massively Parallel PIM Array. IEEE Computer 28(3):23-31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  181. Goldschmidt, S. R. 1993. Simulation of Multiprocessors: Speed and Accuracy . Ph.D. diss., Stanford University (June). Google ScholarGoogle Scholar
  182. Golub, G., and C. Van Loan. 1997. Matrix Computations 3e . Baltimore, MD: Johns Hopkins University Press.Google ScholarGoogle Scholar
  183. Goodman, J. R. 1983. Using Cache Memory to Reduce Processor-Memory Traffic. Proc. 10th Annual Int'l Symposium on Computer Architecture (June): 124-131. Google ScholarGoogle Scholar
  184. Goodman, J. R. 1987. Coherency for Multiprocessor Virtual Address Caches. Proc. Second Int'l Conference on Architectural Support for Programming Languages and Operating Systems . Palo Alto. CA(October):72-81. Google ScholarGoogle Scholar
  185. Goodman, J. R. 1989. Cache Consistency and Sequential Consistency . Tech. Report #1006, University of Wisconsin-Madison. Computer Science Dept. (February).Google ScholarGoogle Scholar
  186. Goodman, J. R., M. K. Vernon, P.J. Woest. 1989. Set of Efficient Synchronization Primitives for a Large-Scale Shared-Memory Multiprocessor. Proc. Third Int'l Conference on Architectural Support for Programming Languages and Operating Systems (April):64-75. Google ScholarGoogle Scholar
  187. Gottlieb, A., R. Grishman, C. P. Kruskal, K. P. McAuliffe, L. Rudolph, and M. Snir. 1983. The NYU Ultracomputer--Designing an MIMD Shared Memory Parallel Computer. IEEE Transactions on Computers C-32(2):175-189. Google ScholarGoogle ScholarDigital LibraryDigital Library
  188. Gottlieb, A., and C. P. Kruskal. 1984. Complexity Results for Permuting Data and Other Computations on Parallel Processors. Journal of the ACM 31(April):193-209. Google ScholarGoogle ScholarDigital LibraryDigital Library
  189. Gottlieb, A., B. Lubachevsky, and L. Rudolph. 1983. Basic Techniques for the Efficient Coordination of Large Numbers of Cooperating Sequenlal Processes. ACM Transactions on Programming Languages and Systems 5(2). Google ScholarGoogle Scholar
  190. Grafe, V. G., and J. E. Hoch. 1990. The Epsilon-2 Hybrid Dataflow Architecture. Proc. COMPCON Spring '90 , San Francisco, CA (March):88-93.Google ScholarGoogle Scholar
  191. Grahn, H., and P. Stenstrom. 1996. Evaluation of a Competitive-Update Cache Coherence Protocol with Migratory Data Detection, Journal of Parallel and Distributed Computing 39(2):168-180. Google ScholarGoogle ScholarDigital LibraryDigital Library
  192. Grahn, H., P. Stenstrom and M. Dubois. 1995. Implementation and Evaluation of Update-Based Protocols under Relaxed Memory Consistency Models. Future Generation Computer Systems 11(3):247-271. Google ScholarGoogle ScholarDigital LibraryDigital Library
  193. Granuke, G., and S. Thakkar. 1990. Synchronization Algorithms for Shared Memory Multiprocessors. IEEE Computer 23(6):60-69. Google ScholarGoogle ScholarDigital LibraryDigital Library
  194. Gray, J. 1991. The Benchmark Handbook for Database and Transaction Processing Systems . San Francisco: Morgan Kaufmann. Google ScholarGoogle Scholar
  195. Green, S. A., and D. J. Paddon. 1990. A Highly Flexible Multiprocessor Solution for Ray Tracing. The Visual Computer 6:62-73.Google ScholarGoogle ScholarCross RefCross Ref
  196. Greenberg, R. I. and C. E. Leiserson. 1989. Randomized Routing on Fat-Trees. Advances in Computing Research 5:345-374.Google ScholarGoogle Scholar
  197. Greenwald, M., and D. R. Cheriton. 1996. The Synergy between Non-Blocking Synchronization and Operating System Structure. Proc. Second Symposium on Operating System Design and Implementation, USENIX , Seattle (October): 123-136. Google ScholarGoogle Scholar
  198. Gropp, W. E. Lusk and A. Skjellum 1994. Using MPI: Portable Parallel Programming with the Message-Passing Interface . Cambridge, MA: MIT Press. Google ScholarGoogle Scholar
  199. Groscup, W. 1992. The Intel Paragon XP/S Supercomputer. Proc. Fifth ECMWF Workshop on the Use of Parallel Processors in Meteorology (November):262-273.Google ScholarGoogle Scholar
  200. Gunther, K. D. 1981. Prevention of Deadlocks in Packet-Switched Data Transport Systems. IEEE Transactions on Communication C-29(4):512-24.Google ScholarGoogle ScholarCross RefCross Ref
  201. Gupta, A., J. L. Hennessy, K. Gharachorloo, T. Mowry and W.-D. Weber. 1991. Comparative Evaluation of Latency Reducing and Tolerating Techniques. Proc. 18th Int'l Symposium on Computer Architecture (May):254-263. Google ScholarGoogle Scholar
  202. Gupta, A., and W.-D. Weber. 1992. Cache Invalidation Patterns in Shared-Memory Multiprocessors. IEEE Transactions on Computers 41(7):794-810. Google ScholarGoogle ScholarDigital LibraryDigital Library
  203. Gupta, A., W.-D. Weber, and T. Mowry. 1990. Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache-Coherence Schemes. Proc. Int'l Conference on Parallel Processing I (August):312-321.Google ScholarGoogle Scholar
  204. Gurd, J. R., C. C. Kerkham and I. Watson. 1985. The Manchester Prototype Dataflow Computer. Communications of the ACM 28(1):34-52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  205. Gustafson, J. L. 1988. Reevaluating Amdahl's Law. Communications of the ACM 31(5):532-533. Google ScholarGoogle ScholarDigital LibraryDigital Library
  206. Gustafson, J. L., and Q. O. Snell. 1994. HINT: A New Way to Measure Computer Performance . Tech. Report. Ames Laboratory, U.S. Dept. of Energy. Ames, IA.Google ScholarGoogle Scholar
  207. Gustavson, D. 1992. The Scalable Coherence Interface and Related Standards Projects. IEEE Micro 12(1):10-22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  208. Gwennap, L. 1994a. Microprocessors Head Toward MP on a Chip. Microprocessor Report (May).Google ScholarGoogle Scholar
  209. Gwennap, L. 1994b. PA-7200 Enables Inexpensive MP Systems. Microprocessor Report (March).Google ScholarGoogle Scholar
  210. Hagersten, E. 1992. Toward Scalable Cache Only Memory Architectures . Ph.D. diss., Swedish Institute of Computer Science (October).Google ScholarGoogle Scholar
  211. Hagersten, E., A. Landin. and S. Haridi. 1992. DDM--A Cache Only Memory Architecture. IEEE Computer 25(9):44-54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  212. Hanrahan, P., D. Salzman and L. A. Aupperle. 1991. A Rapid Hierarchical Radiosity Algorithm. Proc. SIGGRAPH (July). Google ScholarGoogle Scholar
  213. Hayashi, K., T. Doi, T. Horie, Y. Koyanagi, O. Shiraki, N. Imamura, T. Shimizu, H. Ishihata and T. Shindo. 1994. AP1000+: Archiieciural Support of PUT/GET Interface for Parallelizing Compiler. ACM SIGPLAN Notices 29(11):196. Google ScholarGoogle ScholarDigital LibraryDigital Library
  214. Heinlein, J., R. P. Bosch, Jr., K. Gharachorloo, M. Rosenblum, and A. Gupta. 1997. Coherent Block Data Transfer in the FLASH Multiprocessor. Proc. 11th Int'l Parallel Processing Symposium (April). Google ScholarGoogle Scholar
  215. Heinlein, J., K. Gharachorloo, S. Dresser, and A. Gupta. 1994. Integration of Message Passing and Shared Memory in the Stanford FLASH Multiprocessor. Proc. 6th Int'l Conference on Architectural Support for Programming Languages and Operating Systems (October):38-50. Google ScholarGoogle Scholar
  216. Heinrich, M., J. Kuskin, D. Ofelt, J. Heinlein, J. Baxter, J. P. Singh, R. Simoni, K. Gharachorloo, D. Nakahira, M. Horowitz, A. Gupta, M. Rosenblum, and J. L. Hennessy. 1994. The Performance Impact of Flexibility on the Stanford FLASH Multiprocessor. Proc. 6th Int'l Conference on Architectural Support for Programming Languages and Operating Systems (October):274-285. Google ScholarGoogle Scholar
  217. Hennessy, J. L., and N. Jouppi. 1991. Computer Technology and Architecture: An Evolving Interaction. IEEE Computer 24(9): 18-29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  218. Hennessy, J. L., and D. A. Patterson. 1996. Computer Architecture: A Quantitative Approach . 2nd ed. San Francisco: Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  219. Herlihy, M. P. 1988. Impossibility and Universality Results for Wait-Free Synchronizalion. Seventh ACM SIGACTS-SICOPS Symposium on Principles of Distributed Computing (August):276-290. Google ScholarGoogle Scholar
  220. Herlihy, M. P. 1991. Wait-Free Synchronization. ACM Transactions on Programming Languages and Systems 13(1):124-149. Google ScholarGoogle ScholarDigital LibraryDigital Library
  221. Herlihy, M. P. 1993. A Methodology for Implementing Highly Concurrent Data Objects. ACM Transactions on Programming Languages and Systems 15(5):745-770. Google ScholarGoogle ScholarDigital LibraryDigital Library
  222. Herlihy, M. P., and J. E. B. Moss. 1993. Transactional Memory: Architectural Support for Lock-Free Data Structures. Proc. 20th Annual Symposium on Computer Architecture , San Diego, CA (May):289-301. Google ScholarGoogle Scholar
  223. Herlihy, M. P., and J. Wing. 1987. Axioms for Concurrent Objects. Proc. 14th ACM Symposium on Principles of Programming Languages (January): 13-26. Google ScholarGoogle Scholar
  224. Hernquist, L. 1987. Performance Characteristics of Tree Codes. Astrophysics Journal Supplement 64(August):715-734.Google ScholarGoogle ScholarCross RefCross Ref
  225. Hey, A. J. G. 1991. The Genesis Distributed Memory Benchmarks. Parallel Computing 17:1111-1130. Google ScholarGoogle ScholarDigital LibraryDigital Library
  226. High Performance Fortran Forum. 1993. High Performance Fortran Language Specification. Scientific Programming 2(1): 1-270.Google ScholarGoogle Scholar
  227. Hill, M. D., S. J. Eggers, J. R. Larus, G. S. Taylor, G. Adams, B. K. Bose, G. A. Gibson, P. M. Hansen, J. Keller, S. I. Kong, C. G. Lee, D. Lee, J. M. Pendleton, S. A. Ritchie, D. A. Wood, B. G. Zorn, P. N. Hilfinger, D. A. Hodges, R. H. Katz, J. Ousterhut, and D. A. Patterson. 1986. Design Decisions in SPUR. IEEE Computer 19(10):8-22. Also in Computers for Artificial Intelligence Processing . Edited by B. W. Wah and C. V Ramamoorthy. New York: John Wiley and Sons, 273-299. Google ScholarGoogle ScholarDigital LibraryDigital Library
  228. Hill, M. D., and A. J. Smith. 1989. Evaluating Associativity in CPU Caches. IEEE Transactions on Computers C-38(12):1612-1630. Google ScholarGoogle Scholar
  229. Hillis, W. D. 1985. The Connection Machine . Cambridge, MA: MIT Press. Google ScholarGoogle Scholar
  230. Hillis, W. D., and G. L. Steele. 1986. Data Parallel Algorithms. Communications of the ACM 29(12):1170-1183. Google ScholarGoogle ScholarDigital LibraryDigital Library
  231. Hillis, W. D., and L. W. Tucker. 1993. The CM-5 Connection Machine: A Scalable Supercomputer. Communications of the ACM 36(11):31-40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  232. Hirata, H., K. Kimura, S. Nagamine, Y. Mochizuki, A. Nishimura, Y. Nakase and T. Nishizawa. 1992. An Elementary Processor Architecture with Simultaneous Instruction Issuing from Multiple Threads. Proc. 19th Int'l Symposium on Computer Architecture (May):136-145. Google ScholarGoogle Scholar
  233. Hoare, C. A. R. 1978. Communicating Sequential Processes. Communications of the ACM 21(8):666-667. Google ScholarGoogle ScholarDigital LibraryDigital Library
  234. Hockney, R. W. and C. R. Jesshope. 1988. Parallel Computers 2. London: Adam Hilger.Google ScholarGoogle Scholar
  235. Holt, C. M. Heinrich, J. P. Singh, E. Rothberg and J. L. Hennessy. 1995. The Effects of Latency, Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors . Tech. Report #CSL-TR-95-660, Computer Systems Laboratory. Stanford University (January). Google ScholarGoogle Scholar
  236. Homewood M., and M. McLaren. 1993. Meiko CS-2 Interconnect Elan--Elite Design. Hot Interconnects (August).Google ScholarGoogle Scholar
  237. Horiw, T. K. Hayashi, T. Shimizu, and H. Ishihata. 1993. Improving the AP1000 Parallel Computer Performance with Message Passing. Proc. 20th Annual Int'l Symposium on Computer Architecture (May):314-325. Google ScholarGoogle Scholar
  238. Horowitz, M. 1997. Limits of Electrical Signalling. Hot Interconnects Keynote (August).Google ScholarGoogle Scholar
  239. Horst, R. 1995. TNet: A Reliable System Area Network. IEEE Micro 15(1):37-45. Google ScholarGoogle ScholarDigital LibraryDigital Library
  240. Horst, R. W. and T. C. K. Chou 1985. An Architecture for High Volume Transaction Processing. Proc. 12th Annual Int'l Symposium on Computer Architecture (June):240-245. Boston MA. (Tandem NonStop II). Google ScholarGoogle Scholar
  241. Horst, R. W., R. L. Harris, and R. L. Jardine. 1990. Multiple Instruction Issue in the NonStop Cyclone Processor. Proc. Annual International Symposium on Computer Architecture (ISCA) , 216-226. Google ScholarGoogle Scholar
  242. Hristea, C. D. Lenoski and J. Keen. 1997. Measuring Memory Hierarchy Performance of Cache Coherent Multiprocessors Using Micro Benchmarks. Proc. SC97 (November; all-Web conference proceeding). Google ScholarGoogle Scholar
  243. Hunt, D. 1996. Advanced Features of the 64-Bit PA-8000 . Palo Alto. CA: Hewlett Packard Corp.Google ScholarGoogle Scholar
  244. IEEE Computer Society. 1993. IEEE Standard for Scalable Coherent Interface (SCI) . IEEE Standard 1596-1992. Washington, DC: IEEE Computer Society.Google ScholarGoogle Scholar
  245. IEEE Computer Society. 1995. IEEE Standard for Cache Optimization for Large Numbers of Processors Using the Scalable Coherent Interface (SCI) Draft 0.35 (September). Washington, DC: IEEE Computer Society.Google ScholarGoogle Scholar
  246. Iftode, L., C. Dubnicki, E. W. Felten and K. Li. 1996. Improving Release-Consistent Shared Virtual Memory Using Automatic Update. Proc. Second Symposium on High Performance Computer Architecture (February): 14-25. Google ScholarGoogle Scholar
  247. Iftode, L., J. P. Singh, and K. Li. 1996a. Understanding Application Performance on Shared Virtual Memory Systems. Proc. 23rd Int'l Symposium on Computer Architecture (April): 122-133. Google ScholarGoogle Scholar
  248. Iftode, L., J. P. Singh, and K. Li. 1996b. Scope Consistency: A Bridge between Release Consistency and Entry Consistency. Proc. Symposium on Parallel Algorithms and Architectures (June). Google ScholarGoogle Scholar
  249. Intel Corporation. 1994. 1750, 1860, 1960 Processors and Related Products . Santa Clara, CA: Intel Corp.Google ScholarGoogle Scholar
  250. Intel Corporation. 1996. Pentium® Pro Family Developers Manual . Sanla Clara, CA: Intel Corp.Google ScholarGoogle Scholar
  251. Jeremiassen, T. E., and S. J. Eggers. Eliminating False Sharing. Proc. 1991 Int'l Conference on Parallel Processing (August):377-381.Google ScholarGoogle Scholar
  252. Jiang, D., H. Shan, and J. P Singh. 1997. Application Restructuring and Performance Portability on Shared Virtual Memory and Hardware-Coherent Multiprocessors. Proc. Sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (June):217-229. Google ScholarGoogle Scholar
  253. Jiang, D., and J. P. Singh. 1998. A Methodology and an Evaluation of the SGI Origin2000. Proc. SIGMETRICS Conference on Measurement and Modeling of Computer Systems (June). Google ScholarGoogle Scholar
  254. Joe, T. 1995. COMA-F: A Non-Hierarchical Cache Only Memory Architecture . Ph.D. diss., Computer Systems Laboratory, Stanford University (March). Google ScholarGoogle Scholar
  255. Joe, T., and J. L. Hennessy. 1994. Evaluating the Memory Overhead Required for COMA Architectures. Proc. 21st Int'l Symposium on Computer Architecture (April):82-93. Google ScholarGoogle Scholar
  256. Joerg, C. F. 1994. Design and Implementation of a Packet Switched Routing Chip . Tech. Report MIT/LCS/TR-482, MIT Laboratory for Computer Science (August). Google ScholarGoogle Scholar
  257. Joerg, C. F., and A. Boughton. 1991. The Monsoon Interconnection Network. Proc. ICCD (October). Google ScholarGoogle Scholar
  258. Johnson, M. 1991. Superscalar Microprocessor Design . Englewood Cliffs, NJ: Prentice Hall.Google ScholarGoogle Scholar
  259. Jordan, H. F. 1985. HEP Architecture, Programming, and Performance. In Parallel MIMD Compulation: The HEP Supercomputer and Its Applications . Edited by J. S. Kowalik. Cambridge, MA: MIT Press, 8. Google ScholarGoogle Scholar
  260. Jouppi, N. P. 1990. Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers. Proc. 17th Annual Symposium on Computer Architecture (June):364-373. Google ScholarGoogle Scholar
  261. Jouppi, N. P., and P. Ranganathan. 1997. The Relative Importance of Memory Latency, Bandwidth, and Branch Limits to Performance . Workshop on Mixing Logic and DRAM: Chips that Compute and Remember. Presented at the Annual Int'l Symposium on Computer Architecture (ISCA) '97 (June).Google ScholarGoogle Scholar
  262. Jouppi, N. P. and D. Wall. 1989. Available Instruction-Level Parallelism for Superscalar and Superpipelined Machines. ASPLOS III , 272-282. Google ScholarGoogle Scholar
  263. Kagi, A., D. Burger, and J. R. Goodman. 1997. Efficient Synchronization: Let Them Eat QOLB. Proc. 24th Int'l Symposium on Computer Architecture (ISCA) (June): 170-180. Google ScholarGoogle Scholar
  264. Karlin, A. R., M. S. Manasse, L. Rudolph and D. D. Sleator. 1986. Competitive Snoopy Caching. Proc. 27th Annual IEEE Symposium on Foundations of Computer Science . Google ScholarGoogle Scholar
  265. Karol, M., M. Hluchyj and S. Morgan. 1987. Input versus Output Queueing on a Space Division Packet Switch. IEEE Transactions on Communications 35(12):1347-1356.Google ScholarGoogle ScholarCross RefCross Ref
  266. Karp, R., U. Vazirani and V. Vazirani. 1990. An Optimal Algorithm for On-Line Bipartite Matching. Proc. 22nd ACM Symposium on the Theory of Computing (May):352-358. Google ScholarGoogle Scholar
  267. Kaxiras, S. 1996. Kiloprocessor Extensions to SCI. Proc. 10th Int'l Parallel Processing Symposium . Google ScholarGoogle Scholar
  268. Kaxiras, S., and J. Goodman. The GLOW Cache Coherence Protocol Extensions for Widely Shared Data. Proc. Int'l Conference on Supercomputing (May):35-43. Google ScholarGoogle Scholar
  269. Kecton, K. K., T. E. Anderson, and D. A. Patterson. 1995. LogP Quantified: The Case for Low-Overhead Local Area Networks. Hot Interconnects III: Symposium on High Performance Interconnects (August).Google ScholarGoogle Scholar
  270. Keleher, P., A. L. Cox, S. Dwarkadas and W. Zwaenepoel. 1994. TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems. Proc. Winter USENIX Conference (January):15-132. Google ScholarGoogle Scholar
  271. Keleher, P., A. L. Cox, and W. Zwaenepoel. 1992. Lazy Consistency for Software Distributed Shared Memory. Proc. 19th Int'l Symposium on Computer Architecture (May): 13-21. Google ScholarGoogle Scholar
  272. Kermani, P. and L. Kleinrock. 1979. Virtual Cut-Through: A New Computer Communication Switching Technique. Computer Networks 3 (September):267-286.Google ScholarGoogle Scholar
  273. Kessler, R. E., and J. L. Schwarzmeier. 1993. Cray T3D: A New Dimension for Cray Research. Proc. Papers, COMPCON Spring'93 , San Francisco (February):176-182.Google ScholarGoogle Scholar
  274. Knuth, D. E. 1966. Additional Comments on a Problem in Concurrent Programming Control. Communications of the ACM 9(5):321-322. Google ScholarGoogle ScholarDigital LibraryDigital Library
  275. Koebel, C. D. Loveman, R. Schreiber, G. Steele, and M. Zosel. 1994. The High Performance Fortran Handbook . Cambridge, MA: MIT Press. Google ScholarGoogle Scholar
  276. Koeninger, R. K., M. Furtney, and M. Walker. 1994. A Shared Memory MPP from Cray Research. Digital Technical Journal 6(2):8-21.Google ScholarGoogle Scholar
  277. Kogge, P. M. 1994. EXECUBE--A New Architecture for Scalable MPPs. 1994 Int'l Conference on Parallel Processing (August):177-184. Google ScholarGoogle Scholar
  278. Kontothanassis, L. I., G. Hunt, R. Stets, N. Hardavellas, M. Cierniak, S. Parthasarathy, W. Meira, S. Dwarkadas, and M. Scott. 1997. VM-Based Shared Memory on Low-Latency, Remote-Memory-Access Networks. Proc. 24th Int'l Symposium on Computer Architecture (June). Google ScholarGoogle Scholar
  279. Kontothanassis, L. I., and M. L. Scott. 1996. Using Memory-Mapped Network Interfaces to Improve the Performance of Distributed Shared Memory. Proc. Second Symposium on High Performance Computer Architecture (February): 166-177. Google ScholarGoogle Scholar
  280. Kostantantindou, S., and L. Snyder. 1991. Chaos Router: Architecture and Performance. Proc. 18th Annual Symposium on Computer Architecture (May):212-221. Google ScholarGoogle Scholar
  281. Krishnamurthy, A., K. E. Schauser, C. J. Scheiman, R. Y. Wang, D. E. Culler, and K. Yelick. 1996. Evaluation of Architectural Support for Global Address-Based Communication in Large-Scale Parallel Machines. ACM SIGPLAN Notices 31(9):37-48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  282. Krishnamurthy, A., and K. A. Yelick. 1994. Optimizing Parallel SPMD Programs. Seventh Annual Workshop on Languages and Compilers for Parallel Computing . Ithaca, NY (August). Google ScholarGoogle Scholar
  283. Krishnamurthy, A., and K. A. Yelick. 1995. Optimizing Parallel Programs with Explicit Sychronization. Programming Language Design and Implementation , 196-204. Google ScholarGoogle Scholar
  284. Krishnamurthy, A., and K. A. Yelick. 1996. Analyses and Optimizations for Shared Address Space Programs. JPDC 38(2):130-144. Google ScholarGoogle ScholarDigital LibraryDigital Library
  285. Kroft, D. 1981. Lockup-Free Instruction Fetch/Prefetch Cache Organization. Proc. Eighth Int'l Symposium on Computer Architecture (May):81-87. Google ScholarGoogle Scholar
  286. Kronenberg, N. R. H. Levy, and W. D. Strecker. 1986. Vax Clusters: A Closely-Coupled Distributed System. ACM Transactions on Computer Systems 4(2): 130-146. Google ScholarGoogle ScholarDigital LibraryDigital Library
  287. Kruskal, C. P., and M. Snir. 1983. The Performance of Multistage Interconnection Networks for Multiprocessors. IEEE Transactions on Computers C-32(12):1091-1098. Google ScholarGoogle Scholar
  288. Kubiatowicz, J., and A. Agarwal. 1993. The Anatomy of a Message in the Alewife Multiprocessor. Proc. Int'l Conference on Supercomputing (July): 195-206. Google ScholarGoogle Scholar
  289. Kuehn, J. T., and B. J. Smith. 1988. The Horizon Supercomputing System: Architecture and Software. Proc. Supercomputing '88 (November):28-34. Google ScholarGoogle Scholar
  290. Kumar, M. 1992. Unique Design Concepts in GFII and Their Impact on Performance. IBM Journal of Research and Development 36(6):990-1000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  291. Kumar, V., A. Grama, A. Gupta, and G. Karypis. 1994. Introduction to Parallel Computing: Design and Analysis of Algorithms . Redwood City. CA: Benjamin/Cummings Publishing Company. Google ScholarGoogle Scholar
  292. Kumar, V., and A. Gupta. 1991. Analysis of Scalability of Parallel Algorithms and Architectures: A Survey. Proc. Int'l Conference on Supercomputing (June):396-405. Google ScholarGoogle Scholar
  293. Kung, H. T. R. Sansom, S. Schlick, P. A. Steenkiste, M. Arnould, F. J. Bitz, F. Christianson, E. C. Cooper, O. Menzilcioglu, D Ombres, and B. Zill. 1989. Network-Based Multicomputers: An Emerging Parallel Architecture. Proc. Supercomputing '91 Conference (November):664-673. Google ScholarGoogle Scholar
  294. Kurihara, K., D. Chaiken and A. Agarwal. 1991. Latency Tolerance through Multithreading in Large-Scale Multiprocessors. Proc. Int'l Symposium on Shared Memory Multiprocessing (April):91-101.Google ScholarGoogle Scholar
  295. Kuskin, J., D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, and J. Hennessy. 1994. The Stanford FLASH Multiprocessor. Proc. 21st Int'l Symposium on Computer Architecture (April): 302-313. Google ScholarGoogle Scholar
  296. Lam, M. S., and R. P. Wilson. 1992. Limits on Control Flow on Parallelism. Proc 19th Annual Int'l Symposium on Computer Architecture (May):46-57. Google ScholarGoogle Scholar
  297. Lamport, L. 1979. How to Make a Multiprocessor Computer that Correctly Executes Multiprocess Programs. IEEE Transactions on Computers C-28(9):690-691. Google ScholarGoogle ScholarDigital LibraryDigital Library
  298. Larus, J. R., B. Richards, and G. Viswanathan. 1996. Parallel Programming in C**: A Large-Grain Data-Parallel Programming Language. In Parallel Programming Using C++ . Edited by G. V. Wilson and P. Lu. Cambridge, MA: MIT Press.Google ScholarGoogle Scholar
  299. Laudon, J., A. 1994. Architectural and Implementation Tradeoffs in Multiple-Context Processors . Ph.D. diss., Stanford University, Stanford, California. Also published as Tech. Report #CSL-TR-94-634. Computer Systems Laboratory, Stanford University (May). Google ScholarGoogle Scholar
  300. Laudon, J., A. Gupta, and M. Horowitz. 1994. Architectural and Implementation Tradeoffs in the Design of Multiple-Context Processors. In Multithreaded Computer Architecture: A Summary of the State of the Art . Edited by R. A. Iannucci. Dordrecht, Germany; Norwell, MA; Kluwer Academic Publishers, 167-200.Google ScholarGoogle Scholar
  301. Laudon, J. P. and D. Lenoski. 1997. The SGI Origin: A ccNUMA Highly Scalable Server. Proc. 24th Int'l Symposium on Computer Architecture . Google ScholarGoogle Scholar
  302. Lawton, J. V., J. J. Brosnan, M. P. Doyle, S.D. O'Rlodain and T. G. Reddin. 1996. Building a High-Performance Message-Passing System for MEMORY CHANNEL Clusters. Digital Technical Journal 8(2):96-116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  303. Lee, C. G. 1989. Multi-Step Gradual Rounding. IEEE Transactions on Computers 38(4):595-600. Google ScholarGoogle ScholarDigital LibraryDigital Library
  304. Lee, R. L., A. Y. Kwok and F. A. Briggs. 1991. The Floating point Performance of a Superscalar SPARC Processor. Proc. 4th Symposium on Architectural Support for Programming Languages and Operating Systems (April):28-37. Google ScholarGoogle Scholar
  305. Leighton, F. T. 1992. Introduction to Parallel Algorithms and Architectures . San Francisco: Morgan Kaufmann. Google ScholarGoogle Scholar
  306. Leiserson, C. E. 1985. Fat-Trees: Universal Networks for Hardware-Efficient Supercomputing. IEEE Transactions on Computers C-34(10):892-901. Google ScholarGoogle ScholarDigital LibraryDigital Library
  307. Leiserson, C. E., Z. S. Abuhamdeh, D. C. Douglas, C. R. Feynman, M. N. Ganmukhi, J. V. Hill, W. D. Hillis, B. C. Kuszmaul, M. A. St. Pierre, D. S. Wells, M. C. Wong, S. Yang, and R. Zak. 1996. The Network Architecture of the Connection Machine CM-5. Journal of Parallel and Distributed Computing 33(2): 145-158. Also in Proc. Fourth Symposium on Parallel Algorithms and Architectures '92 (June):272-285. Google ScholarGoogle ScholarDigital LibraryDigital Library
  308. Lenoski, D. 1992. The Stanford DASH Multiprocessor . Ph.D. diss., Computer Systems Laboratory, Stanford University.Google ScholarGoogle Scholar
  309. Lenoski, D., J. Laudon, K. Gharachorloo, A. Gupta, and J. L. Hennessy. 1990. The Directory-Based Cache Coherence Protocol for the DASH Multiprocessor. Proc. 17th Int'l Symposium on Computer Architecture (May):148-159. Google ScholarGoogle Scholar
  310. Lenoski, D., J. Laudon, T. Joe, D. Nakahira, L. Stevens, A. Gupta, and J. L. Hennessy. 1992. The DASH Prototype: Implementation and Performance. Proc. 19th Int'l Symposium on Computer Architecture , Gold Coast, Australia (May):92-103. Google ScholarGoogle Scholar
  311. Lenoski, D., J. Laudon, T. Joe, D. Nakahira, L. Stevens, A. Gupta, and J. L. Hennessy. 1993. The DASH Prototype: Logic Overhead and Performance. IEEE Transactions on Parallel and Distributed Systems 4(1):41-61. Google ScholarGoogle ScholarDigital LibraryDigital Library
  312. Li. K., and P. Hudak. 1989. Memory Coherence in Shared Virtual Memory Systems. ACM Transactions on Computer Systems 7(4):321-359. Google ScholarGoogle ScholarDigital LibraryDigital Library
  313. Li, S.-Y. 1988. Theory of Periodic Contention and Its Application to Packet Switching. Proc. INFOCOM '88 (March):320-325.Google ScholarGoogle Scholar
  314. Lim, B.-H., and A. Agarwal. 1994. Reactive Syncronization Algorithms for Multiprocessors. Proc. Sixth Int'l Conference on Architectural Support for Programming Languages and Operating Systems , 25-35. Google ScholarGoogle Scholar
  315. Linder, D., and J. Harden. 1991. An Adaptive Fault Tolerant Wormhole Strategy for k-ary n-cubes. IEEE Transactions on Computer C-40(1):2-12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  316. Lipton, R., and J. Sandberg. 1988. PRAM: A Scalable Shared Memory . Tech. Report #CS-TR-180-88, Computer Science Dept., Princeton University (September).Google ScholarGoogle Scholar
  317. Litzkow, M., M. Livny, and M. W. Mutka. 1988. Condor--A Hunter of Idle Workstations. Proc. Eighth Int'l Conference of Distributed Computing Systems (June): 104-111.Google ScholarGoogle Scholar
  318. Lo, J. L., S. J. Eggers, J.S. Emer, H. M. Levy, R. L. Stamm and D. M. Tullsen. 1997. Converting Thread-Level Parallelism into Instruction-Level Parallelism via Simultaneous Multithreading. ACM Transactions on Computer Systems (August). Google ScholarGoogle Scholar
  319. Lonergan, W., and P. King. 1961. Design of the B 5000 System. Datamation 7(5):28-32.Google ScholarGoogle Scholar
  320. Lovett, T. and R. Clapp. 1996. STiNG: A CC-NUMA Computer System for the Commercial Marketplace. Proc. 23rd Int'l Symposium on Computer Architecture (May);308-317. Google ScholarGoogle Scholar
  321. Luk, C.-K., and T. C. Mowry. 1996. Compiler-Based Prefetching for Recursive Data Structures. Proc. Seventh Symposium on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VII) (October):222-233. Google ScholarGoogle Scholar
  322. Lukowsky, J., and S. Polit. 1997 (date accessed). IP Packet Switching on the GlGAswitch/FDDI System. http://www.networks.digital.com:80/dr/techart/gsfip-mn.hlml.Google ScholarGoogle Scholar
  323. Mainwaring, A., B. Chun, S. Schleimer, and D. Wilkerson. 1997. System Area Network Mapping. Proc. Ninth Annual ACM Symposium on Parallel Algorithms and Architecture , Newport, RI (June):116-126. Google ScholarGoogle Scholar
  324. Mainwaring, A., and D. E. Culler. 1996. Active Message Applications Programming Interface and Communication Subsystem Organization . Tech. Report CSD-96-918. University of California at Berkeley. Google ScholarGoogle Scholar
  325. Martin, R. 1994. HPAM: An Active Message Layer of a Network of Workstations. Presented at Hot Interconnects II (August).Google ScholarGoogle Scholar
  326. Massalin, H., and C. Pu. 1991. A Lock-Free Multiprocessor OS Kernel . Tech. Report CUCS-005-01, Columbia University, Computer Science Dept. (October).Google ScholarGoogle Scholar
  327. Matelan, N. 1985. The FLEX/32 Multicomputer. Proc. 12th Annual Int'l Symposium on Computer Architecture , Boston, MA. (Flex) (June):209-213. Google ScholarGoogle Scholar
  328. May, C., E. Silha, R. Simpson, and H. Warren, eds. 1994. The PowerPC Architecture: A Specification for a New Family of RISC Processors . San Francisco: Morgan Kaufmann. Google ScholarGoogle Scholar
  329. McCreight, E. 1984. The Dragon Computer System: An Early Overview . Tech. Report, Xerox Corp. (September).Google ScholarGoogle Scholar
  330. Mellor-Crummey, J. and M. Scott. 1991. Algorithms for Scalable Synchronization on Shared Memory Mutiprocessors. ACM Transactions on Computer Systems 9(1):21-65. Google ScholarGoogle ScholarDigital LibraryDigital Library
  331. Melvin, S., and Y. Patt. 1991. Exploiting Fine-Grained Parallelism through a Combination of Hardware and Software Techniques. Proc. Annual Int'l Symposium on Computer Architecture (ISCA) , 287-296. Google ScholarGoogle Scholar
  332. Michael, M., and M. Scott. 1996. Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms. Proc. 15th Annual ACM Symposium on Principles of Distributed Computing , Philadelphia, PA (May): 267-276. Google ScholarGoogle Scholar
  333. Minnich, R., D. Burns, and F. Hady. 1995. The Memory-Integrated Network Interface. IEEE Micro 15(1): 11-20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  334. MIPS Technologies. 1991. MIPS R4000 Users Manual . Mountain View, CA: MIPS Technologies.Google ScholarGoogle Scholar
  335. MIPS Technologies. 1996. RI0000 Microprocessor User's Manual, Version 1.1 (January). Mountain View, CA: MIPS Technologies.Google ScholarGoogle Scholar
  336. Miyoshi, H.; M. Fukuda, T. Iwamiya, T. Nakamura, M. Tuchiya, M. Yoshida, K. Yamamoto, Y. Yamamoto, S. Ogawa, Y. Matsuo, T. Yamane, M. Takamura, M. Ikeda, S. Okada, Y. Sakamoto, T. Kitamura, H. Hatama, M. Kishimoto, M. Arnould, F. J. Bitz, E. C. Cooper, H. T. Kung, R. Sansom, S. Schlick, P. A. Steenkiste, and B. Zill. 1994. Development and Achievement of NAL Numerical Wind Tunnel (NWT) for CFD Computations. Proc. Supercomputing '94 , Washington, DC (November):685-692. Google ScholarGoogle Scholar
  337. Mowry, T. C. 1994. Tolerating Latency through Software-Controlled Data Prefetching . Ph.D. diss., Computer Systems Laboratory, Stanforcf University. Also published as Tech. Report #CSL-TR-94-628. Computer Systems laboratory, Stanford University (June). Google ScholarGoogle Scholar
  338. MPI Forum. 1993. Document for a Standard Message-Passing Interface . Tech. Report CS-93-214. University of Tennessee, Knoxville, Computer Science Dept. (November). Google ScholarGoogle Scholar
  339. MPI Forum. 1994. MPI: A Message Passing Interface. Int'l Journal of Supercomputing Applications 8(3/4). Special Issue on MPI. (updated 5/95). Also published in Proc. Supercomputing '93 Conference (May). Los Alamitos, CA: IEEE Computer Society Press. 878-883. Updated spec at http://www.mcs.anl.gov/mpi/. Google ScholarGoogle Scholar
  340. Mukherjee, S., and M. Hill. 1997. A Case for Making Network Interfaces Less Peripheral. Hot Interconnects (August).Google ScholarGoogle Scholar
  341. NAS Parallel Benchmarks. 1998 (date accessed). http://science.nas.nasa.gov/Software/NPB/.Google ScholarGoogle Scholar
  342. Nayfeh, B. A., L. Hammond, K. Olukoton. 1996. Evaluation of Design Alternatives for a Multiprocessor Microprocessor. Proc. 23rd Annual Int'l Symposium on Computer Architecture (May). New York: ACM Press, 67-77. Google ScholarGoogle Scholar
  343. Nestle, E., and A. Inselberg. 1985. The Synapse N+1 System: Architectural Characieristics and Performance Data of a Tightly-Coupled Multiprocessor System. Proc. 12th Annual Int'l Symposium on Computer Architecture , Boston, MA (Synapse) (June):233-239. Google ScholarGoogle Scholar
  344. Ngai, J., and C. Seitz. 1989. A Framework for Adaptive Routing in Multicomputer Networks. Proc. 1989 Symposium on Parallel Algorithms and Architectures (June):2-10. Google ScholarGoogle Scholar
  345. Nickolls, J. R. 1990. The Design of the MasPar MP-1: A Cost Effective Massively Parallel Computer. COMPCON Spring '90, Digest of Papers , San Francisco, CA (February/March):25-28.Google ScholarGoogle Scholar
  346. Nikhil, R. S., and Arvind. 1989. Can Dataflow Subsume von Neumann Computing? Proc. 16th Annual Int'l Symposium on Computer Architecture (May):262-72. Google ScholarGoogle Scholar
  347. Nikhil, R., G. Papadopoulos, and Arvind. 1993. *T: A Multithreaded Massively Parallel Architecture. Proc. Annual Int'l Symposium on Computer Architecture (ISCA) '93 (May):156-167. Google ScholarGoogle Scholar
  348. Noakes, M. D., D. A. Wallach and W.J. Dally. 1993. The J-Machine Multicomputer: An Architectural Evaluation. Proc. 20th Int'l Symposium on Computer Architecture (May):224-235. Google ScholarGoogle Scholar
  349. Nuth, P., and W.J. Dally. 1992. The J-Machine Network. Proc. Int'l Conference on Computer Design: VLSI in Computers and Processors (October). Google ScholarGoogle Scholar
  350. Nuth, P., and W.J. Dally. 1995. The Named-State Register File: Implementation and Performance. Proc. First Int'l Symposium on High-Performance Computer Architecture (January):4-13. Google ScholarGoogle Scholar
  351. O'Krafka, B., and A. Newton. 1990. An Empirical Evaluation of Two Memory-Efficient Directory Methods. Proc. 17th Int'l Symposium on Computer Architecture (May):138-147. Google ScholarGoogle Scholar
  352. Office of Science and Technology Policy. 1993. Grand Challenges 1993: High Performance Computing and Communications, A Report by the Committee on Physical, Mathematical, and Engineering Sciences . Washington, DC: Office of Science and Technology Policy.Google ScholarGoogle Scholar
  353. Ohara, M. 1996. Producer-Oriented versus Consumer-Oriented Prefetching: A Comparison and Analysis of Parallel Application Programs . Ph.D. diss., Computer Systems Laboratory. Stanford University. Available as Tech. Report #CSL-TR-96-695, Stanford University (June). Google ScholarGoogle Scholar
  354. Olukotun, K., B. A. Nayfeh, L. Hammond, K. Wilson and K. Chang. 1996. The Case for a Single-Chip Multiprocessor. Proc. ASPLOS (October):2-11. Google ScholarGoogle Scholar
  355. Omondi, A. R. 1994. Ideas for the Design of Multithreaded Pipelines. In Multithreaded Computer Architecture: A Summary of the State of the Art . Edited by R. Iannucci. Dordrecht, Germany; Norwell, MA: Kluwer Academic Publishers, 1994. See also A. R. Omondi, Design of a High Performance Instruction Pipeline. Computer Systems Science and Engineering 6(1):13-29 (1991).Google ScholarGoogle Scholar
  356. Pacheco, P. 1996. Parallel Programming with MPI . San Francisco: Morgan Kaufmann. Google ScholarGoogle Scholar
  357. Padegs, A. 1981. System/360 and Beyond. IBM Journal of Research and Development 25(5):377-390. Google ScholarGoogle ScholarDigital LibraryDigital Library
  358. Pai, V. S., P. Ranganathan, S. V. Adve, and T. Harton. 1996. An Evaluation of Memory Consistency Models for Shared-Memory Systems with ILP Processors. Proc. Seventh Symposium on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VII) (October):12-23. Google ScholarGoogle Scholar
  359. Papadimitriou, C. H. 1979. The Serializability of Concurrent Database Updates. Journal of the ACM 26(4):631-653. Google ScholarGoogle ScholarDigital LibraryDigital Library
  360. Papadopoulos, G. M., and D. E. Culler. 1990. Monsoon: An Explicit Token-Store Architecture. Proc. 17th Annual Int'l Symposium on Computer Architecture , Seattle, WA (May):82-91. Google ScholarGoogle Scholar
  361. Papamarcos, M., and J. Patel. 1984. A Low Overhead Coherence Solution for Multiprocessors with Private Cache Memories. Proc. 11th Annual Int'l Symposium on Computer Architecture (June):348-354. Google ScholarGoogle Scholar
  362. PARKBENCH Committee. 1994. Public International Benchmarks for Parallel Computers. Scientific Programming 3(2). Also published as Tech. Report CS93-213, University of Tennessee, Knoxville, Dept. of Computer Science (November). Google ScholarGoogle Scholar
  363. Patterson, D. A. 1995. Microprocessors in 2020. Scientific American (September).Google ScholarGoogle Scholar
  364. Patterson, D. A., T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, and K. Yelick. 1997. A Case for Intelligent RAM. IEEE Micro 17(2):34-44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  365. Peterson, L., and B. Davie. 1996. Computer Networks . San Francisco: Morgan Kaufmann. Google ScholarGoogle Scholar
  366. Pfeiffer, W., S. Hotovy, N. Nystrom, D. Rudy, T. Sterling, M. Straka. 1995 (date accessed). JNNIE: The Joint NSF-NASA Initiative on Evaluation, http://www.tc.cornell.edu/JNNIE/finrep/jnnie.html.Google ScholarGoogle Scholar
  367. Pfister, G. F. 1995. In Search of Clusters--The Coming Battle for Lowly Parallel Computing . Englewood Cliffs. NJ: Prentice Hall. Google ScholarGoogle Scholar
  368. Pfister, G. F., W. C. Brantley, D. A. George, S. L. Harvey, W. J. Kleinfelder, K. P. McAuliff, E. A. Melton, V. A. Norton, and J. Weiss. 1985. The IBM Research Parallel Processor Prototype (RP3): Introduction and Architecture. Proc. Int'l Conference on Parallel Processing (August):264-771.Google ScholarGoogle Scholar
  369. Pfister, G. F., and V. A. Norton. 1985. Hot Spot Contention and Combining Multistage Interconnection Networks. IEEE Transactions on Computers C-34(10).Google ScholarGoogle ScholarCross RefCross Ref
  370. Pierce, P. 1988. The NX/2 Operating System. Proc. Third Conference on Hypereube Concurrent Computers and Applications (January):384-390. Google ScholarGoogle Scholar
  371. Pierce, P., and G. Reenter. 1994. The Paragon Implementation of the NX Message Passing Interface. Proc. Scalable High-Performance Computing Conference (May):184-90.Google ScholarGoogle Scholar
  372. Porter, R. E. 1960. Datamation 6(1):8-14.Google ScholarGoogle Scholar
  373. Przybylski, S., M. Horowitz, J. L. Hennessy. 1988. Performance Tradeoffs in Cache Design. Proc. 15th Annual Symposium on Computer Architecture (May):290-298. Google ScholarGoogle Scholar
  374. Ranganathan, P. V. S. Pai, H. Abdel-Shafi, and S. V. Adve. 1997. The Interaction of Software Prefetching with ILP Processors in Shared-Memory Systems. Proc. 24th Int'l Symposium on Computer Architecture (June). Google ScholarGoogle Scholar
  375. Ratner, J. 1985. Concurrent Processing: A New Direction in Scientific Computing. Proc. 1985 National Computing Conference , 835.Google ScholarGoogle Scholar
  376. Reddaway, S. F. 1973. DAP--A Distributed Array Processor. First Annual Int'l Symposium on Computer Architecture (Dccember):61-65. Google ScholarGoogle Scholar
  377. Reinhardt, S. K., M. D. Hill, J. R. Larus, A. R. Lebeck, J. C. Lewis, and D. A. Wood. 1993. The Wisconsin Wind Tunnel: Virtual Prototyping of Parallel Computers. Proc. ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems (May):48-60. Google ScholarGoogle Scholar
  378. Reinhardt, S. K., J. R. Larus, and D. A. Wood. 1994. Tempest and Typhoon: User-Level Shared Memory. Proc. 21st Int'l Symposium on Computer Architecture (April):325-337. Google ScholarGoogle Scholar
  379. Reinhardt, S. K., R. W. Pfile, and D. A. Wood. 1996. Decoupled Hardware Support for Distributed Shared Memory. Proc. 23rd Int'l Symposium on Computer Architecture (May):34-43. Google ScholarGoogle Scholar
  380. Rettberg, R., W. Crowther, P. Carvey, and R. Tomlinson. 1990. The Monarch Parallel Processor Hardware Design. IEEE Computer (April):18-30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  381. Rettberg, R., and R. Thomas. 1986. Contention is No Obstacle to Shared-Memory Multiprocessing. Communications of the ACM 29(12):1202-1212. Google ScholarGoogle ScholarDigital LibraryDigital Library
  382. Rinard, M. C., D. J. Scales, and M. S. Lam. 1993. Jade: A High-Level. Machine-Independent Language for Parallel Programming. IEEE Computer 26(6). Google ScholarGoogle Scholar
  383. Rodgers, D. 1985. Improvements on Multiprocessor System Design. Proc. 12th Annual Int'l Symposium on Computer Architecture , Boston, MA (Sequent B8000) (June):225-231. Google ScholarGoogle Scholar
  384. Rosenblum, M., S. A. Herrod, E. Witchel, and A. Gupta. 1995. Complete Computer Simulation: The SimOS Approach. IEEE Parallel and Distributed Technology 3(4). Google ScholarGoogle Scholar
  385. Rosenburg, B. 1989. Low-Synchronization Translation Lookaside Buffer Consistency in Large-Scale Shared-Memory Multiprocessors. Proc. Symposium on Operating Sysrems Principles (December). Google ScholarGoogle Scholar
  386. Rothberg, E., J. P. Singh, and A. Gupta. 1993. Working Sets, Cache Sizes, and Node Granularity Issues for Large-Scale Multiprocessors. Proc. 20th Int'l Symposium on Computer Architecture (May):14-25. Google ScholarGoogle Scholar
  387. Russel, R. M. 1978. The CRAY-1 Computer System. Communications of the ACM 21(1):63-72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  388. Saavedra-Barrera, R. H., D. E. Culler, T. von Eicken. 1990. Analysis of Multithreaded Architectures for Parallel Computing. Second Annual ACM Symposium on Parallel Algorithms and Architectures (July): 169-178. Google ScholarGoogle Scholar
  389. Saavedra, R. H., R. S. Gaines, and M.J. Carlton. 1993. Micro Benchmark Analysis of the KSR1. Proc. Supercomputing '93 , Portland, OR (November):202-213. Google ScholarGoogle Scholar
  390. Saavedra, R. H., and A. J. Smith. 1996. Analysis of Benchmark Characteristics and Benchmark Performance Prediction. ACM Transactions on Computer Systems 14(4);344-384. Google ScholarGoogle ScholarDigital LibraryDigital Library
  391. Sakai, S., Y. Kodama and Y. Yamaguchi 1991. Prototype Implementation of a Highly Parallel Dataflow Machine EM4. Proc. Fifth Int'l Parallel Processing Symposium . Anaheim, CA (April/May):278-286. Google ScholarGoogle Scholar
  392. Salmon, J. 1990. Parallel Hierarchical N-body Methods . Ph.D. diss., California Institute of Technology. Google ScholarGoogle Scholar
  393. Salmon, J. K., M. S. Warren, and G. S. Winckelmans. 1994. Fast Parallel Tree codes for Gravitational and Fluid Dynamical N-body Problems. Intl. Journal of Supercomputer Applications 8:129-142. Google ScholarGoogle ScholarDigital LibraryDigital Library
  394. Samanta, R., A. Bilas, L. Iftode, and J. R Singh. 1998. Home-Based SVM Protocols for SMP Clusters: Design, Simulations, Implementation, and Performance. Proc. 23rd Annual Int'l Symposium on Computer Architecture (February).Google ScholarGoogle Scholar
  395. Saulsbury, A., F. Pong, and A. Nowatzyk 1996. Missing the Memory Wall: The Case for Processor/Memory Integration. Proc. 23rd Annual Int'l Symposium on Computer Architecture (May):90-101. Google ScholarGoogle Scholar
  396. Saulsbury, A., T. Wilkinson, J. Carter, and A. Landin. 1995. An Argument For Simple COMA Proc. First IEEE Sympostum on High Performance Computer Architecture (January):276-285. Google ScholarGoogle Scholar
  397. Savage, J. 1985. Parallel Processing as a Language Design Problem. Proc. 12th Annual Int'l Symposium on Computer Architecture , Boston, MA (Myrias 4000) (June):221-224. Google ScholarGoogle Scholar
  398. Scales, D. J., K. Gharachorloo and C. A. Thekkath. 1996. Shasta: A Low Overhead. Software-Only Approach for Supporting Fine-Grain Shared Memory. Proc. Seventh Int'l Conference on Architectural Support for Programming, Languages and Operating Systems (October):174-185. Google ScholarGoogle Scholar
  399. Scales, D. J., and M. S. Lam. 1994. The Design and Evaluation of a Shared Object System for Distributed Memory Machines. Proc. First Symposium on Operating System Design and Implementation (November):101-114. Google ScholarGoogle Scholar
  400. Schanin, D.J. 1986. The Design and Development of a Very High Speed System Bus--The Encore Multimax Nanobus. In Proc. Fall Joint Computer Conference (Encore) , Dallas, TX (November). Edited by H. S. Stone. Los Alamitos: IEEE Computer Society Press, 410-418. Google ScholarGoogle Scholar
  401. Schauser, K. E., and C. J. Scheiman. 1995. Experience with Active Messages on the Meiko CS-2. Proc. Ninth Int'l Symposium on Parallel Processing (IPPS'95) (April):140-149. Google ScholarGoogle Scholar
  402. Scheurich, C. and M. Dubois. 1987. Correct Memory Operation of Cache-Based Multiprocessors. Proc. 14th Int'l Symposium on Computer Architecture (June):234-243. Google ScholarGoogle Scholar
  403. Schoinas, I., B. Falsafi, A. R. Lebeck, S. K. Reinhardt, J. R. Larus, and D. A. Wood. 1994. Fine-Grain Access Control for Distributed Shared Memory. Proc. 6th Int'l Conference on Architectural Support for Programming Languages and Operating Systems (October):297-306. Google ScholarGoogle Scholar
  404. Schroeder, M. D., A. D. Birrell, M. Burrows, H. Murray, R. M. Needham, T. L. Rodeheffer, E. H. Satterthwaite and C. P. Thacker. 1991. Autonet: A High-Speed. Self-Configuring Local Area Network Using Point-to-Point Links. IEEE Journal on Selected Areas in Communications 9(8):1318-1335. Google ScholarGoogle Scholar
  405. Schwiebert, L., and D. N. Jayasimha. 1995. A Universal Proof Technique for Deadlock-Free Routing in Interconnection Networks. Symposium on Parallel Algorithms and Architecture (July):175-184. Google ScholarGoogle Scholar
  406. Scott, S. 1991. A Cache-Coherence Mechanism for Scalable Shared-Memory Multiprocessors. Proc. Int'l Symposium on Shared Memory Multiprocessing (April):49-59.Google ScholarGoogle Scholar
  407. Scott, S. 1996. Synchronization and Communication in the T3E Multiprocessor. Proc. Seventh Int'l Conference on Architectural Support for Programming Languages and Operating Systems (October):26-36, Cambridge. MA. Google ScholarGoogle Scholar
  408. Scott, S., and J. R. Goodman. 1993. Performance of Pruning Cache Directories for Large-Scale Multiprocessors. IEEE Transactions on Parallel and Distributed Systems 4(5):520-534. Google ScholarGoogle ScholarDigital LibraryDigital Library
  409. Scott, S. 1994. The Impact of Pipelined Channels on k-ary n-Cube Networks. IEEE Transactions on Parallel and Distributed Systems 5(1):2-16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  410. Scott, S., M. Vernon, and J. R. Goodman. 1992. Performance of the SCI Ring. Proc. 19th Int'l Symposium on Computer Architecture (May):403-414. Google ScholarGoogle Scholar
  411. Seitz, C. L. 1984. Concurrent VLSI Architectures. IEEE Transactions on Computers 33(12):1247-1265. Google ScholarGoogle ScholarDigital LibraryDigital Library
  412. Seitz, C. L. 1985. The Cosmic Cube. Communications of the ACM 28(1):22-33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  413. Seitz, C. L., and W.-K. Su. 1993. A Family of Routing and Communication Chips Based on Mosaic. Proc. of Univ. of Washington Symposium on Integrated Systems . Cambridge, MA: MIT Press, 320-337. Google ScholarGoogle Scholar
  414. Shah, G., J. Nieplocha, J. Mirza, C. Kim, R. Harrison, R. K. Govindaraju, K. Gildea, P. DiNicola, and C. Bender. 1998. Performance and Experience with LAPI--A New High-Performance Communicalion Library for the IBM RS/6000 SP. Twelfth Int'l Parallel Processing Symposium (March):260-266. Google ScholarGoogle Scholar
  415. Shasha, D., and M. Snir. 1988. Efficient and Correct Execution of Parallel Programs that Share Memory. ACM Transactions on Programming Languages and Operating Systems 10(2):282-312. Google ScholarGoogle ScholarDigital LibraryDigital Library
  416. Shimada, T., K. Hiraki and K. Nishida. 1984. An Architecture of a Data Flow Machine and Its Evaluation. Proc. COMPCON '84 , 486-90.Google ScholarGoogle Scholar
  417. Simoni, R., and M. Horowitz. 1991. Dynamic Pointer Allocation for Scalable Cache Coherence Directories. Proc. Int'l Symposium on Shared Memory Multiprocessing (April):72-81.Google ScholarGoogle Scholar
  418. Sindhu, R.J.-M. Frailong, and M. Cekleov. 1991. Formal Specification of Memory Models . Tech. Report (PARC) CSL-91-11. Xerox Corp., Palo Alto Research Center, Palo Alto, CA.Google ScholarGoogle Scholar
  419. Sindhu, P., et al. 1993. XDBus: A High-Performance, Consistent, Packet Switched VLSI Bus. Proc. COMPCON (Spring):338-344.Google ScholarGoogle Scholar
  420. Singh, J. P. 1993. Parallel Hierarchical N-body Methods and Their Implications for Multiprocessors . Ph.D. diss., Tech. Report #CSL-TR-93-565. Stanford University (March). Google ScholarGoogle Scholar
  421. Singh, J. P. 1998. Some Aspects of Controlling Scheduling in Handware Control Prefetching . To be published as Tech. Report, Princeton University, Computer Science Dept.Google ScholarGoogle Scholar
  422. Singh, J. P., A. Gupta, and M. Levoy. 1994. Parallel Visualization Algorithms: Performance and Architectural Implications. IEEE Computer 27(6). Google ScholarGoogle Scholar
  423. Singh, J. P., J. L. Hennessy and A. Gupta. 1993. Scaling Parallel Programs for Multiprocessors: Methodology and Examples. IEEE Computer 26(7):42-50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  424. Singh, J. P., J. L. Hennessy and A. Gupta. 1995. Implications of Parallel Hierarchical N-body Applications for Multiprocessors. ACM Transactions on Computer Systems (May). Google ScholarGoogle Scholar
  425. Singh, J. P., C. Holt, T. Totsuka, A. Gupta, and J. L. Hennessy. 1995. Load Balancing and Data Locality in Hierarchial N-body Methods: Barnes-Hut, Fast Multipole and Radiosity. Journal of Parallel and Distributed Computing (June). Google ScholarGoogle ScholarDigital LibraryDigital Library
  426. Singh, J. P., T. Joe, A. Gupta, and J. L. Hennessy. 1993. An Empirical Comparison of the KSR-1 and DASH Multiprocessors. Proc. Supercomputing '93 (November). Google ScholarGoogle Scholar
  427. Singh, J. P., E. Rothberg, and A. Gupta. 1994. Modeling Communication in Parallel Algorithms: A Fruitful Interaction between Theory and Systems? Proc. 10th Annual ACM Symposium on Parallel Algorithms and Architectures . Google ScholarGoogle Scholar
  428. Singh, J. P. W-D. Weber, and A. Gupta. 1992. SPLASH: The Stanford Parallel. Applications for SHared Memory. Computer Architecture News 20(1):5-44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  429. Sites, R. L. ed. 1992 Alpha Architecture Reference Manual . Hudson. MA: Digital Press, Digital Equipment Corp. Google ScholarGoogle Scholar
  430. Slater, M. 1994. Intel Unveils Multiprocessor System Specification. Microprocessor Report (May):12-14.Google ScholarGoogle Scholar
  431. Slotnick, D. L. 1967. Unconventional Systems. Proc. AFIPS Spring Joint Computer Conference 30:477-481. Google ScholarGoogle Scholar
  432. Slotnick, D. L., W. C. Borck, and R. C. McReynolds. 1962. The Solomon Computer. Proc. AFIPS Fall Joint Computer Conference 22:97-107. Google ScholarGoogle Scholar
  433. Smith, A. J. 1982. Cache Memories. ACM Computing Surveys 14(3):473-530. Google ScholarGoogle ScholarDigital LibraryDigital Library
  434. Smith, B. J. 1981. Architecture and Applications of the HEP Multiprocessor Computer System. Proc. SPIE: Real-Time Signal Processing IV 298(August):241-248.Google ScholarGoogle Scholar
  435. Smithm B. J. 1985. The Architecture of HEP. In Parallel MIMD Computation. The HEP Supercomputer and Its Applications . Edited by J.S. Kowalik. Cambridge, MA: MIT Press. 41-55. Google ScholarGoogle Scholar
  436. Smith, M. D., M. Johnson, and M. A. Horowitz. 1989. Limits on Multiple Instruction Issue. Proc. Third Int'l Conference on Architectural Support for Programming Languages and Operating Systems , 290-302, Apr. Google ScholarGoogle Scholar
  437. Snir, M., S. Otto, S. H. Lederman, D. Walker, and J. Dongarra. 1995. MPI: The Complete Reference . Cambridge, MA: MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  438. Sohi, G., S. Breach, and T. N. Vijaykumar. 1995 Multiscalar Processors. Proc 22nd Annual Int'l Symposium on Computer Architecture (June):414-425. Google ScholarGoogle Scholar
  439. SPEC (Standard Performance Evaluation Corporation). 1995 (date accessed). http://www.specbench.org/. (SPEC Benchmark Suite Release 1.0., 1989).Google ScholarGoogle Scholar
  440. Spertus, E., S. C. Goldstein, K. E. Schauser, T. von Eicken, D. E. Culler, W. J. Dally. 1993. Evaluation of Mechanisms for Fine-Grained Parallel Programs in the J-Machine and the CM-5. Proc. 20th Annual Symposium on Computer Architecture (May):302-313. Google ScholarGoogle Scholar
  441. Stenstrom, P. T. Joe and A. Gupia. 1992. Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures. Proc. 19th Int'l Symposium on Computer Architecture (May):80-91. Google ScholarGoogle Scholar
  442. Stets, R., S. Dwarkadas, N. Hardavellas, G. Hunt, L. Koniothanassis, S. Parthasarathy and M. Scott 1997. Cashmere-2L: Software Coherent Shared Memory on a Clustered Remote-Write Network. Proc. 16th ACM Symposium on Operating Systems Principles (October). Google ScholarGoogle Scholar
  443. Stone, H. S. 1970. A Logic-in-Memory Computer. IEEE Transactions on Computers C-19(1):73-78. Google ScholarGoogle ScholarDigital LibraryDigital Library
  444. Stunkel, C. B., D. G. Shea, D. G. Grice, P. H. Hochschild and M. Tsao. 1994. The SP-1 High Performance Swiich. Proc. Scalable High Performance Computing Conference (May): 150-157 Knoxville, TN.Google ScholarGoogle Scholar
  445. Stunkel, C. B., et al. 1998 (date accessed). The SP2 Communication Subsystem . http://ibm.tc.cornell.edu/ibm/pps/doc/css/css.ps.Google ScholarGoogle Scholar
  446. SUN Microsystems. 1991. The SPARC Architecture Manual . #800-199-12. Version 8 (January). Mountain View, CA: SUN Microsystems.Google ScholarGoogle Scholar
  447. Sunderam, V. S. 1990. PVM: A Framework for Parallel Distributed Computing. Concurrency: Practice and Experience 2(4):315-339. Google ScholarGoogle ScholarDigital LibraryDigital Library
  448. Sunderam, V. S., J. Dongarra, A. Geist, and R. Manchek. 1994. The PVM Concurrent Computing System: Evolution, Experiences, and Trends. Parallel Computing 20(4):531-547. Google ScholarGoogle ScholarDigital LibraryDigital Library
  449. Swan, R.J., A. Bechtolsheim, K.-W. Lai, and J. K. Ousterhout. 1977. The Implementation of the CM* Multi-Microprocessor. Proc. AFIPS Conference/National Computer Conference (46):645-655. Google ScholarGoogle Scholar
  450. Swan, R. J., S. H. Fuller, and D. R Siewiorek. 1977. CM*--A Modular, Multi-Microprocessor. Proc. AFIPS Conference/National Computer Conference (46):637-44. Google ScholarGoogle Scholar
  451. Sweazey, P., and A.J. Smith. 1986. A Class of Compatible Cache Consistency Protocols and Their Support by the IEEE Futurebus. Proc. 13th Int'l Symposium on Computer Architecture (May):414-423. Google ScholarGoogle Scholar
  452. Tamir, Y., and G. L. Frazier. 1988. High-Performance Multi-Queue Buffers for VLSI Communication Switches. Proc. 15th Annual Int'l Symposium on Computer Architecture , 343-354. Google ScholarGoogle Scholar
  453. Tanenbaum, A. S., and A. S. Woodhull. 1997. Operating System Design and Implementation 2nd ed. Englewood Cliffs, NJ: Prentice Hall. Google ScholarGoogle Scholar
  454. Tang, C. 1976. Cache Design in a Tightly Coupled Multiprocessor System. Proc. AFIPS Conference (June):749-753. Google ScholarGoogle Scholar
  455. Teller, P. 1990. Translation-Lookaside Buffer Consistency. IEEE Computer 23(6):26-36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  456. Thacker, C. L. Stewart, and E. Satterthwaite, Jr. 1988. Firefly: A Multiprocessor Workstation. IEEE Transactions on Computers 37(8):909-20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  457. Thapar, M., and B. Delagi. 1990. Stanford Distributed-Directory Protocol. IEEE Computer 23(6):78-80. Google ScholarGoogle ScholarDigital LibraryDigital Library
  458. Thekkath, R., A. P. Singh, J. P. Singh, J. Hennessy and S. John. 1997. An Application-Driven Evaluation of the Convex Exemplar SP-1200. Proc. Int'l Parallel Processing Symposium (June).Google ScholarGoogle Scholar
  459. Thompson, M., J. Barton, T. Jermoluk, and J. Wagner. 1988. Translation Lookaside Buffer Synchronization in a Multiprocessor System. Proc. USENIX Technical Conference (February).Google ScholarGoogle Scholar
  460. Thornton, J. E. 1964. Parallel Operation in the Control Data 6600. AFIPS Proc. Fall Joint Computer Conference , Part 2 26:33-40. Reprinted in Siework, Bell, and Newell. 1982. Computer Structures: Principles and Examples . New York; McGraw-Hill. Google ScholarGoogle Scholar
  461. Tomasulo, R. M. 1967. An Efficient Algorithm for Exploiting Multiple Arithmetic Units. IBM Journal of Research and Development 11(1):25-33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  462. Torrellas, J., M. S. Lam, and J. L. Hennessy. 1994. False Sharing and Spatial Locality in Multiprocessor Caches. IEEE Transactions on Computers 43(6):651-663. Google ScholarGoogle ScholarDigital LibraryDigital Library
  463. Transaction Processing Council. 1998. http://www.tpc.orgGoogle ScholarGoogle Scholar
  464. Traw, C., and J. Smith. 1991. A High-Performance Host Interface for ATM Networks. Proc. ACM SIGCOMM Conference (September):317-325. Google ScholarGoogle Scholar
  465. Traylor, R., and D. Dunning. 1992. Routing Chip Set for Intel Paragon Parallel Supercomputer. Proc. Hot Chips '92 Symposium (August).Google ScholarGoogle Scholar
  466. Tucker, L. W., and A. Mainwaring. 1994. CMMD: Active Messages on the CM-5. Parallel Computing 20(4):481-496. Google ScholarGoogle ScholarDigital LibraryDigital Library
  467. Tucker, L. W., and G. G. Robertson. 1988. Architecture and Applications of the Connection Machine. IEEE Computer 21(8):26-38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  468. Tucker, S. 1986. The IBM 3090 System: An Overview. IBM Systems Journal 25(1):4-19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  469. Tullsen, D. M., and S.J. Eggers. 1993. Limitations of Cache Prefetching on a Bus-Based Multiprocessor. Proc. 20th Annual Symposium on Computer Architecture (May):278-288. Google ScholarGoogle Scholar
  470. Tullsen, D. M., S.J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, and R. L. Stamm. 1996. Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor. Proc. 23rd Int'l Symposium on Computer Architecture (May): 191-202. Google ScholarGoogle Scholar
  471. Tullsen, D. M., S.J. Eggers. and H. M. Levy 1995. Simultaneous Multithreading: Maximizing On-Chip Parallelism. Proc. 20th Annual Symposium on Computer Architecture (June):278-288. Google ScholarGoogle Scholar
  472. Turner, J. S. 1988. Design of a Broadcast Packet Switching Network. IEEE Transactions on Communication 36(6):734-743.Google ScholarGoogle ScholarCross RefCross Ref
  473. Valiant, L. G. 1990. A Bridging Model for Parallel Computation. Communications of the ACM 33(8):103-111. Google ScholarGoogle ScholarDigital LibraryDigital Library
  474. Valois, J. 1995. Lock-Free Linked Lists Using Compare-and-Swap. Proc. 14th Annual ACM Symposium on Principles of Distributed Computing , Ottawa, Canada (August):214-222. Google ScholarGoogle Scholar
  475. Vick, C. R., and J. A. Cornell 1978. PEPE Architecture--Present and Future. Proc. AFIPS Conference 47:981-1002.Google ScholarGoogle Scholar
  476. von Eicken, T. A. Basu and V. Buch. 1995. Low-Latency Communication Over ATM Using Active Messages. IEEE Micro 15(1):46-53. Google ScholarGoogle ScholarDigital LibraryDigital Library
  477. von Eicken, T., D. E. Culler, S. C. Goldstein, and K. E. Schauser. 1992. Active Messages: A Mechanism for Integrated Communication and Computation. Proc. 19th Annual Int'l Symposium on Computer Architecture , Gold Coast, Australia (May) 256-266. Google ScholarGoogle Scholar
  478. Vranesic, Z., M. Stumm, D. Lewis, and R. White 1991. Hector: A Hierarchically Structured Shared Memory Multiprocessor. IEEE Computer 24(1):72-78. Google ScholarGoogle ScholarDigital LibraryDigital Library
  479. Wall, D. W. 1991. Limits of Instruction-Level Parallelism. ASPLOS IV (April):176-188. Google ScholarGoogle Scholar
  480. Wallach, D. A. 1992. PHD: A Hierarchical Cache Coherence Protocol . S.M. thesis. Massachusetts Institute of Technology. Also available as Tech. Report #1389. Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Boston, MA (August).Google ScholarGoogle Scholar
  481. Wang, W-H., J.-L. Baer and H. M. Levy. 1989. Organization and Performance of a Two-Level Virtual-Real Cache Hierarchy. Proc. 16th Annual Int'l Symposium on Computer Architecture (June):140-148. Google ScholarGoogle Scholar
  482. Warren, M. S., and J. K. Salmon. 1993. A Parallel Hashed Oct-Tree N-body Algorithm. Proc. Supercomputing '93 . Washington, DC: IEEE Computer Society, 12-21. Google ScholarGoogle Scholar
  483. Weaver, D., and T. Germond, eds. 1994. The SPARC Architecture Manual . SPARC International, Version 9. Englewood Cliffs, NJ: Prentice Hall. Google ScholarGoogle Scholar
  484. Weber, W.-D. 1993. Scalable Directories for Cache-Coherent Shared-Memory Multiprocessors Ph.D. diss., Computer Systems Laboratory, Stanford University (January). Also available as Tech. Report #CSL-TR-93-557. Stanford University.Google ScholarGoogle Scholar
  485. Weber, W.-D., S. Gold, P. Helland, T. Shimizu, T. Wicki and W. Wilcke. 1997. The Mercury Interconnect Architecture: A Cost-Effective Infrastructure for High-Performance Servers. Proc. 24th Int'l Symposium on Computer Architecture (June):98-107. Google ScholarGoogle Scholar
  486. Weiss, S. and J. Smith. 1994. Power and PowerPC . San Francisco: Morgan Kaufmann. Google ScholarGoogle Scholar
  487. Widdoes, L., Jr., and S. Correll. 1980. The S-1 Project: Developing High Performance Computers. Proc. COMPCON (Spring):282-291.Google ScholarGoogle Scholar
  488. Wilson, A., Jr. 1987. Hierarchical Cache/Bus Architecture for Shared Memory Multiprocessors. Proc. 14th Int'l Symposium on Computer Architecture (June):244-252. Google ScholarGoogle Scholar
  489. Wolf, M. E., and M. S. Lam. 1991. A Data Locality Optimizing Algorithm. Proc. ACM SIGPLAN'91 Conference on Programming Language Design and Implementation (June):30-44. Google ScholarGoogle Scholar
  490. Wolfe, M. 1989. Optimizing Supercompilers for Supercomputers . Cambridge, MA, MIT Press. Google ScholarGoogle Scholar
  491. Woo, S. C., J. P. Singh, and J. L. Hennessy. 1994. The Performance Advantages of Integrating Block Data Transfer in Cache-Coherent Multiprocessors. Proc. 6th Int'l Conference on Architectural Support for Programming Languages and Operating Systems (October):219-229, San Jose, CA. Google ScholarGoogle Scholar
  492. Woo, S. C. M. Ohara, E. J. Torrie, J. P. Singh, and A. Gupta. 1995. The SPLASH-2 Programs; Characterization and Methodological Considerations. Proc. 22nd Annual Int'l Symposium on Computer Architecture (June):24-36. Google ScholarGoogle Scholar
  493. Wood, D. A., S. Chandra, B. Falsafi, M. D. Hill, J. R. Larus, A. R. Lebeck, J. C. Lewis, S. S. Mukherjee, S. Palacharla, and S. K. Reinhardt. 1993. Mechanisms for Cooperative Shared Memory. Proc. 20th Annual Symposium on Computer Architecture (May):156-167. Google ScholarGoogle Scholar
  494. Wood, D. A., S. J. Eggers, G. Gibson, M. D. Hill, J. M. Pendleton, S. A. Ritchie, G. S. Taylor, R. H. Katz, and D. A. Patterson. 1986. An In-Cache Address Translation Mechanism. Proc. 13th Annual Symposium on Computer Architecture (June):358-365. Google ScholarGoogle Scholar
  495. Wood, D. A., and M. D. Hill. 1995. Cost-Effective Parallel Computing. IEEE Computer 28(2):69-72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  496. Woodbury, P., A. Wilson, B. Shein, I. Gertner, P.Y. Chen, J. Bartlett, and Z. Aral. 1989. Shared Memory Multiprocessors: The Right Approach to Parallel Processing. Proc. COMPCON (Spring): 72-80.Google ScholarGoogle Scholar
  497. Wulf, W., R. Levin, and C. Person. 1975. Overview of the Hydra Operating System Development. Proc. 5th Symposium on Operating Systems Principles (November):122-131. Google ScholarGoogle Scholar
  498. Yamashita, N., T. Kimura, Y. Fujita, Y. Aimoto, T. Manaba, S. Okazaki, K. Nakamura, and M. Yamashina. 1994. A 3.84GIPS Integrated Memory Array Processor LSI with 64 Processing Elements and 2Mb SRAM. Int'l Solid-State Circuits Conference , San Francisco (February):260-261.Google ScholarGoogle Scholar
  499. Zekauskas, M. J., W. A. Sawdon, and B. N. Bershad. 1994. Software Write Detection for a Distributed Shared Memory. Proc. Operating Systems Design and Implementation Symposium (November):87-100. Google ScholarGoogle Scholar
  500. Zhang, Z., and J. Torrellas. 1995. Speeding Up Irregular Applications in Shared-Memory Multiprocessors: Memory Binding and Group Prefetching. Proc. 22nd Annual Symposium on Computer Architecture (May):188-199. Google ScholarGoogle Scholar
  501. Zhou, Y., L. Iftode, and K. Li. 1996. Performance Evaluation of Two Home-Based Lazy Release Consistency Protocols for Shared Virtual Memory Systems. Proc. Operating Systems Design and Implementation Symposium (October). Google ScholarGoogle Scholar

Cited By

  1. ACM
    Wang F and Hsu C Rainbow Connection Number in Pyramid Networks Proceedings of the 11th International Conference on Computer Modeling and Simulation, (147-150)
  2. ACM
    León-Sandoval E and Barbosa-Santillán L Data intensive parallel tree algorithm patterns based on GPUs Proceedings of the 2018 International Conference on Data Science and Information Technology, (69-73)
  3. Cairo M and Rizzi R The complexity of simulation and matrix multiplication Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, (2203-2214)
  4. Qin H, Liu Z, Liu Y and Zhong H (2017). An object-oriented MATLAB toolbox for automotive body conceptual design using distributed parallel optimization, Advances in Engineering Software, 106:C, (19-32), Online publication date: 1-Apr-2017.
  5. Kantert J, Tomforde S, Scharrer R, Weber S, Edenhofer S and Mller-Schloer C (2017). Identification and classification of agent behaviour at runtime in open, trust-based organic computing systems, Journal of Systems Architecture: the EUROMICRO Journal, 75:C, (68-78), Online publication date: 1-Apr-2017.
  6. Huang C, Kumar R, Elver M, Grot B and Nagarajan V C3D The 49th Annual IEEE/ACM International Symposium on Microarchitecture, (1-12)
  7. Ros A and Kaxiras S Racer The 49th Annual IEEE/ACM International Symposium on Microarchitecture, (1-13)
  8. Fernández-Pascual R, Ros A and Acacio M Optimization of a Linked Cache Coherence Protocol for Scalable Manycore Coherence Proceedings of the 29th International Conference on Architecture of Computing Systems -- ARCS 2016 - Volume 9637, (100-112)
  9. Dimakopoulou M, Eranian S, Koziris N and Bambos N Reliable and efficient performance monitoring in linux Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, (1-13)
  10. ACM
    Kuiper G, Geuns S and Bekooij M Utilization Improvement by Enforcing Mutual Exclusive Task Execution in Modal Stream Processing Applications Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems, (28-37)
  11. ACM
    Zhang G, Horn W and Sanchez D Exploiting commutativity to reduce the cost of updates to shared data in cache-coherent systems Proceedings of the 48th International Symposium on Microarchitecture, (13-25)
  12. Daya B, Chen C, Subramanian S, Kwon W, Park S, Krishna T, Holt J, Chandrakasan A and Peh L SCORPIO Proceeding of the 41st annual international symposium on Computer architecuture, (25-36)
  13. ACM
    Daya B, Chen C, Subramanian S, Kwon W, Park S, Krishna T, Holt J, Chandrakasan A and Peh L (2014). SCORPIO, ACM SIGARCH Computer Architecture News, 42:3, (25-36), Online publication date: 16-Oct-2014.
  14. ACM
    Liu C and Yang C Exploiting heterogeneity in MPSoCs to prevent potential trojan propagation across malicious IPs Proceedings of the 24th edition of the great lakes symposium on VLSI, (335-340)
  15. Carpenter A, Hu J, Kocabas O, Huang M and Wu H Enhancing effective throughput for transmission line-based bus Proceedings of the 39th Annual International Symposium on Computer Architecture, (165-176)
  16. ACM
    Carpenter A, Hu J, Kocabas O, Huang M and Wu H (2012). Enhancing effective throughput for transmission line-based bus, ACM SIGARCH Computer Architecture News, 40:3, (165-176), Online publication date: 5-Sep-2012.
  17. ACM
    Schor L, Bacivarov I, Rai D, Yang H, Kang S and Thiele L Scenario-based design flow for mapping streaming applications onto on-chip many-core systems Proceedings of the 2012 international conference on Compilers, architectures and synthesis for embedded systems, (71-80)
  18. ACM
    Nychis G, Fallin C, Moscibroda T, Mutlu O and Seshan S (2012). On-chip networks from a networking perspective, ACM SIGCOMM Computer Communication Review, 42:4, (407-418), Online publication date: 24-Sep-2012.
  19. ACM
    Nychis G, Fallin C, Moscibroda T, Mutlu O and Seshan S On-chip networks from a networking perspective Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication, (407-418)
  20. ACM
    Xue J, Garg A, Ciftcioglu B, Hu J, Wang S, Savidis I, Jain M, Berman R, Liu P, Huang M, Wu H, Friedman E, Wicks G and Moore D (2010). An intra-chip free-space optical interconnect, ACM SIGARCH Computer Architecture News, 38:3, (94-105), Online publication date: 19-Jun-2010.
  21. ACM
    Kim H, Ahn J and Kim J Replication-aware leakage management in chip multiprocessors with private L2 cache Proceedings of the 16th ACM/IEEE international symposium on Low power electronics and design, (135-140)
  22. ACM
    Xue J, Garg A, Ciftcioglu B, Hu J, Wang S, Savidis I, Jain M, Berman R, Liu P, Huang M, Wu H, Friedman E, Wicks G and Moore D An intra-chip free-space optical interconnect Proceedings of the 37th annual international symposium on Computer architecture, (94-105)
  23. Li X and Hammami O (2009). An automatic design flow for data parallel and pipelined signal processing applications on embedded multiprocessor with NoC, International Journal of Reconfigurable Computing, 2009, (2-2), Online publication date: 1-Jan-2009.
  24. ACM
    Ophelders F, Bekooij M and Corporaal H A tuneable software cache coherence protocol for heterogeneous MPSoCs Proceedings of the 7th IEEE/ACM international conference on Hardware/software codesign and system synthesis, (383-392)
  25. ACM
    Zeng H, Yourst M, Ghose K and Ponomarev D MPTLsim Proceedings of the 46th Annual Design Automation Conference, (226-231)
  26. ACM
    Moonen A, Bekooij M, van den Berg R and van Meerbergen J Cache aware mapping of streaming applications on a multiprocessor system-on-chip Proceedings of the conference on Design, automation and test in Europe, (300-305)
  27. ACM
    Bijlsma T, Bekooij M, Jansen P and Smit G Communication between nested loop programs via circular buffers in an embedded multiprocessor system Proceedings of the 11th international workshop on Software & compilers for embedded systems, (33-42)
  28. Poletti F, Poggiali A, Bertozzi D, Benini L, Marchal P, Loghi M and Poncino M (2007). Energy-Efficient Multiprocessor Systems-on-Chip for Embedded Computing, IEEE Transactions on Computers, 56:5, (606-621), Online publication date: 1-May-2007.
  29. ACM
    Moreira O, Valente F and Bekooij M Scheduling multiple independent hard-real-time jobs on a heterogeneous multiprocessor Proceedings of the 7th ACM & IEEE international conference on Embedded software, (57-66)
  30. Cameron K, Ge R and Sun X (2007). $\log_{\rm n}{\rm P}$ and $\log_{3}{\rm P}$, IEEE Transactions on Computers, 56:3, (314-327), Online publication date: 1-Mar-2007.
  31. ACM
    Tumeo A, Monchiero M, Palermo G, Ferrandi F and Sciuto D A design kit for a fully working shared memory multiprocessor on FPGA Proceedings of the 17th ACM Great Lakes symposium on VLSI, (219-222)
  32. ACM
    Stuijk S, Basten T, Geilen M and Corporaal H Multiprocessor resource allocation for throughput-constrained synchronous dataflow graphs Proceedings of the 44th annual Design Automation Conference, (777-782)
  33. Acacio M, Gonzalez J, Garcia J and Duato J (2005). A Two-Level Directory Architecture for Highly Scalable cc-NUMA Multiprocessors, IEEE Transactions on Parallel and Distributed Systems, 16:1, (67-79), Online publication date: 1-Jan-2005.
  34. ACM
    Bekooij M, Parmar S and van Meerbergen J Performance guarantees by simulation of process Proceedings of the 2005 workshop on Software and compilers for embedded systems, (10-19)
  35. Bhunia S, Datta A, Banerjee N and Roy K (2005). GAARP, IEEE Transactions on Computers, 54:6, (752-766), Online publication date: 1-Jun-2005.
  36. Bilardi G, Pietracaprina A, Pucci G, Schifano F and Tripiccione R The potential of on-chip multiprocessing for QCD machines Proceedings of the 12th international conference on High Performance Computing, (386-397)
  37. Brown J and Wen Z Toward an application support layer Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics, (912-919)
  38. ACM
    Chaudhuri M and Heinrich M (2004). SMTp, ACM SIGARCH Computer Architecture News, 32:2, (124), Online publication date: 2-Mar-2004.
  39. Chaudhuri M and Heinrich M SMTp Proceedings of the 31st annual international symposium on Computer architecture
  40. ACM
    Teo Y and Onggo B Formalization and strictness of simulation event orderings Proceedings of the eighteenth workshop on Parallel and distributed simulation, (89-96)
  41. Dongarra J, Foster I, Fox G, Gropp W, Kennedy K, Torczon L and White A References Sourcebook of parallel computing, (729-789)
  42. ACM
    Goel A, Roychoudhury A and Mitra T (2003). Compactly representing parallel program executions, ACM SIGPLAN Notices, 38:10, (191-202), Online publication date: 1-Oct-2003.
  43. ACM
    Goel A, Roychoudhury A and Mitra T Compactly representing parallel program executions Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming, (191-202)
  44. ACM
    Lepak K and Lipasti M (2002). Temporally silent stores, ACM SIGARCH Computer Architecture News, 30:5, (30-41), Online publication date: 1-Dec-2002.
  45. ACM
    Mauer C, Hill M and Wood D Full-system timing-first simulation Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, (108-116)
  46. ACM
    Mauer C, Hill M and Wood D (2019). Full-system timing-first simulation, ACM SIGMETRICS Performance Evaluation Review, 30:1, (108-116), Online publication date: 1-Jun-2002.
  47. ACM
    Lepak K and Lipasti M (2002). Temporally silent stores, ACM SIGPLAN Notices, 37:10, (30-41), Online publication date: 1-Oct-2002.
  48. ACM
    Lepak K and Lipasti M Temporally silent stores Proceedings of the 10th international conference on Architectural support for programming languages and operating systems, (30-41)
  49. ACM
    Lepak K and Lipasti M (2002). Temporally silent stores, ACM SIGOPS Operating Systems Review, 36:5, (30-41), Online publication date: 1-Dec-2002.
  50. Beaumont O, Boudet V and Robert Y A Realistic Model and an Efficient Heuristic for Scheduling with Heterogeneous Processors Proceedings of the 16th International Parallel and Distributed Processing Symposium
  51. Sorin D, Plakal M, Condon A, Hill M, Martin M and Wood D (2002). Specifying and Verifying a Broadcast and a Multicast Snooping Cache Coherence Protocol, IEEE Transactions on Parallel and Distributed Systems, 13:6, (556-578), Online publication date: 1-Jun-2002.
  52. Martin M, Sorin D, Cain H, Hill M and Lipasti M Correctly implementing value prediction in microprocessors that support multithreading or multiprocessing Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture, (328-337)
  53. Li T and John L (2001). ADir_pNB, IEEE Transactions on Computers, 50:9, (921-934), Online publication date: 1-Sep-2001.
  54. Kwak H, Lee B, Hurson A, Yoon S and Hahn W (1999). Effects of Multithreading on Cache Performance, IEEE Transactions on Computers, 48:2, (176-184), Online publication date: 1-Feb-1999.
Contributors
  • Google LLC
  • Princeton University
  • Microsoft Research

Recommendations