Parallel Computer Architecture: A Hardware/Software Approach | Guide books

Parallel Computer Architecture: A Hardware/Software ApproachSeptember 1998

September 1998

Publisher:

Morgan Kaufmann Publishers Inc.
340 Pine Street, Sixth Floor
San Francisco
CA
United States

ISBN:978-0-08-057307-6

Published:29 September 1998

Pages:

1056

PDF eReader

Bibliometrics

Abstract

The most exciting development in parallel computer architecture is the convergence of traditionally disparate approaches on a common machine structure. This book explains the forces behind this convergence of shared-memory, message-passing, data parallel, and data-driven computing architectures. It then examines the design issues that are critical to all parallel architecture across the full range of modern design, covering data access, communication performance, coordination of cooperative work, and correct implementation of useful semantics. It not only describes the hardware and software techniques for addressing each of these issues but also explores how these techniques interact in the same system. Examining architecture from an application-driven perspective, it provides comprehensive discussions of parallel programming for high performance and of workload-driven evaluation, based on understanding hardware-software interactions. * synthesizes a decade of research and development for practicing engineers, graduate students, and researchers in parallel computer architecture, system software, and applications development * presents in-depth application case studies from computer graphics, computational science and engineering, and data mining to demonstrate sound quantitative evaluation of design trade-offs * describes the process of programming for performance, including both the architecture-independent and architecture-dependent aspects, with examples and case-studies * illustrates bus-based and network-based parallel systems with case studies of more than a dozen important commercial designs Table of Contents 1 Introduction 2 Parallel Programs 3 Programming for Performance 4 Workload-Driven Evaluation 5 Shared Memory Multiprocessors 6 Snoop-based Multiprocessor Design 7 Scalable Multiprocessors 8 Directory-based Cache Coherence 9 Hardware-Software Tradeoffs 10 Interconnection Network Design 11 Latency Tolerance 12 Future Directions APPENDIX A Parallel Benchmark Suites

References

Abali, B., and C. Aykanat. 1994. Routing Algorithms for IBM SP1. Lecture Notes in Computer Science, Vol. 853. New York: Springer-Verlag, 161-175. Google Scholar
Abdel-Shafi, H. A., J. Hall, S. V. Adve, and V. S. Adve. 1997. An Evaluation of Fine-Grain Producer-Initiated Communication in Cache-Coherent Multiprocessors. Proc. Third Symposium on High Performance Computer Architecture (February). Google Scholar
Adve, S. V 1993. Designing Memory Consistency Models for Shared-Memory Multiprocessors. Ph.D. diss., University of Wisconsin-Madison. Available as Tech. Report #1198, University of Wisconsin-Madison, Computer Science (December). Google Scholar
Adve, S. Y. and K. Gharachorloo. 1996. Shared Memory Consistency Models: A Tutorial. IEEE Computer 29(12):66-76. Google ScholarDigital Library
Adve, S. V., K. Gharachorloo, A. Gupta, J. L. Hennessy, and M. Hill. 1993. Sufficient Systems Requirements for Supporting the PLpc Memory Model . Tech. Report #1200, University of Wisconsin-Madison. Computer Science (December). Also available as Tech. Report #CSL-TR-93-595, Stanford University.Google Scholar
Adve, S. V., and M. Hill. 1990a. Weak Ordering: A New Definition. 1990. Proc. 17th Int'l Symposium on Computer Architecture (May):2-14. Google Scholar
Adve, S. V., and M. Hill. 1990b. Implementing Sequential Consistency in Cache-Based Systems. Proc. 1990 Int'l Conference on Parallel Processing (August):47-50.Google Scholar
Adve, S. V., and M. Hill, 1993. A Unified Formalization of Four Shared-Memory Models. IEEE Transactions on Parallel and Distributed Systems 4(6):613-624. Google ScholarDigital Library
Agarwal, A. 1991. Limit on Interconnection Performance. IEEE Transactions on Parallel and Distributed Systems 2(4):398-412. Google ScholarDigital Library
Agarwal, A., R. Bianchini, D. Chaiken, K. L. Johnson, D. Kranz, J. Kubiatowicz, B.-H. Lim, K. Mackenzie, and D. Yeung. 1995. The MIT Alewife Machine: Architecture and Performance. Proc. 22nd Int'l Symposium on Computer Architecture (May/June):2-13. Google Scholar
Agarwal, A., and A. Gupta. 1988. Memory-Reference Characteristics of Multiprocessor Applications Under MACH. Proc. ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems (May):215-225. Google Scholar
Agarwal, A., B.-H. Lim, D. Kranz, and J. Kubiatowicz. 1990. (April): A Processor Architecture for Multiprocessing. Proc. 17th Annual Int'l Symposium on Computer Architecture (June):104-114. Google Scholar
Agarwal, A., B.-H Lim, D. Kranz, and J. Kubiatowicz. 1991. LimitLESS Directories: A Scalable Cache Coherence Scheme. Proc. Fourth Int'l Conference on Architectural Support for Programming Languages and Operating Systems (April):224-234. Google Scholar
Agarwal, A., R. Simoni, J. Hennessy, and M. Horowitz. 1988. An Evaluation of Directory Schemes for Cache Coherence. Proc. 15th Int'l Symposium on Computer Architecture (June):280-289. Google Scholar
Aiken, A., and A. Nicolau. 1988. Optimal Loop Parallelization. Proc. SIGPLAN Conference on Programming Language Design and Implementation (June):308-317. Also published in SIGPLAN Notices 23(7). Google Scholar
Aimoto, Y., T. Kimura, Y. Yabe, H. Heiuchi, et al. 1996. A 7.68GIPS 3.84GB/S 1W Parallel Image-Processing RAM Integrating a 16Mb DRAM and 128 Processors, International Solid-State Circuits Conference , San Francisco (February):372-373.Google Scholar
Alexander, T. B., K. G. Robertson, D. T. Lindsay, D. L. Rogers, J. R. Obermeyer, J. R. Keller, K. Y. Oka and M. M. Jones II. 1994. Corporate Business Servers: An Alternative to Mainframes for Business Computing (HP K-Class). Hewlett-Packard Journal (June):8-33.Google Scholar
Almasi, G. S., and A. Gottlieb. 1989. Highly Parallel Computing. Redwood City. CA: Benjamin/Cummings. Google Scholar
Alverson, R., D. Callahan, D. Cummings, B. Koblenz, A. Porterfield and B. Smith. 1990. The Tera Computer System. Proc. 1990 Int'l Conference on Supercomputing (June): 1-6. Google Scholar
Amdahl, G. M. 1967. Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities. AFIPS 1967 Spring Joint Computer Conference 40:483-485. Google Scholar
Anderson, J. P., S. A. Hoffman, J. Shifman and R. Williams. 1962. D825-A Multiple-Computer Sysiem for Command and Control. AFIP Proc. FJCC 22:86-96. Google Scholar
Anderson, J., and M. Lam. 1993. Global Optimizations for Parallelism and Locality on Scalable Parallel Machines. Proc. SIGPLAN'93 Conference on Programming Language Design and Implementation (June). Google Scholar
Anderson, T. E., D. E. Culler, D. Patterson. 1995. A Case for NOW (Networks of Workstations). IEEE Micro 15(1):54-6. Google ScholarDigital Library
Anderson, T. E., S. S. Owicki, J. P. Saxe and C. P. Thacker. 1992. High Speed Switch Scheduling for Local Area Networks. Proc. ASPLOS V (October):98-110. Google Scholar
Archibald, J., and J.-L. Baer. 1986. Cache Coherence Protocols: Evaluation Using a Multiprocessor Simulation Model. ACM Transactions on Computer Systems 4(4):273-298. Google ScholarDigital Library
Arnould, E. A., F. J. Bitz, E. C. Cooper, H. T. Kung, R. D. Sansom, and P. A. Steenkiste. 1989. The Design of Nectar: A Network Backplane for Heterogeneous Multicomputers. Proc. ASPLO III (April):205-216. Google Scholar
Arpaci, R. H., D. E. Culler, A. Krishnamurthy, S. G. Steinberg, and K. Yelick. 1995. Empirical Evaluation of the Cray-T3D: A Compiler Perspective. Proc. 22nd Int'l Symposium on Computer Architecture (June):320-331. Google Scholar
Arvind, and D. E. Culler. 1986. Dataflow Architectures. Annual Reviews in Computer Science 1:225-253. Palo Alto, CA: Annual Reviews. Reprinted in Dataflow and Reduction Architectures. Edited by S. S. Thakkar. Los Alamitos, CA: IEEE Computer Society Press, 1987.Google ScholarCross Ref
Athas, W. C., and C. L. Seitz. 1988. Multicomputers: Message-Passing Concurrent Computers. IEEE Computer 21(8):9-24. Google ScholarDigital Library
August, M. C., G. M. Brost, C. C. Hsiung and A. J. Schiffleger. 1989. Cray X-MP: The Birth of a Supercomputer. Computer 22(1):45-52. Google ScholarDigital Library
Baer, J.-L., and T.-F Chen. 1991. An Efficient On-Chip Preloading Scheme to Reduce Data Access Penalty. Proc. Supercomputing '91 (November):176-186. Google Scholar
Baer, J.-L., and W.-H. Wang. 1988. On the Inclusion Properties for Multi-Level Cache Hierarchies. Proc. 15th Annual Int'l Symposium on Computer Architecture (May):73-80. Google Scholar
Bailey, D. H. 1990. FFTs in External or Hierarchical Memory Journal of Supercomputing 4(1):23-35. Also published in Proc. Supercomputing '89 (November):234-242. Google Scholar
Bailey, D. H. 1991. Twelve Ways 10 Fool the Masses When Giving Performance Results on Parallel Computers. Supercomputing Review (August):54-55.Google Scholar
Bailey, D. H. 1993. Misleading Performance Reporting in the Supercomputing Field. Scientific Programming 1(2):141-151. Also published in Proc Supercomputing '93 . Google ScholarDigital Library
Bailey, D. H., E. Barszcz, J. Barton, D. Browning, R. Carter, L. Dagum, R. Fatoohi, S. Fineberg, P. Frederickson, T. Lasinski, R. Schreiber, H. Simon, V. Venkatakrishnan, and S. Weeratunga. 1991. The NAS Parallel Benchmarks. Intl. Journal of Supercomputer Applications 5(3);66-73. Also published as Tech. Report RNR-94-007, Numerical Aerodynamic Simulation Facility, NASA Ames Research Center (March 1994).Google Scholar
Bailey, D. H., E. Barszcz, L. Dagum and H. D. Simon. 1994. NAS Parallel Benchmark Results 3-94. Proc. Scalable High-Performance Computing Conference , Knoxville, TN (May): 111-120.Google Scholar
Bailey, D., T. Harris, W. Saphir, R. van der Wijngaart, A Woo and M. Yarrow. 1995. The NAS Parallel Benchmarks 2. 0. Report NAS-95-020, Numencal Aerodynamic Simulation Facility. NASA Ames Research Center (December).Google Scholar
Baker, W. E., R. W. Horst, D. P Sonnier, and W. J. Watson 1995. A Flexible ServerNet-Based Fault-Tolerant Architecture. Proc. 25th Int'l Symposium on Fault-Tolerant Compuling (June). Los Alamitos, CA: IEEE Computer Society Press, 2-11. Google Scholar
Bakoglu, H. B. 1990. Circuits, Interconnection, and Packaging for VLSI. Reading, MA: Addison-Wesley.Google Scholar
Ball, J. R., R. C. Bollinger, T. A. Jeeves, R. C. McReynolds, D. H. Shaffer. 1962. On the Use of the Solomon Parallel-Processing Computer. Proc AFIPS Fall Joint Computer Conference 22:137-146. Google Scholar
Banks, D. and M. Prudence 1993. A High Performance Network Architecture for a PARISC Workstation. IEEE Journal on Selected Areas in Communication 11(2): 191-202. Google ScholarDigital Library
Barnes, J. E., and P. Hut 1989. Error Analysis of a Tree Code. Astrophysics Journal Supplement 70(June):389-417.Google ScholarCross Ref
Barosso, L., and M. Dubois. 1993. The Performance of Cache-Coherent Ring-Based Multiprocessors. Proc. 20th Annual Int'l Symposium on Computer Architectures (ISCA) (May):268-277 Google Scholar
Barosso, L., and M. Dubois. 1995. Performance Evaluation of the Slotted Ring Multiprocessors. IEEE Transactions on Computers 44(7):878-890. Google ScholarDigital Library
Barroso, L. A., S. Iman, J. Jeong, K. Oner, K. Ramamurthy and M. Dubois. 1995. RPM: A Rapid Prototyping Engine for Multiprocessor Systems. IEEE Computer 28(2):26-34. Google ScholarDigital Library
Barszcz, E., Fatoohi, R., Venkatakrishnan, V., and Weeratunga, S. 1993. Solution of Regular, Sparse Triangular Linear Systems on Vector and Distributed-Memory Multiprocessors . Tech. Report NAS RNR-93-007. NASA Ames Research Center. Moffett Field, CA (April).Google Scholar
Barton, E., J. Crownie, and M. McLaren. 1994. Message Passing on the Meiko CS-2. Parallel Computing 20(4):497-507. Google ScholarDigital Library
Baskett, F. T. Jermoluk, and D. Solomon. 1988. The 4D-MP Graphics Superworkstation: Computing + Graphics = 40 MIPS + 40 MFLOPS and 100,000 Lighted Polygons per Second. Proc. 33rd IEEE Computer Society Int'l Conference--COMPCON '88 (February):468-471.Google Scholar
Batcher, K. E. 1974. Staran Parallel Processor System Hardware. Proc. AFIPS National Computer Conference , 405-410. Google Scholar
Batcher, K. E. 1980. Design of a Massively Parallel Processor. IEEE Transactions on Computers C- 29(9):836-840. Google ScholarDigital Library
Bell, C. G. 1985. Multis: A New Class of Multiprocessor Computers. Science 228:462-467.Google Scholar
Benes, V. 1965. Mathematical Theory of Connecting Networks and Telephone Traffic. San Diego, CA: Academic Press.Google Scholar
Bennett, J. E., and M.J. Flynn. 1996a. Latency Tolerance for Dynamic Processors. Tech. Report #CSL-TR-96-687, Computer Systems Laboratory, Stanford University. Google Scholar
Bennett, J. E., and M. J. Flynn. 1996b. Reducing Cache Miss Rates Using Prediction Caches. Tech. Report #CSL-TR-96- 707. Computer Systems Laboratory, Stanford University. Google Scholar
Berry, M., D. Chen, P. Koss, et al. 1989. The PERFECT Club Benchmarks: Effective Performance Evaluation of Computers. Int'l Journal of Supercomputer Applications 3(3):5-40. Google ScholarDigital Library
Bershad, B. N., M. J. Zekauskas, and W. A. Sawdon. 1993. The Midway Distributed Shared Memory System. Proc. COMPCON '93 (February).Google Scholar
Bhatt, S. M. and C. E. Leiserson. 1983. How to Assemble Tree Machines. ACM Symposium on Theory of Computing (STOC '82) . New York: ACM Press. Google Scholar
Biagioni, E., E. Cooper, and R. Sansom. 1993. Designing a Practical ATM LAN. IEEE Network (March). Google Scholar
Bilas, A., L. Iftode, and J. P. Singh. 1998. Evaluation of Hardware Support for Next-Generation Shared Virtual Memory Clusters. Proc. Int'l Conference on Supercomputing (July). Google Scholar
Bisiani, R., and M. Ravishankar. 1990. PLUS: A Distributed Shared-Memory System. Proc. 17th Int'l Symposium on Computer Architecture (May): 115-124. Google Scholar
Black, D., R. Rashid, D. Golub, C. Hill, R. Baron. 1989. Translation Lookaside Buffer Consistency: A Software Approach. Proc. Third Int'l Conference on Architectural Support for Programming Languages and Operating Systems. Boston (April): 113-122. Google Scholar
Blackford, L. S., J. Choi, A. Cleary, E. D'Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. 1997. ScaLAPACK Users' Guide. Philadelphia, PA: Society for Industrial and Applied Mathematics (SIAM). Google Scholar
Blelloch, G. 1993. Prefix Sums and Their Applications. In Synthesis of Parallel Algorithms . Edited by J. Reif. San Francisco: Morgan Kaufmann, 35-60.Google Scholar
Blelloch, G. E., C. E. Leiserson, B. M. Mages, C. G. Plaxton, S.J. Smith, and M. A. Zagha. 1991. Comparison of Sorting Algorithms for the Connection Machine CM-2. Proc. Symposium on Parallel Algorithms ana Architectures (July):3-16. Google Scholar
Blumrich, M. A., C. Dubnicki, E. W. Felten, K. Li, M.R. Mesarina. 1994. Two Virtual Memory Mapped Network Interface Designs. Proc. Hot Interconnects II Symposium (August).Google Scholar
Blumrich, M., K. Li, R. Alpert, C. Dubnicki, E. Felten, and J. Sandberg. 1994. A Virtual Memory Mapped Network Interface for the Shrimp Multicomputer. Proc. 21st Int'l Symposium on Computer Architecture (April): 142-153. Google Scholar
Boden, N., D. Cohen, R. Felderman, A. Kulawik, C. Seitz, J. Seizovic, and W. Su. 1995. Myrinet: A Gigabit-per-Second Local Area Network. IEEE Micro 15(1):29-38. Google ScholarDigital Library
Bodin, F., P. Beckman, D. Gannon, S. Yang, S. Kesavan, A. Malony, and B. Mohr. 1993. Implementing a Parallel C++ Runtime System for Scalable Parallel Sysiems. Proc. Supercomputing '93 (November):588-597. Also in Scientific Programming 2(3). Google Scholar
Bolt Beranek and Newman Advanced Computers. 1989. TC2000 Technical Product Summary. Cambridge, MA: Bolt Beranek and Newman.Google Scholar
Bomans, L., and D. Roose. 1989. Benchmarking the iPSC/2 Hypereube Multiprocessor. Concurrency: Practice and Experience , 1(1):3-18.Google ScholarCross Ref
Borkar, S., R. Cohn, G. Cox, T. Gross, H. T. Kung, M. Lam, M. Levine, B. Moore, W. Moore, C. Peterson, J. Susman, J. Sutton, J. Urbanski, and J. Webb. 1990. Supporting Systolic and Memory Communication in iWarp. Proc. 17th Annual Int'l Symposium on Computer Architecture , Seattle, WA (May):70-81. Revised version appears as Tech. Report #CMU-CS-90-197. Carnegie Mellon University. Google Scholar
Bouknight, W. J., S. A. Denenberg, D. E. McIntyre, J. M. Randall, A. H. Sameh, and D. L. Slotnick. 1972. The Illiac IV System. Proc. IEEE 60(4):369-388.Google Scholar
Boyle, J., R. Butler, T. Disz, B. Glickfield, E. Lusk, W. R. Overbeek, J. Patterson, and R. Stevens. 1987. Portable Programs for Parallel Processors. New York: Holt, Rinehart and Winston. Google Scholar
Brewer, E. A., F. T. Chong, F. T. Leighton. 1994. Scalable Expanders: Exploiting Hierarchical Random Wiring. Proc. 1994 Symposium on the Theory of Computing , Montreal, Canada (May):144-152. Google Scholar
Brewer, E. A., F. T. Chong, L. T. Liu, J. Kubiatowicz, S. D. Sharma. 1995. Remote Queues: Exposing Network Queues for Atomicity and Optimization. Proc. Seventh Annual Symposium on Parallel Algorithms and Architectures (July):42-53. Google Scholar
Brewer, E. A., and B. C. Kuszmaul. 1994. How to Get Good Performance from the CM-5 Data Network. Proc. 1994 Int'l Parallel Processing Symposium , Cancun, Mexico (April):858-867. Google Scholar
Bruno, J., P. R. Cappello. 1988. Implementing the Beam and Warming Method on the Hypercube. Proc. Third Conference on Hypercube Concurrent Computers and Applications , Pasadena, CA, Jan 19-20. Google Scholar
Burger, D. 1997. System-Level Implications of Processor-Memory integration. Workshop on Mixing Logic and DRAM: Chips that Compute and Remember. Presented at the Int'l Symposium on Computer Architecture (ISCA) '97 (June).Google Scholar
Burger, D., J. Goodman, and A. Kagi. 1996. Memory Bandwidth Limitations in Future Microprocessors. Proc. 23rd Annual Symposium on Computer Architecture (May):78-89. Google Scholar
Burkhardt, H., et al. 1992. Overview of the KSR-1 Computer System. Tech. Report KSR-TR-9202001. Kendall Square Research. Boston (February).Google Scholar
Butler, M., T-Y. Yeh, Y. Patt, M. Alsup, H. Scales, and M. Shebanow. 1991. Single Instruction Stream Parallelism Is Greater Than Two. Proc. Annual Int'l Symposium on Computer Architecture (ISCA), 276-86. Google Scholar
Callahan, T. and S. C. Goldstein. 1995. NIFDY: A Low Overhead. High Throughput Network Interface. Proc. 22nd Annual Symposium on Computer Architecture (June):230-241. Google Scholar
Cardoza, W., F. Glover, and W. Snaman Jr. 1996. Design of a TruCluster Multicomputer System for the Digital UNIX Environment. Digital Technical Journal 8(1):5-17. Google ScholarDigital Library
Carter, J. B., J. K. Bennett, and W. Zwaenepoel. 1991. Implementation and Performance of Munin. Proc. 13th Symposium on Operating Systems Principles (October):152-164. Google Scholar
Carter, J. B., J. K. Bennett, and W. Zwaenepoel. 1995. Techniques for Reducing Consistency-Related Communication in Distributed Shared-Memory Systems. ACM Transactions of Computer Systems 13(3):205-244. Google ScholarDigital Library
Catanzaro, B. 1997. Multiprocessor System Architectures: A Technical Survery of Multiprocessor/ Multithreaded Systems Using SPARC, Multi-level Bus Architectures and Solaris (SunOS). Mountain View, CA: Sun Microsystems.Google Scholar
Cekleov, M., D. Yen, P. Sindhu, J.-M. Frailong, et al. 1993. SPARCcenter 2000: Multiprocessing for the 90s, Digest of Papers. Proc. COMPCON Spring '93. Los Alamitos, CA: IEEE Computer Society Press, 345-353.Google Scholar
Censier, L., and P. Feautrier. 1978. A New Solution to Cache Coherence Problems in Multiprocessor Systems. IEEE Transaction on Computer Systems C-27(12):1112-1118. Google Scholar
Chan, K., et al. 1993. Multiprocessor Features of the HP Corporate Business Servers. Proc. COMPCON (Spring):330-337.Google Scholar
Chandy, K. M., and J. Misra. 1988. Parallel Program Design: A Foundation. Reading. MA: Addison Wesley. Google Scholar
Chang, P. P. S. A. Mahlke, W. Y. Chen, N. J. Warter, and W. W. Hwu. 1991. IMPACT: An Architectural Framework for Multiple-Instruction Issue Processors. Proc. 18th Int'l Symposium on Computer Architecture (ISCA) 19(3):266-275. Google Scholar
Chen, T.-F., and J.-L. Baer. 1992. Reducing Memory Latency via Non-Blocking and Prefetching Caches. Proc. Fifth Int'l Conference on Architectural Support for Programming Languages and Operating Systems (October):51-61. Google Scholar
Chen, T.-F., and J.-L. Baer. 1994. A Performance Study of Software and Hardware Data Prefetching Schemes. Proc. 21st Annual Symposium on Computer Architecture (April):223-232. Google Scholar
Cheong, H., and A. Viedenbaum. 1990. Compiler-directed Cache Management in Multiprocessors. IEEE Computer 23(6):39-47. Google ScholarDigital Library
Chien, A. A., and J. H. Kim. 1992. Planar-Adaptive Routing: Low-Cost Adaptive Networks for Multiprocessors. Proc. 19th Annual International Symposium on Computer Architecture (ISCA), Gold Coast, Australia (May):268-277. Google Scholar
Choi, J.J.J. Dongarra, R. Pozo, and D. W. Walker. 1992. ScaLAPACK: A Scalable Linear Algebra Library for Distributed Memory Concurrent Computers. Proc. Fourth Symposium on the Frontiers of Massively Parallel Computation, McLean, VA. Los Alamitos, CA: IEEE Computer Society Press, 120-127.Google Scholar
Chun, B. N., A. M. Mainwaring, and D. E. Culler. 1998. Virtual Network Transport Protocols for Myrinet. IEEE Micro (January):53-63. Google ScholarDigital Library
Clark, R., and K. Alnes. 1996. An SCI Chipset and Adapter. Symposium Record, Hot Interconnects IV (August):221-235.Google Scholar
Cohen, D., G. Finn, R. Felderman, and A. DeSchon. 1993. ATOMIC: A Low-Cost, Very High-Speed, Local Communication Architecture. Proc. 1993 Int. Conference on Parallel Processing . Google Scholar
Convex Computer Corporation. 1993. Exemplar Architecture. Richardson, TX: Convex Computer Corp.Google Scholar
Corella, F., J. Stone, C. Barton. 1993. A Formal Specification of the PowerPC Shared Memory Architecture. Tech. Report Computer Science RC 18638 (81566), IBM Research Division. T.J. Watson Research Center (January).Google Scholar
Cornell, J. A. 1972. Parallel Processing of Ballistic Missile Defense Radar Data with PEPE. COMPCON 72, 69-72.Google Scholar
Cox, A., and R. Fowler. 1993. Adaptive Cache Coherency for Detecting Migratory Shared Data. Proc. 20th Int'l Symposium on Computer Architecture (May):98-108. Google Scholar
Crowther, W., J. Goodhue, R. Gurwitz, R. Rettberg, and R. Thomas. 1985. The Butterfly Parallel Processor. IEEE Computer Architecture Technical Newsletter , 18-46.Google Scholar
Culler, D. E. 1994. Multithreading: Fundamental Limits, Potential Gains, and Alternatives. In Multithreaded Computer Architecture: A Summary of the State of the Art. Edited by R. Iannucci. Dordrecht, Germany; Norwell, MA: Kluwer Academic Publishers, 97-138.Google Scholar
Culler, D. E., A. C. Dusseau, S. C. Goldstein, A. Krishnamurthy, S. Lumetta, T. von Eicken, and K. Yelick. 1993. Parallel Programming in Split-C. Proc. Supercomputing '93 (November): 262-273. Google Scholar
Culler, D. E., A. C. Dusseau, R. P. Martin, and K. E. Schauser. 1993. Fast Parallel Sorting under LogP: From Theory to Practice. In Portability and Performance for Parallel Processing . Chapter 4. New York: John Wiley & Sons, 71-98.Google Scholar
Culler, D. E., R. M. Karp, D. A., Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. von Eicken. 1993. LogP: Toward A Realistic Model of Parallel Computation. ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (May): 1-12. Google Scholar
Culler, D. E., R. M. Karp, D. A., Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. von Eicken. 1996. LogP: A Practical Model of Parallel Computation. CACM 39(11):78-85. Google ScholarDigital Library
Culler, D. E., A. Sah, K. E. Schauser, T. von Eicken, and J. Wawrzynek. 1991. Fine-Grain Parallelism with Minimal Hardware Support. Proc. Fourth Int'l Symposium on Arch. Support for Programming Languages and Systems (ASPLOS) (April):164-175. Google Scholar
Culler, D. E., K. E. Schauser, and T. von Eicken. 1993. Two Fundamental Limits on Dataflow Multithreading. Proc. IFIP WG 10.3 working Conference on Architectures and Compilation Techniques for Fine and Medium Grain Parallelism . Orlando, FL. Google Scholar
Dahlgren, F. 1995. Boosting the Performance of Hybrid Snooping Cache Protocols. Proc 22nd Int'l Symposium on Computer Architecture (June):60-69. Google Scholar
Dahlgren, F., M. Dubois, and P. Stenstrom. 1994. Combined Performance Gains of Simple Cache Protocol Extensions. Proc. 21st Int'l Symposium on Computer Architecture (April): 187-197. Google Scholar
Dahlgren, F., M. Dubois, and P. Stenstrom. 1995. Sequential Hardware Prefetching in Shared-Memory Multiprocessors. IEEE Transactions on Parallel and Distributed Systems 6(7). Google ScholarDigital Library
Dally, W. J. 1990a. Virtual-Channel Flow Control. Proc. 17th Annual Int'l Symposium on Computer Architecture (ISCA) , Seattle, WA, (May):60-68. Google Scholar
Dally, W. J. 1990b. Performance Analysis of k -ary n -cube Interconnection Networks. IEEE-TOC 39(6):775-85. Google ScholarDigital Library
Dally, W. J., A. Chien, S. Fiske, W. Horwat, J. Keen, J. Larivee, R. Lethin, P Nuth, S. Willis. 1989. The J-Machine: A Fine-Grained Concurrent Computer. Proc IFIP 11th World Computer Congress, Information Processing'89 , 1147-1153.Google Scholar
Dally, W. J., J. A. S. Fiske, J. S. Keen, R. A. Lethin, M. D. Noakes and P. R. Nuth. 1992. The Message Driven Processor: A Multicomputer Processing Node with Efficient Mechanisms. IEEE Micro (April):23-39. Google ScholarDigital Library
Dally, W. J., J. S. Keen, M. D. Noakes. 1993. The J-Machine Architecture and Evaluation. Digest of Papers. COMPCON Spring '93. San Francisco, CA (February):183-188.Google Scholar
Dally, W. J., and C. Seitz. 1987. Deadlock-Free Message Routing in Multiprocessor Interconnections Networks. IEEE-TOC C-36(5):547-553. Google Scholar
Denning, P. J. 1968. The Working Set Model for Program Behavior. Communications of the ACM 11(5):323-333. Google ScholarDigital Library
Dennis, J. B. 1980. Dataflow Supercomputers. IEEE Computer 13(11):93-100. Google ScholarDigital Library
Digital Equipment Corporation. 1992. Alpha Architecture Handbook . Maynard, MA: Digital Equipment Corp.Google Scholar
Dijkstra, E. W. 1965. Solution of a Problem in Concurrent Programming Control. Communications of the ACM 8(9):569. Google ScholarDigital Library
Dijkstra, E. W., and C. S. Sholten. 1968. Termination Detection for Diffusing Computations. Information Processing Letters 1:1-4.Google Scholar
Dongarra, J. J. 1990. Performance of Various Computers Using Standard Linear Equations Software in a Fortran Environment . Tech. Report CS-89-85. University of Tennessee, Computer Science Dept. (March). Google Scholar
Dongarra, J. J. 1994. Performance of Various Computers Using Standard Linear Equation Software . Tech. Repon CS-89-85. University of Tennessee, Computer Science Dept. (November); current report available from [email protected] Google Scholar
Dongarra, J. J., J. Martin, and J. Worlton. 1987. Computer Benchmarking: Paths and Pitfalls. IEEE Spectrum (July):38. Google Scholar
Dongarra, J. J., and D. W. Walker. 1995. Software Libraries for Linear Algebra Computations on High performance Computers. SIAM Review 37:151-180. Google ScholarDigital Library
Dongarra, J. J., and W. Genlzsch, eds. 1993. Computer Benchmarks. Amsterdam: Elsevier Science B. V., North-Holland. Google Scholar
Dubnicki, C. L. Iftode, E. W. Felten, K. Li. 1996. Software Support for Virtual Memory-Mapped Communication. Tenth Int'l Parallel Processing Symposium (April). Google Scholar
Dubnicki, C., and T. LeBlanc. 1992. Adjustable Block Size Coherent Caches. Proc. 19th Annual Int'l Symposium on Computer Architecture (May):170-180. Google Scholar
Dubois, M., and C. Scheurich, 1990. Memory Access Dependencies in Shared-Memory Multi-processors. IEEE Transactions on Software Engtneering 16(6):660-673. Google ScholarDigital Library
Dubois, M., C. Scheurich, and F. Briggs. 1986. Memory Access Buffering in Multiprocessors. Proc. 13th Int'l Symposium on Computer Architecture (June):434-442. Google Scholar
Dubois, M., J. Skeppstedt, L. Ricciulli, K. Ramamurthy, and P. Slenstrom. 1993. The Detection and Elimination of Useless Misses in Multiprocessors. Proc. 20th Int'l Symposium on Computer Architecture (May):88-97. Google Scholar
Dubois, M., J.-C. Wang, L. A. Barroso, K. Chen and Y.-S. Chen. 1991. Delayed Consistency and Its Effects on the Miss Rate of Parallel Programs. Proc. Supercomputing '91 (November): 197-206. Google Scholar
Dunigan, T. H. 1988. Performance of a Second Generation Hypereube Tech. Report ORNL/TM-10881, Oak Ridge National Lab. (November).Google Scholar
Dunning, D., G. Regnier, G. McAlpine, D. Camaron, B. Shubert, F. Berry, A. M. Merriti, E. Gronke and C. Dodd. 1998. The Virtual Interface Architecture. IEEE Micro 18(2). Google Scholar
Dusseau, A. C., D. E. Culler, K. E. Schauser, and R. P. Martin. 1996. Fast Parallel Sorting under LogP: Experience with the CM-5. IEEE Transactions on Parallel and Distributed Systems 7(8): 791-805. Google ScholarDigital Library
Dwarkadas, S., P. Keleher, A. L. Cox, and W. Zwaenepoel. 1993. Evaluation of Release Consistent Software Distributed Shared Memory on Emerging Network Technology. Proc. 20th Int'l Symposium on Computer Architecture (May):144-155. Google Scholar
Eggers, S., and R. Katz. 1988. A Characterization of Sharing in Parallel Programs and Its Application to Coherency Protocol Evaluation. Proc. 15th Annual Int'l Symposium on Computer Architecture (May):373-382. Google Scholar
Eggers, S., and R. Katz. 1989a. The Effect of Sharing on the Cache and Bus Performance of Parallel Programs. Proc. Third Int'l Conference on Architectural Support for Programming Languages and Operating Systems (May):257-270. Google Scholar
Eggers, S., and R. Katz. 1989b. Evaluating the Performance of Four Snooping Cache Coherency Protocols. Proc. 16th Annual Int'l Symposium on Computer Architecture (May):2-15. Google Scholar
Eigenmann, R., and S. Hassanzadeh. 1996. Benchmarking with Real Industrial Applications: The SPEC High Performance Group. IEEE Computational Science and Engineering (spring). Google Scholar
Elliott, D. G., W. M. Snelgrove, and M. Stumm. 1992. Computational RAM: A Memory-SIMD Hybrid and Its Application to DSP. Custom Integrated Circuits Conference , Boston, MA (May):30.6.1-30.6.4.Google Scholar
Elliott, D. G., M. Stumm, and W. M. Snelgrove. 1997. Computational RAM: The Case for SIMD Computing in Memory . Workshop on Mixing Logic and DRAM: Chips that Compute and Remember. Presented at Annual International Symposium on Computer Architecture (ISCA) '97 (June).Google Scholar
Erlichson, A., B. Nayfeh, J. P. Singh and Oyekunle Olukotun. 1995. The Benefits of Clustering in Cache-Coherent Multiprocessors: An Application-Driven Investigation. Proc. Supercomputing 95 (November). Google Scholar
Erlichson, A., N. Nuckolls, G. Chesson, and J. L. Hennessy. 1996. SoftFLASH: Analyzing the Performance of Clustered Distributed Virtual Shared Memory. Proc. Seventh Int'l Conference on Architectural Support for Programming Languages and Operating Systems (October):210-220. Google Scholar
Falsafi, B., A. R. Lebeck, S. K. Reinhardt, I. Schoinas, M. D. Hill, J. R. Larus, A. Rogers, and D. A. Wood. 1994. Application-Specific Protocols for User-Level Shared Memory. Proc. Supercomputing '94 (November):380-389. Google Scholar
Falsafi, B. and D. A. Wood. 1997. Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA. Proc. 24th Int'l Symposium on Computer Architecture (June):229-240. Google Scholar
Farkas, K., Z. Vranesic, and M. Stumm. 1992. Cache Consistency in Hierarchical Ring-Based Multiprocessors. Poc. Supercomputing '92 (November). Google Scholar
Feigel, C. P. 1994. TI Introduces Four-Processor DSP Chip. Microprocessor Report (March):28.Google Scholar
Felderman, R., et al. 1994. Atomic: A High Speed Local Communicalion Architecture. Journal of High Speed Networks 3(1):1-29. Google ScholarDigital Library
Fenwick, D. M., D. J. Foley, W. B. Gist, S. R. VanDoren, and D. Wissell. 1995. The AlphaServer 8000 Series: High-End Server Platform Development. Digital Technical Journal 7(1):43-65. Google ScholarDigital Library
Flanagan, J. L. 1994. Technologies for Multimedia Communications. IEEE Proceedings 82(4):590-603.Google ScholarCross Ref
Flynn, M. J. 1972. Some Computer Organizations and Their Effectiveness. IEEE Transactions on Computing C-21(Seplember):948-960. Google ScholarDigital Library
Fortune, S., and J. Wyllie. 1978. Parallelism in Random Access Machines. Proc. 10th ACM Symposium on Theory of Computing (May). Google Scholar
Fox. G., M. Johnson, G. Lyzenga, S. Otto, J. Salmon, and D. Walker. 1988. Solving Problems on Concurrent Processors , vol. 1. Englewood Cliffs. NJ: Prentice Hall. Google Scholar
Frailong, J.-L. et al. 1993. The Next Generation SPARC Multiprocessing System Architecture. Proc. COMPCON (spring):475-480.Google Scholar
Frank, S., H. Burkhardt III, and J. Rothnie. 1993. The KSR1: Bridging the Gap between Shared Memory and MPPs. Proc. COMPCON, Digest of Papers (spring);285-294.Google Scholar
Fu, J. W. C., and J. H. Patel. 1991. Data Prefetching in Multiprocessor Vector Cache Memories. Proc. 18th Annual Symposium on Computer Architecture (May):54-63. Google Scholar
Fu, J. W. C., J. H. Patel, and B. L. Janssens. 1992. Stride Directed Prefetching in Scalar Processors. Proc. 25th Annual Int'l Symposium on Microarchitecture (December): 102-110. Google Scholar
Fuchs, H., G. Abram, and E. Grant. 1983. Near Real-Time Shaded Display of Rigid Objects. Proc. SIGGRAPH . Google Scholar
Galles, M., and E. Williams. 1993. Performance Optimizations, Implementation, and Verification of the SGI Challenge Multiprocessor. Proc. 27th Hawaii Int'l Conference on System Sciences Vol. I: Architecture (January). Also in SGI Challenge . Edited by T. N. Mudge and B. D. Shriver. Los Alamitos, CA: IEEE Computer Society Press, 1994, 134-143.Google Scholar
Geist, A., A. Beguelin, and J. Dongarra, W. Jiang, R. Manchek and V. Sunderam 1994. PVM 3.0 Users'Guide and Reference Manual . Tech Report ORNL/TM-12187. Oak Ridge, TN: Oak Ridge National Laboratory (February), http://wwweece.ksu.edu/pvm3/ug.ps.Google Scholar
Geist, A., A. Beguelin, J. Dongarra, R. Manchek, W. Jiang, and V. Sunderam. 1994. PVM: A Users' Guide and Tutorial/or Networked Parallel Computing . Cambridge, MA: MIT Press. Google ScholarCross Ref
Geist, G. A., and V. S. Sunderam. 1992. Network Based Concurrent Computing on the PVM System, Journal of Concurrency: Practice and Experience 4(4):293-311. Google ScholarDigital Library
Gharachorloo, K. 1995. Memory Consistency Models for Shared-Memory Multiprocessors , Ph.D. diss., Computer Systems Laboratory. Stanford University (December). Also published as Tech. Report #CSL-TR-95-685. Google Scholar
Gharachorloo, K., S. Adve, A. Gupta, M. Hill, and J. L. Hennessy. 1992. Programming for Different Memory Consistency Models. Journal of Parallel and Distributed Computing 15(4):399-407.Google ScholarCross Ref
Gharachorloo, K., A. Gupta, and J. L. Hennessy. 1991a. Performance Evaluation of Memory Consistency Models for Shared-Memory Multiprocessors. Proc. 4th Int'l Conference on Architectural Support for Programming Languages and Operating Systems (April):245-257. Google Scholar
Gharachorloo, K., A. Gupta, and J. L. Hennessy. 1991b. Two Techniques to Enhance the Performance of Memory Consistency Models. Proc. Int'l Conference on Parallel Processing (August): 1355-1364.Google Scholar
Gharachorloo, K., A. Gupta, and J. L. Hennessy. 1992. Hiding Memory Latency Using Dynamic Scheduling in Shared Memory Multiprocessors. Proc. 19th Int'l Symposium on Computer Architecture (May):22-33. Google Scholar
Gharachorloo, K., D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J. L. Hennessy. 1990. Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors. Proc. 17th Int'l Symposium on Computer Architecture (May):15-26. Google Scholar
Gillett, R. 1996. Memory Channel Neiwork for PCI. IEEE Micro 16(1):12-18. Google ScholarDigital Library
Gillett, R., M. Collins, and D. Pimm. 1996. Overview of Network Memory Channel for PCI. Proc. IEEE Spring COMPCON '96 (February). Google Scholar
Gillett, R., and R. Kaufmann. 1997. Using Memory Channel Network. IEEE Micro 17(1):19-25. Google ScholarDigital Library
Glass, C. J., and L. M. Ni. 1992. The Turn Model for Adaptive Routing. Proc. Annual International Symposium on Computer Architecture (ISCA) (May): 278-287. Google Scholar
Godiwala, N. D., and B. A. Maskas. 1995. The Second-Generation Processor Module for AlphaServer 2100 Systems. Digital Technical Journal 7(1). Google Scholar
Gokhale, M., B. Holmes, and K. Iobst. 1995. Processing in Memory: The Terasys Massively Parallel PIM Array. IEEE Computer 28(3):23-31. Google ScholarDigital Library
Goldschmidt, S. R. 1993. Simulation of Multiprocessors: Speed and Accuracy . Ph.D. diss., Stanford University (June). Google Scholar
Golub, G., and C. Van Loan. 1997. Matrix Computations 3e . Baltimore, MD: Johns Hopkins University Press.Google Scholar
Goodman, J. R. 1983. Using Cache Memory to Reduce Processor-Memory Traffic. Proc. 10th Annual Int'l Symposium on Computer Architecture (June): 124-131. Google Scholar
Goodman, J. R. 1987. Coherency for Multiprocessor Virtual Address Caches. Proc. Second Int'l Conference on Architectural Support for Programming Languages and Operating Systems . Palo Alto. CA(October):72-81. Google Scholar
Goodman, J. R. 1989. Cache Consistency and Sequential Consistency . Tech. Report #1006, University of Wisconsin-Madison. Computer Science Dept. (February).Google Scholar
Goodman, J. R., M. K. Vernon, P.J. Woest. 1989. Set of Efficient Synchronization Primitives for a Large-Scale Shared-Memory Multiprocessor. Proc. Third Int'l Conference on Architectural Support for Programming Languages and Operating Systems (April):64-75. Google Scholar
Gottlieb, A., R. Grishman, C. P. Kruskal, K. P. McAuliffe, L. Rudolph, and M. Snir. 1983. The NYU Ultracomputer--Designing an MIMD Shared Memory Parallel Computer. IEEE Transactions on Computers C-32(2):175-189. Google ScholarDigital Library
Gottlieb, A., and C. P. Kruskal. 1984. Complexity Results for Permuting Data and Other Computations on Parallel Processors. Journal of the ACM 31(April):193-209. Google ScholarDigital Library
Gottlieb, A., B. Lubachevsky, and L. Rudolph. 1983. Basic Techniques for the Efficient Coordination of Large Numbers of Cooperating Sequenlal Processes. ACM Transactions on Programming Languages and Systems 5(2). Google Scholar
Grafe, V. G., and J. E. Hoch. 1990. The Epsilon-2 Hybrid Dataflow Architecture. Proc. COMPCON Spring '90 , San Francisco, CA (March):88-93.Google Scholar
Grahn, H., and P. Stenstrom. 1996. Evaluation of a Competitive-Update Cache Coherence Protocol with Migratory Data Detection, Journal of Parallel and Distributed Computing 39(2):168-180. Google ScholarDigital Library
Grahn, H., P. Stenstrom and M. Dubois. 1995. Implementation and Evaluation of Update-Based Protocols under Relaxed Memory Consistency Models. Future Generation Computer Systems 11(3):247-271. Google ScholarDigital Library
Granuke, G., and S. Thakkar. 1990. Synchronization Algorithms for Shared Memory Multiprocessors. IEEE Computer 23(6):60-69. Google ScholarDigital Library
Gray, J. 1991. The Benchmark Handbook for Database and Transaction Processing Systems . San Francisco: Morgan Kaufmann. Google Scholar
Green, S. A., and D. J. Paddon. 1990. A Highly Flexible Multiprocessor Solution for Ray Tracing. The Visual Computer 6:62-73.Google ScholarCross Ref
Greenberg, R. I. and C. E. Leiserson. 1989. Randomized Routing on Fat-Trees. Advances in Computing Research 5:345-374.Google Scholar
Greenwald, M., and D. R. Cheriton. 1996. The Synergy between Non-Blocking Synchronization and Operating System Structure. Proc. Second Symposium on Operating System Design and Implementation, USENIX , Seattle (October): 123-136. Google Scholar
Gropp, W. E. Lusk and A. Skjellum 1994. Using MPI: Portable Parallel Programming with the Message-Passing Interface . Cambridge, MA: MIT Press. Google Scholar
Groscup, W. 1992. The Intel Paragon XP/S Supercomputer. Proc. Fifth ECMWF Workshop on the Use of Parallel Processors in Meteorology (November):262-273.Google Scholar
Gunther, K. D. 1981. Prevention of Deadlocks in Packet-Switched Data Transport Systems. IEEE Transactions on Communication C-29(4):512-24.Google ScholarCross Ref
Gupta, A., J. L. Hennessy, K. Gharachorloo, T. Mowry and W.-D. Weber. 1991. Comparative Evaluation of Latency Reducing and Tolerating Techniques. Proc. 18th Int'l Symposium on Computer Architecture (May):254-263. Google Scholar
Gupta, A., and W.-D. Weber. 1992. Cache Invalidation Patterns in Shared-Memory Multiprocessors. IEEE Transactions on Computers 41(7):794-810. Google ScholarDigital Library
Gupta, A., W.-D. Weber, and T. Mowry. 1990. Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache-Coherence Schemes. Proc. Int'l Conference on Parallel Processing I (August):312-321.Google Scholar
Gurd, J. R., C. C. Kerkham and I. Watson. 1985. The Manchester Prototype Dataflow Computer. Communications of the ACM 28(1):34-52. Google ScholarDigital Library
Gustafson, J. L. 1988. Reevaluating Amdahl's Law. Communications of the ACM 31(5):532-533. Google ScholarDigital Library
Gustafson, J. L., and Q. O. Snell. 1994. HINT: A New Way to Measure Computer Performance . Tech. Report. Ames Laboratory, U.S. Dept. of Energy. Ames, IA.Google Scholar
Gustavson, D. 1992. The Scalable Coherence Interface and Related Standards Projects. IEEE Micro 12(1):10-22. Google ScholarDigital Library
Gwennap, L. 1994a. Microprocessors Head Toward MP on a Chip. Microprocessor Report (May).Google Scholar
Gwennap, L. 1994b. PA-7200 Enables Inexpensive MP Systems. Microprocessor Report (March).Google Scholar
Hagersten, E. 1992. Toward Scalable Cache Only Memory Architectures . Ph.D. diss., Swedish Institute of Computer Science (October).Google Scholar
Hagersten, E., A. Landin. and S. Haridi. 1992. DDM--A Cache Only Memory Architecture. IEEE Computer 25(9):44-54. Google ScholarDigital Library
Hanrahan, P., D. Salzman and L. A. Aupperle. 1991. A Rapid Hierarchical Radiosity Algorithm. Proc. SIGGRAPH (July). Google Scholar
Hayashi, K., T. Doi, T. Horie, Y. Koyanagi, O. Shiraki, N. Imamura, T. Shimizu, H. Ishihata and T. Shindo. 1994. AP1000+: Archiieciural Support of PUT/GET Interface for Parallelizing Compiler. ACM SIGPLAN Notices 29(11):196. Google ScholarDigital Library
Heinlein, J., R. P. Bosch, Jr., K. Gharachorloo, M. Rosenblum, and A. Gupta. 1997. Coherent Block Data Transfer in the FLASH Multiprocessor. Proc. 11th Int'l Parallel Processing Symposium (April). Google Scholar
Heinlein, J., K. Gharachorloo, S. Dresser, and A. Gupta. 1994. Integration of Message Passing and Shared Memory in the Stanford FLASH Multiprocessor. Proc. 6th Int'l Conference on Architectural Support for Programming Languages and Operating Systems (October):38-50. Google Scholar
Heinrich, M., J. Kuskin, D. Ofelt, J. Heinlein, J. Baxter, J. P. Singh, R. Simoni, K. Gharachorloo, D. Nakahira, M. Horowitz, A. Gupta, M. Rosenblum, and J. L. Hennessy. 1994. The Performance Impact of Flexibility on the Stanford FLASH Multiprocessor. Proc. 6th Int'l Conference on Architectural Support for Programming Languages and Operating Systems (October):274-285. Google Scholar
Hennessy, J. L., and N. Jouppi. 1991. Computer Technology and Architecture: An Evolving Interaction. IEEE Computer 24(9): 18-29. Google ScholarDigital Library
Hennessy, J. L., and D. A. Patterson. 1996. Computer Architecture: A Quantitative Approach . 2nd ed. San Francisco: Morgan Kaufmann. Google ScholarDigital Library
Herlihy, M. P. 1988. Impossibility and Universality Results for Wait-Free Synchronizalion. Seventh ACM SIGACTS-SICOPS Symposium on Principles of Distributed Computing (August):276-290. Google Scholar
Herlihy, M. P. 1991. Wait-Free Synchronization. ACM Transactions on Programming Languages and Systems 13(1):124-149. Google ScholarDigital Library
Herlihy, M. P. 1993. A Methodology for Implementing Highly Concurrent Data Objects. ACM Transactions on Programming Languages and Systems 15(5):745-770. Google ScholarDigital Library
Herlihy, M. P., and J. E. B. Moss. 1993. Transactional Memory: Architectural Support for Lock-Free Data Structures. Proc. 20th Annual Symposium on Computer Architecture , San Diego, CA (May):289-301. Google Scholar
Herlihy, M. P., and J. Wing. 1987. Axioms for Concurrent Objects. Proc. 14th ACM Symposium on Principles of Programming Languages (January): 13-26. Google Scholar
Hernquist, L. 1987. Performance Characteristics of Tree Codes. Astrophysics Journal Supplement 64(August):715-734.Google ScholarCross Ref
Hey, A. J. G. 1991. The Genesis Distributed Memory Benchmarks. Parallel Computing 17:1111-1130. Google ScholarDigital Library
High Performance Fortran Forum. 1993. High Performance Fortran Language Specification. Scientific Programming 2(1): 1-270.Google Scholar
Hill, M. D., S. J. Eggers, J. R. Larus, G. S. Taylor, G. Adams, B. K. Bose, G. A. Gibson, P. M. Hansen, J. Keller, S. I. Kong, C. G. Lee, D. Lee, J. M. Pendleton, S. A. Ritchie, D. A. Wood, B. G. Zorn, P. N. Hilfinger, D. A. Hodges, R. H. Katz, J. Ousterhut, and D. A. Patterson. 1986. Design Decisions in SPUR. IEEE Computer 19(10):8-22. Also in Computers for Artificial Intelligence Processing . Edited by B. W. Wah and C. V Ramamoorthy. New York: John Wiley and Sons, 273-299. Google ScholarDigital Library
Hill, M. D., and A. J. Smith. 1989. Evaluating Associativity in CPU Caches. IEEE Transactions on Computers C-38(12):1612-1630. Google Scholar
Hillis, W. D. 1985. The Connection Machine . Cambridge, MA: MIT Press. Google Scholar
Hillis, W. D., and G. L. Steele. 1986. Data Parallel Algorithms. Communications of the ACM 29(12):1170-1183. Google ScholarDigital Library
Hillis, W. D., and L. W. Tucker. 1993. The CM-5 Connection Machine: A Scalable Supercomputer. Communications of the ACM 36(11):31-40. Google ScholarDigital Library
Hirata, H., K. Kimura, S. Nagamine, Y. Mochizuki, A. Nishimura, Y. Nakase and T. Nishizawa. 1992. An Elementary Processor Architecture with Simultaneous Instruction Issuing from Multiple Threads. Proc. 19th Int'l Symposium on Computer Architecture (May):136-145. Google Scholar
Hoare, C. A. R. 1978. Communicating Sequential Processes. Communications of the ACM 21(8):666-667. Google ScholarDigital Library
Hockney, R. W. and C. R. Jesshope. 1988. Parallel Computers 2. London: Adam Hilger.Google Scholar
Holt, C. M. Heinrich, J. P. Singh, E. Rothberg and J. L. Hennessy. 1995. The Effects of Latency, Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors . Tech. Report #CSL-TR-95-660, Computer Systems Laboratory. Stanford University (January). Google Scholar
Homewood M., and M. McLaren. 1993. Meiko CS-2 Interconnect Elan--Elite Design. Hot Interconnects (August).Google Scholar
Horiw, T. K. Hayashi, T. Shimizu, and H. Ishihata. 1993. Improving the AP1000 Parallel Computer Performance with Message Passing. Proc. 20th Annual Int'l Symposium on Computer Architecture (May):314-325. Google Scholar
Horowitz, M. 1997. Limits of Electrical Signalling. Hot Interconnects Keynote (August).Google Scholar
Horst, R. 1995. TNet: A Reliable System Area Network. IEEE Micro 15(1):37-45. Google ScholarDigital Library
Horst, R. W. and T. C. K. Chou 1985. An Architecture for High Volume Transaction Processing. Proc. 12th Annual Int'l Symposium on Computer Architecture (June):240-245. Boston MA. (Tandem NonStop II). Google Scholar
Horst, R. W., R. L. Harris, and R. L. Jardine. 1990. Multiple Instruction Issue in the NonStop Cyclone Processor. Proc. Annual International Symposium on Computer Architecture (ISCA) , 216-226. Google Scholar
Hristea, C. D. Lenoski and J. Keen. 1997. Measuring Memory Hierarchy Performance of Cache Coherent Multiprocessors Using Micro Benchmarks. Proc. SC97 (November; all-Web conference proceeding). Google Scholar
Hunt, D. 1996. Advanced Features of the 64-Bit PA-8000 . Palo Alto. CA: Hewlett Packard Corp.Google Scholar
IEEE Computer Society. 1993. IEEE Standard for Scalable Coherent Interface (SCI) . IEEE Standard 1596-1992. Washington, DC: IEEE Computer Society.Google Scholar
IEEE Computer Society. 1995. IEEE Standard for Cache Optimization for Large Numbers of Processors Using the Scalable Coherent Interface (SCI) Draft 0.35 (September). Washington, DC: IEEE Computer Society.Google Scholar
Iftode, L., C. Dubnicki, E. W. Felten and K. Li. 1996. Improving Release-Consistent Shared Virtual Memory Using Automatic Update. Proc. Second Symposium on High Performance Computer Architecture (February): 14-25. Google Scholar
Iftode, L., J. P. Singh, and K. Li. 1996a. Understanding Application Performance on Shared Virtual Memory Systems. Proc. 23rd Int'l Symposium on Computer Architecture (April): 122-133. Google Scholar
Iftode, L., J. P. Singh, and K. Li. 1996b. Scope Consistency: A Bridge between Release Consistency and Entry Consistency. Proc. Symposium on Parallel Algorithms and Architectures (June). Google Scholar
Intel Corporation. 1994. 1750, 1860, 1960 Processors and Related Products . Santa Clara, CA: Intel Corp.Google Scholar
Intel Corporation. 1996. Pentium® Pro Family Developers Manual . Sanla Clara, CA: Intel Corp.Google Scholar
Jeremiassen, T. E., and S. J. Eggers. Eliminating False Sharing. Proc. 1991 Int'l Conference on Parallel Processing (August):377-381.Google Scholar
Jiang, D., H. Shan, and J. P Singh. 1997. Application Restructuring and Performance Portability on Shared Virtual Memory and Hardware-Coherent Multiprocessors. Proc. Sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (June):217-229. Google Scholar
Jiang, D., and J. P. Singh. 1998. A Methodology and an Evaluation of the SGI Origin2000. Proc. SIGMETRICS Conference on Measurement and Modeling of Computer Systems (June). Google Scholar
Joe, T. 1995. COMA-F: A Non-Hierarchical Cache Only Memory Architecture . Ph.D. diss., Computer Systems Laboratory, Stanford University (March). Google Scholar
Joe, T., and J. L. Hennessy. 1994. Evaluating the Memory Overhead Required for COMA Architectures. Proc. 21st Int'l Symposium on Computer Architecture (April):82-93. Google Scholar
Joerg, C. F. 1994. Design and Implementation of a Packet Switched Routing Chip . Tech. Report MIT/LCS/TR-482, MIT Laboratory for Computer Science (August). Google Scholar
Joerg, C. F., and A. Boughton. 1991. The Monsoon Interconnection Network. Proc. ICCD (October). Google Scholar
Johnson, M. 1991. Superscalar Microprocessor Design . Englewood Cliffs, NJ: Prentice Hall.Google Scholar
Jordan, H. F. 1985. HEP Architecture, Programming, and Performance. In Parallel MIMD Compulation: The HEP Supercomputer and Its Applications . Edited by J. S. Kowalik. Cambridge, MA: MIT Press, 8. Google Scholar
Jouppi, N. P. 1990. Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers. Proc. 17th Annual Symposium on Computer Architecture (June):364-373. Google Scholar
Jouppi, N. P., and P. Ranganathan. 1997. The Relative Importance of Memory Latency, Bandwidth, and Branch Limits to Performance . Workshop on Mixing Logic and DRAM: Chips that Compute and Remember. Presented at the Annual Int'l Symposium on Computer Architecture (ISCA) '97 (June).Google Scholar
Jouppi, N. P. and D. Wall. 1989. Available Instruction-Level Parallelism for Superscalar and Superpipelined Machines. ASPLOS III , 272-282. Google Scholar
Kagi, A., D. Burger, and J. R. Goodman. 1997. Efficient Synchronization: Let Them Eat QOLB. Proc. 24th Int'l Symposium on Computer Architecture (ISCA) (June): 170-180. Google Scholar
Karlin, A. R., M. S. Manasse, L. Rudolph and D. D. Sleator. 1986. Competitive Snoopy Caching. Proc. 27th Annual IEEE Symposium on Foundations of Computer Science . Google Scholar
Karol, M., M. Hluchyj and S. Morgan. 1987. Input versus Output Queueing on a Space Division Packet Switch. IEEE Transactions on Communications 35(12):1347-1356.Google ScholarCross Ref
Karp, R., U. Vazirani and V. Vazirani. 1990. An Optimal Algorithm for On-Line Bipartite Matching. Proc. 22nd ACM Symposium on the Theory of Computing (May):352-358. Google Scholar
Kaxiras, S. 1996. Kiloprocessor Extensions to SCI. Proc. 10th Int'l Parallel Processing Symposium . Google Scholar
Kaxiras, S., and J. Goodman. The GLOW Cache Coherence Protocol Extensions for Widely Shared Data. Proc. Int'l Conference on Supercomputing (May):35-43. Google Scholar
Kecton, K. K., T. E. Anderson, and D. A. Patterson. 1995. LogP Quantified: The Case for Low-Overhead Local Area Networks. Hot Interconnects III: Symposium on High Performance Interconnects (August).Google Scholar
Keleher, P., A. L. Cox, S. Dwarkadas and W. Zwaenepoel. 1994. TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems. Proc. Winter USENIX Conference (January):15-132. Google Scholar
Keleher, P., A. L. Cox, and W. Zwaenepoel. 1992. Lazy Consistency for Software Distributed Shared Memory. Proc. 19th Int'l Symposium on Computer Architecture (May): 13-21. Google Scholar
Kermani, P. and L. Kleinrock. 1979. Virtual Cut-Through: A New Computer Communication Switching Technique. Computer Networks 3 (September):267-286.Google Scholar
Kessler, R. E., and J. L. Schwarzmeier. 1993. Cray T3D: A New Dimension for Cray Research. Proc. Papers, COMPCON Spring'93 , San Francisco (February):176-182.Google Scholar
Knuth, D. E. 1966. Additional Comments on a Problem in Concurrent Programming Control. Communications of the ACM 9(5):321-322. Google ScholarDigital Library
Koebel, C. D. Loveman, R. Schreiber, G. Steele, and M. Zosel. 1994. The High Performance Fortran Handbook . Cambridge, MA: MIT Press. Google Scholar
Koeninger, R. K., M. Furtney, and M. Walker. 1994. A Shared Memory MPP from Cray Research. Digital Technical Journal 6(2):8-21.Google Scholar
Kogge, P. M. 1994. EXECUBE--A New Architecture for Scalable MPPs. 1994 Int'l Conference on Parallel Processing (August):177-184. Google Scholar
Kontothanassis, L. I., G. Hunt, R. Stets, N. Hardavellas, M. Cierniak, S. Parthasarathy, W. Meira, S. Dwarkadas, and M. Scott. 1997. VM-Based Shared Memory on Low-Latency, Remote-Memory-Access Networks. Proc. 24th Int'l Symposium on Computer Architecture (June). Google Scholar
Kontothanassis, L. I., and M. L. Scott. 1996. Using Memory-Mapped Network Interfaces to Improve the Performance of Distributed Shared Memory. Proc. Second Symposium on High Performance Computer Architecture (February): 166-177. Google Scholar
Kostantantindou, S., and L. Snyder. 1991. Chaos Router: Architecture and Performance. Proc. 18th Annual Symposium on Computer Architecture (May):212-221. Google Scholar
Krishnamurthy, A., K. E. Schauser, C. J. Scheiman, R. Y. Wang, D. E. Culler, and K. Yelick. 1996. Evaluation of Architectural Support for Global Address-Based Communication in Large-Scale Parallel Machines. ACM SIGPLAN Notices 31(9):37-48. Google ScholarDigital Library
Krishnamurthy, A., and K. A. Yelick. 1994. Optimizing Parallel SPMD Programs. Seventh Annual Workshop on Languages and Compilers for Parallel Computing . Ithaca, NY (August). Google Scholar
Krishnamurthy, A., and K. A. Yelick. 1995. Optimizing Parallel Programs with Explicit Sychronization. Programming Language Design and Implementation , 196-204. Google Scholar
Krishnamurthy, A., and K. A. Yelick. 1996. Analyses and Optimizations for Shared Address Space Programs. JPDC 38(2):130-144. Google ScholarDigital Library
Kroft, D. 1981. Lockup-Free Instruction Fetch/Prefetch Cache Organization. Proc. Eighth Int'l Symposium on Computer Architecture (May):81-87. Google Scholar
Kronenberg, N. R. H. Levy, and W. D. Strecker. 1986. Vax Clusters: A Closely-Coupled Distributed System. ACM Transactions on Computer Systems 4(2): 130-146. Google ScholarDigital Library
Kruskal, C. P., and M. Snir. 1983. The Performance of Multistage Interconnection Networks for Multiprocessors. IEEE Transactions on Computers C-32(12):1091-1098. Google Scholar
Kubiatowicz, J., and A. Agarwal. 1993. The Anatomy of a Message in the Alewife Multiprocessor. Proc. Int'l Conference on Supercomputing (July): 195-206. Google Scholar
Kuehn, J. T., and B. J. Smith. 1988. The Horizon Supercomputing System: Architecture and Software. Proc. Supercomputing '88 (November):28-34. Google Scholar
Kumar, M. 1992. Unique Design Concepts in GFII and Their Impact on Performance. IBM Journal of Research and Development 36(6):990-1000. Google ScholarDigital Library
Kumar, V., A. Grama, A. Gupta, and G. Karypis. 1994. Introduction to Parallel Computing: Design and Analysis of Algorithms . Redwood City. CA: Benjamin/Cummings Publishing Company. Google Scholar
Kumar, V., and A. Gupta. 1991. Analysis of Scalability of Parallel Algorithms and Architectures: A Survey. Proc. Int'l Conference on Supercomputing (June):396-405. Google Scholar
Kung, H. T. R. Sansom, S. Schlick, P. A. Steenkiste, M. Arnould, F. J. Bitz, F. Christianson, E. C. Cooper, O. Menzilcioglu, D Ombres, and B. Zill. 1989. Network-Based Multicomputers: An Emerging Parallel Architecture. Proc. Supercomputing '91 Conference (November):664-673. Google Scholar
Kurihara, K., D. Chaiken and A. Agarwal. 1991. Latency Tolerance through Multithreading in Large-Scale Multiprocessors. Proc. Int'l Symposium on Shared Memory Multiprocessing (April):91-101.Google Scholar
Kuskin, J., D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, and J. Hennessy. 1994. The Stanford FLASH Multiprocessor. Proc. 21st Int'l Symposium on Computer Architecture (April): 302-313. Google Scholar
Lam, M. S., and R. P. Wilson. 1992. Limits on Control Flow on Parallelism. Proc 19th Annual Int'l Symposium on Computer Architecture (May):46-57. Google Scholar
Lamport, L. 1979. How to Make a Multiprocessor Computer that Correctly Executes Multiprocess Programs. IEEE Transactions on Computers C-28(9):690-691. Google ScholarDigital Library
Larus, J. R., B. Richards, and G. Viswanathan. 1996. Parallel Programming in C**: A Large-Grain Data-Parallel Programming Language. In Parallel Programming Using C++ . Edited by G. V. Wilson and P. Lu. Cambridge, MA: MIT Press.Google Scholar
Laudon, J., A. 1994. Architectural and Implementation Tradeoffs in Multiple-Context Processors . Ph.D. diss., Stanford University, Stanford, California. Also published as Tech. Report #CSL-TR-94-634. Computer Systems Laboratory, Stanford University (May). Google Scholar
Laudon, J., A. Gupta, and M. Horowitz. 1994. Architectural and Implementation Tradeoffs in the Design of Multiple-Context Processors. In Multithreaded Computer Architecture: A Summary of the State of the Art . Edited by R. A. Iannucci. Dordrecht, Germany; Norwell, MA; Kluwer Academic Publishers, 167-200.Google Scholar
Laudon, J. P. and D. Lenoski. 1997. The SGI Origin: A ccNUMA Highly Scalable Server. Proc. 24th Int'l Symposium on Computer Architecture . Google Scholar
Lawton, J. V., J. J. Brosnan, M. P. Doyle, S.D. O'Rlodain and T. G. Reddin. 1996. Building a High-Performance Message-Passing System for MEMORY CHANNEL Clusters. Digital Technical Journal 8(2):96-116. Google ScholarDigital Library
Lee, C. G. 1989. Multi-Step Gradual Rounding. IEEE Transactions on Computers 38(4):595-600. Google ScholarDigital Library
Lee, R. L., A. Y. Kwok and F. A. Briggs. 1991. The Floating point Performance of a Superscalar SPARC Processor. Proc. 4th Symposium on Architectural Support for Programming Languages and Operating Systems (April):28-37. Google Scholar
Leighton, F. T. 1992. Introduction to Parallel Algorithms and Architectures . San Francisco: Morgan Kaufmann. Google Scholar
Leiserson, C. E. 1985. Fat-Trees: Universal Networks for Hardware-Efficient Supercomputing. IEEE Transactions on Computers C-34(10):892-901. Google ScholarDigital Library
Leiserson, C. E., Z. S. Abuhamdeh, D. C. Douglas, C. R. Feynman, M. N. Ganmukhi, J. V. Hill, W. D. Hillis, B. C. Kuszmaul, M. A. St. Pierre, D. S. Wells, M. C. Wong, S. Yang, and R. Zak. 1996. The Network Architecture of the Connection Machine CM-5. Journal of Parallel and Distributed Computing 33(2): 145-158. Also in Proc. Fourth Symposium on Parallel Algorithms and Architectures '92 (June):272-285. Google ScholarDigital Library
Lenoski, D. 1992. The Stanford DASH Multiprocessor . Ph.D. diss., Computer Systems Laboratory, Stanford University.Google Scholar
Lenoski, D., J. Laudon, K. Gharachorloo, A. Gupta, and J. L. Hennessy. 1990. The Directory-Based Cache Coherence Protocol for the DASH Multiprocessor. Proc. 17th Int'l Symposium on Computer Architecture (May):148-159. Google Scholar
Lenoski, D., J. Laudon, T. Joe, D. Nakahira, L. Stevens, A. Gupta, and J. L. Hennessy. 1992. The DASH Prototype: Implementation and Performance. Proc. 19th Int'l Symposium on Computer Architecture , Gold Coast, Australia (May):92-103. Google Scholar
Lenoski, D., J. Laudon, T. Joe, D. Nakahira, L. Stevens, A. Gupta, and J. L. Hennessy. 1993. The DASH Prototype: Logic Overhead and Performance. IEEE Transactions on Parallel and Distributed Systems 4(1):41-61. Google ScholarDigital Library
Li. K., and P. Hudak. 1989. Memory Coherence in Shared Virtual Memory Systems. ACM Transactions on Computer Systems 7(4):321-359. Google ScholarDigital Library
Li, S.-Y. 1988. Theory of Periodic Contention and Its Application to Packet Switching. Proc. INFOCOM '88 (March):320-325.Google Scholar
Lim, B.-H., and A. Agarwal. 1994. Reactive Syncronization Algorithms for Multiprocessors. Proc. Sixth Int'l Conference on Architectural Support for Programming Languages and Operating Systems , 25-35. Google Scholar
Linder, D., and J. Harden. 1991. An Adaptive Fault Tolerant Wormhole Strategy for k-ary n-cubes. IEEE Transactions on Computer C-40(1):2-12. Google ScholarDigital Library
Lipton, R., and J. Sandberg. 1988. PRAM: A Scalable Shared Memory . Tech. Report #CS-TR-180-88, Computer Science Dept., Princeton University (September).Google Scholar
Litzkow, M., M. Livny, and M. W. Mutka. 1988. Condor--A Hunter of Idle Workstations. Proc. Eighth Int'l Conference of Distributed Computing Systems (June): 104-111.Google Scholar
Lo, J. L., S. J. Eggers, J.S. Emer, H. M. Levy, R. L. Stamm and D. M. Tullsen. 1997. Converting Thread-Level Parallelism into Instruction-Level Parallelism via Simultaneous Multithreading. ACM Transactions on Computer Systems (August). Google Scholar
Lonergan, W., and P. King. 1961. Design of the B 5000 System. Datamation 7(5):28-32.Google Scholar
Lovett, T. and R. Clapp. 1996. STiNG: A CC-NUMA Computer System for the Commercial Marketplace. Proc. 23rd Int'l Symposium on Computer Architecture (May);308-317. Google Scholar
Luk, C.-K., and T. C. Mowry. 1996. Compiler-Based Prefetching for Recursive Data Structures. Proc. Seventh Symposium on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VII) (October):222-233. Google Scholar
Lukowsky, J., and S. Polit. 1997 (date accessed). IP Packet Switching on the GlGAswitch/FDDI System. http://www.networks.digital.com:80/dr/techart/gsfip-mn.hlml.Google Scholar
Mainwaring, A., B. Chun, S. Schleimer, and D. Wilkerson. 1997. System Area Network Mapping. Proc. Ninth Annual ACM Symposium on Parallel Algorithms and Architecture , Newport, RI (June):116-126. Google Scholar
Mainwaring, A., and D. E. Culler. 1996. Active Message Applications Programming Interface and Communication Subsystem Organization . Tech. Report CSD-96-918. University of California at Berkeley. Google Scholar
Martin, R. 1994. HPAM: An Active Message Layer of a Network of Workstations. Presented at Hot Interconnects II (August).Google Scholar
Massalin, H., and C. Pu. 1991. A Lock-Free Multiprocessor OS Kernel . Tech. Report CUCS-005-01, Columbia University, Computer Science Dept. (October).Google Scholar
Matelan, N. 1985. The FLEX/32 Multicomputer. Proc. 12th Annual Int'l Symposium on Computer Architecture , Boston, MA. (Flex) (June):209-213. Google Scholar
May, C., E. Silha, R. Simpson, and H. Warren, eds. 1994. The PowerPC Architecture: A Specification for a New Family of RISC Processors . San Francisco: Morgan Kaufmann. Google Scholar
McCreight, E. 1984. The Dragon Computer System: An Early Overview . Tech. Report, Xerox Corp. (September).Google Scholar
Mellor-Crummey, J. and M. Scott. 1991. Algorithms for Scalable Synchronization on Shared Memory Mutiprocessors. ACM Transactions on Computer Systems 9(1):21-65. Google ScholarDigital Library
Melvin, S., and Y. Patt. 1991. Exploiting Fine-Grained Parallelism through a Combination of Hardware and Software Techniques. Proc. Annual Int'l Symposium on Computer Architecture (ISCA) , 287-296. Google Scholar
Michael, M., and M. Scott. 1996. Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms. Proc. 15th Annual ACM Symposium on Principles of Distributed Computing , Philadelphia, PA (May): 267-276. Google Scholar
Minnich, R., D. Burns, and F. Hady. 1995. The Memory-Integrated Network Interface. IEEE Micro 15(1): 11-20. Google ScholarDigital Library
MIPS Technologies. 1991. MIPS R4000 Users Manual . Mountain View, CA: MIPS Technologies.Google Scholar
MIPS Technologies. 1996. RI0000 Microprocessor User's Manual, Version 1.1 (January). Mountain View, CA: MIPS Technologies.Google Scholar
Miyoshi, H.; M. Fukuda, T. Iwamiya, T. Nakamura, M. Tuchiya, M. Yoshida, K. Yamamoto, Y. Yamamoto, S. Ogawa, Y. Matsuo, T. Yamane, M. Takamura, M. Ikeda, S. Okada, Y. Sakamoto, T. Kitamura, H. Hatama, M. Kishimoto, M. Arnould, F. J. Bitz, E. C. Cooper, H. T. Kung, R. Sansom, S. Schlick, P. A. Steenkiste, and B. Zill. 1994. Development and Achievement of NAL Numerical Wind Tunnel (NWT) for CFD Computations. Proc. Supercomputing '94 , Washington, DC (November):685-692. Google Scholar
Mowry, T. C. 1994. Tolerating Latency through Software-Controlled Data Prefetching . Ph.D. diss., Computer Systems Laboratory, Stanforcf University. Also published as Tech. Report #CSL-TR-94-628. Computer Systems laboratory, Stanford University (June). Google Scholar
MPI Forum. 1993. Document for a Standard Message-Passing Interface . Tech. Report CS-93-214. University of Tennessee, Knoxville, Computer Science Dept. (November). Google Scholar
MPI Forum. 1994. MPI: A Message Passing Interface. Int'l Journal of Supercomputing Applications 8(3/4). Special Issue on MPI. (updated 5/95). Also published in Proc. Supercomputing '93 Conference (May). Los Alamitos, CA: IEEE Computer Society Press. 878-883. Updated spec at http://www.mcs.anl.gov/mpi/. Google Scholar
Mukherjee, S., and M. Hill. 1997. A Case for Making Network Interfaces Less Peripheral. Hot Interconnects (August).Google Scholar
NAS Parallel Benchmarks. 1998 (date accessed). http://science.nas.nasa.gov/Software/NPB/.Google Scholar
Nayfeh, B. A., L. Hammond, K. Olukoton. 1996. Evaluation of Design Alternatives for a Multiprocessor Microprocessor. Proc. 23rd Annual Int'l Symposium on Computer Architecture (May). New York: ACM Press, 67-77. Google Scholar
Nestle, E., and A. Inselberg. 1985. The Synapse N+1 System: Architectural Characieristics and Performance Data of a Tightly-Coupled Multiprocessor System. Proc. 12th Annual Int'l Symposium on Computer Architecture , Boston, MA (Synapse) (June):233-239. Google Scholar
Ngai, J., and C. Seitz. 1989. A Framework for Adaptive Routing in Multicomputer Networks. Proc. 1989 Symposium on Parallel Algorithms and Architectures (June):2-10. Google Scholar
Nickolls, J. R. 1990. The Design of the MasPar MP-1: A Cost Effective Massively Parallel Computer. COMPCON Spring '90, Digest of Papers , San Francisco, CA (February/March):25-28.Google Scholar
Nikhil, R. S., and Arvind. 1989. Can Dataflow Subsume von Neumann Computing? Proc. 16th Annual Int'l Symposium on Computer Architecture (May):262-72. Google Scholar
Nikhil, R., G. Papadopoulos, and Arvind. 1993. *T: A Multithreaded Massively Parallel Architecture. Proc. Annual Int'l Symposium on Computer Architecture (ISCA) '93 (May):156-167. Google Scholar
Noakes, M. D., D. A. Wallach and W.J. Dally. 1993. The J-Machine Multicomputer: An Architectural Evaluation. Proc. 20th Int'l Symposium on Computer Architecture (May):224-235. Google Scholar
Nuth, P., and W.J. Dally. 1992. The J-Machine Network. Proc. Int'l Conference on Computer Design: VLSI in Computers and Processors (October). Google Scholar
Nuth, P., and W.J. Dally. 1995. The Named-State Register File: Implementation and Performance. Proc. First Int'l Symposium on High-Performance Computer Architecture (January):4-13. Google Scholar
O'Krafka, B., and A. Newton. 1990. An Empirical Evaluation of Two Memory-Efficient Directory Methods. Proc. 17th Int'l Symposium on Computer Architecture (May):138-147. Google Scholar
Office of Science and Technology Policy. 1993. Grand Challenges 1993: High Performance Computing and Communications, A Report by the Committee on Physical, Mathematical, and Engineering Sciences . Washington, DC: Office of Science and Technology Policy.Google Scholar
Ohara, M. 1996. Producer-Oriented versus Consumer-Oriented Prefetching: A Comparison and Analysis of Parallel Application Programs . Ph.D. diss., Computer Systems Laboratory. Stanford University. Available as Tech. Report #CSL-TR-96-695, Stanford University (June). Google Scholar
Olukotun, K., B. A. Nayfeh, L. Hammond, K. Wilson and K. Chang. 1996. The Case for a Single-Chip Multiprocessor. Proc. ASPLOS (October):2-11. Google Scholar
Omondi, A. R. 1994. Ideas for the Design of Multithreaded Pipelines. In Multithreaded Computer Architecture: A Summary of the State of the Art . Edited by R. Iannucci. Dordrecht, Germany; Norwell, MA: Kluwer Academic Publishers, 1994. See also A. R. Omondi, Design of a High Performance Instruction Pipeline. Computer Systems Science and Engineering 6(1):13-29 (1991).Google Scholar
Pacheco, P. 1996. Parallel Programming with MPI . San Francisco: Morgan Kaufmann. Google Scholar
Padegs, A. 1981. System/360 and Beyond. IBM Journal of Research and Development 25(5):377-390. Google ScholarDigital Library
Pai, V. S., P. Ranganathan, S. V. Adve, and T. Harton. 1996. An Evaluation of Memory Consistency Models for Shared-Memory Systems with ILP Processors. Proc. Seventh Symposium on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VII) (October):12-23. Google Scholar
Papadimitriou, C. H. 1979. The Serializability of Concurrent Database Updates. Journal of the ACM 26(4):631-653. Google ScholarDigital Library
Papadopoulos, G. M., and D. E. Culler. 1990. Monsoon: An Explicit Token-Store Architecture. Proc. 17th Annual Int'l Symposium on Computer Architecture , Seattle, WA (May):82-91. Google Scholar
Papamarcos, M., and J. Patel. 1984. A Low Overhead Coherence Solution for Multiprocessors with Private Cache Memories. Proc. 11th Annual Int'l Symposium on Computer Architecture (June):348-354. Google Scholar
PARKBENCH Committee. 1994. Public International Benchmarks for Parallel Computers. Scientific Programming 3(2). Also published as Tech. Report CS93-213, University of Tennessee, Knoxville, Dept. of Computer Science (November). Google Scholar
Patterson, D. A. 1995. Microprocessors in 2020. Scientific American (September).Google Scholar
Patterson, D. A., T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, and K. Yelick. 1997. A Case for Intelligent RAM. IEEE Micro 17(2):34-44. Google ScholarDigital Library
Peterson, L., and B. Davie. 1996. Computer Networks . San Francisco: Morgan Kaufmann. Google Scholar
Pfeiffer, W., S. Hotovy, N. Nystrom, D. Rudy, T. Sterling, M. Straka. 1995 (date accessed). JNNIE: The Joint NSF-NASA Initiative on Evaluation, http://www.tc.cornell.edu/JNNIE/finrep/jnnie.html.Google Scholar
Pfister, G. F. 1995. In Search of Clusters--The Coming Battle for Lowly Parallel Computing . Englewood Cliffs. NJ: Prentice Hall. Google Scholar
Pfister, G. F., W. C. Brantley, D. A. George, S. L. Harvey, W. J. Kleinfelder, K. P. McAuliff, E. A. Melton, V. A. Norton, and J. Weiss. 1985. The IBM Research Parallel Processor Prototype (RP3): Introduction and Architecture. Proc. Int'l Conference on Parallel Processing (August):264-771.Google Scholar
Pfister, G. F., and V. A. Norton. 1985. Hot Spot Contention and Combining Multistage Interconnection Networks. IEEE Transactions on Computers C-34(10).Google ScholarCross Ref
Pierce, P. 1988. The NX/2 Operating System. Proc. Third Conference on Hypereube Concurrent Computers and Applications (January):384-390. Google Scholar
Pierce, P., and G. Reenter. 1994. The Paragon Implementation of the NX Message Passing Interface. Proc. Scalable High-Performance Computing Conference (May):184-90.Google Scholar
Porter, R. E. 1960. Datamation 6(1):8-14.Google Scholar
Przybylski, S., M. Horowitz, J. L. Hennessy. 1988. Performance Tradeoffs in Cache Design. Proc. 15th Annual Symposium on Computer Architecture (May):290-298. Google Scholar
Ranganathan, P. V. S. Pai, H. Abdel-Shafi, and S. V. Adve. 1997. The Interaction of Software Prefetching with ILP Processors in Shared-Memory Systems. Proc. 24th Int'l Symposium on Computer Architecture (June). Google Scholar
Ratner, J. 1985. Concurrent Processing: A New Direction in Scientific Computing. Proc. 1985 National Computing Conference , 835.Google Scholar
Reddaway, S. F. 1973. DAP--A Distributed Array Processor. First Annual Int'l Symposium on Computer Architecture (Dccember):61-65. Google Scholar
Reinhardt, S. K., M. D. Hill, J. R. Larus, A. R. Lebeck, J. C. Lewis, and D. A. Wood. 1993. The Wisconsin Wind Tunnel: Virtual Prototyping of Parallel Computers. Proc. ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems (May):48-60. Google Scholar
Reinhardt, S. K., J. R. Larus, and D. A. Wood. 1994. Tempest and Typhoon: User-Level Shared Memory. Proc. 21st Int'l Symposium on Computer Architecture (April):325-337. Google Scholar
Reinhardt, S. K., R. W. Pfile, and D. A. Wood. 1996. Decoupled Hardware Support for Distributed Shared Memory. Proc. 23rd Int'l Symposium on Computer Architecture (May):34-43. Google Scholar
Rettberg, R., W. Crowther, P. Carvey, and R. Tomlinson. 1990. The Monarch Parallel Processor Hardware Design. IEEE Computer (April):18-30. Google ScholarDigital Library
Rettberg, R., and R. Thomas. 1986. Contention is No Obstacle to Shared-Memory Multiprocessing. Communications of the ACM 29(12):1202-1212. Google ScholarDigital Library
Rinard, M. C., D. J. Scales, and M. S. Lam. 1993. Jade: A High-Level. Machine-Independent Language for Parallel Programming. IEEE Computer 26(6). Google Scholar
Rodgers, D. 1985. Improvements on Multiprocessor System Design. Proc. 12th Annual Int'l Symposium on Computer Architecture , Boston, MA (Sequent B8000) (June):225-231. Google Scholar
Rosenblum, M., S. A. Herrod, E. Witchel, and A. Gupta. 1995. Complete Computer Simulation: The SimOS Approach. IEEE Parallel and Distributed Technology 3(4). Google Scholar
Rosenburg, B. 1989. Low-Synchronization Translation Lookaside Buffer Consistency in Large-Scale Shared-Memory Multiprocessors. Proc. Symposium on Operating Sysrems Principles (December). Google Scholar
Rothberg, E., J. P. Singh, and A. Gupta. 1993. Working Sets, Cache Sizes, and Node Granularity Issues for Large-Scale Multiprocessors. Proc. 20th Int'l Symposium on Computer Architecture (May):14-25. Google Scholar
Russel, R. M. 1978. The CRAY-1 Computer System. Communications of the ACM 21(1):63-72. Google ScholarDigital Library
Saavedra-Barrera, R. H., D. E. Culler, T. von Eicken. 1990. Analysis of Multithreaded Architectures for Parallel Computing. Second Annual ACM Symposium on Parallel Algorithms and Architectures (July): 169-178. Google Scholar
Saavedra, R. H., R. S. Gaines, and M.J. Carlton. 1993. Micro Benchmark Analysis of the KSR1. Proc. Supercomputing '93 , Portland, OR (November):202-213. Google Scholar
Saavedra, R. H., and A. J. Smith. 1996. Analysis of Benchmark Characteristics and Benchmark Performance Prediction. ACM Transactions on Computer Systems 14(4);344-384. Google ScholarDigital Library
Sakai, S., Y. Kodama and Y. Yamaguchi 1991. Prototype Implementation of a Highly Parallel Dataflow Machine EM4. Proc. Fifth Int'l Parallel Processing Symposium . Anaheim, CA (April/May):278-286. Google Scholar
Salmon, J. 1990. Parallel Hierarchical N-body Methods . Ph.D. diss., California Institute of Technology. Google Scholar
Salmon, J. K., M. S. Warren, and G. S. Winckelmans. 1994. Fast Parallel Tree codes for Gravitational and Fluid Dynamical N-body Problems. Intl. Journal of Supercomputer Applications 8:129-142. Google ScholarDigital Library
Samanta, R., A. Bilas, L. Iftode, and J. R Singh. 1998. Home-Based SVM Protocols for SMP Clusters: Design, Simulations, Implementation, and Performance. Proc. 23rd Annual Int'l Symposium on Computer Architecture (February).Google Scholar
Saulsbury, A., F. Pong, and A. Nowatzyk 1996. Missing the Memory Wall: The Case for Processor/Memory Integration. Proc. 23rd Annual Int'l Symposium on Computer Architecture (May):90-101. Google Scholar
Saulsbury, A., T. Wilkinson, J. Carter, and A. Landin. 1995. An Argument For Simple COMA Proc. First IEEE Sympostum on High Performance Computer Architecture (January):276-285. Google Scholar
Savage, J. 1985. Parallel Processing as a Language Design Problem. Proc. 12th Annual Int'l Symposium on Computer Architecture , Boston, MA (Myrias 4000) (June):221-224. Google Scholar
Scales, D. J., K. Gharachorloo and C. A. Thekkath. 1996. Shasta: A Low Overhead. Software-Only Approach for Supporting Fine-Grain Shared Memory. Proc. Seventh Int'l Conference on Architectural Support for Programming, Languages and Operating Systems (October):174-185. Google Scholar
Scales, D. J., and M. S. Lam. 1994. The Design and Evaluation of a Shared Object System for Distributed Memory Machines. Proc. First Symposium on Operating System Design and Implementation (November):101-114. Google Scholar
Schanin, D.J. 1986. The Design and Development of a Very High Speed System Bus--The Encore Multimax Nanobus. In Proc. Fall Joint Computer Conference (Encore) , Dallas, TX (November). Edited by H. S. Stone. Los Alamitos: IEEE Computer Society Press, 410-418. Google Scholar
Schauser, K. E., and C. J. Scheiman. 1995. Experience with Active Messages on the Meiko CS-2. Proc. Ninth Int'l Symposium on Parallel Processing (IPPS'95) (April):140-149. Google Scholar
Scheurich, C. and M. Dubois. 1987. Correct Memory Operation of Cache-Based Multiprocessors. Proc. 14th Int'l Symposium on Computer Architecture (June):234-243. Google Scholar
Schoinas, I., B. Falsafi, A. R. Lebeck, S. K. Reinhardt, J. R. Larus, and D. A. Wood. 1994. Fine-Grain Access Control for Distributed Shared Memory. Proc. 6th Int'l Conference on Architectural Support for Programming Languages and Operating Systems (October):297-306. Google Scholar
Schroeder, M. D., A. D. Birrell, M. Burrows, H. Murray, R. M. Needham, T. L. Rodeheffer, E. H. Satterthwaite and C. P. Thacker. 1991. Autonet: A High-Speed. Self-Configuring Local Area Network Using Point-to-Point Links. IEEE Journal on Selected Areas in Communications 9(8):1318-1335. Google Scholar
Schwiebert, L., and D. N. Jayasimha. 1995. A Universal Proof Technique for Deadlock-Free Routing in Interconnection Networks. Symposium on Parallel Algorithms and Architecture (July):175-184. Google Scholar
Scott, S. 1991. A Cache-Coherence Mechanism for Scalable Shared-Memory Multiprocessors. Proc. Int'l Symposium on Shared Memory Multiprocessing (April):49-59.Google Scholar
Scott, S. 1996. Synchronization and Communication in the T3E Multiprocessor. Proc. Seventh Int'l Conference on Architectural Support for Programming Languages and Operating Systems (October):26-36, Cambridge. MA. Google Scholar
Scott, S., and J. R. Goodman. 1993. Performance of Pruning Cache Directories for Large-Scale Multiprocessors. IEEE Transactions on Parallel and Distributed Systems 4(5):520-534. Google ScholarDigital Library
Scott, S. 1994. The Impact of Pipelined Channels on k-ary n-Cube Networks. IEEE Transactions on Parallel and Distributed Systems 5(1):2-16. Google ScholarDigital Library
Scott, S., M. Vernon, and J. R. Goodman. 1992. Performance of the SCI Ring. Proc. 19th Int'l Symposium on Computer Architecture (May):403-414. Google Scholar
Seitz, C. L. 1984. Concurrent VLSI Architectures. IEEE Transactions on Computers 33(12):1247-1265. Google ScholarDigital Library
Seitz, C. L. 1985. The Cosmic Cube. Communications of the ACM 28(1):22-33. Google ScholarDigital Library
Seitz, C. L., and W.-K. Su. 1993. A Family of Routing and Communication Chips Based on Mosaic. Proc. of Univ. of Washington Symposium on Integrated Systems . Cambridge, MA: MIT Press, 320-337. Google Scholar
Shah, G., J. Nieplocha, J. Mirza, C. Kim, R. Harrison, R. K. Govindaraju, K. Gildea, P. DiNicola, and C. Bender. 1998. Performance and Experience with LAPI--A New High-Performance Communicalion Library for the IBM RS/6000 SP. Twelfth Int'l Parallel Processing Symposium (March):260-266. Google Scholar
Shasha, D., and M. Snir. 1988. Efficient and Correct Execution of Parallel Programs that Share Memory. ACM Transactions on Programming Languages and Operating Systems 10(2):282-312. Google ScholarDigital Library
Shimada, T., K. Hiraki and K. Nishida. 1984. An Architecture of a Data Flow Machine and Its Evaluation. Proc. COMPCON '84 , 486-90.Google Scholar
Simoni, R., and M. Horowitz. 1991. Dynamic Pointer Allocation for Scalable Cache Coherence Directories. Proc. Int'l Symposium on Shared Memory Multiprocessing (April):72-81.Google Scholar
Sindhu, R.J.-M. Frailong, and M. Cekleov. 1991. Formal Specification of Memory Models . Tech. Report (PARC) CSL-91-11. Xerox Corp., Palo Alto Research Center, Palo Alto, CA.Google Scholar
Sindhu, P., et al. 1993. XDBus: A High-Performance, Consistent, Packet Switched VLSI Bus. Proc. COMPCON (Spring):338-344.Google Scholar
Singh, J. P. 1993. Parallel Hierarchical N-body Methods and Their Implications for Multiprocessors . Ph.D. diss., Tech. Report #CSL-TR-93-565. Stanford University (March). Google Scholar
Singh, J. P. 1998. Some Aspects of Controlling Scheduling in Handware Control Prefetching . To be published as Tech. Report, Princeton University, Computer Science Dept.Google Scholar
Singh, J. P., A. Gupta, and M. Levoy. 1994. Parallel Visualization Algorithms: Performance and Architectural Implications. IEEE Computer 27(6). Google Scholar
Singh, J. P., J. L. Hennessy and A. Gupta. 1993. Scaling Parallel Programs for Multiprocessors: Methodology and Examples. IEEE Computer 26(7):42-50. Google ScholarDigital Library
Singh, J. P., J. L. Hennessy and A. Gupta. 1995. Implications of Parallel Hierarchical N-body Applications for Multiprocessors. ACM Transactions on Computer Systems (May). Google Scholar
Singh, J. P., C. Holt, T. Totsuka, A. Gupta, and J. L. Hennessy. 1995. Load Balancing and Data Locality in Hierarchial N-body Methods: Barnes-Hut, Fast Multipole and Radiosity. Journal of Parallel and Distributed Computing (June). Google ScholarDigital Library
Singh, J. P., T. Joe, A. Gupta, and J. L. Hennessy. 1993. An Empirical Comparison of the KSR-1 and DASH Multiprocessors. Proc. Supercomputing '93 (November). Google Scholar
Singh, J. P., E. Rothberg, and A. Gupta. 1994. Modeling Communication in Parallel Algorithms: A Fruitful Interaction between Theory and Systems? Proc. 10th Annual ACM Symposium on Parallel Algorithms and Architectures . Google Scholar
Singh, J. P. W-D. Weber, and A. Gupta. 1992. SPLASH: The Stanford Parallel. Applications for SHared Memory. Computer Architecture News 20(1):5-44. Google ScholarDigital Library
Sites, R. L. ed. 1992 Alpha Architecture Reference Manual . Hudson. MA: Digital Press, Digital Equipment Corp. Google Scholar
Slater, M. 1994. Intel Unveils Multiprocessor System Specification. Microprocessor Report (May):12-14.Google Scholar
Slotnick, D. L. 1967. Unconventional Systems. Proc. AFIPS Spring Joint Computer Conference 30:477-481. Google Scholar
Slotnick, D. L., W. C. Borck, and R. C. McReynolds. 1962. The Solomon Computer. Proc. AFIPS Fall Joint Computer Conference 22:97-107. Google Scholar
Smith, A. J. 1982. Cache Memories. ACM Computing Surveys 14(3):473-530. Google ScholarDigital Library
Smith, B. J. 1981. Architecture and Applications of the HEP Multiprocessor Computer System. Proc. SPIE: Real-Time Signal Processing IV 298(August):241-248.Google Scholar
Smithm B. J. 1985. The Architecture of HEP. In Parallel MIMD Computation. The HEP Supercomputer and Its Applications . Edited by J.S. Kowalik. Cambridge, MA: MIT Press. 41-55. Google Scholar
Smith, M. D., M. Johnson, and M. A. Horowitz. 1989. Limits on Multiple Instruction Issue. Proc. Third Int'l Conference on Architectural Support for Programming Languages and Operating Systems , 290-302, Apr. Google Scholar
Snir, M., S. Otto, S. H. Lederman, D. Walker, and J. Dongarra. 1995. MPI: The Complete Reference . Cambridge, MA: MIT Press. Google ScholarDigital Library
Sohi, G., S. Breach, and T. N. Vijaykumar. 1995 Multiscalar Processors. Proc 22nd Annual Int'l Symposium on Computer Architecture (June):414-425. Google Scholar
SPEC (Standard Performance Evaluation Corporation). 1995 (date accessed). http://www.specbench.org/. (SPEC Benchmark Suite Release 1.0., 1989).Google Scholar
Spertus, E., S. C. Goldstein, K. E. Schauser, T. von Eicken, D. E. Culler, W. J. Dally. 1993. Evaluation of Mechanisms for Fine-Grained Parallel Programs in the J-Machine and the CM-5. Proc. 20th Annual Symposium on Computer Architecture (May):302-313. Google Scholar
Stenstrom, P. T. Joe and A. Gupia. 1992. Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures. Proc. 19th Int'l Symposium on Computer Architecture (May):80-91. Google Scholar
Stets, R., S. Dwarkadas, N. Hardavellas, G. Hunt, L. Koniothanassis, S. Parthasarathy and M. Scott 1997. Cashmere-2L: Software Coherent Shared Memory on a Clustered Remote-Write Network. Proc. 16th ACM Symposium on Operating Systems Principles (October). Google Scholar
Stone, H. S. 1970. A Logic-in-Memory Computer. IEEE Transactions on Computers C-19(1):73-78. Google ScholarDigital Library
Stunkel, C. B., D. G. Shea, D. G. Grice, P. H. Hochschild and M. Tsao. 1994. The SP-1 High Performance Swiich. Proc. Scalable High Performance Computing Conference (May): 150-157 Knoxville, TN.Google Scholar
Stunkel, C. B., et al. 1998 (date accessed). The SP2 Communication Subsystem . http://ibm.tc.cornell.edu/ibm/pps/doc/css/css.ps.Google Scholar
SUN Microsystems. 1991. The SPARC Architecture Manual . #800-199-12. Version 8 (January). Mountain View, CA: SUN Microsystems.Google Scholar
Sunderam, V. S. 1990. PVM: A Framework for Parallel Distributed Computing. Concurrency: Practice and Experience 2(4):315-339. Google ScholarDigital Library
Sunderam, V. S., J. Dongarra, A. Geist, and R. Manchek. 1994. The PVM Concurrent Computing System: Evolution, Experiences, and Trends. Parallel Computing 20(4):531-547. Google ScholarDigital Library
Swan, R.J., A. Bechtolsheim, K.-W. Lai, and J. K. Ousterhout. 1977. The Implementation of the CM* Multi-Microprocessor. Proc. AFIPS Conference/National Computer Conference (46):645-655. Google Scholar
Swan, R. J., S. H. Fuller, and D. R Siewiorek. 1977. CM*--A Modular, Multi-Microprocessor. Proc. AFIPS Conference/National Computer Conference (46):637-44. Google Scholar
Sweazey, P., and A.J. Smith. 1986. A Class of Compatible Cache Consistency Protocols and Their Support by the IEEE Futurebus. Proc. 13th Int'l Symposium on Computer Architecture (May):414-423. Google Scholar
Tamir, Y., and G. L. Frazier. 1988. High-Performance Multi-Queue Buffers for VLSI Communication Switches. Proc. 15th Annual Int'l Symposium on Computer Architecture , 343-354. Google Scholar
Tanenbaum, A. S., and A. S. Woodhull. 1997. Operating System Design and Implementation 2nd ed. Englewood Cliffs, NJ: Prentice Hall. Google Scholar
Tang, C. 1976. Cache Design in a Tightly Coupled Multiprocessor System. Proc. AFIPS Conference (June):749-753. Google Scholar
Teller, P. 1990. Translation-Lookaside Buffer Consistency. IEEE Computer 23(6):26-36. Google ScholarDigital Library
Thacker, C. L. Stewart, and E. Satterthwaite, Jr. 1988. Firefly: A Multiprocessor Workstation. IEEE Transactions on Computers 37(8):909-20. Google ScholarDigital Library
Thapar, M., and B. Delagi. 1990. Stanford Distributed-Directory Protocol. IEEE Computer 23(6):78-80. Google ScholarDigital Library
Thekkath, R., A. P. Singh, J. P. Singh, J. Hennessy and S. John. 1997. An Application-Driven Evaluation of the Convex Exemplar SP-1200. Proc. Int'l Parallel Processing Symposium (June).Google Scholar
Thompson, M., J. Barton, T. Jermoluk, and J. Wagner. 1988. Translation Lookaside Buffer Synchronization in a Multiprocessor System. Proc. USENIX Technical Conference (February).Google Scholar
Thornton, J. E. 1964. Parallel Operation in the Control Data 6600. AFIPS Proc. Fall Joint Computer Conference , Part 2 26:33-40. Reprinted in Siework, Bell, and Newell. 1982. Computer Structures: Principles and Examples . New York; McGraw-Hill. Google Scholar
Tomasulo, R. M. 1967. An Efficient Algorithm for Exploiting Multiple Arithmetic Units. IBM Journal of Research and Development 11(1):25-33. Google ScholarDigital Library
Torrellas, J., M. S. Lam, and J. L. Hennessy. 1994. False Sharing and Spatial Locality in Multiprocessor Caches. IEEE Transactions on Computers 43(6):651-663. Google ScholarDigital Library
Transaction Processing Council. 1998. http://www.tpc.orgGoogle Scholar
Traw, C., and J. Smith. 1991. A High-Performance Host Interface for ATM Networks. Proc. ACM SIGCOMM Conference (September):317-325. Google Scholar
Traylor, R., and D. Dunning. 1992. Routing Chip Set for Intel Paragon Parallel Supercomputer. Proc. Hot Chips '92 Symposium (August).Google Scholar
Tucker, L. W., and A. Mainwaring. 1994. CMMD: Active Messages on the CM-5. Parallel Computing 20(4):481-496. Google ScholarDigital Library
Tucker, L. W., and G. G. Robertson. 1988. Architecture and Applications of the Connection Machine. IEEE Computer 21(8):26-38. Google ScholarDigital Library
Tucker, S. 1986. The IBM 3090 System: An Overview. IBM Systems Journal 25(1):4-19. Google ScholarDigital Library
Tullsen, D. M., and S.J. Eggers. 1993. Limitations of Cache Prefetching on a Bus-Based Multiprocessor. Proc. 20th Annual Symposium on Computer Architecture (May):278-288. Google Scholar
Tullsen, D. M., S.J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, and R. L. Stamm. 1996. Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor. Proc. 23rd Int'l Symposium on Computer Architecture (May): 191-202. Google Scholar
Tullsen, D. M., S.J. Eggers. and H. M. Levy 1995. Simultaneous Multithreading: Maximizing On-Chip Parallelism. Proc. 20th Annual Symposium on Computer Architecture (June):278-288. Google Scholar
Turner, J. S. 1988. Design of a Broadcast Packet Switching Network. IEEE Transactions on Communication 36(6):734-743.Google ScholarCross Ref
Valiant, L. G. 1990. A Bridging Model for Parallel Computation. Communications of the ACM 33(8):103-111. Google ScholarDigital Library
Valois, J. 1995. Lock-Free Linked Lists Using Compare-and-Swap. Proc. 14th Annual ACM Symposium on Principles of Distributed Computing , Ottawa, Canada (August):214-222. Google Scholar
Vick, C. R., and J. A. Cornell 1978. PEPE Architecture--Present and Future. Proc. AFIPS Conference 47:981-1002.Google Scholar
von Eicken, T. A. Basu and V. Buch. 1995. Low-Latency Communication Over ATM Using Active Messages. IEEE Micro 15(1):46-53. Google ScholarDigital Library
von Eicken, T., D. E. Culler, S. C. Goldstein, and K. E. Schauser. 1992. Active Messages: A Mechanism for Integrated Communication and Computation. Proc. 19th Annual Int'l Symposium on Computer Architecture , Gold Coast, Australia (May) 256-266. Google Scholar
Vranesic, Z., M. Stumm, D. Lewis, and R. White 1991. Hector: A Hierarchically Structured Shared Memory Multiprocessor. IEEE Computer 24(1):72-78. Google ScholarDigital Library
Wall, D. W. 1991. Limits of Instruction-Level Parallelism. ASPLOS IV (April):176-188. Google Scholar
Wallach, D. A. 1992. PHD: A Hierarchical Cache Coherence Protocol . S.M. thesis. Massachusetts Institute of Technology. Also available as Tech. Report #1389. Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Boston, MA (August).Google Scholar
Wang, W-H., J.-L. Baer and H. M. Levy. 1989. Organization and Performance of a Two-Level Virtual-Real Cache Hierarchy. Proc. 16th Annual Int'l Symposium on Computer Architecture (June):140-148. Google Scholar
Warren, M. S., and J. K. Salmon. 1993. A Parallel Hashed Oct-Tree N-body Algorithm. Proc. Supercomputing '93 . Washington, DC: IEEE Computer Society, 12-21. Google Scholar
Weaver, D., and T. Germond, eds. 1994. The SPARC Architecture Manual . SPARC International, Version 9. Englewood Cliffs, NJ: Prentice Hall. Google Scholar
Weber, W.-D. 1993. Scalable Directories for Cache-Coherent Shared-Memory Multiprocessors Ph.D. diss., Computer Systems Laboratory, Stanford University (January). Also available as Tech. Report #CSL-TR-93-557. Stanford University.Google Scholar
Weber, W.-D., S. Gold, P. Helland, T. Shimizu, T. Wicki and W. Wilcke. 1997. The Mercury Interconnect Architecture: A Cost-Effective Infrastructure for High-Performance Servers. Proc. 24th Int'l Symposium on Computer Architecture (June):98-107. Google Scholar
Weiss, S. and J. Smith. 1994. Power and PowerPC . San Francisco: Morgan Kaufmann. Google Scholar
Widdoes, L., Jr., and S. Correll. 1980. The S-1 Project: Developing High Performance Computers. Proc. COMPCON (Spring):282-291.Google Scholar
Wilson, A., Jr. 1987. Hierarchical Cache/Bus Architecture for Shared Memory Multiprocessors. Proc. 14th Int'l Symposium on Computer Architecture (June):244-252. Google Scholar
Wolf, M. E., and M. S. Lam. 1991. A Data Locality Optimizing Algorithm. Proc. ACM SIGPLAN'91 Conference on Programming Language Design and Implementation (June):30-44. Google Scholar
Wolfe, M. 1989. Optimizing Supercompilers for Supercomputers . Cambridge, MA, MIT Press. Google Scholar
Woo, S. C., J. P. Singh, and J. L. Hennessy. 1994. The Performance Advantages of Integrating Block Data Transfer in Cache-Coherent Multiprocessors. Proc. 6th Int'l Conference on Architectural Support for Programming Languages and Operating Systems (October):219-229, San Jose, CA. Google Scholar
Woo, S. C. M. Ohara, E. J. Torrie, J. P. Singh, and A. Gupta. 1995. The SPLASH-2 Programs; Characterization and Methodological Considerations. Proc. 22nd Annual Int'l Symposium on Computer Architecture (June):24-36. Google Scholar
Wood, D. A., S. Chandra, B. Falsafi, M. D. Hill, J. R. Larus, A. R. Lebeck, J. C. Lewis, S. S. Mukherjee, S. Palacharla, and S. K. Reinhardt. 1993. Mechanisms for Cooperative Shared Memory. Proc. 20th Annual Symposium on Computer Architecture (May):156-167. Google Scholar
Wood, D. A., S. J. Eggers, G. Gibson, M. D. Hill, J. M. Pendleton, S. A. Ritchie, G. S. Taylor, R. H. Katz, and D. A. Patterson. 1986. An In-Cache Address Translation Mechanism. Proc. 13th Annual Symposium on Computer Architecture (June):358-365. Google Scholar
Wood, D. A., and M. D. Hill. 1995. Cost-Effective Parallel Computing. IEEE Computer 28(2):69-72. Google ScholarDigital Library
Woodbury, P., A. Wilson, B. Shein, I. Gertner, P.Y. Chen, J. Bartlett, and Z. Aral. 1989. Shared Memory Multiprocessors: The Right Approach to Parallel Processing. Proc. COMPCON (Spring): 72-80.Google Scholar
Wulf, W., R. Levin, and C. Person. 1975. Overview of the Hydra Operating System Development. Proc. 5th Symposium on Operating Systems Principles (November):122-131. Google Scholar
Yamashita, N., T. Kimura, Y. Fujita, Y. Aimoto, T. Manaba, S. Okazaki, K. Nakamura, and M. Yamashina. 1994. A 3.84GIPS Integrated Memory Array Processor LSI with 64 Processing Elements and 2Mb SRAM. Int'l Solid-State Circuits Conference , San Francisco (February):260-261.Google Scholar
Zekauskas, M. J., W. A. Sawdon, and B. N. Bershad. 1994. Software Write Detection for a Distributed Shared Memory. Proc. Operating Systems Design and Implementation Symposium (November):87-100. Google Scholar
Zhang, Z., and J. Torrellas. 1995. Speeding Up Irregular Applications in Shared-Memory Multiprocessors: Memory Binding and Group Prefetching. Proc. 22nd Annual Symposium on Computer Architecture (May):188-199. Google Scholar
Zhou, Y., L. Iftode, and K. Li. 1996. Performance Evaluation of Two Home-Based Lazy Release Consistency Protocols for Shared Virtual Memory Systems. Proc. Operating Systems Design and Implementation Symposium (October). Google Scholar

Cited By

Contributors

David E Culler
Google LLC
- Publication Years1986 - 2024
- Publication counts252
- Citation count29,067
- Available for Download197
- Downloads (cumulative)259,947
- Downloads (12 months)26,630
- Downloads (6 weeks)2,905
- Average Downloads per Article1,320
- Average Citation per Article115
View Full Profile
Jaswinder Pal Singh
Princeton University
- Publication Years1991 - 2008
- Publication counts83
- Citation count8,486
- Available for Download67
- Downloads (cumulative)40,903
- Downloads (12 months)6,008
- Downloads (6 weeks)962
- Average Downloads per Article610
- Average Citation per Article102
View Full Profile
Anoop Gupta
Microsoft Research
- Publication Years1984 - 2024
- Publication counts106
- Citation count10,386
- Available for Download98
- Downloads (cumulative)70,016
- Downloads (12 months)11,499
- Downloads (6 weeks)1,771
- Average Downloads per Article714
- Average Citation per Article98
View Full Profile

Comments

Recommendations

Architecture of the VPP500 parallel supercomputer
Supercomputing '94: Proceedings of the 1994 ACM/IEEE conference on Supercomputing

The VPP500 vector parallel processor is a highly parallel, distributed memory supercomputer that has a performance range of 6.4 to 355 gigaFLOPS and a main memory capacity from 1 to 222 gigabytes. The system scalably supports between 4 and 222 ...
The NYU Ultracomputer Designing an MIMD Shared Memory Parallel Computer

We present the design for the NYU Ultracomputer, a shared-memory MIMD parallel machine composed of thousands of autonomous processing elements. This machine uses an enhanced message switching network with the geometry of an Omega-network to approximate ...
A universal parallel computer architecture
Abstract
Advances in interconnection network performance and interprocessor interaction mechanisms enable the construction of fine-grain parallel computers in which the nodes are physically small and have a small amount of memory. This class of machines ...

Browse Books

Sections

References

Cited By

Architecture of the VPP500 parallel supercomputer

The NYU Ultracomputer Designing an MIMD Shared Memory Parallel Computer

A universal parallel computer architecture

Save to Binder

Sections

References

Cited By

Save to Binder

Recommendations

Architecture of the VPP500 parallel supercomputer

The NYU Ultracomputer Designing an MIMD Shared Memory Parallel Computer

A universal parallel computer architecture