article

Free access

The effectiveness of multiple hardware contexts

Authors:

Radhika Thekkath,

Susan J. EggersAuthors Info & Claims

ACM SIGOPS Operating Systems Review, Volume 28, Issue 5

Pages 328 - 337

https://doi.org/10.1145/381792.195583

Published: 01 November 1994 Publication History

Abstract

Multithreaded processors are used to tolerate long memory latencies. By executing threads loaded in multiple hardware contexts, an otherwise idle processor can keep busy, thus increasing its utilization. However, the larger size of a multi-thread working set can have a negative effect on cache conflict misses. In this paper we evaluate the two phenomena together, examining their combined effect on execution time.

The usefulness of multiple hardware contexts depends on: program data locality, cache organization and degree of multiprocessing. Multiple hardware contexts are most effective on programs that have been optimized for data locality. For these programs, execution time dropped with increasing contexts, over widely varying architectures. With unoptimized applications, multiple contexts had limited value. The best performance was seen with only two contexts, and only on uniprocessors and small multiprocessors. The behavior of the unoptimized applications changed more noticeably with variations in cache associativity and cache hierarchy, unlike the optimized programs.

As a mechanism for exploiting program parallelism, an additional processor is clearly better than another context. However, there were many configurations for which the addition of a few hardware contexts brought as much or greater performance than a larger multiprocessor with fewer than the optimal number of contexts.

References

[1]

A. Agarwal. Limits on interconnection network performnce, iEEE Transactions on Parallel and Distributed Systms, 2(4):398-412, October 1991.

Digital Library

[2]

A. Agarwai. Performance tradeoffs in multithreaded processors. IEEE Transactions on Parallel and Distributed Systems, 3(5):525-539, September 1992.

Digital Library

[3]

A. Agarwal, B-H. Lim, D. Kranz, and J. Kubiatowicz. APRIL: A processor architecture for multiprocessmg. 17th Annual International Symposium on Computer Arc. hitecture, pages 104-114, May 1990.

Digital Library

[4]

R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield, and B. Smith. The Tera computer system. International Conference on Supercomputing, pages 1{-6, June 1990.

Digital Library

[5]

B.N. Bershad, E. D, Lazowska, and H. M. Levy. PRESTO: A system for object-oriented parallel programming. Software: Practice and Experience, 18(8):713-732, August 1988.

Digital Library

[6]

B. Boothe and A. Ranade. Improved mulfithreading techniques for hiding communication latency in multiprocessors. 19th Annual International Symposium on Computer Architecture, pages 214-223, May 1992.

Digital Library

[7]

D. Chaiken, J. Kubiatowicz, and A. Agarwal. LimitLESS directories: A scalable cache coherence scheme. Architectural Support for Programming Languages and Operating Systems, pages 224-234, April 1991.

Digital Library

[8]

S.j. Eggers, D. R. Keppel, E. J. Koldinger, and H. M. Levy. Techniques for efficient inline tracing on a shared-memory mulfiprocessor. ACM SiGMETRICS Conference on Measurernent and Modeling of Computer Systems, pages 37-46, May 990.

Digital Library

[9]

K.i. Farkas and N. P. Jouppi. Complexity/performance tradeoffs with non-blocking loads. 21th Annual International Symposium on Computer Architecture, pages 211-222, April 1994.

Digital Library

[10]

M. K. Fattens and A. R. Pleszkum. Strategies for achieving processor throughput. 18th Annual International Symposium on Computer Architecture, pages 362-369, May 1991.

Digital Library

[11]

A. Gupta, J. Hennesey, K. Gharachorloo, T Mowry, and W- D. Weber. Comparative evaluation of latency reducing and tolerating techniques. 18th Annual International Symposium on Computer Architecture, pages 254-263, May 1991.

Digital Library

[12]

R. H. Halstead and T. Fujita. MASA: A mulfithreaded processor architecture for parallel symbolic computing. 15th Annual International Symposium on Computer Archi,tecture, pages 443--451, May 1988.

Digital Library

[13]

T.E. Jeremiassen and S.J. Eggers. Computing per-process summary side-effect information. 5th Workshop on Languages and Compilers for Parallel Computing, August 1992. Also appeared as LNCS #757, pages 175-19I.

Digital Library

[14]

T.E. Jeremiassen and S.J. Eggers. Static analysis of barrier synchronization in explicitly parallel programs. International Conference on Parallel Architectures and Compilation Techniques, Montreal, August 1994.

Digital Library

[15]

D. Kroft. Lockup-free instruction fetch/prefetch cache organization. 8th Annual Symposium on Computer Architecture, pages 81-87, May 1981.

Digital Library

[16]

E. P. Markatos and T. J. LeBlanc. Using processor affinity in loop scheduling on shared-memory multiprocessors. Supercompt~ing '92, pages 104-113, November 1992.

Digital Library

[17]

J. H. Mulder, N. T. Quach, and M. J Flynn. An area model for on-chip memories and its applications. IEEE Journal of Solid-State Circuits, 26(2):98-106, February 1991.

[18]

C. D. Polychronopoulos and D. J. Kuck. Guided selfscheduling' A practical scheduling scheme for parallel supercomputers. IEEE Transactions on Computers, C- 36(t2):1425-1439, December 1987.

Digital Library

[19]

R. H. Saavedra-Barrera, D. E. Culler, and T. yon Eicken. Analysis of multithreaded architectures for parallel computing. 2nd Annual ACM Symposium on Parallel Algorithms and Architectures, pages 169-178, July 1990.

Digital Library

[20]

J.P. Singh, W-D. Weber, and A. Gupta. SPLASH: Stanford parallel applications for shared-memory. Computer Architecture News, 20(1 ):5--44, March 1992.

Digital Library

[21]

B.J. Smith. Architecture and applications of the HEP multiprocessor computer system. SPIE, Real-Time Signal Processing/V, 298:241-248, 1981.

[22]

Symmetry Technical Summary. Sequent Computer Systems, Inc.

[23]

R. Thekkath and S.J. Eggers. Impact of sharing-basedthTead placement on multithreaded architectures. 21th Annual international Symposium on Computer Architecture, pages 176- 186, April 1994.

Digital Library

[24]

T.H. Tzen and L. M. Ni. Dynamic loop scheduling for sharedmemory multiprocessors. 1991 International Conference on Parallel Processing, pages 1i:246-250, August 1991.

[25]

T. Wada, S. Rajan, and S. A. Przybylski. An analytical access time model for on-chip cache memories. IEEE Journal of Solid-State Circuits, 27(8):1147-1156, August 1992.

[26]

W-D. Weber and A. Gupta. Exploring the benefits of multiple hardware contexts in a multiprocessor architecture: Preliminary results. 16th Annual International Symposium on Computer Architecture, pages 273-280, June 1989.

Digital Library

Cited By

Hoozemans Jvan Straten JWong S(2018)Increasing resource utilization in mixed-criticality systems using a polymorphic VLIW processorJournal of Systems Architecture10.1016/j.sysarc.2018.01.00384(2-11)Online publication date: Mar-2018
https://doi.org/10.1016/j.sysarc.2018.01.003
Hoozemans Jvan Straten JWong S(2017)Using a polymorphic VLIW processor to improve schedulability and performance for mixed-criticality systems2017 IEEE 23rd International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA)10.1109/RTCSA.2017.8046315(1-9)Online publication date: Aug-2017
https://doi.org/10.1109/RTCSA.2017.8046315
Hoozemans JJohansen JStraten JBrandon AWong S(2015)Multiple contexts in a multi-ported VLIW register file implementation2015 International Conference on ReConFigurable Computing and FPGAs (ReConFig)10.1109/ReConFig.2015.7393329(1-6)Online publication date: Dec-2015
https://doi.org/10.1109/ReConFig.2015.7393329
Show More Cited By

Index Terms

The effectiveness of multiple hardware contexts

Recommendations

The effectiveness of multiple hardware contexts
ASPLOS VI: Proceedings of the sixth international conference on Architectural support for programming languages and operating systems

Multithreaded processors are used to tolerate long memory latencies. By executing threads loaded in multiple hardware contexts, an otherwise idle processor can keep busy, thus increasing its utilization. However, the larger size of a multi-thread ...
Effectiveness of hardware-based stride and sequential prefetching in shared-memory multiprocessors
HPCA '95: Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture

We study the relative efficiency of previously proposed stride and sequential prefetching-two promising hardware-based prefetching schemes to reduce read-miss penalties in shared-memory multiprocessors. Although stride accesses dominate in four out of ...
Exploring the benefits of multiple hardware contexts in a multiprocessor architecture: preliminary results
ISCA '89: Proceedings of the 16th annual international symposium on Computer architecture

A fundamental problem that any scalable multiprocessor must address is the ability to tolerate high latency memory operations. This paper explores the extent to which multiple hardware contexts per processor can help to mitigate the negative effects of ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGOPS Operating Systems Review

ACM SIGOPS Operating Systems Review Volume 28, Issue 5

Dec. 1994

323 pages

ISSN:0163-5980

DOI:10.1145/381792

Chairman:
Henry M. Levy
Univ. of Washington, Seattle

Issue’s Table of Contents

ASPLOS VI: Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
November 1994
341 pages
ISBN:0897916603
DOI:10.1145/195473
Chairmen:
Forest Baskett
Silicon Graphics
,
Douglas Clark
Princeton Univ.

Copyright © 1994 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 November 1994

Published in SIGOPS Volume 28, Issue 5

Check for updates

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

55
Total Citations
View Citations
612
Total Downloads

Downloads (Last 12 months)67
Downloads (Last 6 weeks)20

Reflects downloads up to 11 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Hoozemans Jvan Straten JWong S(2018)Increasing resource utilization in mixed-criticality systems using a polymorphic VLIW processorJournal of Systems Architecture10.1016/j.sysarc.2018.01.00384(2-11)Online publication date: Mar-2018
https://doi.org/10.1016/j.sysarc.2018.01.003
Hoozemans Jvan Straten JWong S(2017)Using a polymorphic VLIW processor to improve schedulability and performance for mixed-criticality systems2017 IEEE 23rd International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA)10.1109/RTCSA.2017.8046315(1-9)Online publication date: Aug-2017
https://doi.org/10.1109/RTCSA.2017.8046315
Hoozemans JJohansen JStraten JBrandon AWong S(2015)Multiple contexts in a multi-ported VLIW register file implementation2015 International Conference on ReConFigurable Computing and FPGAs (ReConFig)10.1109/ReConFig.2015.7393329(1-6)Online publication date: Dec-2015
https://doi.org/10.1109/ReConFig.2015.7393329
SATO MEGAWA RTAKIZAWA HKOBAYASHI H(2013)A Capacity-Aware Thread Scheduling Method Combined with Cache Partitioning to Reduce Inter-Thread Cache ConflictsIEICE Transactions on Information and Systems10.1587/transinf.E96.D.2047E96.D:9(2047-2054)Online publication date: 2013
https://doi.org/10.1587/transinf.E96.D.2047
Bouteiller ACappello FDongarra JGuermouche AHérault TRobert Y(2013)Multi-criteria checkpointing strategiesProceedings of the 19th international conference on Parallel Processing10.1007/978-3-642-40047-6_43(420-431)Online publication date: 26-Aug-2013
https://dl.acm.org/doi/10.1007/978-3-642-40047-6_43
Guangzuo Cui Mingzeng Hu Xiaoming Li (1997)Parallel replacement mechanism for multithreadProceedings. Advances in Parallel and Distributed Computing10.1109/APDC.1997.574052(338-344)Online publication date: 1997
https://doi.org/10.1109/APDC.1997.574052
Zhang YWang MWang WYu Z(2023)Re-Cache: Mitigating cache contention by exploiting locality characteristics with reconfigurable memory hierarchy for GPGPUsMicroelectronics Journal10.1016/j.mejo.2023.105825138(105825)Online publication date: Aug-2023
https://doi.org/10.1016/j.mejo.2023.105825
Jog AKayiran OChidambaram Nachiappan NMishra AKandemir MMutlu OIyer RDas C(2013)OWLACM SIGPLAN Notices10.1145/2499368.245115848:4(395-406)Online publication date: 16-Mar-2013
https://dl.acm.org/doi/10.1145/2499368.2451158
Jog AKayiran OChidambaram Nachiappan NMishra AKandemir MMutlu OIyer RDas C(2013)OWLACM SIGARCH Computer Architecture News10.1145/2490301.245115841:1(395-406)Online publication date: 16-Mar-2013
https://dl.acm.org/doi/10.1145/2490301.2451158
Jog AKayiran OChidambaram Nachiappan NMishra AKandemir MMutlu OIyer RDas CSarkar VBodik R(2013)OWLProceedings of the eighteenth international conference on Architectural support for programming languages and operating systems10.1145/2451116.2451158(395-406)Online publication date: 16-Mar-2013
https://dl.acm.org/doi/10.1145/2451116.2451158
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents