Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article
Free access

Whole-program optimization for time and space efficient threads

Published: 01 September 1996 Publication History

Abstract

Modern languages and operating systems often encourage programmers to use threads, or independent control streams, to mask the overhead of some operations and simplify program structure. Multitasking operating systems use threads to mask communication latency, either with hardwares devices or users. Client-server applications typically use threads to simplify the complex control-flow that arises when multiple clients are used. Recently, the scientific computing community has started using threads to mask network communication latency in massively parallel architectures, allowing computation and communication to be overlapped. Lastly, some architectures implement threads in hardware, using those threads to tolerate memory latency.In general, it would be desirable if threaded programs could be written to expose the largest degree of parallelism possible, or to simplify the program design. However, threads incur time and space overheads, and programmers often compromise simple designs for performance. In this paper, we show how to reduce time and space thread overhead using control flow and register liveness information inferred after compilation. Our techniques work on binaries, are not specific to a particular compiler or thread library and reduce the the overall execution time of fine-grain threaded programs by ≈ 15-30%. We use execution-driven analysis and an instrumented operating system to show why the execution time is reduced and to indicate areas for future work.

References

[1]
Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. Compilers Principles, Techniques, and Tools. Addison-Wesley Publishing Company, Reading, Massachusetts, 1986.
[2]
R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. PortenSield, and B. Smith. The Tera computer system. In international Conference on Supercomputing, pages 1-6, June 1990.
[3]
Thomas E. Anderson, Henry M. Levy, Brian N. Bershad, and Edward D. Lazowska. The interaction of architecture and operating system design. In Proceedings of the 18th International Symposium on Computing Architecture, 199 I.
[4]
Andrew W. Appel and Zhong Shao. An empirical and analytic study of stack vs. heap cost for languages with closures. Technical Report CS-TR-450-94, Princeton Universi .ty, March 1994.
[5]
M.E. Benitez and Jack W. Davidson. A portable global optimizer and linker. In Proceedings of the A CM SIGPLAN '88 Conference on Programming Language Design and Implementation, pages 329-338, Atlanta, GA, June 1988.
[6]
D.E. Culler, A. Sah, K.E. Schauser, T. yon Eicken, and J. Wawrzynek. Fine-grain parallelism with minimal hardware support: a compiler-controlled threaded abstract machine. In Proceedings of the 4th Symposium on Architectural Support for Programming Languages and Operating Systems, pages 164-175, Santa Clara, CA, April 1991.
[7]
Seth Goldstein, Klaus Schauser, and David Culler. Languages, Compilers and Run-Time Systems for Scalable Computers, chapter 12, pages 153-168. Kluwer Academic Publishers, 1996.
[8]
Dirk Gmnwald. Heaps o' stacks: Time and space efficient threads without operating system support. Technical Report CU-CS-750-94, University of Colorado, November }{ 994.
[9]
Robert Heib, R. Kent Dybvig, and Carl Bmggeman. Representing control in the presenceof first-class continuations. In Proceedings of the ACM .SIGPLAN '90 Conference on Programming Language Design and Implementation, pages 66-77, June 1990.
[10]
Robert Henry, Allan Porterfield, and Burton Smith. Tera compiler overview. (discussion), January 1994.
[11]
Richard K. Johnsson and John D. Wick. An overview of the Mesa processor architecture. In Proceedings of the Symposium on Architectural Support for Programming Languages and Operating Systems, pages 20-29, Palo Alto, Califomia, 1982.
[12]
R. Kessler and M. Hill. Page placement algorithms for large direct-mapped real-index caches. A CM Transactions on Computer Systems, 10(4):338-359, November 1992.
[13]
Butler W. Lampson. Fast procedure calls. In Proceedings of the Symposium on Architectural Support for Programming Languages and Operating Systems, pages 66-76, Palo Alto, Califomia, 1982.
[14]
Kelvin D. Nilsen and William J. Schmidt. A high-performance hardware assisted real-time garbage collection system. In Proceedings of the 5th Symposium on Architectural Support for Programming Languages and Operating Systems, Oct 1994.
[15]
John Plevyak, Vijay Karamcheti, Xingbin Zhang, and Andrew A. Chien. A hybrid execution model for finegrained languages on distributed memory multicomputers. In Proceedings SuperComputing '95, San Diego, California, November 1995.
[16]
atsa Santhanam and Daryl Odnert. Register allocation across procedure and module boundaries. In Proceedings of the ACM SIGPLAN '90 Conference on Programming Language Design and Implementation, pages 28-39, June 1990.
[17]
Zhong Shao and Andrew W. Appel. Space-efficient closure representations. In Proceedings of the 1994 ACM Conference on LiSP and Functional Programming, pages 150-161. ACM, June 1994.
[18]
Jasweinder Pal Singh, Wolf-Dietrich Weber, and Anoop Gupta. Splash: Stanford parallel applications for sharedmemory. Technical Report CSL-TR-91-469, Computer Systems Laboratory, Stanford, 1991.
[19]
Amitabh Srivastava and Alan Eustace. Atom: A system for building customized program analysis tools. In Proceedings of the ACM SIGPLAN '94 Conference on Programming Language Design and Implementation. ACM, ACM, 1994.
[20]
Amitabh Srivastava and David W. Wall. A practical system for intermodule code optimizations at link-time. Journal of Programming Languages, March 1992. (Also available as DEC-WRL TR-92-6).
[21]
P.A. Steenkiste and J.L. Hennessy. A simple interprocedural register allocation algorithm and its effectiveness for Lisp. ACM Transactions on Programming Languages and Systems, 11 (1):1-32, January 1989.
[22]
David W. Wall. Global register allocation at link time. In Proceedings of the ACM SIGPLAN '86 Conference on Programming Language Design and Implementation, pages 264-275. ACM, 1986.
[23]
David W. Wall and Michael L. Powell. The mahler experience: Using an intermediate language as the machine description. In Proceedings of the 2nd Symposium on Architectural Support for Programming Languages and Operating Systems, pages 100-104. ACM, October 1987.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGOPS Operating Systems Review
ACM SIGOPS Operating Systems Review  Volume 30, Issue 5
Dec. 1996
273 pages
ISSN:0163-5980
DOI:10.1145/248208
Issue’s Table of Contents
  • cover image ACM Conferences
    ASPLOS VII: Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
    October 1996
    290 pages
    ISBN:0897917677
    DOI:10.1145/237090
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 September 1996
Published in SIGOPS Volume 30, Issue 5

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)88
  • Downloads (Last 6 weeks)27
Reflects downloads up to 13 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2013)Compiler support for lightweight context switchingACM Transactions on Architecture and Code Optimization10.1145/2400682.24006959:4(1-25)Online publication date: 20-Jan-2013
  • (2018)Semi-Extended Tasks: Efficient Stack Sharing Among Blocking Threads2018 IEEE Real-Time Systems Symposium (RTSS)10.1109/RTSS.2018.00049(338-349)Online publication date: Dec-2018
  • (2009)Eliminating the call stack to save RAMACM SIGPLAN Notices10.1145/1543136.154246144:7(60-69)Online publication date: 19-Jun-2009
  • (2009)Eliminating the call stack to save RAMProceedings of the 2009 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems10.1145/1542452.1542461(60-69)Online publication date: 19-Jun-2009
  • (2008)MTSSACM Transactions on Embedded Computing Systems10.1145/1376804.13768147:4(1-37)Online publication date: 1-Aug-2008
  • (2007)Offline compression for on-chip ramACM SIGPLAN Notices10.1145/1273442.125077642:6(363-372)Online publication date: 10-Jun-2007
  • (2007)Offline compression for on-chip ramProceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/1250734.1250776(363-372)Online publication date: 15-Jun-2007
  • (2005)MTSSProceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems10.1145/1086297.1086323(191-201)Online publication date: 24-Sep-2005
  • (2004)Balancing register allocation across threads for a multithreaded network processorACM SIGPLAN Notices10.1145/996893.99687639:6(289-300)Online publication date: 9-Jun-2004
  • (2004)Balancing register allocation across threads for a multithreaded network processorProceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation10.1145/996841.996876(289-300)Online publication date: 9-Jun-2004
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media