Article

TAPE: a transactional application profiling environment

Authors:

Austen McDonald,

Brian D. Carlstrom,

JaeWoong Chung,

Christos Kozyrakis,

Kunle OlukotunAuthors Info & Claims

ICS '05: Proceedings of the 19th annual international conference on Supercomputing

Pages 199 - 208

https://doi.org/10.1145/1088149.1088176

Published: 20 June 2005 Publication History

Abstract

Transactional Coherence and Consistency (TCC) provides a new parallel programming model that uses transactions as the basic unit of parallel work and communication. TCC simplifies the development of correct parallel code because hardware provides transaction atomicity and ordering. Nevertheless, the programmer or a dynamic compiler must still optimize the parallel code for performance.This paper presents TAPE, a hardware and software infrastructure for profiling in TCC systems. TAPE extends the hardware for transactional execution to identify performance impediments such as dependence violations, buffer overflows, and work imbalance. It filters infrequent events to reduce resource requirements and allows the programmer to focus on the most important bottlenecks. We demonstrate that TAPE introduces minimal die area and performance overhead and can be used continuously, even for production runs. Moreover, we demonstrate how to leverage the profiling information to guide optimization for a set of parallel applications. TAPE accurately identifies the source code location and type of the most important bottlenecks, allowing a programmer to achieve maximum parallel speedup with a few profiling steps.

References

[1]

Intel Corporation, VTune: a visual tuning environment. http://support.intel.com/support/performancetools/vtune/.

[2]

Stanford Parallel Applications for Shared Memory, SPLASH. http://www-flash.stanford.edu/apps/SPLASH/.

[3]

Java Grande Forum, Java Grande Benchmark Suite. http://www.epcc.ed.ac.uk/javagrande/, 2000.

[4]

V. Agarwal, M. S. Hrishikesh, S. W. Keckler, and D. Burger, Clock rate versus IPC: the end of the road for conventional microarchitectures. In ISCA-27: Proceedings of the 27th International Symposium on Computer Architecture, pages 248--259, 2000.

Digital Library

[5]

C. S. Ananian, K. Asanović, B. C. Kuszmaul, C. E. Leiserson, and S. Lie. Unbounded Transactional Memory. In HPCA'05: Proceedings of the 11th International Symposium on High-Performance Computer Architecture, pages 316--327, Feb. 2005.

Digital Library

[6]

J. M. Anderson et al. Continuous profiling: where have all the cycles gone? In SOSP-XVI: Proceedings of the sixteenth ACM symposium on Operating systems principles, 1997.

Digital Library

[7]

Broadcom Corporation. The Broadcom BCM-1250 Multiprocessor. In Presentation at 2002 Embedded Processor Forum, April 2002.

[8]

M. Chen and K. Olukotun. TEST: a tracer for extracting speculative threads. In CGO '03: Proceedings of the international symposium on Code generation and optimization, pages 301--312. IEEE Computer Society, 2003.

Digital Library

[9]

J. Dean, J. E. Hicks, C. A. Waldspurger, W. E. Weihl, and G. Z. Chrysos. ProfileMe: Hardware support for instruction-level profiling on out-of-order processors. In MICRO'97: International Symposium on Microarchitecture, pages 292--302, 1997.

Digital Library

[10]

A. J. Goldberg and J. L. Hennessy. Performance debugging shared memory multiprocessor programs with MTOOL. In Supercomputing '91: Proceedings of the 1991 ACM/IEEE conference on Supercomputing, pages 481--490. ACM Press, 1991.

Digital Library

[11]

L. Hammond, B. D. Carlstrom, V. Wong, B. Hertzberg, M. Chen, C. Kozyrakis, and K. Olukotun. Programming with transactional coherence and consistency. In ASPLOS-XI: Proceedings of the 11th Intl, Conference on Arch. Support for Programming Languages and Operating Systems, Oct. 2004.

Digital Library

[12]

L. Hammond, V. Wong, M. Chen, B. D. Carlstrom, J. D. Davis, B. Hertzberg, M. K. Prabhu, H. Wijaya, C. Kozyrakis, and K. Olukotun. Transactional memory coherence and consistency. In ISCA-31: Proceedings of the 31st International Symposium on Computer Architecture, pages 102--113, June 2004.

Digital Library

[13]

P. Kongetira, A 32-way multithreaded Sparc processor. In Conference Record of Hot Chips 16, Stanford, CA, August 2004.

[14]

J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, and J. Hennessy. The Stanford FLASH multiprocessor. In ISCA-21: Proceedings of the 21st International Symposium on Computer Architecture, pages 302--313, 1994.

Digital Library

[15]

M. Martonosi, A. Gupta, and T. Anderson. MemSpy: analyzing memory system bottlenecks in programs. In SIGMETRICS '92/PERFORMANCE '92: Proceedings of the 1992 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems, pages 1--12. ACM Press, 1992.

Digital Library

[16]

M. Martonosi, D. Ofelt, and M. Heinrich. Integrating performance monitoring and communication in parallel computers. In Measurement and Modeling of Computer Systems, pages 138--147, 1996.

Digital Library

[17]

A. McDonald et al. Characterization of TCC on Chip-Multiprocessors. In PACT-XIV: The Fourteenth International Conference on Parallel Architectures and Compilation Techniques, Sept. 2005.

Digital Library

[18]

J. T. R. Kalla, B. Sinharoy. Simultaneous multi-threading implementation in POWER5. In Conference Record of Hot Chips 15 Symposium, Stanford, CA, August 2003.

[19]

R. Raman. UltraSparc Gemini: Dual CPU processor. In Conference Record of Hot Chips 15 Symposium, Palo Alto, CA, August 2003.

[20]

S. K. Reinhardt, R. W. Pfile, and D. A. Wood. Hardware support for flexible distributed shared memory. IEEE Transactions on Computers, 47(10):1056--1072, 1998.

Digital Library

[21]

Standard Performance Evaluation Corporation, SPEC CPU Benchmarks. http://www.specbench.org/, 1995--2000.

[22]

S. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH2 programs: Characterization and methodological considerations. In ISCA-22: Proceedings of the 22nd International Symposium on Computer Architecture, pages 24--36, June 1995.

Digital Library

[23]

M. Wolfe. High-Performance Compilers for Parallel Computing. Addison-Wesley, 1995.

Digital Library

[24]

Z. Xu, J. R. Larus, and B. P. Miller. Shared-memory performance profiling. In PPoPP-VI: Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 240--251, 1997.

Digital Library

[25]

M. Zagha, B. Larson, S. Turner, and M. Itzkowitz. Performance analysis using the MIPS R10000 performance counters. 1996.

[26]

C. B. Zilles and G. S. Sohi. A programmable co-processor for profiling. In HPCA-7: Proceedings of the 7th International Symposium on High-Performance Computer Architecture, pages 241--253, 2001.

Digital Library

Cited By

Gottschlich JKnauerhase RPokam G(2013)But how do we really debug transactional memory programs?Proceedings of the 5th USENIX Conference on Hot Topics in Parallelism10.5555/3241639.3241648(9-9)Online publication date: 24-Jun-2013
https://dl.acm.org/doi/10.5555/3241639.3241648
Schindewolf MRocker BKarl WHeuveline V(2013)Evaluation of two formulations of the conjugate gradients method with transactional memoryProceedings of the 19th international conference on Parallel Processing10.1007/978-3-642-40047-6_52(508-520)Online publication date: 26-Aug-2013
https://dl.acm.org/doi/10.1007/978-3-642-40047-6_52
Gaudet MAmaral JYew PCho SDeRose LLilja D(2012)Transactional event profiling in a best-effort hardware transactional memory systemProceedings of the 21st international conference on Parallel architectures and compilation techniques10.1145/2370816.2370904(475-476)Online publication date: 19-Sep-2012
https://dl.acm.org/doi/10.1145/2370816.2370904
Show More Cited By

Recommendations

Parallelism in tape-sorting

Two methods for employing parallelism in tape-sorting are presented. Method A is the natural way to use parallelism. Method B is new. Both approximately achieve the goal of reducing the processing time by a divisor which is the number of processors.
Some Time-Space Tradeoff Results Concerning Single-Tape and Offline TM’s

Fast simulations of time-bounded single-tape TM’s and offline TM’s (i.e., TM’s with a two-way read-only input and one storage tape) by space-bounded TM’s of the same type are presented. The following results are shown: (1) Any language accepted by a ...
WFR-TM

Transactional Memory (TM) is a promising concurrent programming paradigm which employs transactions to achieve synchronization in accessing common data known as transactional variables. A transaction may either commit, making its updates to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '05: Proceedings of the 19th annual international conference on Supercomputing

June 2005

414 pages

ISBN:1595931678

DOI:10.1145/1088149

General Chair:
Arvind
MIT
,
Program Chair:
Larry Rudolph
MIT

Copyright © 2005 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 June 2005

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Article

Conference

ICS05

Sponsor:

SIGARCH

ICS05: International Conference on Supercomputing 2005

June 20 - 22, 2005

Massachusetts, Cambridge

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

18
Total Citations
View Citations
306
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 12 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Gottschlich JKnauerhase RPokam G(2013)But how do we really debug transactional memory programs?Proceedings of the 5th USENIX Conference on Hot Topics in Parallelism10.5555/3241639.3241648(9-9)Online publication date: 24-Jun-2013
https://dl.acm.org/doi/10.5555/3241639.3241648
Schindewolf MRocker BKarl WHeuveline V(2013)Evaluation of two formulations of the conjugate gradients method with transactional memoryProceedings of the 19th international conference on Parallel Processing10.1007/978-3-642-40047-6_52(508-520)Online publication date: 26-Aug-2013
https://dl.acm.org/doi/10.1007/978-3-642-40047-6_52
Gaudet MAmaral JYew PCho SDeRose LLilja D(2012)Transactional event profiling in a best-effort hardware transactional memory systemProceedings of the 21st international conference on Parallel architectures and compilation techniques10.1145/2370816.2370904(475-476)Online publication date: 19-Sep-2012
https://dl.acm.org/doi/10.1145/2370816.2370904
Arcas OKirchhofer PSonmez NSchindewolf MUnsal OKarl WCristal A(2012)A Low-Overhead Profiling and Visualization Framework for Hybrid Transactional MemoryProceedings of the 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines10.1109/FCCM.2012.11(1-8)Online publication date: 29-Apr-2012
https://dl.acm.org/doi/10.1109/FCCM.2012.11
Zyulkyarov FStipic SHarris TUnsal OCristal AHur IValero M(2011)Profiling and Optimizing Transactional Memory ApplicationsInternational Journal of Parallel Programming10.1007/s10766-011-0177-240:1(25-56)Online publication date: 28-Jul-2011
https://doi.org/10.1007/s10766-011-0177-2
Harris TLarus JRajwar R(2010)Transactional Memory, 2nd editionSynthesis Lectures on Computer Architecture10.2200/S00272ED1V01Y201006CAC0115:1(1-263)Online publication date: 22-Dec-2010
https://doi.org/10.2200/S00272ED1V01Y201006CAC011
Zyulkyarov FStipic SHarris TUnsal OCristal AHur IValero MSalapura VGschwind MKnoop J(2010)Discovering and understanding performance bottlenecks in transactional applicationsProceedings of the 19th international conference on Parallel architectures and compilation techniques10.1145/1854273.1854311(285-294)Online publication date: 11-Sep-2010
https://dl.acm.org/doi/10.1145/1854273.1854311
Zyulkyarov FHarris TUnsal OCristal AValero M(2010)Debugging programs that use atomic blocks and transactional memoryACM SIGPLAN Notices10.1145/1837853.169346345:5(57-66)Online publication date: 9-Jan-2010
https://dl.acm.org/doi/10.1145/1837853.1693463
Zyulkyarov FHarris TUnsal OCristal AValero MGovindarajan RPadua DHall M(2010)Debugging programs that use atomic blocks and transactional memoryProceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/1693453.1693463(57-66)Online publication date: 9-Jan-2010
https://dl.acm.org/doi/10.1145/1693453.1693463
Ramadan HRossbach CWitchel E(2008)Dependence-aware transactional memory for increased concurrencyProceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture10.5555/1521747.1521799(246-257)Online publication date: 8-Nov-2008
https://dl.acm.org/doi/10.5555/1521747.1521799
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents