Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1088149.1088176acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
Article

TAPE: a transactional application profiling environment

Published: 20 June 2005 Publication History
  • Get Citation Alerts
  • Abstract

    Transactional Coherence and Consistency (TCC) provides a new parallel programming model that uses transactions as the basic unit of parallel work and communication. TCC simplifies the development of correct parallel code because hardware provides transaction atomicity and ordering. Nevertheless, the programmer or a dynamic compiler must still optimize the parallel code for performance.This paper presents TAPE, a hardware and software infrastructure for profiling in TCC systems. TAPE extends the hardware for transactional execution to identify performance impediments such as dependence violations, buffer overflows, and work imbalance. It filters infrequent events to reduce resource requirements and allows the programmer to focus on the most important bottlenecks. We demonstrate that TAPE introduces minimal die area and performance overhead and can be used continuously, even for production runs. Moreover, we demonstrate how to leverage the profiling information to guide optimization for a set of parallel applications. TAPE accurately identifies the source code location and type of the most important bottlenecks, allowing a programmer to achieve maximum parallel speedup with a few profiling steps.

    References

    [1]
    Intel Corporation, VTune: a visual tuning environment. http://support.intel.com/support/performancetools/vtune/.
    [2]
    Stanford Parallel Applications for Shared Memory, SPLASH. http://www-flash.stanford.edu/apps/SPLASH/.
    [3]
    Java Grande Forum, Java Grande Benchmark Suite. http://www.epcc.ed.ac.uk/javagrande/, 2000.
    [4]
    V. Agarwal, M. S. Hrishikesh, S. W. Keckler, and D. Burger, Clock rate versus IPC: the end of the road for conventional microarchitectures. In ISCA-27: Proceedings of the 27th International Symposium on Computer Architecture, pages 248--259, 2000.
    [5]
    C. S. Ananian, K. Asanović, B. C. Kuszmaul, C. E. Leiserson, and S. Lie. Unbounded Transactional Memory. In HPCA'05: Proceedings of the 11th International Symposium on High-Performance Computer Architecture, pages 316--327, Feb. 2005.
    [6]
    J. M. Anderson et al. Continuous profiling: where have all the cycles gone? In SOSP-XVI: Proceedings of the sixteenth ACM symposium on Operating systems principles, 1997.
    [7]
    Broadcom Corporation. The Broadcom BCM-1250 Multiprocessor. In Presentation at 2002 Embedded Processor Forum, April 2002.
    [8]
    M. Chen and K. Olukotun. TEST: a tracer for extracting speculative threads. In CGO '03: Proceedings of the international symposium on Code generation and optimization, pages 301--312. IEEE Computer Society, 2003.
    [9]
    J. Dean, J. E. Hicks, C. A. Waldspurger, W. E. Weihl, and G. Z. Chrysos. ProfileMe: Hardware support for instruction-level profiling on out-of-order processors. In MICRO'97: International Symposium on Microarchitecture, pages 292--302, 1997.
    [10]
    A. J. Goldberg and J. L. Hennessy. Performance debugging shared memory multiprocessor programs with MTOOL. In Supercomputing '91: Proceedings of the 1991 ACM/IEEE conference on Supercomputing, pages 481--490. ACM Press, 1991.
    [11]
    L. Hammond, B. D. Carlstrom, V. Wong, B. Hertzberg, M. Chen, C. Kozyrakis, and K. Olukotun. Programming with transactional coherence and consistency. In ASPLOS-XI: Proceedings of the 11th Intl, Conference on Arch. Support for Programming Languages and Operating Systems, Oct. 2004.
    [12]
    L. Hammond, V. Wong, M. Chen, B. D. Carlstrom, J. D. Davis, B. Hertzberg, M. K. Prabhu, H. Wijaya, C. Kozyrakis, and K. Olukotun. Transactional memory coherence and consistency. In ISCA-31: Proceedings of the 31st International Symposium on Computer Architecture, pages 102--113, June 2004.
    [13]
    P. Kongetira, A 32-way multithreaded Sparc processor. In Conference Record of Hot Chips 16, Stanford, CA, August 2004.
    [14]
    J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, and J. Hennessy. The Stanford FLASH multiprocessor. In ISCA-21: Proceedings of the 21st International Symposium on Computer Architecture, pages 302--313, 1994.
    [15]
    M. Martonosi, A. Gupta, and T. Anderson. MemSpy: analyzing memory system bottlenecks in programs. In SIGMETRICS '92/PERFORMANCE '92: Proceedings of the 1992 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems, pages 1--12. ACM Press, 1992.
    [16]
    M. Martonosi, D. Ofelt, and M. Heinrich. Integrating performance monitoring and communication in parallel computers. In Measurement and Modeling of Computer Systems, pages 138--147, 1996.
    [17]
    A. McDonald et al. Characterization of TCC on Chip-Multiprocessors. In PACT-XIV: The Fourteenth International Conference on Parallel Architectures and Compilation Techniques, Sept. 2005.
    [18]
    J. T. R. Kalla, B. Sinharoy. Simultaneous multi-threading implementation in POWER5. In Conference Record of Hot Chips 15 Symposium, Stanford, CA, August 2003.
    [19]
    R. Raman. UltraSparc Gemini: Dual CPU processor. In Conference Record of Hot Chips 15 Symposium, Palo Alto, CA, August 2003.
    [20]
    S. K. Reinhardt, R. W. Pfile, and D. A. Wood. Hardware support for flexible distributed shared memory. IEEE Transactions on Computers, 47(10):1056--1072, 1998.
    [21]
    Standard Performance Evaluation Corporation, SPEC CPU Benchmarks. http://www.specbench.org/, 1995--2000.
    [22]
    S. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH2 programs: Characterization and methodological considerations. In ISCA-22: Proceedings of the 22nd International Symposium on Computer Architecture, pages 24--36, June 1995.
    [23]
    M. Wolfe. High-Performance Compilers for Parallel Computing. Addison-Wesley, 1995.
    [24]
    Z. Xu, J. R. Larus, and B. P. Miller. Shared-memory performance profiling. In PPoPP-VI: Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 240--251, 1997.
    [25]
    M. Zagha, B. Larson, S. Turner, and M. Itzkowitz. Performance analysis using the MIPS R10000 performance counters. 1996.
    [26]
    C. B. Zilles and G. S. Sohi. A programmable co-processor for profiling. In HPCA-7: Proceedings of the 7th International Symposium on High-Performance Computer Architecture, pages 241--253, 2001.

    Cited By

    View all
    • (2013)But how do we really debug transactional memory programs?Proceedings of the 5th USENIX Conference on Hot Topics in Parallelism10.5555/3241639.3241648(9-9)Online publication date: 24-Jun-2013
    • (2013)Evaluation of two formulations of the conjugate gradients method with transactional memoryProceedings of the 19th international conference on Parallel Processing10.1007/978-3-642-40047-6_52(508-520)Online publication date: 26-Aug-2013
    • (2012)Transactional event profiling in a best-effort hardware transactional memory systemProceedings of the 21st international conference on Parallel architectures and compilation techniques10.1145/2370816.2370904(475-476)Online publication date: 19-Sep-2012
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICS '05: Proceedings of the 19th annual international conference on Supercomputing
    June 2005
    414 pages
    ISBN:1595931678
    DOI:10.1145/1088149
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 June 2005

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Article

    Conference

    ICS05
    Sponsor:
    ICS05: International Conference on Supercomputing 2005
    June 20 - 22, 2005
    Massachusetts, Cambridge

    Acceptance Rates

    Overall Acceptance Rate 629 of 2,180 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 12 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2013)But how do we really debug transactional memory programs?Proceedings of the 5th USENIX Conference on Hot Topics in Parallelism10.5555/3241639.3241648(9-9)Online publication date: 24-Jun-2013
    • (2013)Evaluation of two formulations of the conjugate gradients method with transactional memoryProceedings of the 19th international conference on Parallel Processing10.1007/978-3-642-40047-6_52(508-520)Online publication date: 26-Aug-2013
    • (2012)Transactional event profiling in a best-effort hardware transactional memory systemProceedings of the 21st international conference on Parallel architectures and compilation techniques10.1145/2370816.2370904(475-476)Online publication date: 19-Sep-2012
    • (2012)A Low-Overhead Profiling and Visualization Framework for Hybrid Transactional MemoryProceedings of the 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines10.1109/FCCM.2012.11(1-8)Online publication date: 29-Apr-2012
    • (2011)Profiling and Optimizing Transactional Memory ApplicationsInternational Journal of Parallel Programming10.1007/s10766-011-0177-240:1(25-56)Online publication date: 28-Jul-2011
    • (2010)Transactional Memory, 2nd editionSynthesis Lectures on Computer Architecture10.2200/S00272ED1V01Y201006CAC0115:1(1-263)Online publication date: 22-Dec-2010
    • (2010)Discovering and understanding performance bottlenecks in transactional applicationsProceedings of the 19th international conference on Parallel architectures and compilation techniques10.1145/1854273.1854311(285-294)Online publication date: 11-Sep-2010
    • (2010)Debugging programs that use atomic blocks and transactional memoryACM SIGPLAN Notices10.1145/1837853.169346345:5(57-66)Online publication date: 9-Jan-2010
    • (2010)Debugging programs that use atomic blocks and transactional memoryProceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/1693453.1693463(57-66)Online publication date: 9-Jan-2010
    • (2008)Dependence-aware transactional memory for increased concurrencyProceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture10.5555/1521747.1521799(246-257)Online publication date: 8-Nov-2008
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media