Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Contech: Efficiently Generating Dynamic Task Graphs for Arbitrary Parallel Programs

Published: 08 July 2015 Publication History
  • Get Citation Alerts
  • Abstract

    Parallel programs can be characterized by task graphs encoding instructions, memory accesses, and the parallel work’s dependencies, while representing any threading library and architecture. This article presents Contech, a high performance framework for generating dynamic task graphs from arbitrary parallel programs, and a novel representation enabling programmers and compiler optimizations to understand and exploit program aspects. The Contech framework supports a variety of languages (including C, C++, and Fortran), parallelization libraries, and ISAs (including × 86 and ARM). Running natively for collection speed and minimizing program perturbation, the instrumentation shows 4 × improvement over a Pin-based implementation on PARSEC and NAS benchmarks.

    Supplementary Material

    TACO1202-25 (taco1202-25.pdf)
    Slide deck associated with this paper

    References

    [1]
    V. S. Adve and R. Sakellariou. 2001. Compiler synthesis of task graphs for parallel program performance prediction. In Proceedings of the 13th International Workshop on Languages and Compilers for Parallel Computing-Revised Papers (LCPC’00). Springer-Verlag, London, UK, 208--226. http://dl.acm.org/citation.cfm?id=645678.663959.
    [2]
    V. S. Adve and M. K. Vernon. 2004. Parallel program performance prediction using deterministic task graph analysis. ACM Trans. Comput. Syst. 22, 1 (Feb. 2004), 94--136.
    [3]
    K. Agrawal, C. E. Leiserson, and J. Sukha. 2010. Executing task graphs using work-stealing. In IEEE International Symposium on Parallel Distributed Processing (IPDPS). 1--12.
    [4]
    V. A. F. Almeida, I. M. M. Vasconcelos, J. N. C. Árabe, and D. A. Menascé. 1992. Using random task graphs to investigate the potential benefits of heterogeneity in parallel systems. In Proceedings of the 1992 ACM/IEEE Conference on Supercomputing (Supercomputing’92). IEEE Computer Society, Los Alamitos, CA, 683--691. http://dl.acm.org/citation.cfm?id=147877.148113
    [5]
    D. Ansaloni, W. Binder, A. Heydarnoori, and L. Y. Chen. 2012. Deferred methods: Accelerating dynamic program analysis on multicores. In Proceedings of the 10th International Symposium on Code Generation and Optimization (CGO’12). ACM, New York, NY, 242--251.
    [6]
    M. Bach, M. Charney, R. Cohn, E. Demikhovsky, T. Devor, K. Hazelwood, A. Jaleel, Chi-Keung Luk, G. Lyons, H. Patil, and A. Tal. 2010. Analyzing parallel programs with pin. Computer 43, 3 (2010), 34--41.
    [7]
    Barcelona Supercomputing Center 2015. Paraver. Barcelona Supercomputing Center. http://www.bsc.es/computer-sciences/performance-tools/paraver.
    [8]
    C. J. Beckmann and C. D. Polychronopoulos. 1992. Microarchitecture support for dynamic scheduling of acyclic task graphs. In Proceedings of the 25th Annual International Symposium on Microarchitecture (MICRO 25). IEEE Computer Society, Los Alamitos, CA, 140--148. http://dl.acm.org/citation.cfm?id=144953.145791.
    [9]
    C. Bienia. 2011. Benchmarking Modern Multiprocessors. Ph.D. Dissertation. Princeton University.
    [10]
    G. E. Blelloch, P. B. Gibbons, Y. Matias, and G. J. Narlikar. 1997. Space-efficient scheduling of parallelism with synchronization variables. In Proceedings of the 9th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’97). ACM, New York, NY, 12--23.
    [11]
    R. D. Blumofe and C. E. Leiserson. 1993. Space-efficient scheduling of multithreaded computations. In Proceedings of the 25th Annual ACM Symposium on Theory of Computing (STOC’93). ACM, New York, NY, 362--371.
    [12]
    D. Bruening, T. Garnett, and S. Amarasinghe. 2003. An infrastructure for adaptive dynamic optimization. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-Directed and Runtime Optimization (CGO’03). IEEE Computer Society, Washington, DC, 265--275. http://dl.acm.org/citation.cfm?id=776261.776290.
    [13]
    H. Cui, J. Wu, J. Gallagher, H. Guo, and J. Yang. 2011. Efficient deterministic multithreading through schedule relaxation. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP’11). ACM, New York, NY, 337--351.
    [14]
    Y. Etsion, F. Cabarcas, A. Rico, A. Ramirez, R. M. Badia, E. Ayguade, J. Labarta, and M. Valero. 2010. Task superscalar: An out-of-order task pipeline. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’43). IEEE Computer Society, Washington, DC, 89--100.
    [15]
    A. Gerasoulis, S. Venugopal, and T. Yang. 1990. Clustering task graphs for message passing architectures. In Proceedings of the 4th International Conference on Supercomputing (ICS’90). ACM, New York, NY, 447--456.
    [16]
    A. Goel, A. Roychoudhury, and T. Mitra. 2003. Compactly representing parallel program executions. In Proceedings of the 9th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’03). ACM, New York, NY, 191--202.
    [17]
    G. Gupta and G. S. Sohi. 2011. Dataflow execution of sequential imperative programs on multicore architectures. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44). ACM, New York, NY, 59--70.
    [18]
    J. Ha, M. Arnold, S. M. Blackburn, and K. S. McKinley. 2009. A concurrent dynamic analysis framework for multicore hardware. In Proceedings of the 24th ACM SIGPLAN Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA’09). ACM, New York, NY, 155--174.
    [19]
    Y. He, C. E. Leiserson, and W. M. Leiserson. 2010. The cilkview scalability analyzer. In Proceedings of the 22nd Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’10). ACM, New York, NY, 145--156.
    [20]
    D. R. Hower and M. D. Hill. 2008. Rerun: Exploiting episodes for lightweight memory race recording. In Proceedings of the 35th Annual International Symposium on Computer Architecture (ISCA’08). IEEE Computer Society, Washington, DC, 265--276.
    [21]
    Intel. 2014. Intel 64 and IA-32 Architectures Software Developer Manuals. Intel Corporation, Santa Clara, CA. Retrieved from http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html.
    [22]
    H. Jin, M. Frumkin, and J. Yan. 1999. The OpenMP Implementation of NAS Parallel Benchmarks and Its Performance. Technical Report NAS-99-011. NAS.
    [23]
    M. Kambadur, K. Tang, and M. A. Kim. 2012. Harmony: Collection and analysis of parallel block vectors. In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA’12). IEEE Computer Society, Washington, DC, 452--463. http://dl.acm.org/citation.cfm?id=2337159.2337211
    [24]
    M. Kulkarni, M. Burtscher, R. Inkulu, K. Pingali, and C. Casçaval. 2009. How much parallelism is there in irregular applications?. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’09). ACM, New York, NY, 3--14.
    [25]
    S. Kumar, C. J. Hughes, and A. Nguyen. 2007. Carbon: Architectural support for fine-grained parallelism on chip multiprocessors. In Proceedings of the 34th Annual International Symposium on Computer Architecture (ISCA’07). ACM, New York, NY, 162--173.
    [26]
    J. R. Larus. 1990. Abstract execution: A technique for efficiently tracing programs. Softw. Pract. Exper. 20, 12 (Nov. 1990), 1241--1258.
    [27]
    J. R. Larus. 1999. Whole program paths. In Proceedings of the ACM SIGPLAN 1999 Conference on Programming Language Design and Implementation (PLDI’99). ACM, New York, NY, 259--269.
    [28]
    C. Lattner and V. Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (CGO’04). IEEE Computer Society, Washington, DC, p. 75. http://dl.acm.org/citation.cfm?id=977395.977673.
    [29]
    M. A. Laurenzano, J. Peraza, L. Carrington, A. Tiwari, W. A. Ward, and R. Campbell. 2012. A static binary instrumentation threading model for fast memory trace collection. In Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis (SCC’12). IEEE Computer Society, Washington, DC, 741--745.
    [30]
    D. L. Long and L. A. Clarke. 1989. Task interaction graphs for concurrency analysis. In Proceedings of the 11th International Conference on Software Engineering (ICSE’89). ACM, New York, NY, 44--52.
    [31]
    C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’05). ACM, New York, NY, 190--200.
    [32]
    Y.-H. Lyu, D.-Y. Hong, T.-Y. Wu, J.-J. Wu, W.-C. Hsu, P. Liu, and P.-C. Yew. 2014. DBILL: An efficient and retargetable dynamic binary instrumentation framework using llvm backend. In Proceedings of the 10th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE’14). ACM, New York, NY, 141--152.
    [33]
    M. McCool, J. Reinders, and A. Robison. 2012. Structured Parallel Programming: Patterns for Efficient Computation (1st ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA.
    [34]
    N. Nethercote and J. Seward. 2007. Valgrind: A framework for heavyweight dynamic binary instrumentation. In Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’07). ACM, New York, NY, 89--100.
    [35]
    J. A. Poovey, B. P. Railing, and T. M. Conte. 2011. Parallel pattern detection for architectural improvements. In Proceedings of the 3rd USENIX Conference on Hot Topic in Parallelism (HotPar’11). USENIX Association, Berkeley, CA, 12--12. http://dl.acm.org/citation.cfm?id=2001252.2001264
    [36]
    B. P. Railing and E. R. Hein. 2015. Contech. Georgia Institute of Technology. https://github.com/bprail/contech
    [37]
    A. Rico, A. Duran, F. Cabarcas, Y. Etsion, A. Ramirez, and M. Valero. 2011. Trace-driven simulation of multithreaded applications. In 2011 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 87--96.
    [38]
    V. Sarkar and B. Simons. 1994. Parallel program graphs and their classification. In Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing. Springer-Verlag, London, UK, 633--655. http://dl.acm.org/citation.cfm?id=645671.665396.
    [39]
    Y. W. Song and Y.-H. Lee. 2014. On the existence of probe effect in multi-threaded embedded programs. In Proceedings of the 14th International Conference on Embedded Software (EMSOFT’14). ACM, New York, NY, Article 18, 9 pages.
    [40]
    S. Sridharan, G. Gupta, and G. S. Sohi. 2014. Adaptive, efficient, parallel execution of parallel programs. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’14). ACM, New York, NY, 169--180.
    [41]
    S. Tallam and R. Gupta. 2007. Unified control flow and data dependence traces. ACM Trans. Archit. Code Optim. 4, 3, Article 19 (Sept. 2007).
    [42]
    D. Upton, K. Hazelwood, R. Cohn, and G. Lueck. 2009. Improving instrumentation speed via buffering. In Proceedings of the Workshop on Binary Instrumentation and Applications (WBIA’09). ACM, New York, NY, 52--61.
    [43]
    H. Vandierendonck, G. Tzenakis, and D. S. Nikolopoulos. 2013. Analysis of dependence tracking algorithms for task dataflow execution. ACM Trans. Archit. Code Optim. 10, 4, Article 61 (Dec. 2013), 24 pages.
    [44]
    R. M. Yoo, C. J. Hughes, C. Kim, Y.-K. Chen, and Christos. Kozyrakis. 2013. Locality-aware task management for unstructured parallelism: A quantitative limit study. In Proceedings of the 25th Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’13). ACM, New York, NY, 315--325.
    [45]
    Q. Zhao, I. Cutcutache, and W.-F. Wong. 2008. Pipa: Pipelined profiling and analysis on multi-core systems. In Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’08). ACM, New York, NY, 185--194.
    [46]
    Q. Zhao, I. Cutcutache, and W.-F. Wong. 2010. PiPA: Pipelined profiling and analysis on multicore systems. ACM Trans. Archit. Code Optim. 7, 3, Article 13 (Dec. 2010), 29 pages.

    Cited By

    View all
    • (2023)CADSS: Computer Architecture Design Simulator for StudentsProceedings of the Workshop on Computer Architecture Education10.1145/3605507.3610626(34-40)Online publication date: 17-Jun-2023
    • (2021)Automatic Optimization and Allocation of Data Using Q-Learning TechniqueAdvances in Smart Grid and Renewable Energy10.1007/978-981-15-7511-2_65(643-649)Online publication date: 5-Jan-2021
    • (2020) S 4 oC: A Self-Optimizing, Self-Adapting Secure System-on-Chip Design Framework to Tackle Unknown Threats — A Network Theoretic, Learning Approach 2020 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS45731.2020.9180687(1-8)Online publication date: Oct-2020
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 12, Issue 2
    July 2015
    410 pages
    ISSN:1544-3566
    EISSN:1544-3973
    DOI:10.1145/2775085
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 08 July 2015
    Accepted: 01 May 2015
    Revised: 01 May 2015
    Received: 01 January 2015
    Published in TACO Volume 12, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Instrumentation
    2. parallel program modeling
    3. task graph

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)90
    • Downloads (Last 6 weeks)3

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)CADSS: Computer Architecture Design Simulator for StudentsProceedings of the Workshop on Computer Architecture Education10.1145/3605507.3610626(34-40)Online publication date: 17-Jun-2023
    • (2021)Automatic Optimization and Allocation of Data Using Q-Learning TechniqueAdvances in Smart Grid and Renewable Energy10.1007/978-981-15-7511-2_65(643-649)Online publication date: 5-Jan-2021
    • (2020) S 4 oC: A Self-Optimizing, Self-Adapting Secure System-on-Chip Design Framework to Tackle Unknown Threats — A Network Theoretic, Learning Approach 2020 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS45731.2020.9180687(1-8)Online publication date: Oct-2020
    • (2020)A comparison of different metaheuristics for the quadratic assignment problem in accelerated systemsApplied Soft Computing10.1016/j.asoc.2020.106927(106927)Online publication date: Nov-2020
    • (2019)Self-Optimizing and Self-Programming Computing Systems: A Combined Compiler, Complex Networks, and Machine Learning ApproachIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2019.2897650(1-12)Online publication date: 2019
    • (2019)Fine-Grained Communication-Aware Task Scheduling Approach for Acyclic and Cyclic Applications on MPSoCsIEEE Access10.1109/ACCESS.2019.29116537(54372-54389)Online publication date: 2019
    • (2018)Prometheus: Processing-in-memory heterogeneous architecture design from a multi-layer network theoretic strategy2018 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE.2018.8342229(1387-1392)Online publication date: Mar-2018
    • (2018)Fast and Accurate Performance Analysis of SynchronizationProceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3178442.3178446(31-40)Online publication date: 24-Feb-2018
    • (2018)Towards Cross-Framework Workload Analysis via Flexible Event-Driven Interfaces2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS.2018.00030(169-178)Online publication date: Apr-2018
    • (2018)A high-level model for exploring multi-core architecturesParallel Computing10.1016/j.parco.2018.10.00680(23-35)Online publication date: Dec-2018
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media