Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Kismet: parallel speedup estimates for serial programs

Published: 22 October 2011 Publication History

Abstract

Software engineers now face the difficult task of refactoring serial programs for parallel execution on multicore processors. Currently, they are offered little guidance as to how much benefit may come from this task, or how close they are to the best possible parallelization. This paper presents Kismet, a tool that creates parallel speedup estimates for unparallelized serial programs. Kismet differs from previous approaches in that it does not require any manual analysis or modification of the program. This difference allows quick analysis of many programs, avoiding wasted engineering effort on those that are fundamentally limited. To accomplish this task, Kismet builds upon the hierarchical critical path analysis (HCPA) technique, a recently developed dynamic analysis that localizes parallelism to each of the potentially nested regions in the target program. It then uses a parallel execution time model to compute an approximate upper bound for performance, modeling constraints that stem from both hardware parameters and internal program structure.
Our evaluation applies Kismet to eight high-parallelism NAS Parallel Benchmarks running on a 32-core AMD multicore system, five low-parallelism SpecInt benchmarks, and six medium-parallelism benchmarks running on the finegrained MIT Raw processor. The results are compelling. Kismet is able to significantly improve the accuracy of parallel speedup estimates relative to prior work based on critical path analysis.

References

[1]
Intel Parallel Advisor 2011. http://software.intel.com/en-us/articles/intel-parallel-advisor.
[2]
NAS Parallel Benchmarks 2.3; OpenMP C. www.hpcc.jp/Omni/.
[3]
V. Adve, J. Mellor-Crummey, M. Anderson, J.-C. Wang, D. A. Reed, and K. Kennedy. An integrated compilation and performance analysis environment for data parallel programs. In SC '95: Proceedings of the ACM/IEEE conference on Supercomputing, 1995.
[4]
A. Agarwal, S. Amarasinghe, R. Barua, M. Frank, W. Lee, V. Sarkar, D. Srikrishna, and M. Taylor. The RAW compiler project. In Proceedings of the Second SUIF Compiler Workshop, 1997.
[5]
G. Ammons, T. Ball, and J. R. Larus. Exploiting hardware performance counters with flow and context sensitive profiling. In PLDI '97: Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 1997.
[6]
T. E. Anderson, and E. D. Lazowska. Quartz: A tool for tuning parallel program performance. In SIGMETRICS, vol. 18, 1990.
[7]
T. Austin, and G. S. Sohi. Dynamic dependency analysis of ordinary programs. In ISCA '92: Proceedings of the International Symposium on Computer Architecture, 1992.
[8]
J. Babb, M. Frank, V. Lee, E. Waingold, R. Barua, M. Taylor, J. Kim, S. Devabhaktuni, and A. Agarwal. The raw benchmark suite: computation structures for general purpose computing. In FCCM '97: Proceedings of the IEEE Symposium on FPGA-Based Custom Computing Machines, 1997.
[9]
Bailey et al. The NAS parallel benchmarks. In SC '91: Proceedings of the Conference on Supercomputing, 1991.
[10]
B. J. Barnes, B. Rountree, D. K. Lowenthal, J. Reeves, B. de Supinski, and M. Schulz. A regression-based approach to scalability prediction. In ICS '08: Proceedings of the International Conference on Supercomputing, 2008.
[11]
J. M. Bull, and D. O'Neill. A microbenchmark suite for OpenMP 2.0. SIGARCH Computer Architecture News, Dec 2001.
[12]
E. S. Chung, P. A. Milder, J. C. Hoe, and K. Mai. Single-chip heterogeneous computing: Does the future include custom logic, fpgas, and gpgpus? In MICRO '10: Proceedings of the IEEE/ACM International Symposium on Microarchitecture, 2010.
[13]
L. De Rose, and D. Reed. Svpablo: A multi-language architecture-independent performance analysis system. In ICPP '99:International Conference on Parallel Processing, 1999.
[14]
E. Waingold et al. Baring It All to Software: Raw Machines. IEEE Computer, Sept 1997.
[15]
S. Garcia, D. Jeon, C. Louie, S. Kota Venkata, and M. B. Taylor. Bridging the parallelization gap: Automating parallelism discovery and planning. In HotPar '10: Proceedings of the USENIX workshop on Hot Topics in Parallelism, 2010.
[16]
S. Garcia, D. Jeon, C. Louie, and M. B. Taylor. Kremlin: Rethinking and rebooting gprof for the multicore age. In PLDI '11: Proceedings of the Conference on Programming Language Design and Implementation, 2011.
[17]
N. Goulding, J. Sampson, G. Venkatesh, S. Garcia, J. Auricchio, J. Babb, M. Taylor, and S. Swanson. GreenDroid: A Mobile Application Processor for a Future of Dark Silicon. In Hotchips, 2010.
[18]
Y. He, C. Leiserson, and W. Leiserson. The Cilkview Scalability Analyzer. In SPAA '10: Proceedings of the Symposium on Parallelism in Algorithms and Architectures, 2010.
[19]
M. D. Hill, and M. R. Marty. Amdahl's law in the multicore era. IEEE Computer, July 2008.
[20]
K. Hoste, A. Phansalkar, L. Eeckhout, A. Georges, L. K. John, and K. De Bosschere. Performance prediction based on inherent program similarity. In PACT '06: Parallel Architectures and Compilation Techniques, 2006.
[21]
D. Jeon, S. Garcia, C. Louie, S. Kota Venkata, and M. B. Taylor. Kremlin: Like gprof, but for Parallelization. In PPoPP '11: Principles and Practice of Parallel Programming, 2011.
[22]
D. Jeon, S. Garcia, C. Louie, and M. B. Taylor. Parkour: Parallel speedup estimates for serial programs. In HotPar '11: Proceedings of the USENIX workshop on Hot Topics in Parallelism, May 2011.
[23]
T. S. Karkhanis, and J. E. Smith. A first-order superscalar processor model. In ISCA '04: Proceedings of the International Symposium on Computer Architecture.
[24]
H. Kim, A. Raman, F. Liu, J. W. Lee, and D. I. August. Scalable speculative parallelization on commodity clusters. In MICRO '10: Proceedings of the IEEE/ACM International Symposium on Microarchitecture, 2010.
[25]
M. Kim, H. Kim, and C. Luk. Prospector: A dynamic data-dependence profiler to help parallel programming. In HotPar '10: Proceedings of the USENIX workshop on Hot Topics in parallelism, 2010.
[26]
M. Kim, H. Kim, and C.-K. Luk. SD3: A scalable approach to dynamic data-dependence profiling. MICRO '10: Proceedings of the International Symposium on Microarchitecture, 2010.
[27]
M. Kulkarni, M. Burtscher, R. Inkulu, K. Pingali, and C. Casçaval. How much parallelism is there in irregular applications? In PPoPP '09: Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2009.
[28]
M. Kumar. Measuring parallelism in computation-intensive scientific/engineering applications. IEEE TOC, Sep 1988.
[29]
M. S. Lam, and R. P. Wilson. Limits of control flow on parallelism. In ISCA, 1992.
[30]
J. R. Larus. Loop-level parallelism in numeric and symbolic programs. IEEE Trans. Parallel Distrib. Syst., 1993.
[31]
C. Lattner, and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In CGO '04: Proceedings of the International Symposium on Code Generation and Optimization, 2004.
[32]
W. Lee, R. Barua, M. Frank, D. Srikrishna, J. Babb, V. Sarkar, and S. Amarasinghe. Space-time scheduling of instruction-level parallelism on a Raw machine. In ASPLOS '98: International Conference on Architectural Support for Programming Languages and Operating Systems, Oct 1998.
[33]
S.-W. Liao, A. Diwan, R. P. Bosch, Jr., A. Ghuloum, and M. S. Lam. SUIF Explorer: an interactive and interprocedural parallelizer. In PPoPP '99: Proceedings of the ACM SIGPLAN symposium on Principles and Practice of Parallel Programming.
[34]
G. Loh. A time-stamping algorithm for efficient performance estimation of superscalar processors. In SIGMETRICS, 2001.
[35]
M. B. Taylor et al. Evaluation of the raw microprocessor: An exposed-wire-delay architecture for ilp and streams. In ISCA '04: Proceedings of the International Symposium on Computer Architecture, Jun 2004.
[36]
M. B. Taylor et al. The Raw Microprocessor: A Computation Fabric for Software Circuits and General-Purpose Programs. In IEEE Micro, Mar/Apr 2002.
[37]
M. Martin, D. Sorin, B. Beckmann, M. Marty, M. Xu, A. R. Alameldeen, K. Moore, M. Hill, and D. Wood. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset. SIGARCH Comput. Archit. News, Nov 2005.
[38]
M. Martonosi, D. Felt, and M. Heinrich. Integrating performance monitoring and communication in parallel computers. In SIGMETRICS, 1996.
[39]
B. P. Miller, M. D. Callaghan, J. M. Cargille, J. K. Hollingsworth, R. B. Irvin, K. L. Karavanic, K. Kunchithapadam, and T. Newhall. The Paradyn Parallel Performance Measurement Tool. IEEE Computer, 1995.
[40]
N. Nethercote, and J. Seward. How to shadow every byte of memory used by a program. In VEE '07: Proceedings of the 3rd international conference on Virtual Execution Environments, 2007.
[41]
N. Nethercote, and J. Seward. Valgrind: A framework for heavyweight dynamic binary instrumentation. In PLDI '07: Proceedings of the Conference on Programming Language Design and Implementation, 2007.
[42]
D. Ofelt, and J. L. Hennessy. Efficient performance prediction for modern microprocessors. In SIGMETRICS, 2000.
[43]
M. K. Prabhu, and K. Olukotun. Exposing speculative thread parallelism in spec2000. In PPoPP '05: Proceedings of the ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, 2005.
[44]
E. Raman, G. Ottoni, A. Raman, M. J. Bridges, and D. I. August. Parallel-stage decoupled software pipelining. In CGO '08: Proceedings of the International Symposium on Code Generation and Optimization, 2008.
[45]
L. Rauchwerger, P. K. Dubey, and R. Nair. Measuring limits of parallelism and characterizing its vulnerability to resource constraints. In MICRO '93: Proceedings of the international symposium on Microarchitecture, 1993.
[46]
S. Bell et al. TILE64 - Processor: A 64-Core SoC with Mesh Interconnect. In ISSCC '08: IEEE Solid-State Circuits Conference, 2008.
[47]
N. R. Tallent, and J. M. Mellor Crummey. Effective performance measurement and analysis of multithreaded applications. In PPoPP '09: Proceedings of the ACM SIGPLAN symposium on Principles and practice of parallel programming, 2009.
[48]
M. B. Taylor. Design Decisions in the Implementation of a Raw Architecture Workstation. Master's thesis, Massachusetts Institute of Technology, Sept 1999.
[49]
M. B. Taylor. Tiled Microprocessors. Ph.D. thesis, Massachusetts Institute of Technology, 2007.
[50]
M. B. Taylor, W. Lee, S. P. Amarasinghe, and A. Agarwal. Scalar operand networks. IEEE Transactions on Parallel and Distributed Systems, Feb 2005.
[51]
K. B. Theobald, G. R. Gao, and L. J. Hendren. On the limits of program parallelism and its smoothability. In MICRO '92: Proceedings of the International Symposium on Microarchitecture, 1992.
[52]
D. W. Wall. Limits of instruction-level parallelism. In Proceedings of the Conference on Architectural Support for Programming Languages and Operating Systems, 1991.
[53]
J. Zhai, W. Chen, and W. Zheng. Phantom: predicting performance of parallel applications on large-scale parallel machines using a single node. In PPoPP '10: Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2010.
[54]
X. Zhang, A. Navabi, and S. Jagannathan. Alchemist: A transparent dependence distance profiling infrastructure. In CGO '09: Proceedings of the International Symposium on Code Generation and Optimization, 2009.
[55]
Y. Zhang, and R. Gupta. Timestamped whole program path representation and its applications. In PLDI '01: Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 2001.
[56]
L. Zhao, R. Iyer, J. Moses, R. lllikkal, S. Makineni, and D. Newell. Exploring Large-Scale CMP Architectures Using ManySim. IEEE Micro, July 2007.
[57]
Q. Zhao, D. Bruening, and S. Amarasinghe. Efficient memory shadowing for 64-bit architectures. In ISMM '10: Proceedings of the International Symposium on Memory Management, Jun 2010.
[58]
Q. Zhao, D. Bruening, and S. Amarasinghe. Umbra: Efficient and scalable memory shadowing. In CGO '10: Proceedings of the IEEE/ACM international symposium on Code Generation and Optimization, 2010.
[59]
H. Zhong, M. Mehrara, S. Lieberman, and S. Mahlke. Uncovering hidden loop level parallelism in sequential applications. In HPCA '08: Proceedings of the International Symposium on High Performance Computer Architecture, 2008.
[60]
D. A. Zier, and B. Lee. Performance evaluation of dynamic speculative multithreading with the cascadia architecture. IEEE Transactions on Parallel and Distributed Systems, Jan 2010.

Cited By

View all
  • (2021)Inherent Parallelism and Speedup Estimation of Sequential ProgramsAnnals of Emerging Technologies in Computing10.33166/AETiC.2021.02.0065:2(62-77)Online publication date: 1-Apr-2021
  • (2020)CALIPER: A Coarse Grain Parallel Performance Estimator and PredictorEmerging Technologies in Computing10.1007/978-3-030-60036-5_2(16-39)Online publication date: 29-Sep-2020
  • (2019)FPGA Prototyping of A Millimeter-Wave Multiple Gigabit WLAN System2019 IEEE International Workshop on Signal Processing Systems (SiPS)10.1109/SiPS47522.2019.9020634(260-265)Online publication date: Oct-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices
ACM SIGPLAN Notices  Volume 46, Issue 10
OOPSLA '11
October 2011
1063 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2076021
Issue’s Table of Contents
  • cover image ACM Conferences
    OOPSLA '11: Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
    October 2011
    1104 pages
    ISBN:9781450309400
    DOI:10.1145/2048066
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 October 2011
Published in SIGPLAN Volume 46, Issue 10

Check for updates

Author Tags

  1. expressible self-parallelism
  2. hierarchical critical path analysis
  3. parallel software engineering
  4. performance estimation

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)11
  • Downloads (Last 6 weeks)0
Reflects downloads up to 10 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2021)Inherent Parallelism and Speedup Estimation of Sequential ProgramsAnnals of Emerging Technologies in Computing10.33166/AETiC.2021.02.0065:2(62-77)Online publication date: 1-Apr-2021
  • (2020)CALIPER: A Coarse Grain Parallel Performance Estimator and PredictorEmerging Technologies in Computing10.1007/978-3-030-60036-5_2(16-39)Online publication date: 29-Sep-2020
  • (2019)FPGA Prototyping of A Millimeter-Wave Multiple Gigabit WLAN System2019 IEEE International Workshop on Signal Processing Systems (SiPS)10.1109/SiPS47522.2019.9020634(260-265)Online publication date: Oct-2019
  • (2017)The CSI Framework for Compiler-Inserted Program InstrumentationProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/31545021:2(1-25)Online publication date: 19-Dec-2017
  • (2017)Generating Performance Models for Irregular Applications2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2017.61(317-326)Online publication date: May-2017
  • (2015)Fast parallel application and multiprocessor design space exploration from sequential codeProceedings of the 10th International Conference on Hardware/Software Codesign and System Synthesis10.5555/2830840.2830858(163-172)Online publication date: 4-Oct-2015
  • (2014)FPGA prototyping of emerging manycore architectures for parallel programming research using Formic boardsJournal of Systems Architecture10.1016/j.sysarc.2014.03.00260:6(481-493)Online publication date: Jun-2014
  • (2013)Parallelism profiling and wall-time prediction for multi-threaded applicationsProceedings of the 4th ACM/SPEC International Conference on Performance Engineering10.1145/2479871.2479901(211-216)Online publication date: 21-Apr-2013
  • (2013)Towards a compiler/runtime synergy to predict the scalability of parallel loops2013 IEEE 6th International Workshop on Multi-/Many-core Computing Systems (MuCoCoS)10.1109/MuCoCoS.2013.6633605(1-10)Online publication date: Sep-2013
  • (2013)A superlinear speedup region for matrix multiplicationConcurrency and Computation: Practice and Experience10.1002/cpe.310226:11(1847-1868)Online publication date: 26-Jul-2013
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media