Article

wPerf: generic Off-CPU analysis to identify bottleneck waiting events

Authors:

Yang WangAuthors Info & Claims

OSDI'18: Proceedings of the 13th USENIX conference on Operating Systems Design and Implementation

Pages 527 - 543

Published: 08 October 2018 Publication History

Abstract

This paper tries to identify waiting events that limit the maximal throughput of a multi-threaded application. To achieve this goal, we not only need to understand an event's impact on threads waiting for this event (i.e., local impact), but also need to understand whether its impact can reach other threads that are involved in request processing (i.e., global impact).

To address these challenges, wPerf computes the local impact of a waiting event with a technique called cascaded re-distribution; more importantly, wPerf builds a wait-for graph to compute whether such impact can indirectly reach other threads. By combining these two techniques, wPerf essentially tries to identify events with large impacts on all threads.

We apply wPerf to a number of open-source multithreaded applications. By following the guide of wPerf, we are able to improve their throughput by up to 4.83×. The overhead of recording waiting events at runtime is about 5.1% on average.

References

[1]

Event Tracing for Windows (ETW). https://docs.microsoft.com/en-us/windows-hardware/drivers/devtest/event-tracing-for-windows--etw-.

[2]

ALAM, M. M. U., LIU, T., ZENG, G., AND MUZAHID, A. Syncperf: Categorizing, detecting, and diagnosing synchronization performance bugs. In Proceedings of the Twelfth European Conference on Computer Systems (New York, NY, USA, 2017), EuroSys '17, ACM, pp. 298-313.

Digital Library

[3]

ANANTHANARAYANAN, G., GHODSI, A., SHENKER, S., AND STOICA, I. Effective straggler mitigation: Attack of the clones. In Presented as part of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13) (Lombard, IL, 2013), USENIX, pp. 185-198.

Digital Library

[4]

ANANTHANARAYANAN, G., KANDULA, S., GREENBERG, A., STOICA, I., LU, Y., SAHA, B., AND HARRIS, E. Reining in the outliers in map-reduce clusters using mantri. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation (Berkeley, CA, USA, 2010), OSDI'10, USENIX Association, pp. 265-278.

Digital Library

[5]

Apache HBASE. http://hbase.apache.org/.

[6]

aync-profiler: Sampling CPU and HEAP Profiler for Java Featuring AsyncGetCallTrace and perf events. https://github.com/jvm-profiling-tools/async-profiler/releases.

[7]

ATTARIYAN, M., CHOW, M., AND FLINN, J. X-ray: Automating root-cause diagnosis of performance anomalies in production software. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (Berkeley, CA, USA, 2012), OSDI'12, USENIX Association, pp. 307-320.

Digital Library

[8]

BACH, M. M., CHARNEY, M., COHN, R., DEMIKHOVSKY, E., DEVOR, T., HAZELWOOD, K., JALEEL, A., LUK, C.-K., LYONS, G., PATIL, H., AND TAL, A. Analyzing parallel programs with pin. Computer 43, 3 (Mar. 2010), 34-41.

Digital Library

[9]

BHAT, S. S., EQBAL, R., CLEMENTS, A. T., KAASHOEK, M. F., AND ZELDOVICH, N. Scaling a file system to many cores using an operation log. In Proceedings of the 26th Symposium on Operating Systems Principles (New York, NY, USA, 2017), SOSP '17, ACM, pp. 69-86.

Digital Library

[10]

Blockgrace. https://github.com/wenleix/BlockGRACE.

[11]

BURROWS, M. The chubby lock service for loosely-coupled distributed systems. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (Berkeley, CA, USA, 2006), OSDI '06, USENIX Association, pp. 335-350.

Digital Library

[12]

CHANG, F., DEAN, J., GHEMAWAT, S., HSIEH, W. C., WALLACH, D. A., BURROWS, M., CHANDRA, T., FIKES, A., AND GRUBER, R. E. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst. 26, 2 (June 2008), 4:1-4:26.

Digital Library

[13]

CHOW, M., MEISNER, D., FLINN, J., PEEK, D., AND WENISCH, T. F. The mystery machine: End-to-end performance analysis of large-scale internet services. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14) (Broomfield, CO, 2014), USENIX Association, pp. 217-231.

Digital Library

[14]

CLEMENTS, A. T., KAASHOEK, M. F., ZELDOVICH, N., MORRIS, R. T., AND KOHLER, E. The scalable commutativity rule: Designing scalable software for multicore processors. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (New York, NY, USA, 2013), SOSP '13, ACM, pp. 1-17.

Digital Library

[15]

CloudLab. https://www.cloudlab.us.

[16]

CURTSINGER, C., AND BERGER, E. D. Coz: Finding code that counts with causal profiling. In Proceedings of the 25th Symposium on Operating Systems Principles (New York, NY, USA, 2015), SOSP '15, ACM, pp. 184-197.

Digital Library

[17]

D3.JS Javascript Library. https://d3js.org/.

[18]

DAVID, F., THOMAS, G., LAWALL, J., AND MULLER, G. Continuously measuring critical section pressure with the free-lunch profiler. In Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications (New York, NY, USA, 2014), OOPSLA '14, ACM, pp. 291-307.

Digital Library

[19]

DEAN, J., AND GHEMAWAT, S. Mapreduce: Simplified data processing on large clusters. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6 (Berkeley, CA, USA, 2004), OSDI'04, USENIX Association, pp. 10-10.

Digital Library

[20]

Dtrace. http://dtrace.org/.

[21]

ERLINGSSON, U., PEINADO, M., PETER, S., AND BUDIU, M. Fay: Extensible distributed tracing from kernels to clusters. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (New York, NY, USA, 2011), SOSP '11, ACM, pp. 311-326.

Digital Library

[22]

GARCIA, S., JEON, D., LOUIE, C. M., AND TAYLOR, M. B. Kremlin: Rethinking and rebooting gprof for the multicore age. In Proceedings of the 32Nd ACM SIGPLAN Conference on Programming Language Design and Implementation (New York, NY, USA, 2011), PLDI '11, ACM, pp. 458-469.

Digital Library

[23]

GHEMAWAT, S., GOBIOFF, H., AND LEUNG, S.-T. The google file system. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles (New York, NY, USA, 2003), SOSP '03, ACM, pp. 29-43.

Digital Library

[24]

GOLDIN, M. Thread performance: Resource contention concurrency profiling in visual studio 2010. https://msdn.microsoft.com/en-us/magazine/ff714587.aspx.

[25]

GNU gprof. https://sourceware.org/binutils/docs/gprof/.

[26]

GRAHAM, S. L., KESSLER, P. B., AND MCKUSICK, M. K. Gprof: A call graph execution profiler. In Proceedings of the 1982 SIGPLAN Symposium on Compiler Construction (New York, NY, USA, 1982), SIGPLAN '82, ACM, pp. 120-126.

Digital Library

[27]

Apache HBase (TM) Configuration. http://hbase.apache.org/0.94/book/important_configurations.html.

[28]

HBase Administration Cookbook. https://www.safaribooksonline.com/library/view/hbase-administration-cookbook/9781849517140/ch09s03.html.

[29]

HDFS. http://hadoop.apache.org/hdfs.

[30]

HE, Y., LEISERSON, C. E., AND LEISERSON, W. M. The cilkview scalability analyzer. In Proceedings of the Twenty-second Annual ACM Symposium on Parallelism in Algorithms and Architectures (New York, NY, USA, 2010), SPAA '10, ACM, pp. 145-156.

Digital Library

[31]

HILL, J. M. D., JARVIS, S. A., SINIOLAKIS, C. J., AND VASILEV, V. P. Portable and architecture independent parallel performance tuning using a call-graph profiling tool. In Parallel and Distributed Processing, 1998. PDP '98. Proceedings of the Sixth Euromicro Workshop on (Jan 1998), pp. 286-294.

[32]

HOLT, R. C. Some deadlock properties of computer systems. ACM Comput. Surv. 4, 3 (Sept. 1972), 179-196.

Digital Library

[33]

HUANG, J., MOZAFARI, B., AND WENISCH, T. F. Statistical analysis of latency through semantic profiling. In Proceedings of the Twelfth European Conference on Computer Systems (New York, NY, USA, 2017), EuroSys '17, ACM, pp. 64-79.

Digital Library

[34]

HUNT, P., KONAR, M., JUNQUEIRA, F. P., AND REED, B. Zookeeper: Wait-free coordination for internet-scale systems. In Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference (Berkeley, CA, USA, 2010), USENIXATC' 10, USENIX Association, pp. 11-11.

Digital Library

[35]

JOAO, J. A., SULEMAN, M. A., MUTLU, O., AND PATT, Y. N. Bottleneck identification and scheduling in multithreaded applications. In Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems (New York, NY, USA, 2012), ASPLOS XVII, ACM, pp. 223-234.

Digital Library

[36]

Jprofiler. https://www.ej-technologies.com/products/jprofiler/overview.html.

[37]

KALDOR, J., MACE, J., BEJDA, M., GAO, E., KUROPATWA, W., O'NEILL, J., ONG, K. W., SCHALLER, B., SHAN, P., VISCOMI, B., VENKATARAMAN, V., VEERARAGHAVAN, K., AND SONG, Y. J. Canopy: An end-to-end performance tracing and analysis system. In Proceedings of the 26th Symposium on Operating Systems Principles (New York, NY, USA, 2017), SOSP '17, ACM, pp. 34-50.

Digital Library

[38]

KAMBADUR, M., TANG, K., AND KIM, M. A. Harmony: Collection and analysis of parallel block vectors. In Proceedings of the 39th Annual International Symposium on Computer Architecture (Washington, DC, USA, 2012), ISCA '12, IEEE Computer Society, pp. 452-463.

Digital Library

[39]

KAMBADUR, M., TANG, K., AND KIM, M. A. Parashares: Finding the important basic blocks in multithreaded programs. In Euro-Par 2014 Parallel Processing: 20th International Conference, Porto, Portugal, August 25-29, 2014. Proceedings (Cham, 2014), F. Silva, I. Dutra, and V. Santos Costa, Eds., Springer International Publishing, pp. 75-86.

[40]

KELLEY, J. E. Critical-path planning and scheduling: Mathematical basis. Oper. Res. 9, 3 (June 1961), 296-320.

Digital Library

[41]

Kernel Probe. https://www.kernel.org/doc/Documentation/kprobes.txt.

[42]

KWON, Y., BALAZINSKA, M., HOWE, B., AND ROLIA, J. Skewtune: Mitigating skew in mapreduce applications. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (New York, NY, USA, 2012), SIGMOD '12, ACM, pp. 25-36.

Digital Library

[43]

LI, J., CHEN, Y., LIU, H., LU, S., ZHANG, Y., GUNAWI, H. S., GU, X., LU, X., AND LI, D. Pcatch: Automatically detecting performance cascading bugs in cloud systems. In Proceedings of the Thirteenth EuroSys Conference (New York, NY, USA, 2018), EuroSys '18, ACM, pp. 7:1-7:14.

Digital Library

[44]

LUK, C.-K., COHN, R., MUTH, R., PATIL, H., KLAUSER, A., LOWNEY, G., WALLACE, S., REDDI, V. J., AND HAZELWOOD, K. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation (New York, NY, USA, 2005), PLDI '05, ACM, pp. 190-200.

Digital Library

[45]

MACE, J., ROELKE, R., AND FONSECA, R. Pivot tracing: Dynamic causal monitoring for distributed systems. In Proceedings of the 25th Symposium on Operating Systems Principles (New York, NY, USA, 2015), SOSP '15, ACM, pp. 378-393.

Digital Library

[46]

Memaslap benchmark. http://docs.libmemcached.org/bin/memaslap.html.

[47]

Memcached. http://memcached.org.

[48]

MILLER, B. P., CLARK, M., HOLLINGSWORTH, J., KIERSTEAD, S., LIM, S. S., AND TORZEWSKI, T. Ips-2: the second generation of a parallel program measurement system. IEEE Transactions on Parallel and Distributed Systems 1, 2 (Apr 1990), 206-217.

Digital Library

[49]

MILLER, B. P., AND HOLLINGSWORTH, J. K. Slack: A New Performance Metric for Parallel Programs. University of Wisconsin-Madison, Computer Sciences Department, 1994.

[50]

MILLER, B. P., AND YANG, C. IPS: an interactive and automatic performance measurement tool for parallel and distributed programs. In Proceedings of the 7th International Conference on Distributed Computing Systems, Berlin, Germany, September 1987 (1987), pp. 482-489.

[51]

MySQL. http://www.mysql.com.

[52]

Scaling the HDFS NameNode (part 2). https://community.hortonworks.com/articles/43839/scaling-the-hdfs-namenode-part-2.htmll.

[53]

Hadoop Tuning Notes. https://anandnalya.com/2011/09/hadoop-tuning-note/.

[54]

NETHERCOTE, N., AND SEWARD, J. Valgrind: A framework for heavyweight dynamic binary instrumentation. In Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation (New York, NY, USA, 2007), PLDI '07, ACM, pp. 89-100.

Digital Library

[55]

Network File System. https://en.wikipedia.org/wiki/Network_File_System.

[56]

Off-CPU Analysis. http://www.brendangregg.com/offcpuanalysis.html.

[57]

Off-CPU Flame Graphs. http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html.

[58]

OProfile - A System Profiler for Linux. http://oprofile.sourceforge.net.

[59]

OYAMA, Y., TAURA, K., AND YONEZAWA, A. Online computation of critical paths for multithreaded languages. In Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing (London, UK, UK, 2000), IPDPS '00, Springer-Verlag, pp. 301-313.

Digital Library

[60]

perf: Linux profiling with performance counters. https://perf.wiki.kernel.org.

[61]

perf-map-agent: A Java Agent to Generate Method Mappings to Use with the Linux 'perf' Tool. https://github.com/jvm-profiling-tools/perf-map-agent.

[62]

RONG SHI, Y. G., AND WANG, Y. Evaluating scalability bottlenecks by workload extrapolation. In 26th IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOT '18) (Milwaukee, WI, 2018), IEEE.

[63]

SHVACHKO, K. HDFS scalability: the limits to growth. http://c59951.r51.cf2.rackcdn.com/5424-1908-shvachko.pdf.

[64]

SHVACHKO, K., KUANG, H., RADIA, S., AND CHANSLER, R. The hadoop distributed file system. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST) (Washington, DC, USA, 2010), MSST '10, IEEE Computer Society, pp. 1-10.

Digital Library

[65]

SOARES, L., AND STUMM, M. Flexsc: Flexible system call scheduling with exception-less system calls. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation (Berkeley, CA, USA, 2010), OSDI'10, USENIX Association, pp. 33-46.

Digital Library

[66]

SZEBENYI, Z., WOLF, F., AND WYLIE, B. J. N. Space-efficient time-series call-path profiling of parallel applications. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (Nov 2009), pp. 1-12.

Digital Library

[67]

TRANSACTION PROCESSING PERFORMANCE COUNCIL. The TPC-C home page. http://www.tpc.org/tpcc/.

[68]

VALIANT, L. G. A bridging model for parallel computation. Commun. ACM 33, 8 (Aug. 1990), 103-111.

Digital Library

[69]

WANG, G., XIE, W., DEMERS, A. J., AND GEHRKE, J. Asynchronous large-scale graph processing made easy. In CIDR 2013, Sixth Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 6-9, 2013, Online Proceedings (2013).

[70]

XIE, C., SU, C., KAPRITSOS, M., WANG, Y., YAGHMAZADEH, N., ALVISI, L., AND MAHAJAN, P. Salt: Combining acid and base in a distributed database. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14) (Broomfield, CO, 2014), USENIX Association, pp. 495-509.

Digital Library

[71]

XIE, C., SU, C., LITTLEY, C., ALVISI, L., KAPRITSOS, M., AND WANG, Y. High-performance acid via modular concurrency control. In Proceedings of the 25th Symposium on Operating Systems Principles (New York, NY, USA, 2015), SOSP '15, ACM, pp. 279-294.

Digital Library

[72]

XIE, W., WANG, G., BINDEL, D., DEMERS, A., AND GEHRKE, J. Fast iterative graph computation with block updates. Proc. VLDB Endow. 6, 14 (Sept. 2013), 2014-2025.

Digital Library

[73]

XIONG, W., PARK, S., ZHANG, J., ZHOU, Y., AND MA, Z. Ad hoc synchronization considered harmful. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation (Berkeley, CA, USA, 2010), OSDI'10, USENIX Association, pp. 163-176.

Digital Library

[74]

Yourkit Java and.Net Profiler. https://www.yourkit.com/.

[75]

YU, T., AND PRADEL, M. Syncprof: Detecting, localizing, and optimizing synchronization bottlenecks. In Proceedings of the 25th International Symposium on Software Testing and Analysis (New York, NY, USA, 2016), ISSTA 2016, ACM, pp. 389-400.

Digital Library

[76]

ZAHARIA, M., KONWINSKI, A., JOSEPH, A. D., KATZ, R., AND STOICA, I. Improving mapreduce performance in heterogeneous environments. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (Berkeley, CA, USA, 2008), OSDI'08, USENIX Association, pp. 29-42.

Digital Library

[77]

ZHAO, X., RODRIGUES, K., LUO, Y., YUAN, D., AND STUMM, M. Non-intrusive performance profiling for entire software stacks based on the flow reconstruction principle. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) (Savannah, GA, 2016), USENIX Association, pp. 603-618.

Digital Library

[78]

ZHOU, F., GAN, Y., MA, S., AND WANG, Y. wPerf: Generic Off-CPU Analysis to Identify Inefficient Synchronization Patterns (technical report). https://web.cse.ohio-state.edu/oportal/tech_reports/9.

[79]

Zookeeper. http://hadoop.apache.org/zookeeper.

Cited By

Inagaki TUeda YNakaike TOhara MApte VDi Marco ALitoiu MMerseguer J(2019)Profile-based Detection of Layered BottlenecksProceedings of the 2019 ACM/SPEC International Conference on Performance Engineering10.1145/3297663.3310296(197-208)Online publication date: 4-Apr-2019
https://dl.acm.org/doi/10.1145/3297663.3310296

wPerf: generic Off-CPU analysis to identify bottleneck waiting events
1. General and reference
  1. Cross-computing tools and techniques

Recommendations

WFR-TM

Transactional Memory (TM) is a promising concurrent programming paradigm which employs transactions to achieve synchronization in accessing common data known as transactional variables. A transaction may either commit, making its updates to ...
Speculation-based techniques for transactional lock-free execution of lock-based programs
Wait-n-GoTM: improving HTM performance by serializing cyclic dependencies
ASPLOS '13

Transactional memory (TM) has been proposed to alleviate some key programmability problems in chip multiprocessors. Most TMs optimistically allow concurrent transactions, detecting read-write or write-write conflicts. Upon conflicts, existing hardware ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

OSDI'18: Proceedings of the 13th USENIX conference on Operating Systems Design and Implementation

October 2018

815 pages

ISBN:9781931971478

Program Chairs:
Andrea Arpaci-Dusseau
University of Wisconsin-Madison
,
Geoff Voelker
University of California, San Diego

Sponsors

NetApp
Google Inc.
NSF
Microsoft: Microsoft
Facebook: Facebook

In-Cooperation

SIGOPS: ACM Special Interest Group on Operating Systems

Publisher

USENIX Association

United States

Publication History

Published: 08 October 2018

Check for updates

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 12 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Inagaki TUeda YNakaike TOhara MApte VDi Marco ALitoiu MMerseguer J(2019)Profile-based Detection of Layered BottlenecksProceedings of the 2019 ACM/SPEC International Conference on Performance Engineering10.1145/3297663.3310296(197-208)Online publication date: 4-Apr-2019
https://dl.acm.org/doi/10.1145/3297663.3310296

View Options

View options

Media

Figures

Other

Tables

View Table of Contents