Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/3291168.3291207acmotherconferencesArticle/Chapter ViewAbstractPublication PagesosdiConference Proceedingsconference-collections
Article

wPerf: generic Off-CPU analysis to identify bottleneck waiting events

Published: 08 October 2018 Publication History

Abstract

This paper tries to identify waiting events that limit the maximal throughput of a multi-threaded application. To achieve this goal, we not only need to understand an event's impact on threads waiting for this event (i.e., local impact), but also need to understand whether its impact can reach other threads that are involved in request processing (i.e., global impact).
To address these challenges, wPerf computes the local impact of a waiting event with a technique called cascaded re-distribution; more importantly, wPerf builds a wait-for graph to compute whether such impact can indirectly reach other threads. By combining these two techniques, wPerf essentially tries to identify events with large impacts on all threads.
We apply wPerf to a number of open-source multithreaded applications. By following the guide of wPerf, we are able to improve their throughput by up to 4.83×. The overhead of recording waiting events at runtime is about 5.1% on average.

References

[1]
Event Tracing for Windows (ETW). https://docs.microsoft.com/en-us/windows-hardware/drivers/devtest/event-tracing-for-windows--etw-.
[2]
ALAM, M. M. U., LIU, T., ZENG, G., AND MUZAHID, A. Syncperf: Categorizing, detecting, and diagnosing synchronization performance bugs. In Proceedings of the Twelfth European Conference on Computer Systems (New York, NY, USA, 2017), EuroSys '17, ACM, pp. 298-313.
[3]
ANANTHANARAYANAN, G., GHODSI, A., SHENKER, S., AND STOICA, I. Effective straggler mitigation: Attack of the clones. In Presented as part of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13) (Lombard, IL, 2013), USENIX, pp. 185-198.
[4]
ANANTHANARAYANAN, G., KANDULA, S., GREENBERG, A., STOICA, I., LU, Y., SAHA, B., AND HARRIS, E. Reining in the outliers in map-reduce clusters using mantri. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation (Berkeley, CA, USA, 2010), OSDI'10, USENIX Association, pp. 265-278.
[5]
Apache HBASE. http://hbase.apache.org/.
[6]
aync-profiler: Sampling CPU and HEAP Profiler for Java Featuring AsyncGetCallTrace and perf events. https://github.com/jvm-profiling-tools/async-profiler/releases.
[7]
ATTARIYAN, M., CHOW, M., AND FLINN, J. X-ray: Automating root-cause diagnosis of performance anomalies in production software. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (Berkeley, CA, USA, 2012), OSDI'12, USENIX Association, pp. 307-320.
[8]
BACH, M. M., CHARNEY, M., COHN, R., DEMIKHOVSKY, E., DEVOR, T., HAZELWOOD, K., JALEEL, A., LUK, C.-K., LYONS, G., PATIL, H., AND TAL, A. Analyzing parallel programs with pin. Computer 43, 3 (Mar. 2010), 34-41.
[9]
BHAT, S. S., EQBAL, R., CLEMENTS, A. T., KAASHOEK, M. F., AND ZELDOVICH, N. Scaling a file system to many cores using an operation log. In Proceedings of the 26th Symposium on Operating Systems Principles (New York, NY, USA, 2017), SOSP '17, ACM, pp. 69-86.
[10]
Blockgrace. https://github.com/wenleix/BlockGRACE.
[11]
BURROWS, M. The chubby lock service for loosely-coupled distributed systems. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (Berkeley, CA, USA, 2006), OSDI '06, USENIX Association, pp. 335-350.
[12]
CHANG, F., DEAN, J., GHEMAWAT, S., HSIEH, W. C., WALLACH, D. A., BURROWS, M., CHANDRA, T., FIKES, A., AND GRUBER, R. E. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst. 26, 2 (June 2008), 4:1-4:26.
[13]
CHOW, M., MEISNER, D., FLINN, J., PEEK, D., AND WENISCH, T. F. The mystery machine: End-to-end performance analysis of large-scale internet services. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14) (Broomfield, CO, 2014), USENIX Association, pp. 217-231.
[14]
CLEMENTS, A. T., KAASHOEK, M. F., ZELDOVICH, N., MORRIS, R. T., AND KOHLER, E. The scalable commutativity rule: Designing scalable software for multicore processors. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (New York, NY, USA, 2013), SOSP '13, ACM, pp. 1-17.
[15]
CloudLab. https://www.cloudlab.us.
[16]
CURTSINGER, C., AND BERGER, E. D. Coz: Finding code that counts with causal profiling. In Proceedings of the 25th Symposium on Operating Systems Principles (New York, NY, USA, 2015), SOSP '15, ACM, pp. 184-197.
[17]
D3.JS Javascript Library. https://d3js.org/.
[18]
DAVID, F., THOMAS, G., LAWALL, J., AND MULLER, G. Continuously measuring critical section pressure with the free-lunch profiler. In Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications (New York, NY, USA, 2014), OOPSLA '14, ACM, pp. 291-307.
[19]
DEAN, J., AND GHEMAWAT, S. Mapreduce: Simplified data processing on large clusters. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6 (Berkeley, CA, USA, 2004), OSDI'04, USENIX Association, pp. 10-10.
[20]
Dtrace. http://dtrace.org/.
[21]
ERLINGSSON, U., PEINADO, M., PETER, S., AND BUDIU, M. Fay: Extensible distributed tracing from kernels to clusters. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (New York, NY, USA, 2011), SOSP '11, ACM, pp. 311-326.
[22]
GARCIA, S., JEON, D., LOUIE, C. M., AND TAYLOR, M. B. Kremlin: Rethinking and rebooting gprof for the multicore age. In Proceedings of the 32Nd ACM SIGPLAN Conference on Programming Language Design and Implementation (New York, NY, USA, 2011), PLDI '11, ACM, pp. 458-469.
[23]
GHEMAWAT, S., GOBIOFF, H., AND LEUNG, S.-T. The google file system. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles (New York, NY, USA, 2003), SOSP '03, ACM, pp. 29-43.
[24]
GOLDIN, M. Thread performance: Resource contention concurrency profiling in visual studio 2010. https://msdn.microsoft.com/en-us/magazine/ff714587.aspx.
[25]
GNU gprof. https://sourceware.org/binutils/docs/gprof/.
[26]
GRAHAM, S. L., KESSLER, P. B., AND MCKUSICK, M. K. Gprof: A call graph execution profiler. In Proceedings of the 1982 SIGPLAN Symposium on Compiler Construction (New York, NY, USA, 1982), SIGPLAN '82, ACM, pp. 120-126.
[27]
Apache HBase (TM) Configuration. http://hbase.apache.org/0.94/book/important_configurations.html.
[28]
HBase Administration Cookbook. https://www.safaribooksonline.com/library/view/hbase-administration-cookbook/9781849517140/ch09s03.html.
[29]
HDFS. http://hadoop.apache.org/hdfs.
[30]
HE, Y., LEISERSON, C. E., AND LEISERSON, W. M. The cilkview scalability analyzer. In Proceedings of the Twenty-second Annual ACM Symposium on Parallelism in Algorithms and Architectures (New York, NY, USA, 2010), SPAA '10, ACM, pp. 145-156.
[31]
HILL, J. M. D., JARVIS, S. A., SINIOLAKIS, C. J., AND VASILEV, V. P. Portable and architecture independent parallel performance tuning using a call-graph profiling tool. In Parallel and Distributed Processing, 1998. PDP '98. Proceedings of the Sixth Euromicro Workshop on (Jan 1998), pp. 286-294.
[32]
HOLT, R. C. Some deadlock properties of computer systems. ACM Comput. Surv. 4, 3 (Sept. 1972), 179-196.
[33]
HUANG, J., MOZAFARI, B., AND WENISCH, T. F. Statistical analysis of latency through semantic profiling. In Proceedings of the Twelfth European Conference on Computer Systems (New York, NY, USA, 2017), EuroSys '17, ACM, pp. 64-79.
[34]
HUNT, P., KONAR, M., JUNQUEIRA, F. P., AND REED, B. Zookeeper: Wait-free coordination for internet-scale systems. In Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference (Berkeley, CA, USA, 2010), USENIXATC' 10, USENIX Association, pp. 11-11.
[35]
JOAO, J. A., SULEMAN, M. A., MUTLU, O., AND PATT, Y. N. Bottleneck identification and scheduling in multithreaded applications. In Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems (New York, NY, USA, 2012), ASPLOS XVII, ACM, pp. 223-234.
[36]
Jprofiler. https://www.ej-technologies.com/products/jprofiler/overview.html.
[37]
KALDOR, J., MACE, J., BEJDA, M., GAO, E., KUROPATWA, W., O'NEILL, J., ONG, K. W., SCHALLER, B., SHAN, P., VISCOMI, B., VENKATARAMAN, V., VEERARAGHAVAN, K., AND SONG, Y. J. Canopy: An end-to-end performance tracing and analysis system. In Proceedings of the 26th Symposium on Operating Systems Principles (New York, NY, USA, 2017), SOSP '17, ACM, pp. 34-50.
[38]
KAMBADUR, M., TANG, K., AND KIM, M. A. Harmony: Collection and analysis of parallel block vectors. In Proceedings of the 39th Annual International Symposium on Computer Architecture (Washington, DC, USA, 2012), ISCA '12, IEEE Computer Society, pp. 452-463.
[39]
KAMBADUR, M., TANG, K., AND KIM, M. A. Parashares: Finding the important basic blocks in multithreaded programs. In Euro-Par 2014 Parallel Processing: 20th International Conference, Porto, Portugal, August 25-29, 2014. Proceedings (Cham, 2014), F. Silva, I. Dutra, and V. Santos Costa, Eds., Springer International Publishing, pp. 75-86.
[40]
KELLEY, J. E. Critical-path planning and scheduling: Mathematical basis. Oper. Res. 9, 3 (June 1961), 296-320.
[41]
Kernel Probe. https://www.kernel.org/doc/Documentation/kprobes.txt.
[42]
KWON, Y., BALAZINSKA, M., HOWE, B., AND ROLIA, J. Skewtune: Mitigating skew in mapreduce applications. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (New York, NY, USA, 2012), SIGMOD '12, ACM, pp. 25-36.
[43]
LI, J., CHEN, Y., LIU, H., LU, S., ZHANG, Y., GUNAWI, H. S., GU, X., LU, X., AND LI, D. Pcatch: Automatically detecting performance cascading bugs in cloud systems. In Proceedings of the Thirteenth EuroSys Conference (New York, NY, USA, 2018), EuroSys '18, ACM, pp. 7:1-7:14.
[44]
LUK, C.-K., COHN, R., MUTH, R., PATIL, H., KLAUSER, A., LOWNEY, G., WALLACE, S., REDDI, V. J., AND HAZELWOOD, K. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation (New York, NY, USA, 2005), PLDI '05, ACM, pp. 190-200.
[45]
MACE, J., ROELKE, R., AND FONSECA, R. Pivot tracing: Dynamic causal monitoring for distributed systems. In Proceedings of the 25th Symposium on Operating Systems Principles (New York, NY, USA, 2015), SOSP '15, ACM, pp. 378-393.
[46]
Memaslap benchmark. http://docs.libmemcached.org/bin/memaslap.html.
[47]
Memcached. http://memcached.org.
[48]
MILLER, B. P., CLARK, M., HOLLINGSWORTH, J., KIERSTEAD, S., LIM, S. S., AND TORZEWSKI, T. Ips-2: the second generation of a parallel program measurement system. IEEE Transactions on Parallel and Distributed Systems 1, 2 (Apr 1990), 206-217.
[49]
MILLER, B. P., AND HOLLINGSWORTH, J. K. Slack: A New Performance Metric for Parallel Programs. University of Wisconsin-Madison, Computer Sciences Department, 1994.
[50]
MILLER, B. P., AND YANG, C. IPS: an interactive and automatic performance measurement tool for parallel and distributed programs. In Proceedings of the 7th International Conference on Distributed Computing Systems, Berlin, Germany, September 1987 (1987), pp. 482-489.
[51]
MySQL. http://www.mysql.com.
[52]
Scaling the HDFS NameNode (part 2). https://community.hortonworks.com/articles/43839/scaling-the-hdfs-namenode-part-2.htmll.
[53]
Hadoop Tuning Notes. https://anandnalya.com/2011/09/hadoop-tuning-note/.
[54]
NETHERCOTE, N., AND SEWARD, J. Valgrind: A framework for heavyweight dynamic binary instrumentation. In Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation (New York, NY, USA, 2007), PLDI '07, ACM, pp. 89-100.
[55]
Network File System. https://en.wikipedia.org/wiki/Network_File_System.
[56]
Off-CPU Analysis. http://www.brendangregg.com/offcpuanalysis.html.
[57]
Off-CPU Flame Graphs. http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html.
[58]
OProfile - A System Profiler for Linux. http://oprofile.sourceforge.net.
[59]
OYAMA, Y., TAURA, K., AND YONEZAWA, A. Online computation of critical paths for multithreaded languages. In Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing (London, UK, UK, 2000), IPDPS '00, Springer-Verlag, pp. 301-313.
[60]
perf: Linux profiling with performance counters. https://perf.wiki.kernel.org.
[61]
perf-map-agent: A Java Agent to Generate Method Mappings to Use with the Linux 'perf' Tool. https://github.com/jvm-profiling-tools/perf-map-agent.
[62]
RONG SHI, Y. G., AND WANG, Y. Evaluating scalability bottlenecks by workload extrapolation. In 26th IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOT '18) (Milwaukee, WI, 2018), IEEE.
[63]
SHVACHKO, K. HDFS scalability: the limits to growth. http://c59951.r51.cf2.rackcdn.com/5424-1908-shvachko.pdf.
[64]
SHVACHKO, K., KUANG, H., RADIA, S., AND CHANSLER, R. The hadoop distributed file system. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST) (Washington, DC, USA, 2010), MSST '10, IEEE Computer Society, pp. 1-10.
[65]
SOARES, L., AND STUMM, M. Flexsc: Flexible system call scheduling with exception-less system calls. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation (Berkeley, CA, USA, 2010), OSDI'10, USENIX Association, pp. 33-46.
[66]
SZEBENYI, Z., WOLF, F., AND WYLIE, B. J. N. Space-efficient time-series call-path profiling of parallel applications. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (Nov 2009), pp. 1-12.
[67]
TRANSACTION PROCESSING PERFORMANCE COUNCIL. The TPC-C home page. http://www.tpc.org/tpcc/.
[68]
VALIANT, L. G. A bridging model for parallel computation. Commun. ACM 33, 8 (Aug. 1990), 103-111.
[69]
WANG, G., XIE, W., DEMERS, A. J., AND GEHRKE, J. Asynchronous large-scale graph processing made easy. In CIDR 2013, Sixth Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 6-9, 2013, Online Proceedings (2013).
[70]
XIE, C., SU, C., KAPRITSOS, M., WANG, Y., YAGHMAZADEH, N., ALVISI, L., AND MAHAJAN, P. Salt: Combining acid and base in a distributed database. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14) (Broomfield, CO, 2014), USENIX Association, pp. 495-509.
[71]
XIE, C., SU, C., LITTLEY, C., ALVISI, L., KAPRITSOS, M., AND WANG, Y. High-performance acid via modular concurrency control. In Proceedings of the 25th Symposium on Operating Systems Principles (New York, NY, USA, 2015), SOSP '15, ACM, pp. 279-294.
[72]
XIE, W., WANG, G., BINDEL, D., DEMERS, A., AND GEHRKE, J. Fast iterative graph computation with block updates. Proc. VLDB Endow. 6, 14 (Sept. 2013), 2014-2025.
[73]
XIONG, W., PARK, S., ZHANG, J., ZHOU, Y., AND MA, Z. Ad hoc synchronization considered harmful. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation (Berkeley, CA, USA, 2010), OSDI'10, USENIX Association, pp. 163-176.
[74]
Yourkit Java and.Net Profiler. https://www.yourkit.com/.
[75]
YU, T., AND PRADEL, M. Syncprof: Detecting, localizing, and optimizing synchronization bottlenecks. In Proceedings of the 25th International Symposium on Software Testing and Analysis (New York, NY, USA, 2016), ISSTA 2016, ACM, pp. 389-400.
[76]
ZAHARIA, M., KONWINSKI, A., JOSEPH, A. D., KATZ, R., AND STOICA, I. Improving mapreduce performance in heterogeneous environments. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (Berkeley, CA, USA, 2008), OSDI'08, USENIX Association, pp. 29-42.
[77]
ZHAO, X., RODRIGUES, K., LUO, Y., YUAN, D., AND STUMM, M. Non-intrusive performance profiling for entire software stacks based on the flow reconstruction principle. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) (Savannah, GA, 2016), USENIX Association, pp. 603-618.
[78]
ZHOU, F., GAN, Y., MA, S., AND WANG, Y. wPerf: Generic Off-CPU Analysis to Identify Inefficient Synchronization Patterns (technical report). https://web.cse.ohio-state.edu/oportal/tech_reports/9.
[79]
Zookeeper. http://hadoop.apache.org/zookeeper.

Cited By

View all
  • (2019)Profile-based Detection of Layered BottlenecksProceedings of the 2019 ACM/SPEC International Conference on Performance Engineering10.1145/3297663.3310296(197-208)Online publication date: 4-Apr-2019
  1. wPerf: generic Off-CPU analysis to identify bottleneck waiting events

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    OSDI'18: Proceedings of the 13th USENIX conference on Operating Systems Design and Implementation
    October 2018
    815 pages
    ISBN:9781931971478

    Sponsors

    • NetApp
    • Google Inc.
    • NSF
    • Microsoft: Microsoft
    • Facebook: Facebook

    In-Cooperation

    Publisher

    USENIX Association

    United States

    Publication History

    Published: 08 October 2018

    Check for updates

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 12 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2019)Profile-based Detection of Layered BottlenecksProceedings of the 2019 ACM/SPEC International Conference on Performance Engineering10.1145/3297663.3310296(197-208)Online publication date: 4-Apr-2019

    View Options

    View options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media