Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3373376.3378455acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article
Open access

FirePerf: FPGA-Accelerated Full-System Hardware/Software Performance Profiling and Co-Design

Published: 13 March 2020 Publication History

Abstract

Achieving high-performance when developing specialized hardware/software systems requires understanding and improving not only core compute kernels, but also intricate and elusive system-level bottlenecks. Profiling these bottlenecks requires both high-fidelity introspection and the ability to run sufficiently many cycles to execute complex software stacks, a challenging combination. In this work, we enable agile full-system performance optimization for hardware/software systems with FirePerf, a set of novel out-of-band system-level performance profiling capabilities integrated into the open-source FireSim FPGA-accelerated hardware simulation platform. Using out-of-band call stack reconstruction and automatic performance counter insertion, FirePerf enables introspecting into hardware and software at appropriate abstraction levels to rapidly identify opportunities for software optimization and hardware specialization, without disrupting end-to-end system behavior like traditional profiling tools. We demonstrate the capabilities of FirePerf with a case study that optimizes the hardware/software stack of an open-source RISC-V SoC with an Ethernet NIC to achieve 8x end-to-end improvement in achievable bandwidth for networking applications running on Linux. We also deploy a RISC-V Linux kernel optimization discovered with FirePerf on commercial RISC-V silicon, resulting in up to 1.72x improvement in network performance.

References

[1]
2018. Kendryte K210 Announcement. https://cnrv.io/bi-week-rpts/ 2018-09--16.
[2]
2019. FireSim: Easy-to-use, Scalable, FPGA-accelerated Cycle-accurate Hardware Simulation in the Cloud. https://github.com/firesim/firesim.
[3]
2019. Network Maximum Transmission Unit (MTU) for Your EC2 Instance. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ network_mtu.html.
[4]
2019. strace: strace is a diagnostic, debugging and instructional userspace utility for Linux. https://github.com/strace/strace.
[5]
B. Agrawal, T. Sherwood, C. Shin, and S. Yoon. 2008. Addressing the Challenges of Synchronization/Communication and Debugging Support in Hardware/Software Cosimulation. In 21st International Conference on VLSI Design (VLSID 2008). 354--361. https://doi.org/10. 1109/VLSI.2008.74
[6]
Krste Asanovic, Rimas Aviienis, Jonathan Bachrach, Scott Beamer, David Biancolin, Christopher Celio, Henry Cook, Daniel Dabbelt, John Hauser, Adam Izraelevitz, Sagar Karandikar, Ben Keller, Donggyu Kim, John Koenig, Yunsup Lee, Eric Love, Martin Maas, Albert Magyar, Howard Mao, Miquel Moreto, Albert Ou, David A. Patterson, Brian Richards, Colin Schmidt, Stephen Twigg, Huy Vo, and Andrew Waterman. 2016. The Rocket Chip Generator. Technical Report UCB/EECS- 2016--17. EECS Department, University of California, Berkeley.
[7]
J. Bachrach, H. Vo, B. Richards, Y. Lee, A. Waterman, R. Aviienis, J. Wawrzynek, and K. Asanovic. 2012. Chisel: Constructing hardware in a Scala embedded language. In DAC Design Automation Conference 2012. 1212--1221. https://doi.org/10.1145/2228360.2228584
[8]
Jeff Barr. 2018. New C5n Instances with 100 Gbps Networking. https://aws.amazon.com/blogs/aws/new-c5n-instances-with-100- gbps-networking/.
[9]
Luiz Barroso, Mike Marty, David Patterson, and Parthasarathy Ranganathan. 2017. Attack of the Killer Microseconds. Commun. ACM 60, 4 (March 2017), 48--54. https://doi.org/10.1145/3015146
[10]
David Biancolin, Sagar Karandikar, Donggyu Kim, Jack Koenig, AndrewWaterman, Jonathan Bachrach, and Krste Asanovi?. 2019. FASED: FPGA-Accelerated Simulation and Evaluation of DRAM. In The 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA'19) (Seaside, CA, USA) (FPGA '19). ACM, New York, NY, USA, 10. https://doi.org/10.1145/3289602.3293894
[11]
N. L. Binkert, R. G. Dreslinski, L. R. Hsu, K. T. Lim, A. G. Saidi, and S. K. Reinhardt. 2006. The M5 Simulator: Modeling Networked Systems. IEEE Micro 26, 4 (July 2006), 52--60. https://doi.org/10.1109/MM.2006. 82
[12]
N. L. Binkert, L. R. Hsu, A. G. Saidi, R. G. Dreslinski, A. L. Schultz, and S. K. Reinhardt. 2005. Performance analysis of system overheads in TCP/IP workloads. In 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05). 218--228. https://doi.org/10.1109/PACT.2005.35
[13]
Brendan Gregg. 2019. Flame Graphs. http://www.brendangregg.com/ flamegraphs.html.
[14]
Brendan Gregg. 2019. FlameGraph: Stack trace visualizer. https: //github.com/brendangregg/FlameGraph.
[15]
Christopher Celio, David A. Patterson, and Krste Asanovi?. 2015. The Berkeley Out-of-Order Machine (BOOM): An Industry-Competitive, Synthesizable, Parameterized RISC-V Processor. Technical Report UCB/EECS-2015--167. EECS Department, University of California, Berkeley.
[16]
Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil A. Patil, William Reinhart, Darrel Eric Johnson, Jebediah Keefe, and Hari Angepat. 2007. FPGA-Accelerated Simulation Technologies (FAST): Fast, Full- System, Cycle-Accurate Simulators. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 40). IEEE Computer Society, Washington, DC, USA, 249--261. https://doi.org/10.1109/MICRO.2007.36
[17]
Eric S. Chung, Michael K. Papamichael, Eriko Nurvitadhi, James C. Hoe, Ken Mai, and Babak Falsafi. 2009. ProtoFlex: Towards Scalable, Full-System Multiprocessor Simulations Using FPGAs. ACM Trans. Reconfigurable Technol. Syst. 2, 2, Article 15 (June 2009), 32 pages. https://doi.org/10.1145/1534916.1534925
[18]
Arnaldo Carvalho De Melo. 2010. The New Linux perf Tools. In Slides from Linux Kongress, Vol. 18.
[19]
S. De Pestel, S. Van den Steen, S. Akram, and L. Eeckhout. 2019. RPPM: Rapid Performance Prediction of Multithreaded Workloads on Multicore Processors. In 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 257--267. https: //doi.org/10.1109/ISPASS.2019.00038
[20]
DWARF Debugging Information Format Committee. 2017. DWARF Debugging Information Format Version 5. Standard. http://www. dwarfstd.org/doc/DWARF5.pdf
[21]
Lieven Eeckhout. 2010. Computer Architecture Performance Evaluation Methods. Synthesis Lectures on Computer Architecture 5, 1 (2010), 1--145. https://doi.org/10.2200/S00273ED1V01Y201006CAC010 arXiv:https://doi.org/10.2200/S00273ED1V01Y201006CAC010
[22]
ESnet/LBNL. 2019. iPerf - The ultimate speed test tool for TCP, UDP and SCTP. https://iperf.fr/.
[23]
M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi. 2014. A Case for Specialized Processors for Scale-Out Workloads. IEEE Micro 34, 3 (May 2014), 31--42. https://doi.org/10.1109/MM.2014.41
[24]
Daniel Firestone, AndrewPutnam, Sambhrama Mundkur, Derek Chiou, Alireza Dabagh, Mike Andrewartha, Hari Angepat, Vivek Bhanu, Adrian Caulfield, Eric Chung, Harish Kumar Chandrappa, Somesh Chaturmohta, Matt Humphrey, Jack Lavier, Norman Lam, Fengfen Liu, Kalin Ovtcharov, Jitu Padhye, Gautham Popuri, Shachar Raindel, Tejas Sapre, Mark Shaw, Gabriel Silva, Madhan Sivakumar, Nisheeth Srivastava, Anshuman Verma, Qasim Zuhair, Deepak Bansal, Doug Burger, Kushagra Vaid, David A. Maltz, and Albert Greenberg. 2018. Azure Accelerated Networking: SmartNICs in the Public Cloud. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18). USENIX Association, Renton, WA, 51--66. https: //www.usenix.org/conference/nsdi18/presentation/firestone
[25]
Brendan Gregg. 2016. The Flame Graph. Commun. ACM 59, 6 (May 2016), 48--57. https://doi.org/10.1145/2909476
[26]
A. Izraelevitz, J. Koenig, P. Li, R. Lin, A. Wang, A. Magyar, D. Kim, C. Schmidt, C. Markley, J. Lawson, and J. Bachrach. 2017. Reusability is FIRRTL ground: Hardware construction languages, compiler frameworks, and transformations. In 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 209--216. https: //doi.org/10.1109/ICCAD.2017.8203780
[27]
M. Jahre and L. Eeckhout. 2018. GDP: Using Dataflow Properties to Accurately Estimate Interference-Free Performance at Runtime. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). 296--309. https://doi.org/10.1109/HPCA.2018. 00034
[28]
Svilen Kanev, Juan Pablo Darago, Kim Hazelwood, Parthasarathy Ranganathan, Tipp Moseley, Gu-Yeon Wei, and David Brooks. 2015. Profiling a Warehouse-scale Computer. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (Portland, Oregon) (ISCA '15). ACM, New York, NY, USA, 158--169. https: //doi.org/10.1145/2749469.2750392
[29]
Sagar Karandikar, Howard Mao, Donggyu Kim, David Biancolin, Alon Amid, Dayeol Lee, Nathan Pemberton, Emmanuel Amaro, Colin Schmidt, Aditya Chopra, Qijing Huang, Kyle Kovacs, Borivoje Nikolic, Randy Katz, Jonathan Bachrach, and Krste Asanovi?. 2018. FireSim: FPGA-accelerated Cycle-exact Scale-out System Simulation in the Public Cloud. In Proceedings of the 45th Annual International Symposium on Computer Architecture (Los Angeles, California) (ISCA '18). IEEE Press, Piscataway, NJ, USA, 29--42. https://doi.org/10.1109/ISCA.2018.00014
[30]
Sagar Karandikar, Howard Mao, Donggyu Kim, David Biancolin, Alon Amid, Dayeol Lee, Nathan Pemberton, Emmanuel Amaro, Colin Schmidt, Aditya Chopra, Qijing Huang, Kyle Kovacs, Borivoje Nikolic, Randy Katz, Jonathan Bachrach, and Krste Asanovi?. 2019. FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud. IEEE Micro 39, 3 (May 2019), 56--65. https://doi.org/10. 1109/MM.2019.2910175
[31]
D. Kim, C. Celio, S. Karandikar, D. Biancolin, J. Bachrach, and K. Asanovi?. 2018. DESSERT: Debugging RTL Effectively with State Snapshotting for Error Replays across Trillions of Cycles. In 2018 28th International Conference on Field Programmable Logic and Applications (FPL). 76--764. https://doi.org/10.1109/FPL.2018.00021
[32]
Hyong-youb Kim and Scott Rixner. 2005. Performance characterization of the FreeBSD network stack. Technical Report.
[33]
Yunsup Lee, Colin Schmidt, Albert Ou, Andrew Waterman, and Krste Asanovi?. 2015. The Hwacha Vector-Fetch Architecture Manual, Version 3.8.1. Technical Report UCB/EECS-2015--262. EECS Department, University of California, Berkeley.
[34]
Y. Lv, B. Sun, Q. Luo, J. Wang, Z. Yu, and X. Qian. 2018. CounterMiner: Mining Big Performance Data from Hardware Counters. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 613--626. https://doi.org/10.1109/MICRO.2018.00056
[35]
Margaret Martonosi, Anoop Gupta, and Thomas Anderson. 1993. Effectiveness of Trace Sampling for Performance Debugging Tools. In Proceedings of the 1993 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems (Santa Clara, California, USA) (SIGMETRICS '93). ACM, New York, NY, USA, 248--259. https://doi.org/10.1145/166955.167023
[36]
John D. McCalpin. 2018. HPL and DGEMM Performance Variability on the Xeon Platinum 8160 Processor. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (Dallas, Texas) (SC '18). IEEE Press, Piscataway, NJ, USA, Article 18, 13 pages. https://doi.org/10.1109/SC.2018.00021
[37]
Tipp Moseley, Neil Vachharajani, and William Jalby. 2011. Hardware Performance Monitoring for the Rest of Us: A Position and Survey. In 8th Network and Parallel Computing (NPC) (Network and Parallel Computing), Erik Altman and Weisong Shi (Eds.), Vol. LNCS-6985. Springer, Changsha? China, 293--312. https://doi.org/10.1007/978--3- 642--24403--2_23 Part 8: Session 8: Microarchitecture.
[38]
I. Moussa, T. Grellier, and G. Nguyen. 2003. Exploring SW performance using SoC transaction-level modeling. In 2003 Design, Automation and Test in Europe Conference and Exhibition. 120--125 suppl. https: //doi.org/10.1109/DATE.2003.1186682
[39]
U. Y. Ogras and R. Marculescu. 2007. Analytical Router Modeling for Networks-on-Chip Performance Analysis. In 2007 Design, Automation Test in Europe Conference Exhibition. 1--6. https://doi.org/10.1109/ DATE.2007.364440
[40]
M. Pellauer, M. Adler, M. Kinsy, A. Parashar, and J. Emer. 2011. HAsim: FPGA-based high-detail multicore simulation using time-division multiplexing. In 2011 IEEE 17th International Symposium on High Performance Computer Architecture. 406--417. https://doi.org/10.1109/HPCA. 2011.5749747
[41]
Erez Perelman, Greg Hamerly, Michael Van Biesbrouck, Timothy Sherwood, and Brad Calder. 2003. Using SimPoint for Accurate and Efficient Simulation. In Proceedings of the 2003 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (San Diego, CA, USA) (SIGMETRICS '03). Association for Computing Machinery, New York, NY, USA, 318--319. https: //doi.org/10.1145/781027.781076
[42]
Kishore Punniyamurthy, Behzad Boroujerdian, and Andreas Gerstlauer. 2017. GATSim: Abstract Timing Simulation of GPUs. In Proceedings of the Conference on Design, Automation & Test in Europe (Lausanne, Switzerland) (DATE '17). European Design and Automation Association, 3001 Leuven, Belgium, Belgium, 43--48. http: //dl.acm.org/citation.cfm?id=3130379.3130390
[43]
J. A. Rowson. 1994. Hardware/Software Co-Simulation. In 31st Design Automation Conference. 439--440. https://doi.org/10.1109/DAC.1994. 204143
[44]
Jürgen Schnerr, Oliver Bringmann, Alexander Viehl, and Wolfgang Rosenstiel. 2008. High-performance Timing Simulation of Embedded Software. In Proceedings of the 45th Annual Design Automation Conference (Anaheim, California) (DAC '08). ACM, New York, NY, USA, 290--295. https://doi.org/10.1145/1391469.1391543
[45]
N. Sehatbakhsh, A. Nazari, A. Zajic, and M. Prvulovic. 2016. Spectral profiling: Observer-effect-free profiling by monitoring EM emanations. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1--11. https://doi.org/10.1109/MICRO.2016.7783762
[46]
SiFive. 2018. SiFive HiFive Unleashed Getting Started Guide. https://sifive.cdn.prismic.io/sifive/fa3a584a-a02f-4fda-b758- a2def05f49f9_hifive-unleashed-getting-started-guide-v1p1.pdf.
[47]
Zhangxi Tan, AndrewWaterman, Rimas Avizienis, Yunsup Lee, Henry Cook, David Patterson, and Krste Asanovi?. 2010. RAMP Gold: An FPGA-based Architecture Simulator for Multiprocessors. In Proceedings of the 47th Design Automation Conference (Anaheim, California) (DAC '10). ACM, New York, NY, USA, 463--468. https://doi.org/10. 1145/1837274.1837390
[48]
Sewook Wee, Jared Casper, Njuguna Njoroge, Yuriy Tesylar, Daxia Ge, Christos Kozyrakis, and Kunle Olukotun. 2007. A Practical FPGAbased Framework for Novel CMP Research. In Proceedings of the 2007 ACM/SIGDA 15th International Symposium on Field Programmable Gate Arrays (Monterey, California, USA) (FPGA '07). ACM, New York, NY, USA, 116--125. https://doi.org/10.1145/1216919.1216936
[49]
G.Wu, J. L. Greathouse, A. Lyashevsky, N. Jayasena, andD. Chiou. 2015. GPGPU performance and power estimation using machine learning. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). 564--576. https://doi.org/10.1109/HPCA.2015. 7056063
[50]
Roland E.Wunderlich, Thomas F.Wenisch, Babak Falsafi, and James C. Hoe. 2003. SMARTS: Accelerating Microarchitecture Simulation via Rigorous Statistical Sampling. SIGARCH Comput. Archit. News 31, 2, 84--97. https://doi.org/10.1145/871656.859629
[51]
T. Yoshino, Y. Sugawara, K. Inagami, J. Tamatsukuri, M. Inaba, and K. Hiraki. 2008. Performance optimization of TCP/IP over 10 Gigabit Ethernet by precise instrumentation. In SC '08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing. 1--12. https://doi.org/10. 1109/SC.2008.5215913
[52]
Xinnian Zheng, Lizy K. John, and Andreas Gerstlauer. 2016. Accurate Phase-level Cross-platform Power and Performance Estimation. In Proceedings of the 53rd Annual Design Automation Conference (Austin, Texas) (DAC '16). ACM, New York, NY, USA, Article 4, 6 pages. https: //doi.org/10.1145/2897937.2897977
[53]
Pin Zhou, Feng Qin, Wei Liu, Yuanyuan Zhou, and Josep Torrellas. 2004. iWatcher: Efficient Architectural Support for Software Debugging. In Proceedings of the 31st Annual International Symposium on Computer Architecture (München, Germany) (ISCA '04). IEEE Computer Society,Washington, DC, USA, 224--. http://dl.acm.org/citation. cfm?id=998680.1006720

Cited By

View all
  • (2024)CuMONITOR: Continuous Monitoring of Microarchitecture for Software Task Identification and ClassificationDigital Threats: Research and Practice10.1145/36528615:3(1-22)Online publication date: 28-Mar-2024
  • (2023)TEA: Time-Proportional Event AnalysisProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589058(1-13)Online publication date: 17-Jun-2023
  • (2023)Balancing Accuracy and Evaluation Overhead in Simulation Point Selection2023 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC59245.2023.00019(43-53)Online publication date: 1-Oct-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ASPLOS '20: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems
March 2020
1412 pages
ISBN:9781450371025
DOI:10.1145/3373376
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 March 2020

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. agile hardware
  2. fpga-accelerated simulation
  3. hardware/software co-design
  4. network performance optimization
  5. performance profiling

Qualifiers

  • Research-article

Funding Sources

Conference

ASPLOS '20

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)496
  • Downloads (Last 6 weeks)82
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)CuMONITOR: Continuous Monitoring of Microarchitecture for Software Task Identification and ClassificationDigital Threats: Research and Practice10.1145/36528615:3(1-22)Online publication date: 28-Mar-2024
  • (2023)TEA: Time-Proportional Event AnalysisProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589058(1-13)Online publication date: 17-Jun-2023
  • (2023)Balancing Accuracy and Evaluation Overhead in Simulation Point Selection2023 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC59245.2023.00019(43-53)Online publication date: 1-Oct-2023
  • (2022)mu-grindProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569671(346-358)Online publication date: 8-Oct-2022
  • (2022)HeteroGen: transpiling C to heterogeneous HLS code with automated test generation and program repairProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507748(1017-1029)Online publication date: 28-Feb-2022
  • (2022)Debugging in the brave new world of reconfigurable hardwareProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507701(946-962)Online publication date: 28-Feb-2022
  • (2022)HEROv2: Full-Stack Open-Source Research Platform for Heterogeneous ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.318939033:12(4368-4382)Online publication date: 1-Dec-2022
  • (2021)TIP: Time-Proportional Instruction ProfilingMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480058(15-27)Online publication date: 18-Oct-2021
  • (2021)A Non-Intrusive Tool Chain to Optimize MPSoC End-to-End SystemsACM Transactions on Architecture and Code Optimization10.1145/344503018:2(1-22)Online publication date: 9-Feb-2021
  • (2021)AIBench Scenario: Scenario-Distilling AI Benchmarking2021 30th International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT52795.2021.00018(142-158)Online publication date: Sep-2021
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media