Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

An Analytical Cache Performance Evaluation Framework for Embedded Out-of-Order Processors Using Software Characteristics

Published: 09 August 2018 Publication History

Abstract

Utilizing analytical models to evaluate proposals or provide guidance in high-level architecture decisions is been becoming more and more attractive. A certain number of methods have emerged regarding cache behaviors and quantified insights in the last decade, such as the stack distance theory and the memory level parallelism (MLP) estimations. However, prior research normally oversimplified the factors that need to be considered in out-of-order processors, such as the effects triggered by reordered memory instructions, and multiple dependences among memory instructions, along with the merged accesses in the same MSHR entry. These ignored influences actually result in low and unstable precisions of recent analytical models.
By quantifying the aforementioned effects, this article proposes a cache performance evaluation framework equipped with three analytical models, which can more accurately predict cache misses, MLPs, and the average cache miss service time, respectively. Similar to prior studies, these analytical models are all fed with profiled software characteristics in which case the architecture evaluation process can be accelerated significantly when compared with cycle-accurate simulations.
We evaluate the accuracy of proposed models compared with gem5 cycle-accurate simulations with 16 benchmarks chosen from Mobybench Suite 2.0, Mibench 1.0, and Mediabench II. The average root mean square errors for predicting cache misses, MLPs, and the average cache miss service time are around 4%, 5%, and 8%, respectively. Meanwhile, the average error of predicting the stall time due to cache misses by our framework is as low as 8%. The whole cache performance estimation can be sped by about 15 times versus gem5 cycle-accurate simulations and 4 times when compared with recent studies. Furthermore, we have shown and studied the insights between different performance metrics and the reorder buffer sizes by using our models. As an application case of the framework, we also demonstrate how to use our framework combined with McPAT to find out Pareto optimal configurations for cache design space explorations.

References

[1]
Jeffrey M. Abramson, David B. Papworth, Haitham H. Akkary, Andrew F. Glew, Glenn J. Hinton, Kris G. Konigsfeld, Paul D. Madland, et al. 1998. Out-of-order processor with a memory subsystem which handles speculatively dispatched load operations. (May 12, 1998). US Patent 5,751,983.
[2]
Kapil Anand and Rajeev Barua. 2015. Instruction-cache locking for improving embedded systems performance. ACM Transactions on Embedded Computing Systems 14, 3 (2015), 53.
[3]
Fabrice Bellard. 2005. QEMU, a fast and portable dynamic translator. In USENIX Annual Technical Conference, FREENIX Track. 41--46.
[4]
Erik Berg and Erik Hagersten. 2004. StatCache: A probabilistic approach to efficient and accurate data locality analysis. In Proceedings of the 2004 IEEE International Symposium on ISPASS Performance Analysis of Systems and Software. IEEE, 20--27.
[5]
Erik Berg, Håkan Zeffer, and Erik Hagersten. 2006. A statistical multiprocessor cache model. In Proceedings of the 2006 IEEE International Symposium on Performance Analysis of Systems and Software. IEEE, 89--99.
[6]
Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, et al. 2011. The gem5 simulator. ACM SIGARCH Computer Architecture News 39, 2 (2011), 1--7.
[7]
J. Bolaria. 2012. Cortex-A57 extends ARM’s reach. Microprocessor Report 11, 5 (2012), 12--1.
[8]
Maximilien Breughe, Stijn Eyerman, and Lieven Eeckhout. 2012. A mechanistic performance model for superscalar in-order processors. In Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’12). IEEE, 14--24.
[9]
Calin CaBcaval and David A. Padua. 2003. Estimating cache misses and locality using stack distances. In Proceedings of the 17th Annual International Conference on Supercomputing. ACM, 150--159.
[10]
Jian Chen, Lizy Kurian John, and Dimitris Kaseridis. 2011. Modeling program resource demand using inherent program characteristics. In Proceedings of the ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems. ACM, 1--12.
[11]
Tien-Fu Chen and Jean-Loup Baer. 1992. Reducing Memory Latency via Non-Blocking and Prefetching Caches, Vol. 27. ACM.
[12]
Yuan Chou, Brian Fahs, and Santosh Abraham. 2004. Microarchitecture optimizations for exploiting memory-level parallelism. In ACM SIGARCH Computer Architecture News, Vol. 32. IEEE Computer Society, 76.
[13]
Gyanesh Das, Prasant Kumar Pattnaik, and Sasmita Kumari Padhy. 2014. Artificial neural network trained by particle swarm optimization for non-linear channel equalization. Expert Systems with Applications 41, 7 (2014), 3491--3496.
[14]
Roeland J. Douma, Sebastian Altmeyer, and Andy D. Pimentel. 2015. Fast and precise cache performance estimation for out-of-order execution. In Design, Automation 8 Test in Europe Conference & Exhibition (DATE’15). IEEE, 1132--1137.
[15]
Jan Draisma, Emil Horobeţ, Giorgio Ottaviani, Bernd Sturmfels, and Rekha Thomas. 2014. The Euclidean distance degree. In Proceedings of the 2014 Symposium on Symbolic-Numeric Computation. ACM, 9--16.
[16]
David Eklov and Erik Hagersten. 2010. StatStack: Efficient modeling of LRU caches. In Proceedings of the 2010 IEEE International Symposium on Performance Analysis of Systems 8 Software (ISPASS’10). IEEE, 55--65.
[17]
Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis, and James E. Smith. 2009. A mechanistic performance model for superscalar out-of-order processors. ACM Transactions on Computer Systems 27, 2 (2009), 3.
[18]
Stijn Eyerman, Kenneth Hoste, and Lieven Eeckhout. 2011. Mechanistic-empirical processor performance modeling for constructing CPI stacks on real hardware. In Proceedings of the 2011 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’11). IEEE, 216--226.
[19]
Peter Greenhalgh. 2011. Big.LITTLE processing with ARM Cortex-A15 8 Cortex-A7. ARM White Paper (2011), 1--8.
[20]
Matthew R. Guthaus, Jeffrey S. Ringenberg, Dan Ernst, Todd M. Austin, Trevor Mudge, and Richard B. Brown. 2001. MiBench: A free, commercially representative embedded benchmark suite. In Proceedings of the 2001 IEEE International Workshop on Workload Characterization (WWC-4’01). IEEE, 3--14.
[21]
Anthony Gutierrez, Joseph Pusdesris, Ronald G. Dreslinski, Trevor Mudge, Chander Sudanthi, Christopher D. Emmons, Mitchell Hayenga, and Nigel Paver. 2014. Sources of error in full-system simulation. In Proceedings of the 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’14). IEEE, 13--22.
[22]
Linley Gwennap. 1997. Intel, HP make EPIC disclosure. Microprocessor Report 11, 14 (1997), 1--9.
[23]
John A. Hartigan and Manchek A. Wong. 1979. Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics) 28, 1 (1979), 100--108.
[24]
John L. Hennessy and David A. Patterson. 2011. Computer Architecture: A Quantitative Approach. Elsevier.
[25]
Glenn Hinton, Dave Sager, Mike Upton, Darrell Boggs, et al. 2001. The microarchitecture of the Pentium® 4 processor. In Intel Technology Journal. Citeseer.
[26]
Yongbing Huang, Zhongbin Zha, Mingyu Chen, and Lixin Zhang. 2014. Moby: A mobile benchmark suite for architectural simulators. In Proceedings of the 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’14). IEEE, 45--54.
[27]
Kazuhisa Ishizaka and Takashi Miyazaki. 2013. Cache memory, including miss status/information and a method using the same. (Oct. 8, 2013). US Patent 8,555,001.
[28]
J. Jaleel and Bruce Jacob. 2005. Using virtual load/store queues (VLSQs) to reduce the negative effects of reordered memory instructions. In Proceedings of the 11th International Symposium on High-Performance Computer Architecture, 2005 (HPCA’11). IEEE, 191--200.
[29]
Kecheng Ji, Ming Ling, Qin Wang, Longxing Shi, and Jianping Pan. 2017a. AFEC: An analytical framework for evaluating cache performance in out-of-order processors. In Proceedings of the 2017 Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE’17). IEEE, 55--60.
[30]
Kecheng Ji, Ming Ling, Yang Zhang, and Longxing Shi. 2017b. An artificial neural network model of LRU-cache misses on out-of-order embedded processors. Microprocessors and Microsystems 50 (2017), 66--79.
[31]
Wen Jin, Zhao Jia Li, Luo Si Wei, and Han Zhen. 2000. The improvements of BP neural network learning algorithm. In Proceedings of the 5th International Conference on Signal Processing (WCCC-ICSP’00), Vol. 3. IEEE, 1647--1649.
[32]
Tejas S. Karkhanis and James E. Smith. 2004. A first-order superscalar processor model. In Proceedings of the 31st Annual International Symposium on Computer Architecture, 2004. Proceedings. IEEE, 338--349.
[33]
Chunho Lee, Miodrag Potkonjak, and William H. Mangione-Smith. 1997. MediaBench: A tool for evaluating and synthesizing multimedia and communicatons systems. In Proceedings of the 30th Annual ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society, 330--335.
[34]
Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2013. The McPAT framework for multicore and manycore architectures: Simultaneously modeling power, area, and timing. ACM Transactions on Architecture and Code Optimization 10, 1 (2013), 5.
[35]
Yun Liang and Tulika Mitra. 2008. Cache modeling in probabilistic execution time analysis. In Proceedings of the 45th Annual Design Automation Conference. ACM, 319--324.
[36]
Yun Liang and Tulika Mitra. 2013. An analytical approach for fast and accurate design space exploration of instruction caches. ACM Transactions on Embedded Computing Systems 13, 3 (2013), 43.
[37]
Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In ACM Sigplan Notices, Vol. 40. ACM, 190--200.
[38]
Sparsh Mittal. 2016. A survey of recent prefetching techniques for processor caches. ACM Computing Surveys 49, 2 (2016), 35.
[39]
Onur Mutlu, Jared Stark, Chris Wilkerson, and Yale N. Patt. 2003. Runahead execution: An alternative to very large instruction windows for out-of-order processors. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture (HPCA-9 2003). IEEE, 129--140.
[40]
Xiaoyue Pan and Bengt Jonsson. 2015. A modeling framework for reuse distance-based estimation of cache performance. In Proceedings of the 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’15). IEEE, 62--71.
[41]
Lutz Prechelt et al. 1994. Proben1: A set of neural network benchmark problems and benchmarking rules. (1994).
[42]
K. Skadron, P. S. Ahuja, M. Martonosi, and D. W. Clark. 1999. Branch prediction, instruction-window size, and cache size: Performance trade-offs and simulation techniques. IEEE Transactions on Computers 48, 11 (1999), 1260--1281.
[43]
Xian-He Sun and Dawei Wang. 2014. Concurrent average memory access time. Computer 47, 5 (2014), 74--80.
[44]
Prathap Kumar Valsan, Heechul Yun, and Farzad Farshchi. 2016. Taming non-blocking caches to improve isolation in multicore real-time systems. In Proceedings of the 2016 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS’16). IEEE, 1--12.
[45]
Sam Van den Steen, Stijn Eyerman, Sander De Pestel, Moncef Mechri, Trevor E. Carlson, David Black-Schaffer, Erik Hagersten, and Lieven Eeckhout. 2016. Analytical processor performance and power modeling using micro-architecture independent characteristics. IEEE Transactions on Computers 65, 12 (2016), 3537--3551.
[46]
Wei Wang and Tanima Dey. 2011. A survey on ARM Cortex a processors. Retrieved March 2011 from http://www.cs.virginia.edu/shadron/cs8535s11/armcotex.pdf.
[47]
Zhonglei Wang and Jörg Henkel. 2013. Fast and accurate cache modeling in source-level simulation of embedded software. In Proceedings of the Conference on Design, Automation and Test in Europe. EDA Consortium, 587--592.

Cited By

View all
  • (2022)Load Balanced Content Prefetching Model for MANET-CLOUD EnvironmentDistributed Computing and Optimization Techniques10.1007/978-981-19-2281-7_53(571-581)Online publication date: 12-Sep-2022
  • (2021)The Predictable Execution Model in PracticeACM Transactions on Embedded Computing Systems10.1145/346537020:5(1-25)Online publication date: 29-Jul-2021
  • (2021)Analytical Modeling the Multi-Core Shared Cache Behavior With Considerations of Data-Sharing and CoherenceIEEE Access10.1109/ACCESS.2021.30533509(17728-17743)Online publication date: 2021
  • Show More Cited By

Index Terms

  1. An Analytical Cache Performance Evaluation Framework for Embedded Out-of-Order Processors Using Software Characteristics

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Embedded Computing Systems
    ACM Transactions on Embedded Computing Systems  Volume 17, Issue 4
    July 2018
    207 pages
    ISSN:1539-9087
    EISSN:1558-3465
    DOI:10.1145/3236463
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Journal Family

    Publication History

    Published: 09 August 2018
    Accepted: 01 May 2018
    Revised: 01 March 2018
    Received: 01 August 2017
    Published in TECS Volume 17, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Analytical models
    2. cache miss service time
    3. cache misses
    4. memory level parallelism
    5. software characteristics

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • Chinese National Mega Project of Scientific Research

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)12
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 27 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Load Balanced Content Prefetching Model for MANET-CLOUD EnvironmentDistributed Computing and Optimization Techniques10.1007/978-981-19-2281-7_53(571-581)Online publication date: 12-Sep-2022
    • (2021)The Predictable Execution Model in PracticeACM Transactions on Embedded Computing Systems10.1145/346537020:5(1-25)Online publication date: 29-Jul-2021
    • (2021)Analytical Modeling the Multi-Core Shared Cache Behavior With Considerations of Data-Sharing and CoherenceIEEE Access10.1109/ACCESS.2021.30533509(17728-17743)Online publication date: 2021
    • (2020)A Locality Optimizer for Loop-dominated Applications Based on Reuse Distance AnalysisACM Transactions on Design Automation of Electronic Systems10.1145/339818925:6(1-26)Online publication date: 2-Sep-2020
    • (2020)Fast Modeling L2 Cache Reuse Distance Histograms Using Combined Locality Information from Software TracesJournal of Systems Architecture10.1016/j.sysarc.2020.101745(101745)Online publication date: Mar-2020
    • (2019)A Gaussian Set Sampling Model for Efficient Shared Cache Profiling on Multi-CoresIEEE Access10.1109/ACCESS.2019.29364397(115560-115567)Online publication date: 2019

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media