article

Free access

Instruction fetching: coping with code bloat

Authors:

Stuart Sechrest,

Joel EmerAuthors Info & Claims

ACM SIGARCH Computer Architecture News, Volume 23, Issue 2

Pages 345 - 356

https://doi.org/10.1145/225830.224445

Published: 01 May 1995 Publication History

Abstract

Previous research has shown that the SPEC benchmarks achieve low miss ratios in relatively small instruction caches. This paper presents evidence that current software-development practices produce applications that exhibit substantially higher instruction-cache miss ratios than do the SPEC benchmarks. To represent these trends, we have assembled a collection of applications, called the Instruction Benchmark Suite (IBS), that provides a better test of instruction-cache performance. We discuss the rationale behind the design of IBS and characterize its behavior relative to the SPEC benchmark suite. Our analysis is based on trace-driven and trap-driven simulations and takes into full account both the application and operating-system components of the workloads.This paper then reexamines a collection of previously-proposed hardware mechanisms for improving instruction-fetch performance in the context of the IBS workloads. We study the impact of cache organization, transfer bandwidth, prefetching, and pipelined memory systems on machines that rely on the use of relatively small primary instruction caches to facilitate increased clock rates. We find that, although of little use for SPEC, the right combination of these techniques substantially benefits IBS. Even so, under IBS, a stubborn lower bound on the instruction-fetch CPI remains as an obstacle to improving overall processor performance.

References

[1]

Accetta, M., Baron, R., Golub, D., Rashid, R., Tevanian, A. and Young, M. Mach: A new kernel foundation for UN1X development, In the Summer 1986 USENIX Conference.]]

[2]

Alexander, C. A., Keshlear, W. M. and Bdggs, F. Translation buffer performance in a UNIX environment. Computer Architecture News 13 (5): 2-14, 1985.]]

Digital Library

[3]

Alexander, C., Keshlear, W., Cooper, F. and Bdggs, E Cache memory performance in a UNIX environment. Computer Architecture News 14: 14-70, 1986.]]

Digital Library

[4]

Agarwal, A., Hennessy, J. and Horowitz, M. Cache performance of operating system and multiprogramming workloads. ACM Transactions on Computer Systems 6 (Number 4): 393-431, 1988.]]

Digital Library

[5]

Baer, J.-L. and Wang, W.-H. Architectural choices for multi-level cache hierarchies. In the 16th International Conference on Parallel Processing: 258-261, 1987.]]

[6]

Baer, J.-L. and Wang, W.-H. On the inclusion properties for multi-level cache hierarchies, in the 15th ISCA, Honolulu, Hawaii, 73-80, 1988.]]

Digital Library

[7]

Bershad, B., Lee, D., Romer, T. and Chen, B. Avoiding conflict misses dynamically in large direct-mapped caches, In the 6th ASPLOS, San Jose, CA, 158-170, 1994.]]

Digital Library

[8]

Bomberger, A., Hardy, N., Frantz, A. P., Landau, C. R., Frantz, W. S., Shapiro, J. S. and Hardy, A. C. The KeyKOS Nanokernel Architecture, In the USENIX Micro-Kernels and Other Kernel Architectures Workshop, Seattle, WA, 95-112, 1992.]]

Digital Library

[9]

Borg, A., Kessler, R. and Wall, D. Generation and analysis of very long address traces, In the 17th ISCA, Seattle, WA, 1990.]]

Digital Library

[10]

Bray, B., Lynch, W. and Flynn, M. J. Page allocation to reduce access time of physical caches. Stanford University, Computer Systems Laboratory. CSL-TR-90-454. 1990.]]

Digital Library

[11]

Brunner, R.A. VAX Architecture Reference Manual. Digital Press, 1991.]]

Digital Library

[12]

Budd, T. An Introduction to Object-Oriented Programming. Addison-Wesley Publishing IBSN 0-201-54709-0, 1991.]]

Digital Library

[13]

Calder, B., Grunwald, D. and Zorn, B. Quantifying behavioral differences between C and C++ programs. The Department of Computer Science, University of Colorado. CU- CS-698-94.1994.]]

[14]

Chen, B. and Bershad, B. The impact of operating sys-n tem structure on memory, system performance, In, the 14th Symposium on Operating System Principles, 1993.]]

Digital Library

[15]

Chen, B. Memory behavior of an Xll window system, In the USENIX Winter 1994 Technical Conference, 1994.]]

Digital Library

[16]

Cheriton, D. R. The V kernel: A software base for distributed systems. IEEE Software 1 (2): 19-42, 1984.]]

Digital Library

[17]

Clark, D. Cache performance in the VAX-11/780. ACM Transactions on Computer Systems 1: 24-37, 1983.]]

Digital Library

[18]

Clark, D. W. and Emer, J. S. Pelformance of the VAX- 11/780 translation buffer: Shnulation and me,~urement. ACM Transactions on Computer Systems 3 (1): 31-62, 1985.]]

Digital Library

[19]

Clark, D. W., Bannon, P. J. and Keller, J. B. Measuring VAX 8800 Performance with a Histogram Hardware Monitor, In the 15th ISCA, Honolulu, Hawaii, 176-185, 1988.]]

Digital Library

[20]

Cmelik, B. and Keppel, D. Shade: A fast instructionset simzdator for execution profiling, In SIGMETRICS, Nashville, TN, ACM, 128-137, 1994.]]

Digital Library

[21]

Custer, H. lnside Windows NT. Redmond, WA, Microsoft Press, 1993.]]

Digital Library

[22]

Cvetanovic, Z. and Bhandarkar, D. Characterization of Alpha AXP performance using TP and SPEC Workloads, In the 21st ISCA, Chicago, Ii1., 1994.]]

Digital Library

[23]

Emer, J. and Clark, D. A characterization of processor performance in the VAX-11/780, In the 11 th ISCA, Ann Arbor, MI, 301-309, 1984.]]

Digital Library

[24]

Farrens, M. and Pleszkun, A. Improving perfotvzance of small on-chip instruction caches, In the 16th ISCA, 234-241, 1989.]]

Digital Library

[25]

Flanagan, J. K., Nelson, B. E. and Archibald, J. K. The inaccuracy of trace-driven simulation using incomplete trace data. Brigham Young University. 1993.]]

[26]

Gee, J., Hill, M., Pnevmatikatos, D. and Smith, A. J. Cache Performance of the SPEC92 Benchmark Suite. IEEE Micro (August): 17-27, 1993.]]

Digital Library

[27]

Happel, L. P. and Jayasumana, A. P. Perfomtance of a RISC machine with two-level caches. IEE Proceedings-E 139 (3): 221-229, 1992.]]

[28]

Hennessy, J. L. and Patterson, D. A. Computer Architecture A Quantitative Approach. San Mate.o, Morgan Kaufmann, 1990.]]

Digital Library

[29]

Hill, M. Aspects of cache memory and instruction buffer performance. The University of California at Berkeley. 1987.]]

[30]

Huck, J. and Hays, J.Architectural support for translation table management in large address space machines, In the 20th ISCA, San Diego, CA, 39-50, 1993.]]

Digital Library

[31]

Hwu, W.-m. and Chang, P. Achieving high instruction cache performance with an optimizing compiler, In the 16th ISCA, Jerusalem, Isreal, 242-251, 1989.]]

Digital Library

[32]

Jouppi, N. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers, In the 17th ISCA, Seattle, WA, 364-373, 1990.]]

Digital Library

[33]

Jouppi, N. and Wilton, S. Tradeoff~ in two-level onchip caching, In the 21st ISCA, Chicago, IL, 34-45, 1994.]]

Digital Library

[34]

Koch, P. Emulating the 68040 in the PowerPC Macintosh, In Microprocessor Forum, San Francisco, CA, 1994.]]

[35]

Kessler, R. Analysis of multi-megabyte secondary CPU cache memories. University of Wisconsin-Madison. 1991.]]

Digital Library

[36]

Kessler, R. and Hill, M. Page placement algorithms for large real-indexed caches. ACM Transactions on Computer Systems 10 (4): 338-359, 1992.]]

Digital Library

[37]

Malan, G., Rashid, R., Golub, D. and Baron, R. DOS as a Mach 3.0 application, In the USENIX Mach Symposium, 27- 40, 1991.]]

[38]

Maynard, A. M., Donnelly, C. and Olszewski, B. Contrasting characteristics and cache performance of technical and multi-user commercial workloads, In the 6th ASPLOS, San Jose, CA, 145-156, 1994.]]

Digital Library

[39]

McFarling, S. Program optimization for instruction caches, In the 3rd ASPLOS, Boston, MA, 183-191, 1989.]]

Digital Library

[40]

Mogul, J. C. and Borg, A. The effect of context switches on cache performance, In the 4th ASPLOS, Santa Clara, CA, 75-84, 1991.]]

Digital Library

[41]

Microprocessor Report. Sebastopol, CA, MicroDesign Resources, 1992, 1993, 1994 and 1995.]]

[42]

Mulder, J., Quach, N. and Flynn, M. An area model for on-chip memories and its application. IEEE Journal of Solid- State Circuits 26 (2): 98-106, 1991.]]

[43]

Nagle, D., Uhlig, R., Mudge, T., Monster: a tool for analyzing the interaction between operating systems and architectures. CSE-TR147-92. University of Michigan, 1992.]]

[44]

Nagle, D., Uhlig, R., Stanley, T., Sechrest, S., Mudge, T. and Brown, R. Design tradeoffsfor software-managed TLBs. In the 20th ISCA, San Diego, CA, 27-38, May 1993.]]

Digital Library

[45]

Nagle, D., Uhlig, R., Mudge, T. and Sechrest, S. Optimal allocation of on-chip memory for multiple-API operating systems, In the 21st ISCA, Chicago, IL, May 1994.]]

Digital Library

[46]

Olukotun, O. A., Mudge, T. N. and Brown, R. B. Implementing a cache for a high-performance GaAs microprocessor, In the 18th ISCA, Toronto, Canada, 138-147, 1991.]]

Digital Library

[47]

Olukotun, K., Mudge, T. and Brown, R. Performance optimization of pipelined primary caches, In The 19th ISCA, Gold Coast, Australia, 181-190, 1992.]]

Digital Library

[48]

Ousterhout, J. K. Tcl and the Tk Toolkit. Addison-Wesley Publishing Company, 1994.]]

Digital Library

[49]

Palcharla, S. and Kessler, R. E. Evaluating stream buffers as a secondary cache replacement, In the 21st ISCA, Chicago, IL, 24-33, 1994.]]

Digital Library

[50]

Patel, K., Smith, B. C. and Rowe, L. A. Performance of a Software MPEG Video Decoder. University of California, Berkeley. 1992.]]

[51]

Pierce, J., Cache Behavior in the Presence of Speculative Execution-The Benefits of Misprediction, Ph.D. Thesis, The University of Michigan, 1995.]]

Digital Library

[52]

Przybylski, S., Horowitz, M. and Hennessy, J. Characteristics of performance-optimal multi-level cache hierarchies, In the 16th ISCA, Jerusalem, Israel, 114-121, 1989.]]

Digital Library

[53]

Przybylski, S. The performance impact of block sizes and fetching strategies, In the 16th ISCA, Seattle, WA, 160- 169, 1990.]]

Digital Library

[54]

Rozier, M., Abrossimov, V., Armand, F., Boule, I., Glen, M., Guillemont, M., Herrman, F., Kaise, C., Langlois, S., Leonard, P. and Neuhauser, W. Overview of the Chorus distributed operating system, In the Micro-kernels and Other Kernel Architectures Workshop, Seattle, WA, USENIX, 39-69, 1992.]]

Digital Library

[55]

Scheifler, R. and Gettys, J. The X window system. ACM Transactions on Graphics 5 (2): 79-109, 1986.]]

Digital Library

[56]

Short, R. and Levy, H. A simulation study of two-level caches, In the 15th ISCA, Honolulu, Hawaii, 81-88, 1988.]]

Digital Library

[57]

Sites, R. L. and Agarwal, A. Multiprocessor cache analysis with ATUM, In the 15th ISCA, Honolulu, Hawaii, 186-195, 1988.]]

Digital Library

[58]

Sites, R., Chernoff, A., Kirk, M., Marks, M. and Robinson, S. Binary translation. Digital Technical Journal 4 (4): 137- 152, 1992.]]

[59]

Smith, A. J. Sequential program prefetching in memory hierarchies. IEEE Computer 11 (12): 7-21, 1978.]]

Digital Library

[60]

Smith, A. J. Cache Memories. Computing Surveys 14 (3): 473-530, 1982.]]

Digital Library

[61]

Smith, A. J. Cache evaluation and the impact on workload choice, In the 12th ISCA, Boston, MA, 64-73, 1985.]]

Digital Library

[62]

Smith, J. E. and Hsu, W.-C. Prefetching in supercomputer instruction caches, In Supercomputing '92, 588-597, 1992.]]

Digital Library

[63]

SPEC. The SPEC Benchmark Suite. SPEC Newsletter. 3: 3-4, 1991. -]]

[64]

SPEC. SPEC: A five year retrospective. The SPEC Newsletter 5 (4): 1-4, 1993.]]

[65]

Taylor, G., Davies, P. and Farmwald, M. The TLB slice - A low-cost high-speed address translation mechanism, In the 17th ISCA, Seattle, WA, 355-363, 1990.]]

Digital Library

[66]

Torrellas, J., Gupta, A. and Hennessy, J. Characterizing the caching and synchronization performance of multiprocessor operating system, In the 5th ASPLOS, Boston, MA, 162- 174, 1992.]]

Digital Library

[67]

Torrellas, J., Xia, C. and Daigle, R. Optimizing instruction cache performance for operating system intensive workloads, in the 21st International Symposium on High-Performance Computer Architecture (HPCA), Raleigh, North Carolina, to appear, 1995.]]

Digital Library

[68]

Touma, W. R. The Dynamics of the Computer Industry. University of Texas at Austin. 1993.]]

Digital Library

[69]

Uhlig, R., Nagle, D., Sechrest, S. and Mudge, T. Trapdriven simulation with Tapeworm IL In the 6th ASPLOS, San Jose, CA, 132-144, 1994.]]

Digital Library

[70]

Uhlig, R. Trap-driven Memory Simulation, Ph.D. Thesis, The University of Michigan, 1995.]]

Digital Library

[71]

Wada, T., Rajan, S. and Przybylski, S. An analytical access time model for on-chip cache memories. IEEE Journal of Solid-State Circuits 27 (8): 1147-1156, 1992.]]

[72]

Wang, W.-H., Baer, J.-L. and Levy, H. Organization and performance of a two-level virtual-real cache hierarchy, In the 16th ISCA, Jerusalem, Israel, 140-148, 1989.]]

Digital Library

[73]

Wiecek, C. A., Kaler, C. G., Fiorelli, S., Davenport, W. C. and Chen, R. C. A Model and Prototype of VMS Using the Mach 3.0 Kernel, In the USENIX Micro-kernels and Other Kernel Architectures Workshop, Seattle, WA, 187-203, 1992.]]

Digital Library

[74]

Wilton, S. and Jouppi, N. An enhanced access and cycle time model for on-chip caches. DEC Western Research Lab. Technical Report 93/5.1994.]]

Cited By

Lee YKim KLee WHan W(2024)In-depth Analysis of Continuous Subgraph Matching in a Common Delta Query Compilation FrameworkProceedings of the ACM on Management of Data10.1145/36549502:3(1-27)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654950
Zhang HAlhanahnah MAhmed FFatih DLeitner PAli-Eldin A(2024)Machine Learning Systems are Bloated and VulnerableProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36390328:1(1-30)Online publication date: 21-Feb-2024
https://dl.acm.org/doi/10.1145/3639032
Galazin AStupachenko EShlykov S(2008)A software instruction prefetching method in architectures with static schedulingProgramming and Computing Software10.1134/S036176880801006434:1(49-53)Online publication date: 1-Jan-2008
https://dl.acm.org/doi/10.1134/S0361768808010064
Show More Cited By

Index Terms

Instruction fetching: coping with code bloat
1. Hardware
2. Software and its engineering
  1. Software creation and management
    1. Designing software

Recommendations

Instruction fetching: coping with code bloat
ISCA '95: Proceedings of the 22nd annual international symposium on Computer architecture

Previous research has shown that the SPEC benchmarks achieve low miss ratios in relatively small instruction caches. This paper presents evidence that current software-development practices produce applications that exhibit substantially higher ...
Fetching instruction streams
MICRO 35: Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture

Fetch performance is a very important factor because it effectively limits the overall processor performance. However, there is little performance advantage in increasing front-end performance beyond what the back-end can consume. For each processor ...
Enhancing the instruction fetching mechanism using data compression

Comments

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News

ACM SIGARCH Computer Architecture News Volume 23, Issue 2

Special Issue: Proceedings of the 22nd annual international symposium on Computer architecture (ISCA '95)

May 1995

412 pages

ISSN:0163-5964

DOI:10.1145/225830

Chairman:
David A. Patterson
Univ. of California, Berkeley

Issue’s Table of Contents

ISCA '95: Proceedings of the 22nd annual international symposium on Computer architecture
July 1995
426 pages
ISBN:0897916980
DOI:10.1145/223982
Chairman:
David A. Patterson
Univ. of California, Berkeley

Copyright © 1995 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 May 1995

Published in SIGARCH Volume 23, Issue 2

Check for updates

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

61
Total Citations
View Citations
613
Total Downloads

Downloads (Last 12 months)74
Downloads (Last 6 weeks)15

Reflects downloads up to 12 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Lee YKim KLee WHan W(2024)In-depth Analysis of Continuous Subgraph Matching in a Common Delta Query Compilation FrameworkProceedings of the ACM on Management of Data10.1145/36549502:3(1-27)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654950
Zhang HAlhanahnah MAhmed FFatih DLeitner PAli-Eldin A(2024)Machine Learning Systems are Bloated and VulnerableProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36390328:1(1-30)Online publication date: 21-Feb-2024
https://dl.acm.org/doi/10.1145/3639032
Galazin AStupachenko EShlykov S(2008)A software instruction prefetching method in architectures with static schedulingProgramming and Computing Software10.1134/S036176880801006434:1(49-53)Online publication date: 1-Jan-2008
https://dl.acm.org/doi/10.1134/S0361768808010064
Eeckhout LDe Bosschere K(2004)How accurate should early design stage power/performance tools be? A case study with statistical simulationJournal of Systems and Software10.1016/S0164-1212(03)00247-473:1(45-62)Online publication date: 1-Sep-2004
https://dl.acm.org/doi/10.1016/S0164-1212%2803%2900247-4
Albonesi DKoren I(1999)STATSJournal of Systems Architecture: the EUROMICRO Journal10.1016/S1383-7621(98)00052-645:12-13(1097-1110)Online publication date: 1-Jun-1999
https://dl.acm.org/doi/10.1016/S1383-7621%2898%2900052-6
Jiang WWang GBhuiyan MWu J(2016)Understanding Graph-Based Trust Evaluation in Online Social NetworksACM Computing Surveys10.1145/290615149:1(1-35)Online publication date: 23-May-2016
https://dl.acm.org/doi/10.1145/2906151
Wang CHsu JDubnov S(2016)Machine Improvisation with Variable Markov OracleComputers in Entertainment10.1145/290537114:3(1-18)Online publication date: 31-Dec-2016
https://dl.acm.org/doi/10.1145/2905371
Kaynak CGrot BFalsafi BFarrens MKozyrakis C(2013)SHIFTProceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/2540708.2540732(272-283)Online publication date: 7-Dec-2013
https://dl.acm.org/doi/10.1145/2540708.2540732
Gutierrez APusdesris JDreslinski RMudge TJerraya ACarloni LMooney VRabbah R(2012)Lazy cache invalidation for self-modifying codesProceedings of the 2012 international conference on Compilers, architectures and synthesis for embedded systems10.1145/2380403.2380433(151-160)Online publication date: 7-Oct-2012
https://dl.acm.org/doi/10.1145/2380403.2380433
Draper GKessler RRiesenfeld R(2009)A history of computing course with a technical focusACM SIGCSE Bulletin10.1145/1539024.150902441:1(458-462)Online publication date: 4-Mar-2009
https://dl.acm.org/doi/10.1145/1539024.1509024
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents