Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article
Free access

Instruction fetching: coping with code bloat

Published: 01 May 1995 Publication History
  • Get Citation Alerts
  • Abstract

    Previous research has shown that the SPEC benchmarks achieve low miss ratios in relatively small instruction caches. This paper presents evidence that current software-development practices produce applications that exhibit substantially higher instruction-cache miss ratios than do the SPEC benchmarks. To represent these trends, we have assembled a collection of applications, called the Instruction Benchmark Suite (IBS), that provides a better test of instruction-cache performance. We discuss the rationale behind the design of IBS and characterize its behavior relative to the SPEC benchmark suite. Our analysis is based on trace-driven and trap-driven simulations and takes into full account both the application and operating-system components of the workloads.This paper then reexamines a collection of previously-proposed hardware mechanisms for improving instruction-fetch performance in the context of the IBS workloads. We study the impact of cache organization, transfer bandwidth, prefetching, and pipelined memory systems on machines that rely on the use of relatively small primary instruction caches to facilitate increased clock rates. We find that, although of little use for SPEC, the right combination of these techniques substantially benefits IBS. Even so, under IBS, a stubborn lower bound on the instruction-fetch CPI remains as an obstacle to improving overall processor performance.

    References

    [1]
    Accetta, M., Baron, R., Golub, D., Rashid, R., Tevanian, A. and Young, M. Mach: A new kernel foundation for UN1X development, In the Summer 1986 USENIX Conference.]]
    [2]
    Alexander, C. A., Keshlear, W. M. and Bdggs, F. Translation buffer performance in a UNIX environment. Computer Architecture News 13 (5): 2-14, 1985.]]
    [3]
    Alexander, C., Keshlear, W., Cooper, F. and Bdggs, E Cache memory performance in a UNIX environment. Computer Architecture News 14: 14-70, 1986.]]
    [4]
    Agarwal, A., Hennessy, J. and Horowitz, M. Cache performance of operating system and multiprogramming workloads. ACM Transactions on Computer Systems 6 (Number 4): 393-431, 1988.]]
    [5]
    Baer, J.-L. and Wang, W.-H. Architectural choices for multi-level cache hierarchies. In the 16th International Conference on Parallel Processing: 258-261, 1987.]]
    [6]
    Baer, J.-L. and Wang, W.-H. On the inclusion properties for multi-level cache hierarchies, in the 15th ISCA, Honolulu, Hawaii, 73-80, 1988.]]
    [7]
    Bershad, B., Lee, D., Romer, T. and Chen, B. Avoiding conflict misses dynamically in large direct-mapped caches, In the 6th ASPLOS, San Jose, CA, 158-170, 1994.]]
    [8]
    Bomberger, A., Hardy, N., Frantz, A. P., Landau, C. R., Frantz, W. S., Shapiro, J. S. and Hardy, A. C. The KeyKOS Nanokernel Architecture, In the USENIX Micro-Kernels and Other Kernel Architectures Workshop, Seattle, WA, 95-112, 1992.]]
    [9]
    Borg, A., Kessler, R. and Wall, D. Generation and analysis of very long address traces, In the 17th ISCA, Seattle, WA, 1990.]]
    [10]
    Bray, B., Lynch, W. and Flynn, M. J. Page allocation to reduce access time of physical caches. Stanford University, Computer Systems Laboratory. CSL-TR-90-454. 1990.]]
    [11]
    Brunner, R.A. VAX Architecture Reference Manual. Digital Press, 1991.]]
    [12]
    Budd, T. An Introduction to Object-Oriented Programming. Addison-Wesley Publishing IBSN 0-201-54709-0, 1991.]]
    [13]
    Calder, B., Grunwald, D. and Zorn, B. Quantifying behavioral differences between C and C++ programs. The Department of Computer Science, University of Colorado. CU- CS-698-94.1994.]]
    [14]
    Chen, B. and Bershad, B. The impact of operating sys-n tem structure on memory, system performance, In, the 14th Symposium on Operating System Principles, 1993.]]
    [15]
    Chen, B. Memory behavior of an Xll window system, In the USENIX Winter 1994 Technical Conference, 1994.]]
    [16]
    Cheriton, D. R. The V kernel: A software base for distributed systems. IEEE Software 1 (2): 19-42, 1984.]]
    [17]
    Clark, D. Cache performance in the VAX-11/780. ACM Transactions on Computer Systems 1: 24-37, 1983.]]
    [18]
    Clark, D. W. and Emer, J. S. Pelformance of the VAX- 11/780 translation buffer: Shnulation and me,~urement. ACM Transactions on Computer Systems 3 (1): 31-62, 1985.]]
    [19]
    Clark, D. W., Bannon, P. J. and Keller, J. B. Measuring VAX 8800 Performance with a Histogram Hardware Monitor, In the 15th ISCA, Honolulu, Hawaii, 176-185, 1988.]]
    [20]
    Cmelik, B. and Keppel, D. Shade: A fast instructionset simzdator for execution profiling, In SIGMETRICS, Nashville, TN, ACM, 128-137, 1994.]]
    [21]
    Custer, H. lnside Windows NT. Redmond, WA, Microsoft Press, 1993.]]
    [22]
    Cvetanovic, Z. and Bhandarkar, D. Characterization of Alpha AXP performance using TP and SPEC Workloads, In the 21st ISCA, Chicago, Ii1., 1994.]]
    [23]
    Emer, J. and Clark, D. A characterization of processor performance in the VAX-11/780, In the 11 th ISCA, Ann Arbor, MI, 301-309, 1984.]]
    [24]
    Farrens, M. and Pleszkun, A. Improving perfotvzance of small on-chip instruction caches, In the 16th ISCA, 234-241, 1989.]]
    [25]
    Flanagan, J. K., Nelson, B. E. and Archibald, J. K. The inaccuracy of trace-driven simulation using incomplete trace data. Brigham Young University. 1993.]]
    [26]
    Gee, J., Hill, M., Pnevmatikatos, D. and Smith, A. J. Cache Performance of the SPEC92 Benchmark Suite. IEEE Micro (August): 17-27, 1993.]]
    [27]
    Happel, L. P. and Jayasumana, A. P. Perfomtance of a RISC machine with two-level caches. IEE Proceedings-E 139 (3): 221-229, 1992.]]
    [28]
    Hennessy, J. L. and Patterson, D. A. Computer Architecture A Quantitative Approach. San Mate.o, Morgan Kaufmann, 1990.]]
    [29]
    Hill, M. Aspects of cache memory and instruction buffer performance. The University of California at Berkeley. 1987.]]
    [30]
    Huck, J. and Hays, J.Architectural support for translation table management in large address space machines, In the 20th ISCA, San Diego, CA, 39-50, 1993.]]
    [31]
    Hwu, W.-m. and Chang, P. Achieving high instruction cache performance with an optimizing compiler, In the 16th ISCA, Jerusalem, Isreal, 242-251, 1989.]]
    [32]
    Jouppi, N. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers, In the 17th ISCA, Seattle, WA, 364-373, 1990.]]
    [33]
    Jouppi, N. and Wilton, S. Tradeoff~ in two-level onchip caching, In the 21st ISCA, Chicago, IL, 34-45, 1994.]]
    [34]
    Koch, P. Emulating the 68040 in the PowerPC Macintosh, In Microprocessor Forum, San Francisco, CA, 1994.]]
    [35]
    Kessler, R. Analysis of multi-megabyte secondary CPU cache memories. University of Wisconsin-Madison. 1991.]]
    [36]
    Kessler, R. and Hill, M. Page placement algorithms for large real-indexed caches. ACM Transactions on Computer Systems 10 (4): 338-359, 1992.]]
    [37]
    Malan, G., Rashid, R., Golub, D. and Baron, R. DOS as a Mach 3.0 application, In the USENIX Mach Symposium, 27- 40, 1991.]]
    [38]
    Maynard, A. M., Donnelly, C. and Olszewski, B. Contrasting characteristics and cache performance of technical and multi-user commercial workloads, In the 6th ASPLOS, San Jose, CA, 145-156, 1994.]]
    [39]
    McFarling, S. Program optimization for instruction caches, In the 3rd ASPLOS, Boston, MA, 183-191, 1989.]]
    [40]
    Mogul, J. C. and Borg, A. The effect of context switches on cache performance, In the 4th ASPLOS, Santa Clara, CA, 75-84, 1991.]]
    [41]
    Microprocessor Report. Sebastopol, CA, MicroDesign Resources, 1992, 1993, 1994 and 1995.]]
    [42]
    Mulder, J., Quach, N. and Flynn, M. An area model for on-chip memories and its application. IEEE Journal of Solid- State Circuits 26 (2): 98-106, 1991.]]
    [43]
    Nagle, D., Uhlig, R., Mudge, T., Monster: a tool for analyzing the interaction between operating systems and architectures. CSE-TR147-92. University of Michigan, 1992.]]
    [44]
    Nagle, D., Uhlig, R., Stanley, T., Sechrest, S., Mudge, T. and Brown, R. Design tradeoffsfor software-managed TLBs. In the 20th ISCA, San Diego, CA, 27-38, May 1993.]]
    [45]
    Nagle, D., Uhlig, R., Mudge, T. and Sechrest, S. Optimal allocation of on-chip memory for multiple-API operating systems, In the 21st ISCA, Chicago, IL, May 1994.]]
    [46]
    Olukotun, O. A., Mudge, T. N. and Brown, R. B. Implementing a cache for a high-performance GaAs microprocessor, In the 18th ISCA, Toronto, Canada, 138-147, 1991.]]
    [47]
    Olukotun, K., Mudge, T. and Brown, R. Performance optimization of pipelined primary caches, In The 19th ISCA, Gold Coast, Australia, 181-190, 1992.]]
    [48]
    Ousterhout, J. K. Tcl and the Tk Toolkit. Addison-Wesley Publishing Company, 1994.]]
    [49]
    Palcharla, S. and Kessler, R. E. Evaluating stream buffers as a secondary cache replacement, In the 21st ISCA, Chicago, IL, 24-33, 1994.]]
    [50]
    Patel, K., Smith, B. C. and Rowe, L. A. Performance of a Software MPEG Video Decoder. University of California, Berkeley. 1992.]]
    [51]
    Pierce, J., Cache Behavior in the Presence of Speculative Execution-The Benefits of Misprediction, Ph.D. Thesis, The University of Michigan, 1995.]]
    [52]
    Przybylski, S., Horowitz, M. and Hennessy, J. Characteristics of performance-optimal multi-level cache hierarchies, In the 16th ISCA, Jerusalem, Israel, 114-121, 1989.]]
    [53]
    Przybylski, S. The performance impact of block sizes and fetching strategies, In the 16th ISCA, Seattle, WA, 160- 169, 1990.]]
    [54]
    Rozier, M., Abrossimov, V., Armand, F., Boule, I., Glen, M., Guillemont, M., Herrman, F., Kaise, C., Langlois, S., Leonard, P. and Neuhauser, W. Overview of the Chorus distributed operating system, In the Micro-kernels and Other Kernel Architectures Workshop, Seattle, WA, USENIX, 39-69, 1992.]]
    [55]
    Scheifler, R. and Gettys, J. The X window system. ACM Transactions on Graphics 5 (2): 79-109, 1986.]]
    [56]
    Short, R. and Levy, H. A simulation study of two-level caches, In the 15th ISCA, Honolulu, Hawaii, 81-88, 1988.]]
    [57]
    Sites, R. L. and Agarwal, A. Multiprocessor cache analysis with ATUM, In the 15th ISCA, Honolulu, Hawaii, 186-195, 1988.]]
    [58]
    Sites, R., Chernoff, A., Kirk, M., Marks, M. and Robinson, S. Binary translation. Digital Technical Journal 4 (4): 137- 152, 1992.]]
    [59]
    Smith, A. J. Sequential program prefetching in memory hierarchies. IEEE Computer 11 (12): 7-21, 1978.]]
    [60]
    Smith, A. J. Cache Memories. Computing Surveys 14 (3): 473-530, 1982.]]
    [61]
    Smith, A. J. Cache evaluation and the impact on workload choice, In the 12th ISCA, Boston, MA, 64-73, 1985.]]
    [62]
    Smith, J. E. and Hsu, W.-C. Prefetching in supercomputer instruction caches, In Supercomputing '92, 588-597, 1992.]]
    [63]
    SPEC. The SPEC Benchmark Suite. SPEC Newsletter. 3: 3-4, 1991. -]]
    [64]
    SPEC. SPEC: A five year retrospective. The SPEC Newsletter 5 (4): 1-4, 1993.]]
    [65]
    Taylor, G., Davies, P. and Farmwald, M. The TLB slice - A low-cost high-speed address translation mechanism, In the 17th ISCA, Seattle, WA, 355-363, 1990.]]
    [66]
    Torrellas, J., Gupta, A. and Hennessy, J. Characterizing the caching and synchronization performance of multiprocessor operating system, In the 5th ASPLOS, Boston, MA, 162- 174, 1992.]]
    [67]
    Torrellas, J., Xia, C. and Daigle, R. Optimizing instruction cache performance for operating system intensive workloads, in the 21st International Symposium on High-Performance Computer Architecture (HPCA), Raleigh, North Carolina, to appear, 1995.]]
    [68]
    Touma, W. R. The Dynamics of the Computer Industry. University of Texas at Austin. 1993.]]
    [69]
    Uhlig, R., Nagle, D., Sechrest, S. and Mudge, T. Trapdriven simulation with Tapeworm IL In the 6th ASPLOS, San Jose, CA, 132-144, 1994.]]
    [70]
    Uhlig, R. Trap-driven Memory Simulation, Ph.D. Thesis, The University of Michigan, 1995.]]
    [71]
    Wada, T., Rajan, S. and Przybylski, S. An analytical access time model for on-chip cache memories. IEEE Journal of Solid-State Circuits 27 (8): 1147-1156, 1992.]]
    [72]
    Wang, W.-H., Baer, J.-L. and Levy, H. Organization and performance of a two-level virtual-real cache hierarchy, In the 16th ISCA, Jerusalem, Israel, 140-148, 1989.]]
    [73]
    Wiecek, C. A., Kaler, C. G., Fiorelli, S., Davenport, W. C. and Chen, R. C. A Model and Prototype of VMS Using the Mach 3.0 Kernel, In the USENIX Micro-kernels and Other Kernel Architectures Workshop, Seattle, WA, 187-203, 1992.]]
    [74]
    Wilton, S. and Jouppi, N. An enhanced access and cycle time model for on-chip caches. DEC Western Research Lab. Technical Report 93/5.1994.]]

    Cited By

    View all
    • (2024)In-depth Analysis of Continuous Subgraph Matching in a Common Delta Query Compilation FrameworkProceedings of the ACM on Management of Data10.1145/36549502:3(1-27)Online publication date: 30-May-2024
    • (2024)Machine Learning Systems are Bloated and VulnerableProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36390328:1(1-30)Online publication date: 21-Feb-2024
    • (2008)A software instruction prefetching method in architectures with static schedulingProgramming and Computing Software10.1134/S036176880801006434:1(49-53)Online publication date: 1-Jan-2008
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM SIGARCH Computer Architecture News
    ACM SIGARCH Computer Architecture News  Volume 23, Issue 2
    Special Issue: Proceedings of the 22nd annual international symposium on Computer architecture (ISCA '95)
    May 1995
    412 pages
    ISSN:0163-5964
    DOI:10.1145/225830
    Issue’s Table of Contents
    • cover image ACM Conferences
      ISCA '95: Proceedings of the 22nd annual international symposium on Computer architecture
      July 1995
      426 pages
      ISBN:0897916980
      DOI:10.1145/223982
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 May 1995
    Published in SIGARCH Volume 23, Issue 2

    Check for updates

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)74
    • Downloads (Last 6 weeks)15
    Reflects downloads up to 12 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)In-depth Analysis of Continuous Subgraph Matching in a Common Delta Query Compilation FrameworkProceedings of the ACM on Management of Data10.1145/36549502:3(1-27)Online publication date: 30-May-2024
    • (2024)Machine Learning Systems are Bloated and VulnerableProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36390328:1(1-30)Online publication date: 21-Feb-2024
    • (2008)A software instruction prefetching method in architectures with static schedulingProgramming and Computing Software10.1134/S036176880801006434:1(49-53)Online publication date: 1-Jan-2008
    • (2004)How accurate should early design stage power/performance tools be? A case study with statistical simulationJournal of Systems and Software10.1016/S0164-1212(03)00247-473:1(45-62)Online publication date: 1-Sep-2004
    • (1999)STATSJournal of Systems Architecture: the EUROMICRO Journal10.1016/S1383-7621(98)00052-645:12-13(1097-1110)Online publication date: 1-Jun-1999
    • (2016)Understanding Graph-Based Trust Evaluation in Online Social NetworksACM Computing Surveys10.1145/290615149:1(1-35)Online publication date: 23-May-2016
    • (2016)Machine Improvisation with Variable Markov OracleComputers in Entertainment10.1145/290537114:3(1-18)Online publication date: 31-Dec-2016
    • (2013)SHIFTProceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/2540708.2540732(272-283)Online publication date: 7-Dec-2013
    • (2012)Lazy cache invalidation for self-modifying codesProceedings of the 2012 international conference on Compilers, architectures and synthesis for embedded systems10.1145/2380403.2380433(151-160)Online publication date: 7-Oct-2012
    • (2009)A history of computing course with a technical focusACM SIGCSE Bulletin10.1145/1539024.150902441:1(458-462)Online publication date: 4-Mar-2009
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media