research-article

Analysis of micro-architecture resources interference on multicore NUMA systems

Authors:

Seong-je ChoAuthors Info & Claims

SAC '16: Proceedings of the 31st Annual ACM Symposium on Applied Computing

Pages 1871 - 1876

https://doi.org/10.1145/2851613.2851742

Published: 04 April 2016 Publication History

Abstract

Modern computer systems employ multiple cores and various layers of memory modules such as local memory and remote memory. Cores and memory modules are connected through diverse micro-architecture resources including LLC (Last Level Cache), IMC (Integrated Memory Controller), Interconnect, and GQ (Global Queue). Since these micro-architecture resources are shared among cores while accessing data, the interference of these resources gives a significant influence on the system performance. To explore this influence, in this paper, we observe how the execution time changes when we execute SPEC CPU2006 and PARSEC benchmarks with different data and threads placement on our experimental system. We also measure the usage behavior of the micro-architecture resources using PMU (Performance Monitoring Unit) and analyze the correlation between the execution time and PMU parameters. Our finding is that the two PMU parameters, called IMC_reads and writes and QHL_requests are good indicators whose differences are well matched with the performance differences. It implies that changing these indicators in a positive way by controlling data and/or thread placement can boost the system performance.

References

[1]

S. Zhuravlev, S. Blagodurov, and A. Fedorova, "Addressing Shared Resource Contention in Multicore Processor via Scheduling", In ASPLOS, 2010.

Digital Library

[2]

S. Boyd-Wickizer, A. T. Clements, Y. Mao, A. Pesterev, M. F. Kaashoek, R. Morris and N. Zelovich, "An Analysis of Linux Scalability to Many Cores", In OSDI, 2010.

Digital Library

[3]

M. Liu and T. Li, "Optimizing Virtual Machine Consolidation Performance on NUMA Server Architecture for Cloud Workloads", In ICSA, 2014.

Digital Library

[4]

B. Lepers, V. Quema, and A. Fedorova, "Thread and Memory Placement on NUMA Systems: Asymmetry Matters", In USENIX ATC, 2015.

Digital Library

[5]

A. Lukefahr, S. Padmanabha, R. Das, F. M. Sleiman, R. Dreslinski, T. F. Wenisch, and S. Mahlke, "Composite Cores: Pushing Heterogeneity into a Core", In MICRO, 2012.

Digital Library

[6]

J. Rao, K. Wang, X. Zhou and C. Xu, "Optimizing Virtual Machine Scheduling in NUMA Multicore Systems", In HPCA, 2013.

Digital Library

[7]

S. Boyd-Wickizer, H. Chen, R. Chen, Y. Mao, F. Kaashoek, R. Morris, A. Pesterev, L. Stein, M. Wu, Y. Dai, Y. Zhang, and Z. Zhang, "Corey: An Operating Systems for Many Cores", In OSDI, 2008.

Digital Library

[8]

SPEC CPU2006, http://www.spec.org/spec2006.

[9]

PARSEC Benchmark Suite, http://parsec.cs.princeton.edu/

[10]

V. M. Weaver, "Linux perf_event Features and Overhead", In FastPath Workshop, 2013.

[11]

D. Levinthal, "Performance Analysis Guide for Intel Core i7 Processor and Intel Xeon 5500 Processors", https://software.intel.com/sites/default/files/m/0/8/8/performa nce_analysis_guide.pdf.

[12]

J. Du, N. Sehrawat, and W. Zwaenepoel, "Performance Profiling on Virtual Machines", In VEE, 2011.

Digital Library

[13]

S. Blagodurov, S. Zhuravlev, M. Dashti, and A. Fedorova,"A Case for NUMA-aware Contention Management on Multicore Systems", In USENIX ATC, 2011.

Digital Library

[14]

R. Lachaize, B. Lepers and V. Quema, "MemProf: a Memory Profiler for NUMA Multicore Systems", In USENIX ATC, 2012.

Digital Library

[15]

M. K. Qureshi and Y. N., Patt, "Utility-Based Cache Partitioning: A Low-Overhead, High-Performance Runtime Mechanism to Partition Shared Caches", In MICRO, 2006.

Digital Library

[16]

X. Zhang, S. Dwarkadas, and K. Shen, "Towards practical page coloring-based multicore cache management", In EuroSys, 2009.

Digital Library

[17]

N. Beckmann, P. Tsai, and D. Sanchez, "Scaling Distributed Cache Hierarchies through Computation and Data Co-Scheduling", In HPCA, 2015.

[18]

H. Wang, C. Isci, L. Subramanian, J. Choi, D. Qian and O Mutlu, "A-DRM: Architecture-aware Distributed Resource Management of Virtual Clusters", In VEE, 2015.

Digital Library

[19]

J. Demme and S. Sethumadhaven, "Rapid Identification of Architectural Bottlenecks via Precise Event Counting", In ISCA, 2011.

Digital Library

[20]

Z. Majo and T. R. Gross, "Memory System Performance in a NUMA Multicore Multiprocessor", In SYSTOR, 2011.

Digital Library

[21]

M. Dashti, A. Fedorova, J. Funston, F. Gaud, R. Lachaize, B. Lepers, V. Quema, and M Roth, "Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems", In ASPLOS, 2013.

Digital Library

[22]

A. Kleen, "An NUMA API for Linux", http://halobates.de/numaapi3.pdf.

[23]

F. Teschke and L. Pirl, "Linux NUMA evolution", In Seminar on NUMA, 201 1876

Cited By

Li TRen YYu DJin S(2017)Analysis of NUMA effects in modern multicore systems for the design of high-performance data transfer applicationsFuture Generation Computer Systems10.1016/j.future.2017.04.00174:C(41-50)Online publication date: 1-Sep-2017
https://dl.acm.org/doi/10.1016/j.future.2017.04.001

Index Terms

Analysis of micro-architecture resources interference on multicore NUMA systems
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multicore architectures
2. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Memory management
        Process management
        Concurrency control

Recommendations

Memory system performance in a NUMA multicore multiprocessor
SYSTOR '11: Proceedings of the 4th Annual International Conference on Systems and Storage

Modern multicore processors with an on-chip memory controller form the base for NUMA (non-uniform memory architecture) multiprocessors. Each processor accesses part of the physical memory directly and has access to the other parts via the memory ...
The instruction register file micro-architecture
Special issue: Parallel computing technologies

In this paper, we address the issue of feeding future superscalar processor cores with enough instructions. Hardware techniques targeting an increase in the instruction fetch bandwidth have been proposed such as the trace cache microarchitecture. We ...
Enabling an OpenCL Compiler for Embedded Multicore DSP Systems
ICPPW '12: Proceedings of the 2012 41st International Conference on Parallel Processing Workshops

OpenCL is an industry's attempt to unify heterogeneous multicore programming. With its programming model defining SPMD kernels, vector types, and address space qualifiers, OpenCL allows programmers to exploit data parallelism with multicore processors ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SAC '16: Proceedings of the 31st Annual ACM Symposium on Applied Computing

April 2016

2360 pages

ISBN:9781450337397

DOI:10.1145/2851613

Conference Chair:
Sascha Ossowski
University Rey Juan Carlos, Spain

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGAPP: ACM Special Interest Group on Applied Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 April 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SAC 2016

Sponsor:

SIGAPP

SAC 2016: Symposium on Applied Computing

April 4 - 8, 2016

Pisa, Italy

Acceptance Rates

SAC '16 Paper Acceptance Rate 252 of 1,047 submissions, 24%;

Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
166
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)1

Reflects downloads up to 09 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Li TRen YYu DJin S(2017)Analysis of NUMA effects in modern multicore systems for the design of high-performance data transfer applicationsFuture Generation Computer Systems10.1016/j.future.2017.04.00174:C(41-50)Online publication date: 1-Sep-2017
https://dl.acm.org/doi/10.1016/j.future.2017.04.001

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents