Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- research-articleJune 2020
SB-Fetch: synchronization aware hardware prefetching for chip multiprocessors
ICS '20: Proceedings of the 34th ACM International Conference on SupercomputingJune 2020, Article No.: 15, Pages 1–12https://doi.org/10.1145/3392717.3392735Shared-memory, multi-threaded applications often require programmers to insert thread synchronization primitives (i.e. locks, barriers, and condition variables) in critical sections to synchronize data access between processes. Scaling performance ...
- research-articleJanuary 2019
Odd-even based adaptive two-way routing in mesh NoCs for hotspot mitigation
ICDCN '19: Proceedings of the 20th International Conference on Distributed Computing and NetworkingJanuary 2019, Pages 248–252https://doi.org/10.1145/3288599.3288611Network-on-Chip is adapted as a profitable framework for communication in on-chip multiprocessors. Congestion management using adaptive routing techniques become the major research focus in recent days. Hotspots are congested cores in multi-core systems,...
- research-articleApril 2016
Identifying Power-Efficient Multicore Cache Hierarchies via Reuse Distance Analysis
ACM Transactions on Computer Systems (TOCS), Volume 34, Issue 1Article No.: 3, Pages 1–30https://doi.org/10.1145/2851503To enable performance improvements in a power-efficient manner, computer architects have been building CPUs that exploit greater amounts of thread-level parallelism. A key consideration in such CPUs is properly designing the on-chip cache hierarchy. ...
- research-articleJanuary 2016
Hierarchical Clustering for On-Chip Networks
AISTECS '16: Proceedings of the 1st International Workshop on Advanced Interconnect Solutions and Technologies for Emerging Computing SystemsJanuary 2016, Article No.: 2, Pages 1–6https://doi.org/10.1145/2857058.2857064Hierarchy and communication locality are a must for many-core systems. As systems scale to dozens or hundreds of cores, we simply cannot afford the power consumption and latency of random communication that spans the entire chip. Existing hierarchical ...
- research-articleDecember 2015
Sensible Energy Accounting with Abstract Metering for Multicore Systems
ACM Transactions on Architecture and Code Optimization (TACO), Volume 12, Issue 4Article No.: 60, Pages 1–26https://doi.org/10.1145/2842616Chip multicore processors (CMPs) are the preferred processing platform across different domains such as data centers, real-time systems, and mobile devices. In all those domains, energy is arguably the most expensive resource in a computing system. ...
-
- ArticleDecember 2014
Cache Balancer: Access Rate and Pain Based Resource Management for Chip Multiprocessors
CANDAR '14: Proceedings of the 2014 Second International Symposium on Computing and NetworkingDecember 2014, Pages 453–456https://doi.org/10.1109/CANDAR.2014.81This paper presents a runtime resource management scheme named Cache Balancer that improves the utilization of on-chip shared caches and reduces access latencies in chip multiprocessor systems. Cache Balancer incorporates an access rate based memory ...
- research-articleDecember 2013
Hardware support for accurate per-task energy metering in multicore systems
ACM Transactions on Architecture and Code Optimization (TACO), Volume 10, Issue 4Article No.: 34, Pages 1–27https://doi.org/10.1145/2541228.2555291Accurately determining the energy consumed by each task in a system will become of prominent importance in future multicore-based systems because it offers several benefits, including (i) better application energy/performance optimizations, (ii) ...
- research-articleDecember 2013
Temporal-based multilevel correlating inclusive cache replacement
ACM Transactions on Architecture and Code Optimization (TACO), Volume 10, Issue 4Article No.: 33, Pages 1–24https://doi.org/10.1145/2541228.2555290Inclusive caches have been widely used in Chip Multiprocessors (CMPs) to simplify cache coherence. However, they have poor performance compared with noninclusive caches not only because of the limited capacity of the entire cache hierarchy but also due ...
- research-articleNovember 2013
Load-balanced pipeline parallelism
SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and AnalysisNovember 2013, Article No.: 14, Pages 1–12https://doi.org/10.1145/2503210.2503295Accelerating a single thread in current parallel systems remains a challenging problem, because sequential threads do not naturally take advantage of the additional cores. Recent work shows that automatic extraction of pipeline parallelism is an ...
- research-articleOctober 2013
A fast and scalable multidimensional multiple-choice knapsack heuristic
ACM Transactions on Design Automation of Electronic Systems (TODAES), Volume 18, Issue 4Article No.: 51, Pages 1–32https://doi.org/10.1145/2541012.2541014Many combinatorial optimization problems in the embedded systems and design automation domains involve decision making in multidimensional spaces. The multidimensional multiple-choice knapsack problem (MMKP) is among the most challenging of the ...
- research-articleOctober 2013
SMT-centric power-aware thread placement in chip multiprocessors
PACT '13: Proceedings of the 22nd international conference on Parallel architectures and compilation techniquesOctober 2013, Pages 167–176In Simultaneous Multi-Threading (SMT) chip multiprocessors (CMPs), thread placement is performed today in a largely power-unaware manner. For example, consolidation of active threads into fewer cores exposes opportunities for power savings that have not ...
- research-articleJune 2013
Directory based cache coherence verification logic in CMPs cache system
MES '13: Proceedings of the First International Workshop on Many-core Embedded SystemsJune 2013, Pages 33–40https://doi.org/10.1145/2489068.2489073This work reports a high speed protocol verificaion logic for Chip Multiprocessors (CMPs) realizing directory based cache coherence system. A special class of cellular automata (CA) referred to as single length cycle 2-attractor CA (TACA), has been ...
- research-articleFebruary 2013
Efficient Reuse Distance Analysis of Multicore Scaling for Loop-Based Parallel Programs
ACM Transactions on Computer Systems (TOCS), Volume 31, Issue 1Article No.: 1, Pages 1–37https://doi.org/10.1145/2427631.2427632Reuse Distance (RD) analysis is a powerful memory analysis tool that can potentially help architects study multicore processor scaling. One key obstacle, however, is that multicore RD analysis requires measuring Concurrent Reuse Distance (CRD) and ...
- research-articleSeptember 2012
Coalition threading: combining traditional andnon-traditional parallelism to maximize scalability
PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniquesSeptember 2012, Pages 273–282https://doi.org/10.1145/2370816.2370857Non-traditional parallelism provides parallel speedup for a single thread without the need to manually divide and coordinate computation. This paper describes coalition threading, a technique that seeks the ideal combination of traditional and non-...
- research-articleJuly 2012
Mitigating the Effects of Process Variation in Ultra-low Voltage Chip Multiprocessors using Dual Supply Voltages and Half-Speed Units
IEEE Computer Architecture Letters (ICAL), Volume 11, Issue 2July 2012, Pages 45–48https://doi.org/10.1109/L-CA.2011.36Energy efficiency is a primary concern for microprocessor designers. One very effective approach to improving processor energy efficiency is to lower its supply voltage to very near to the transistor threshold voltage. This reduces power consumption ...
- research-articleJune 2012
Locality & utility co-optimization for practical capacity management of shared last level caches
ICS '12: Proceedings of the 26th ACM international conference on SupercomputingJune 2012, Pages 279–290https://doi.org/10.1145/2304576.2304615Shared last-level caches (SLLCs) on chip-multiprocessors play an important role in bridging the performance gap between processing cores and main memory. Although there are already many proposals targeted at overcoming the weaknesses of the least-...
- ArticleJune 2012
Architectural Support for Exploiting Fine Grain Parallelism
HPCC '12: Proceedings of the 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and SystemsJune 2012, Pages 61–70https://doi.org/10.1109/HPCC.2012.19The advent of multi-core processors, particularly with projections that numbers of cores will continue to increase, has focused attention on parallel programming. It is widely recognized that current programming techniques, including those that are used ...
- research-articleJune 2012
Identifying optimal multicore cache hierarchies for loop-based parallel programs via reuse distance analysis
MSPC '12: Proceedings of the 2012 ACM SIGPLAN Workshop on Memory Systems Performance and CorrectnessJune 2012, Pages 2–11https://doi.org/10.1145/2247684.2247687Understanding multicore memory behavior is crucial, but can be challenging due to the complex cache hierarchies employed in modern CPUs. In today's hierarchies, performance is determined by complicated thread interactions, such as interference in shared ...
- ArticleMay 2012
Analytical Performance Modeling of Hierarchical Interconnect Fabrics
NOCS '12: Proceedings of the 2012 IEEE/ACM Sixth International Symposium on Networks-on-ChipMay 2012, Pages 107–114https://doi.org/10.1109/NOCS.2012.20The continuous scaling of nanoelectronics is increasing the complexity of chip multiprocessors (CMPs) and exacerbating the memory wall problem. As CMPs become more complex, the memory subsystem is organized into more hierarchical structures to better ...
- research-articleMarch 2012
Balancing Performance and Cost in CMP Interconnection Networks
IEEE Transactions on Parallel and Distributed Systems (TPDS), Volume 23, Issue 3March 2012, Pages 452–459https://doi.org/10.1109/TPDS.2011.173This paper presents an innovative router design, called Rotary Router, which successfully addresses CMP cost/performance constraints. The router structure is based on two independent rings, which force packets to circulate either clockwise or ...