Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- research-articleDecember 2023
Page Size Aware Cache Prefetching
MICRO '22: Proceedings of the 55th Annual IEEE/ACM International Symposium on MicroarchitecturePages 956–974https://doi.org/10.1109/MICRO56248.2022.00070The increase in working set sizes of contemporary applications outpaces the growth in cache sizes, resulting in frequent main memory accesses that deteriorate system performance due to the disparity between processor and memory speeds. Prefetching ...
- research-articleDecember 2023
ASSASIN: Architecture Support for Stream Computing to Accelerate Computational Storage
MICRO '22: Proceedings of the 55th Annual IEEE/ACM International Symposium on MicroarchitecturePages 354–368https://doi.org/10.1109/MICRO56248.2022.00035Computational storage adds computing to storage devices, providing potential benefits in offload, data-reduction, and lower energy. Successful computational SSD architectures should match growing flash bandwidth, which in turn requires high SSD DRAM ...
- research-articleNovember 2021
CAKE: matrix multiplication using constant-bandwidth blocks
SC '21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisArticle No.: 85, Pages 1–14https://doi.org/10.1145/3458817.3476166We offer a novel approach to matrix-matrix multiplication computation on computing platforms with memory hierarchies. Constant-bandwidth (CB) blocks improve computation throughput for architectures limited by external memory bandwidth. Configuring the ...
- research-articleJune 2021
Computing Utilization Enhancement for Chiplet-based Homogeneous Processing-in-Memory Deep Learning Processors
GLSVLSI '21: Proceedings of the 2021 Great Lakes Symposium on VLSIPages 241–246https://doi.org/10.1145/3453688.3461499This paper presents a design strategy of chiplet-based processing-in-memory systems for deep neural network applications. Monolithic silicon chips are area and power limited, failing to catch the recent rapid growth of deep learning algorithms. The paper ...
- research-articleAugust 2020
FeFET-based low-power bitwise logic-in-memory with direct write-back and data-adaptive dynamic sensing interface
- Mingyen Lee,
- Wenjun Tang,
- Bowen Xue,
- Juejian Wu,
- Mingyuan Ma,
- Yu Wang,
- Yongpan Liu,
- Deliang Fan,
- Vijaykrishnan Narayanan,
- Huazhong Yang,
- Xueqing Li
ISLPED '20: Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and DesignPages 127–132https://doi.org/10.1145/3370748.3406572Compute-in-memory (CiM) is a promising method for mitigating the memory wall problem in data-intensive applications. The proposed bitwise logic-in-memory (BLiM) is targeted at data intensive applications, such as database, data encryption. This work ...
-
- research-articleOctober 2017
A bandwidth accurate, flexible and rapid simulating multi-HMC modeling tool
MEMSYS '17: Proceedings of the International Symposium on Memory SystemsPages 71–82https://doi.org/10.1145/3132402.3132403Derived by the demand for ever increasing computing performance, a steadily widening performance gap between memory and processor architectures has emerged. While attempting to mitigate the effects for processing systems that already face the exascale ...
- articleJune 2017
Core Module Optimizing PDE Sparse Matrix Models with HPCG Example
Supercomputing Frontiers and Innovations: an International Journal (SCFI), Volume 4, Issue 2Pages 54–70https://doi.org/10.14529/jsfi170205This paper introduces a fundamentally new computer architecture for supercomputers. The core module is application compatible with an existing superscalar microprocessor, with minimized energy use, and is optimized for local sparse matrix operations. ...
- articleJune 2017
The Simultaneous Transmit And Receive STAR Message Protocol
Supercomputing Frontiers and Innovations: an International Journal (SCFI), Volume 4, Issue 2Pages 38–53https://doi.org/10.14529/jsfi170204The STAR protocol is proposed, which solves three inherent problems with MPI, a well known security problem caused by data memory access faults, and the following four exascale communication problems. Exascale systems must efficiently save the state of ...
- research-articleOctober 2016
Data-Centric Computing Frontiers: A Survey On Processing-In-Memory
MEMSYS '16: Proceedings of the Second International Symposium on Memory SystemsPages 295–308https://doi.org/10.1145/2989081.2989087A major shift from compute-centric to data-centric computing systems can be perceived, as novel big data workloads like cognitive computing and machine learning strongly enforce embarrassingly parallel and highly efficient processor architectures. With ...
- research-articleDecember 2015
Filtered runahead execution with a runahead buffer
MICRO-48: Proceedings of the 48th International Symposium on MicroarchitecturePages 358–369https://doi.org/10.1145/2830772.2830812Runahead execution dynamically expands the instruction window of an out of order processor to generate memory level parallelism (MLP) while the core would otherwise be stalled. Unfortunately, runahead has the disadvantage of requiring the front-end to ...
- research-articleDecember 2015
Scalable Multicore k-NN Search via Subspace Clustering for Filtering
IEEE Transactions on Parallel and Distributed Systems (TPDS), Volume 26, Issue 12Pages 3449–3460https://doi.org/10.1109/TPDS.2014.2372755k Nearest Neighbors (k-NN) search is a widely used category of algorithms with applications in domains such as computer vision and machine learning. Despite the desire to process increasing amounts of high-dimensional data within these domains, k-NN ...
- research-articleNovember 2015
C2-bound: a capacity and concurrency driven analytical model for many-core design
SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisArticle No.: 48, Pages 1–11https://doi.org/10.1145/2807591.2807641In this paper, we propose C2-Bound, a data-driven analytical model, that incorporates both memory capacity and data access concurrency factors to optimize many-core design. C2-Bound is characterized by combining the newly proposed latency model, ...
- short-paperJune 2014
Data filtering for scalable high-dimensional k-NN search on multicore systems
HPDC '14: Proceedings of the 23rd international symposium on High-performance parallel and distributed computingPages 305–310https://doi.org/10.1145/2600212.2600710K Nearest Neighbors (k-NN) search is a widely used category of algorithms with applications in domains such as computer vision and machine learning. With the rapidly increasing amount of data available, and their high dimensionality, k-NN algorithms ...
- ArticleMarch 2013
Universal Numerical Encoder and Profiler Reduces Computing's Memory Wall with Software, FPGA, and SoC Implementations
DCC '13: Proceedings of the 2013 Data Compression ConferencePage 528https://doi.org/10.1109/DCC.2013.107Numerical computations have accelerated significantly since 2005 thanks to two complementary, silicon-enabled trends: multi-core processing and single instruction, multiple data (SIMD) accelerators. Unfortunately, due to fundamental limitations of ...
- keynoteJune 2012
Blue Gene/Q: design for sustained multi-petaflop computing
ICS '12: Proceedings of the 26th ACM international conference on SupercomputingPages 245–246https://doi.org/10.1145/2304576.2304609The Blue Gene/Q system represents the third generation of optimized high-performance computing Blue Gene solution servers and provides a platform for continued growth in HPC performance and capability. Blue Gene/Q started with a new design of the ...
- ArticleJune 2012
A Study of the Memory Wall within the Jacobi Iteration Method
HPCC '12: Proceedings of the 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and SystemsPages 964–969https://doi.org/10.1109/HPCC.2012.140In recent years, a great number of applications have been implemented on the CMP and achieved good performance. The success of the parallelism of a great deal of applications on CMP shows a bright future of the development of the CMP. However, some ...
- ArticleSeptember 2011
Cache Accurate Time Skewing in Iterative Stencil Computations
ICPP '11: Proceedings of the 2011 International Conference on Parallel ProcessingPages 571–581https://doi.org/10.1109/ICPP.2011.47We present a time skewing algorithm that breaks the memory wall for certain iterative stencil computations. A stencil computation, even with constant weights, is a completely memory-bound algorithm. For example, for a large 3D domain of $500^3$ doubles ...
- research-articleAugust 2011
Pinned to the walls: impact of packaging and application properties on the memory and power walls
This article presents a study of the impact of packaging on the memory and power walls, in the context of application properties. The analysis is supported by characterizations of 130 hardware designs spanning 30 years, along with both ...
- research-articleJune 2011
Cache injection for parallel applications
HPDC '11: Proceedings of the 20th international symposium on High performance distributed computingPages 15–26https://doi.org/10.1145/1996130.1996135For two decades, the memory wall has affected many applications in their ability to benefit from improvements in processor speed. Cache injection addresses this disparity for I/O by writing data into a processor's cache directly from the I/O bus. This ...
- ArticleMay 2011
PAC-PLRU: A Cache Replacement Policy to Salvage Discarded Predictions from Hardware Prefetchers
CCGRID '11: Proceedings of the 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid ComputingPages 265–274https://doi.org/10.1109/CCGrid.2011.27Cache replacement policy plays an important role in guaranteeing the availability of cache blocks, reducing miss rates, and improving applications' overall performance. However, recent research efforts on improving replacement policies require either ...