Skip to main content

Zhichun Zhu

Followers

10

Following

2

Co-authors

2

Public Views

Interests

Uploads

Papers by Zhichun Zhu

Polling-Based Memory Interface

ACM Transactions on Design Automation of Electronic Systems

Non-volatile memory has been extensively researched as the alternative for a DRAM-based system; h... more Non-volatile memory has been extensively researched as the alternative for a DRAM-based system; however, the traditional memory controller cannot efficiently track and schedule operations for all the memory devices in heterogeneous systems due to different timing requirements and complex architecture supports of various memory technologies. To address this issue, we propose a hybrid memory architecture framework called POMI (POlling-based Memory Interface). It uses a small buffer chip inserted on each DIMM (Dual In-line Memory Module) to decouple operation scheduling from the controller to enable the support for diverse memory technologies in the system. Unlike the conventional DRAM-based system, POMI uses a polling-based memory bus protocol for communication and to resolve any bus conflicts between memory modules. The buffer chip on each DIMM will provide feedback information to the main memory controller so that the polling overhead is trivial. We propose two unique designs. The f...

Proceedings of the International Symposium on Memory Systems

Adaptation to Reduce Processor Power Consumption an Effective Approach to Reducing Processor Power Consumption is to Adaptively Activate and Deactivate Hardware Resources. The Authors Propose a Look-Ahead Scheme That Adjusts the Processor Issue Rate Trigg

As a negative by-product of the dedicated pursuit of high performance in general-purpose processo... more As a negative by-product of the dedicated pursuit of high performance in general-purpose processors, the last decade has seen a dramatic increase in power consumption. To address this issue, researchers have aimed at reducing the processor’s power dissipation with minimum effect on performance. One effective architecture-level approach, architecture adaptation, dynamically activates and deactivates hardware resources in accord with the changes in a running program’s execution behavior. 1-5 The two key factors in architecture adaptation are when to trigger the adaptation and what adaptation techniques to apply. 5 Our work focuses on the first issue.

An Access-Mode Prediction Technique Based on Cache Hit and Miss Speculation for Cache Design Achieves Minimal Energy Consumption. Using This Method, Cache Accesses Can Be Adaptively Switched Between the Way-Prediction and the Phased Accessing Modes

red data way. 6 Way-prediction is another effective approach that speculatively selects a way to ... more red data way. 6 Way-prediction is another effective approach that speculatively selects a way to access before making a normal cache access. Figures 1b and 1c illustrate the access patterns for phased and way-prediction n-way setassociative caches. Compared with the conventional implementation, the phased cache only probes one data subarray instead of n data subarrays (each way comprises a tag subarray and a data subarray). However, the sequential accesses of tag and data will increase the cache access latency. The way-prediction cache first accesses the tag and data subarrays of the predicted way. If the prediction is not correct, it then probes the rest of tag and data subarrays simultaneously. An access in a phased cache consumes more energy and has longer latency than a correctly predicted access in way-prediction cache, but consumes less energy than a mispredicted access. Hence, when the prediction accuracy is high, the way-prediction cache is more energy-efficient than the pha

WALL: A writeback-aware LLC management for PCM-based main memory systems

2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2018

Thread Fair Memory Request Reordering - DRAM controller

A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture, 2000

Fine-grain priority scheduling on multi-channel memory systems

Proceedings Eighth International Symposium on High Performance Computer Architecture, 2002

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors

11th International Symposium on High-Performance Computer Architecture

Memory Access Scheduling Schemes for Systems with Multi-Core Processors

2008 37th International Conference on Parallel Processing, 2008

Mini-rank: Adaptive DRAM architecture for improving memory power efficiency

2008 41st IEEE/ACM International Symposium on Microarchitecture, 2008

Software thermal management of dram memory for multicore systems

ACM SIGMETRICS Performance Evaluation Review, 2008

Thermal management of DRAM memory has become a critical issue for server systems. We have done, t... more Thermal management of DRAM memory has become a critical issue for server systems. We have done, to our best knowledge, the first study of software thermal management for memory subsystem on real machines. Two recently proposed DTM (Dynamic Thermal Management) policies have been improved and implemented in Linux OS and evaluated on two multicore servers, a Dell PowerEdge 1950 server and a customized Intel SR1500AL server testbed. The experimental results first confirm that a system-level memory DTM policy may significantly improve system performance and power efficiency, compared with existing memory bandwidth throttling scheme. A policy called DTM-ACG (Adaptive Core Gating) shows performance improvement comparable to that reported previously. The average performance improvements are 13.3% and 7.2% on the PowerEdge 1950 and the SR1500AL (vs. 16.3% from the previous simulation-based study), respectively. We also have surprising findings that reveal the weakness of the previous study: ...

ACM SIGARCH Computer Architecture News, 2009

The widespread use of multicore processors has dramatically increased the demands on high bandwid... more The widespread use of multicore processors has dramatically increased the demands on high bandwidth and large capacity from memory systems. In a conventional DDR2/DDR3 DRAM memory system, the memory bus and DRAM devices run at the same data rate. To improve memory bandwidth, we propose a new memory system design called decoupled DIMM that allows the memory bus to operate at a data rate much higher than that of the DRAM devices. In the design, a synchronization buffer is added to relay data between the slow DRAM devices and the fast memory bus; and memory access scheduling is revised to avoid access conflicts on memory ranks. The design not only improves memory bandwidth beyond what can be supported by current memory devices, but also improves reliability, power efficiency, and cost effectiveness by using relatively slow memory devices. The idea of decoupling, precisely the decoupling of bandwidth match between memory bus and a single rank of devices, can also be applied to other typ...

Thread-fair memory request reordering

ABSTRACT As an important part of modern computer system, the main memory is responsible of storin... more ABSTRACT As an important part of modern computer system, the main memory is responsible of storing programs and data structures needed for executing the programs. Performance, power consumption and capacity are the three major factors of a memory system design. Among them, performance and power consumption can be improved by carefully reordering concurrent memory requests to reduce their average latency and wisely utilizing the memory power-saving modes to reduce power consumption. We have implemented a &amp;quot;Thread-Fair&amp;quot; memory request scheduling policy on top of the USIMM [1] memory simulation infrastructure to enter the memory scheduling competition. The scheduler give the read requests blocking the ROBs&amp;#39; (reorder buffer) head high priority when choosing among multiple requests who want to perform row access. In this way, the most critical request is serviced faster and none memory in-tensive thread is less likely being interfered by memory intensive thread. Also, our scheduler uses a &amp;quot;Read Hit Queue&amp;quot; and &amp;quot;Write Hit Queue&amp;quot; per memory bank to group requests hitting on the same row buffer together. By giving the &amp;quot;hitting group&amp;quot; higher priority over the &amp;quot;page missing&amp;quot; requests, the scheduler can maximize the row buffer hit rate and hence reduce average memory access latency and power consumption. Using provided single program and multi-thread workloads, the simulation results indicate that our scheme per-forms 8.7% than the FCFS baseline. The fairness metric is improved by 9.1% and the energy delay product EDP is reduced by 17.3% on average.

Breaking Address Mapping Symmetry at Multi-levels of Memory Heirarchy to Reduce DRAM Row-buffer Conflicts

Journal of Instruction-level Parallelism - JILP, 2001

DRAM row-bufiers have become a critical level of cache in the memory,hierarchy to exploit spatial... more DRAM row-bufiers have become a critical level of cache in the memory,hierarchy to exploit spatial locality in the cache miss stream. Row-bufier con∞icts occur when a sequence of requests on difierent pages goes to the same memory bank, causing higher memory,access latency than requests to the same row or to difierent banks. In this study, we flrst show that the address mapping symmetry between the cache and DRAM is the inherent source of row-bufier con∞icts. Breaking the symmetry to reduce the con∞icts and to retain the spatial locality, we propose and evaluate a permutation-based page interleaving scheme. We have also evaluated and compared two representative cache mapping schemes that break the symmetry at the cache level. We show that the proposed page interleaving scheme outperforms all other mapping schemes based on its overall performance and on its implementation simplicity.

Memory hierarchy considerations for cost-effective cluster computing

IEEE Transactions on Computers, 2000

Design and optimization of large size and low overhead off-chip caches

IEEE Transactions on Computers, 2004

Thermal Modeling and Management of DRAM Systems

IEEE Transactions on Computers, 2013

Cached DRAM for ILP processor memory access latency reduction

IEEE Micro, 2001

Access-mode predictions for low-power cache design

IEEE Micro, 2002

Polling-Based Memory Interface

ACM Transactions on Design Automation of Electronic Systems

Non-volatile memory has been extensively researched as the alternative for a DRAM-based system; h... more Non-volatile memory has been extensively researched as the alternative for a DRAM-based system; however, the traditional memory controller cannot efficiently track and schedule operations for all the memory devices in heterogeneous systems due to different timing requirements and complex architecture supports of various memory technologies. To address this issue, we propose a hybrid memory architecture framework called POMI (POlling-based Memory Interface). It uses a small buffer chip inserted on each DIMM (Dual In-line Memory Module) to decouple operation scheduling from the controller to enable the support for diverse memory technologies in the system. Unlike the conventional DRAM-based system, POMI uses a polling-based memory bus protocol for communication and to resolve any bus conflicts between memory modules. The buffer chip on each DIMM will provide feedback information to the main memory controller so that the polling overhead is trivial. We propose two unique designs. The f...

Proceedings of the International Symposium on Memory Systems

Adaptation to Reduce Processor Power Consumption an Effective Approach to Reducing Processor Power Consumption is to Adaptively Activate and Deactivate Hardware Resources. The Authors Propose a Look-Ahead Scheme That Adjusts the Processor Issue Rate Trigg

As a negative by-product of the dedicated pursuit of high performance in general-purpose processo... more As a negative by-product of the dedicated pursuit of high performance in general-purpose processors, the last decade has seen a dramatic increase in power consumption. To address this issue, researchers have aimed at reducing the processor’s power dissipation with minimum effect on performance. One effective architecture-level approach, architecture adaptation, dynamically activates and deactivates hardware resources in accord with the changes in a running program’s execution behavior. 1-5 The two key factors in architecture adaptation are when to trigger the adaptation and what adaptation techniques to apply. 5 Our work focuses on the first issue.

An Access-Mode Prediction Technique Based on Cache Hit and Miss Speculation for Cache Design Achieves Minimal Energy Consumption. Using This Method, Cache Accesses Can Be Adaptively Switched Between the Way-Prediction and the Phased Accessing Modes

red data way. 6 Way-prediction is another effective approach that speculatively selects a way to ... more red data way. 6 Way-prediction is another effective approach that speculatively selects a way to access before making a normal cache access. Figures 1b and 1c illustrate the access patterns for phased and way-prediction n-way setassociative caches. Compared with the conventional implementation, the phased cache only probes one data subarray instead of n data subarrays (each way comprises a tag subarray and a data subarray). However, the sequential accesses of tag and data will increase the cache access latency. The way-prediction cache first accesses the tag and data subarrays of the predicted way. If the prediction is not correct, it then probes the rest of tag and data subarrays simultaneously. An access in a phased cache consumes more energy and has longer latency than a correctly predicted access in way-prediction cache, but consumes less energy than a mispredicted access. Hence, when the prediction accuracy is high, the way-prediction cache is more energy-efficient than the pha

WALL: A writeback-aware LLC management for PCM-based main memory systems

2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2018

Thread Fair Memory Request Reordering - DRAM controller

A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture, 2000

Fine-grain priority scheduling on multi-channel memory systems

Proceedings Eighth International Symposium on High Performance Computer Architecture, 2002

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors

11th International Symposium on High-Performance Computer Architecture

Memory Access Scheduling Schemes for Systems with Multi-Core Processors

2008 37th International Conference on Parallel Processing, 2008

Mini-rank: Adaptive DRAM architecture for improving memory power efficiency

2008 41st IEEE/ACM International Symposium on Microarchitecture, 2008

Software thermal management of dram memory for multicore systems

ACM SIGMETRICS Performance Evaluation Review, 2008

Thermal management of DRAM memory has become a critical issue for server systems. We have done, t... more Thermal management of DRAM memory has become a critical issue for server systems. We have done, to our best knowledge, the first study of software thermal management for memory subsystem on real machines. Two recently proposed DTM (Dynamic Thermal Management) policies have been improved and implemented in Linux OS and evaluated on two multicore servers, a Dell PowerEdge 1950 server and a customized Intel SR1500AL server testbed. The experimental results first confirm that a system-level memory DTM policy may significantly improve system performance and power efficiency, compared with existing memory bandwidth throttling scheme. A policy called DTM-ACG (Adaptive Core Gating) shows performance improvement comparable to that reported previously. The average performance improvements are 13.3% and 7.2% on the PowerEdge 1950 and the SR1500AL (vs. 16.3% from the previous simulation-based study), respectively. We also have surprising findings that reveal the weakness of the previous study: ...

ACM SIGARCH Computer Architecture News, 2009

The widespread use of multicore processors has dramatically increased the demands on high bandwid... more The widespread use of multicore processors has dramatically increased the demands on high bandwidth and large capacity from memory systems. In a conventional DDR2/DDR3 DRAM memory system, the memory bus and DRAM devices run at the same data rate. To improve memory bandwidth, we propose a new memory system design called decoupled DIMM that allows the memory bus to operate at a data rate much higher than that of the DRAM devices. In the design, a synchronization buffer is added to relay data between the slow DRAM devices and the fast memory bus; and memory access scheduling is revised to avoid access conflicts on memory ranks. The design not only improves memory bandwidth beyond what can be supported by current memory devices, but also improves reliability, power efficiency, and cost effectiveness by using relatively slow memory devices. The idea of decoupling, precisely the decoupling of bandwidth match between memory bus and a single rank of devices, can also be applied to other typ...

Thread-fair memory request reordering

ABSTRACT As an important part of modern computer system, the main memory is responsible of storin... more ABSTRACT As an important part of modern computer system, the main memory is responsible of storing programs and data structures needed for executing the programs. Performance, power consumption and capacity are the three major factors of a memory system design. Among them, performance and power consumption can be improved by carefully reordering concurrent memory requests to reduce their average latency and wisely utilizing the memory power-saving modes to reduce power consumption. We have implemented a &amp;quot;Thread-Fair&amp;quot; memory request scheduling policy on top of the USIMM [1] memory simulation infrastructure to enter the memory scheduling competition. The scheduler give the read requests blocking the ROBs&amp;#39; (reorder buffer) head high priority when choosing among multiple requests who want to perform row access. In this way, the most critical request is serviced faster and none memory in-tensive thread is less likely being interfered by memory intensive thread. Also, our scheduler uses a &amp;quot;Read Hit Queue&amp;quot; and &amp;quot;Write Hit Queue&amp;quot; per memory bank to group requests hitting on the same row buffer together. By giving the &amp;quot;hitting group&amp;quot; higher priority over the &amp;quot;page missing&amp;quot; requests, the scheduler can maximize the row buffer hit rate and hence reduce average memory access latency and power consumption. Using provided single program and multi-thread workloads, the simulation results indicate that our scheme per-forms 8.7% than the FCFS baseline. The fairness metric is improved by 9.1% and the energy delay product EDP is reduced by 17.3% on average.

Breaking Address Mapping Symmetry at Multi-levels of Memory Heirarchy to Reduce DRAM Row-buffer Conflicts

Journal of Instruction-level Parallelism - JILP, 2001

DRAM row-bufiers have become a critical level of cache in the memory,hierarchy to exploit spatial... more DRAM row-bufiers have become a critical level of cache in the memory,hierarchy to exploit spatial locality in the cache miss stream. Row-bufier con∞icts occur when a sequence of requests on difierent pages goes to the same memory bank, causing higher memory,access latency than requests to the same row or to difierent banks. In this study, we flrst show that the address mapping symmetry between the cache and DRAM is the inherent source of row-bufier con∞icts. Breaking the symmetry to reduce the con∞icts and to retain the spatial locality, we propose and evaluate a permutation-based page interleaving scheme. We have also evaluated and compared two representative cache mapping schemes that break the symmetry at the cache level. We show that the proposed page interleaving scheme outperforms all other mapping schemes based on its overall performance and on its implementation simplicity.

Memory hierarchy considerations for cost-effective cluster computing

IEEE Transactions on Computers, 2000

Design and optimization of large size and low overhead off-chip caches

IEEE Transactions on Computers, 2004

Thermal Modeling and Management of DRAM Systems

IEEE Transactions on Computers, 2013

Cached DRAM for ILP processor memory access latency reduction

IEEE Micro, 2001

Access-mode predictions for low-power cache design

IEEE Micro, 2002