ACM Transactions on Design Automation of Electronic Systems
Non-volatile memory has been extensively researched as the alternative for a DRAM-based system; h... more Non-volatile memory has been extensively researched as the alternative for a DRAM-based system; however, the traditional memory controller cannot efficiently track and schedule operations for all the memory devices in heterogeneous systems due to different timing requirements and complex architecture supports of various memory technologies. To address this issue, we propose a hybrid memory architecture framework called POMI (POlling-based Memory Interface). It uses a small buffer chip inserted on each DIMM (Dual In-line Memory Module) to decouple operation scheduling from the controller to enable the support for diverse memory technologies in the system. Unlike the conventional DRAM-based system, POMI uses a polling-based memory bus protocol for communication and to resolve any bus conflicts between memory modules. The buffer chip on each DIMM will provide feedback information to the main memory controller so that the polling overhead is trivial. We propose two unique designs. The f...
As a negative by-product of the dedicated pursuit of high performance in general-purpose processo... more As a negative by-product of the dedicated pursuit of high performance in general-purpose processors, the last decade has seen a dramatic increase in power consumption. To address this issue, researchers have aimed at reducing the processor’s power dissipation with minimum effect on performance. One effective architecture-level approach, architecture adaptation, dynamically activates and deactivates hardware resources in accord with the changes in a running program’s execution behavior. 1-5 The two key factors in architecture adaptation are when to trigger the adaptation and what adaptation techniques to apply. 5 Our work focuses on the first issue.
red data way. 6 Way-prediction is another effective approach that speculatively selects a way to ... more red data way. 6 Way-prediction is another effective approach that speculatively selects a way to access before making a normal cache access. Figures 1b and 1c illustrate the access patterns for phased and way-prediction n-way setassociative caches. Compared with the conventional implementation, the phased cache only probes one data subarray instead of n data subarrays (each way comprises a tag subarray and a data subarray). However, the sequential accesses of tag and data will increase the cache access latency. The way-prediction cache first accesses the tag and data subarrays of the predicted way. If the prediction is not correct, it then probes the rest of tag and data subarrays simultaneously. An access in a phased cache consumes more energy and has longer latency than a correctly predicted access in way-prediction cache, but consumes less energy than a mispredicted access. Hence, when the prediction accuracy is high, the way-prediction cache is more energy-efficient than the pha
Thermal management of DRAM memory has become a critical issue for server systems. We have done, t... more Thermal management of DRAM memory has become a critical issue for server systems. We have done, to our best knowledge, the first study of software thermal management for memory subsystem on real machines. Two recently proposed DTM (Dynamic Thermal Management) policies have been improved and implemented in Linux OS and evaluated on two multicore servers, a Dell PowerEdge 1950 server and a customized Intel SR1500AL server testbed. The experimental results first confirm that a system-level memory DTM policy may significantly improve system performance and power efficiency, compared with existing memory bandwidth throttling scheme. A policy called DTM-ACG (Adaptive Core Gating) shows performance improvement comparable to that reported previously. The average performance improvements are 13.3% and 7.2% on the PowerEdge 1950 and the SR1500AL (vs. 16.3% from the previous simulation-based study), respectively. We also have surprising findings that reveal the weakness of the previous study: ...
The widespread use of multicore processors has dramatically increased the demands on high bandwid... more The widespread use of multicore processors has dramatically increased the demands on high bandwidth and large capacity from memory systems. In a conventional DDR2/DDR3 DRAM memory system, the memory bus and DRAM devices run at the same data rate. To improve memory bandwidth, we propose a new memory system design called decoupled DIMM that allows the memory bus to operate at a data rate much higher than that of the DRAM devices. In the design, a synchronization buffer is added to relay data between the slow DRAM devices and the fast memory bus; and memory access scheduling is revised to avoid access conflicts on memory ranks. The design not only improves memory bandwidth beyond what can be supported by current memory devices, but also improves reliability, power efficiency, and cost effectiveness by using relatively slow memory devices. The idea of decoupling, precisely the decoupling of bandwidth match between memory bus and a single rank of devices, can also be applied to other typ...
ABSTRACT As an important part of modern computer system, the main memory is responsible of storin... more ABSTRACT As an important part of modern computer system, the main memory is responsible of storing programs and data structures needed for executing the programs. Performance, power consumption and capacity are the three major factors of a memory system design. Among them, performance and power consumption can be improved by carefully reordering concurrent memory requests to reduce their average latency and wisely utilizing the memory power-saving modes to reduce power consumption. We have implemented a "Thread-Fair" memory request scheduling policy on top of the USIMM [1] memory simulation infrastructure to enter the memory scheduling competition. The scheduler give the read requests blocking the ROBs' (reorder buffer) head high priority when choosing among multiple requests who want to perform row access. In this way, the most critical request is serviced faster and none memory in-tensive thread is less likely being interfered by memory intensive thread. Also, our scheduler uses a "Read Hit Queue" and "Write Hit Queue" per memory bank to group requests hitting on the same row buffer together. By giving the "hitting group" higher priority over the "page missing" requests, the scheduler can maximize the row buffer hit rate and hence reduce average memory access latency and power consumption. Using provided single program and multi-thread workloads, the simulation results indicate that our scheme per-forms 8.7% than the FCFS baseline. The fairness metric is improved by 9.1% and the energy delay product EDP is reduced by 17.3% on average.
Journal of Instruction-level Parallelism - JILP, 2001
DRAM row-bufiers have become a critical level of cache in the memory,hierarchy to exploit spatial... more DRAM row-bufiers have become a critical level of cache in the memory,hierarchy to exploit spatial locality in the cache miss stream. Row-bufier con∞icts occur when a sequence of requests on difierent pages goes to the same memory bank, causing higher memory,access latency than requests to the same row or to difierent banks. In this study, we flrst show that the address mapping symmetry between the cache and DRAM is the inherent source of row-bufier con∞icts. Breaking the symmetry to reduce the con∞icts and to retain the spatial locality, we propose and evaluate a permutation-based page interleaving scheme. We have also evaluated and compared two representative cache mapping schemes that break the symmetry at the cache level. We show that the proposed page interleaving scheme outperforms all other mapping schemes based on its overall performance and on its implementation simplicity.
ACM Transactions on Design Automation of Electronic Systems
Non-volatile memory has been extensively researched as the alternative for a DRAM-based system; h... more Non-volatile memory has been extensively researched as the alternative for a DRAM-based system; however, the traditional memory controller cannot efficiently track and schedule operations for all the memory devices in heterogeneous systems due to different timing requirements and complex architecture supports of various memory technologies. To address this issue, we propose a hybrid memory architecture framework called POMI (POlling-based Memory Interface). It uses a small buffer chip inserted on each DIMM (Dual In-line Memory Module) to decouple operation scheduling from the controller to enable the support for diverse memory technologies in the system. Unlike the conventional DRAM-based system, POMI uses a polling-based memory bus protocol for communication and to resolve any bus conflicts between memory modules. The buffer chip on each DIMM will provide feedback information to the main memory controller so that the polling overhead is trivial. We propose two unique designs. The f...
As a negative by-product of the dedicated pursuit of high performance in general-purpose processo... more As a negative by-product of the dedicated pursuit of high performance in general-purpose processors, the last decade has seen a dramatic increase in power consumption. To address this issue, researchers have aimed at reducing the processor’s power dissipation with minimum effect on performance. One effective architecture-level approach, architecture adaptation, dynamically activates and deactivates hardware resources in accord with the changes in a running program’s execution behavior. 1-5 The two key factors in architecture adaptation are when to trigger the adaptation and what adaptation techniques to apply. 5 Our work focuses on the first issue.
red data way. 6 Way-prediction is another effective approach that speculatively selects a way to ... more red data way. 6 Way-prediction is another effective approach that speculatively selects a way to access before making a normal cache access. Figures 1b and 1c illustrate the access patterns for phased and way-prediction n-way setassociative caches. Compared with the conventional implementation, the phased cache only probes one data subarray instead of n data subarrays (each way comprises a tag subarray and a data subarray). However, the sequential accesses of tag and data will increase the cache access latency. The way-prediction cache first accesses the tag and data subarrays of the predicted way. If the prediction is not correct, it then probes the rest of tag and data subarrays simultaneously. An access in a phased cache consumes more energy and has longer latency than a correctly predicted access in way-prediction cache, but consumes less energy than a mispredicted access. Hence, when the prediction accuracy is high, the way-prediction cache is more energy-efficient than the pha
Thermal management of DRAM memory has become a critical issue for server systems. We have done, t... more Thermal management of DRAM memory has become a critical issue for server systems. We have done, to our best knowledge, the first study of software thermal management for memory subsystem on real machines. Two recently proposed DTM (Dynamic Thermal Management) policies have been improved and implemented in Linux OS and evaluated on two multicore servers, a Dell PowerEdge 1950 server and a customized Intel SR1500AL server testbed. The experimental results first confirm that a system-level memory DTM policy may significantly improve system performance and power efficiency, compared with existing memory bandwidth throttling scheme. A policy called DTM-ACG (Adaptive Core Gating) shows performance improvement comparable to that reported previously. The average performance improvements are 13.3% and 7.2% on the PowerEdge 1950 and the SR1500AL (vs. 16.3% from the previous simulation-based study), respectively. We also have surprising findings that reveal the weakness of the previous study: ...
The widespread use of multicore processors has dramatically increased the demands on high bandwid... more The widespread use of multicore processors has dramatically increased the demands on high bandwidth and large capacity from memory systems. In a conventional DDR2/DDR3 DRAM memory system, the memory bus and DRAM devices run at the same data rate. To improve memory bandwidth, we propose a new memory system design called decoupled DIMM that allows the memory bus to operate at a data rate much higher than that of the DRAM devices. In the design, a synchronization buffer is added to relay data between the slow DRAM devices and the fast memory bus; and memory access scheduling is revised to avoid access conflicts on memory ranks. The design not only improves memory bandwidth beyond what can be supported by current memory devices, but also improves reliability, power efficiency, and cost effectiveness by using relatively slow memory devices. The idea of decoupling, precisely the decoupling of bandwidth match between memory bus and a single rank of devices, can also be applied to other typ...
ABSTRACT As an important part of modern computer system, the main memory is responsible of storin... more ABSTRACT As an important part of modern computer system, the main memory is responsible of storing programs and data structures needed for executing the programs. Performance, power consumption and capacity are the three major factors of a memory system design. Among them, performance and power consumption can be improved by carefully reordering concurrent memory requests to reduce their average latency and wisely utilizing the memory power-saving modes to reduce power consumption. We have implemented a "Thread-Fair" memory request scheduling policy on top of the USIMM [1] memory simulation infrastructure to enter the memory scheduling competition. The scheduler give the read requests blocking the ROBs' (reorder buffer) head high priority when choosing among multiple requests who want to perform row access. In this way, the most critical request is serviced faster and none memory in-tensive thread is less likely being interfered by memory intensive thread. Also, our scheduler uses a "Read Hit Queue" and "Write Hit Queue" per memory bank to group requests hitting on the same row buffer together. By giving the "hitting group" higher priority over the "page missing" requests, the scheduler can maximize the row buffer hit rate and hence reduce average memory access latency and power consumption. Using provided single program and multi-thread workloads, the simulation results indicate that our scheme per-forms 8.7% than the FCFS baseline. The fairness metric is improved by 9.1% and the energy delay product EDP is reduced by 17.3% on average.
Journal of Instruction-level Parallelism - JILP, 2001
DRAM row-bufiers have become a critical level of cache in the memory,hierarchy to exploit spatial... more DRAM row-bufiers have become a critical level of cache in the memory,hierarchy to exploit spatial locality in the cache miss stream. Row-bufier con∞icts occur when a sequence of requests on difierent pages goes to the same memory bank, causing higher memory,access latency than requests to the same row or to difierent banks. In this study, we flrst show that the address mapping symmetry between the cache and DRAM is the inherent source of row-bufier con∞icts. Breaking the symmetry to reduce the con∞icts and to retain the spatial locality, we propose and evaluate a permutation-based page interleaving scheme. We have also evaluated and compared two representative cache mapping schemes that break the symmetry at the cache level. We show that the proposed page interleaving scheme outperforms all other mapping schemes based on its overall performance and on its implementation simplicity.
Uploads
Papers by Zhichun Zhu