Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Fault Tolerance Technique Offlining Faulty Blocks by Heap Memory Management

Published: 05 June 2019 Publication History

Abstract

As dynamic random access memory (DRAM) cells continue to be scaled down for higher density and capacity, they have more faults. Thus, DRAM reliability becomes a major concern in computer systems. Previous studies have proposed many techniques preserving the reliability in various system components, such as DRAM internal, memory controller, caches, and operating systems. By reviewing the techniques, we identified the following two considerations: First, it is possible to recover faults with reasonable overhead at high fault rate only if the recovery unit is fine-grained. Second, since hardware modification requires additional cost in the employment of a technique, a pure software-based recovery technique is preferable. However, in the existing software-based recovery technique, the recovery unit is too coarse-grained to tolerate the high fault rate.
In this article, we propose a pure software-based recovery technique with fine-granularity. Our key idea is based on heap segments being managed by the system library with variable-sized chunks to handle dynamic allocation in user applications. In our technique, faulty blocks in pages are offlined by marking them as allocated chunks. Thus, not only fault-free pages but also the remaining clean blocks in faulty pages are allowed to be usable space. Our technique is implemented by modifying the operating system and the system library. Since hardware assistance is unnecessary in the implementation, we evaluated our method on a real machine. Our evaluation results show that our technique has negligible performance overhead at high bit error rate (BER) 5.12e-5, which a hardware-based recovery technique could not tolerate without unacceptable area overhead. Also, at the same BER, our method provides 5.22× usable space, compared with page-offline, which is the state-of-the-art pure software-based technique.

References

[1]
Mcelog {n.d.}. Advanced hardware error handling for x86 Linux. Retrieved from http://www.mcelog.org/badpageofflining.html.
[2]
Linux Kernel Archives {n.d.}. Page migration. Retrieved from https://www.kernel.org/doc/Documentation/vm/page_migration.
[3]
N. Axelos, K. Pekmestzi, and D. Gizopoulos. 2012. Efficient memory repair using cache-based redundancy. IEEE Trans. Very Large Scale Integr. Syst. 20, 12 (Dec. 2012), 2278--2288.
[4]
S. Baek, S. Cho, and R. Melhem. 2014. Refresh now and then. IEEE Trans. Comput. 63, 12 (Dec. 2014), 3114--3126.
[5]
Daniel Bartholomew. 2006. QEMU: A multihost, multitarget emulator. Linux J. 2006, 145 (May 2006), 3. Retrieved from http://dl.acm.org/citation.cfm?id=1134160.1134163.
[6]
L. Bautista-Gomez, F. Zyulkyarov, O. Unsal, and S. McIntosh-Smith. 2016. Unprotected computing: A large-scale study of DRAM raw error rate on a supercomputer. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 645--655.
[7]
L. Borucki, G. Schindlbeck, and C. Slayman. 2008. Comparison of accelerated DRAM soft error rates measured at component and system level. In Proceedings of the IEEE International Reliability Physics Symposium. 482--487.
[8]
Daniel Bovet and Marco Cesati. 2005. Understanding the Linux Kernel. Oreilly 8 Associates.
[9]
Ronald P. Cenker, Donald G. Clemons, William R. Huber, Joseph B. Petrizzi, Frank J. Procyk, and George M. Trout. 1979. A fault-tolerant 64K dynamic random-access memory. IEEE Trans. Electron. Devices 26, 6 (June 1979), 853--860.
[10]
Kevin K. Chang, A. Giray Yaglikci, Saugata Ghose, Aditya Agrawal, Niladrish Chatterjee, Abhijith Kashyap, Donghyuk Lee, Mike O’Connor, Hasan Hassan, and Onur Mutlu. 2017. Understanding reduced-voltage operation in modern DRAM devices: Experimental characterization, analysis, and mechanisms. Proc. ACM Measure. Anal. Comput. Syst. 1, 1, (June 2017).
[11]
Timothy J. Dell.1997. A white paper on the benefits of chipkill-correct ECC for PC server main memory. IBM Microelectronics Division.
[12]
Carlos O’Donell. et al. 2017. The GNU C Library. Retrieved from https://www.gnu.org/software/libc.
[13]
Mel Gorman. 2004. Understanding the Linux Virtual Memory Manager. Prentice Hall, Upper Saddle River, NJ.
[14]
Masashi Horiguchi and Kiyoo Itoh. 2011. Nanoscale Memory Repair. Springer, New York, NY, 19--67.
[15]
C. S. Hou, Y. X. Chen, J. F. Li, C. Y. Lo, D. M. Kwai, and Y. F. Chou. 2016. A built-in self-repair scheme for DRAMs with spare rows, columns, and bits. In Proceedings of IEEE International Test Conference (ITC’16). 1--7.
[16]
Ciji Isen and Lizy John. 2009. ESKIMO: Energy savings using semantic knowledge of inconsequential memory occupancy for DRAM subsystem. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’09). ACM, New York, NY, 337--346.
[17]
Jaeyung Jun, Kyu Hyun Choi, Hokwon Kim, Sang Ho Yu, Seon Wook Kim, and Youngsun Han. 2017. Recovering from biased distribution of faulty cells in memory by reorganizing replacement regions through universal hashing. ACM Trans. Design Automat. Electron. Syst. 23, 2, Article 16 (Oct. 2017).
[18]
D. W. Kim and M. Erez. 2015. Balancing reliability, cost, and performance tradeoffs with FreeFault. In Proceedings of IEEE 21th International Symposium on High Performance Computer Architecture (HPCA’15). 439--450.
[19]
K. Kim and J. Lee. 2009. A new investigation of data retention time in truly nanoscaled DRAMs. IEEE Electron. Device Lett. 30, 8 (Aug. 2009), 846--848.
[20]
Toshiaki Kirihata, Gerhard Mueller, Brian Ji, Gerd Frankowsky, John M. Ross, Hartmud Terletzki, Dmitry G. Netis, Oliver Weinfurtner, David R. Hanson, Gabriel Daniel, Louis Lu-Chen Hsu, Daniel W. Storaska, Armin M. Reith, Marco A. Hug, Kevin P. Guay, Manfred Selz, Peter Poechmueller, Heinz Hoenigschmid, and Matthew R. Wordeman. 1999. A 390-mm<sup>2</sup> 16-bank 1 Gb DDR SDRAM with hybrid bitline architecture. IEEE J. Solid-State Circ. 34, 11 (Nov. 1999), 1580--1588.
[21]
Song Liu, Karthik Pattabiraman, Thomas Moscibroda, and Benjamin G. Zorn. 2011. Flikker: Saving DRAM refresh-power through critical data partitioning. In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’11). ACM, 213--224.
[22]
M. Lv, H. Sun, Q. Ren, B. Yu, J. Xin, and N. Zheng. 2015. Logic-DRAM co-design to exploit the efficient repair technique for stacked DRAM. IEEE Trans. Circ. Syst. I: Reg. Papers 62, 5 (May 2015), 1362--1371.
[23]
P. J. Meaney, L. A. Lastras-Montano, V. K. Papazova, E. Stephens, J. S. Johnson, L. C. Alves, J. A. O’Connor, and W. J. Clarke. 2012. IBM zEnterprise redundant array of independent memory subsystem. IBM J. Res. Dev. 56, 1.2 (Jan. 2012), 4:1--4:11.
[24]
Prashant J. Nair, Dae-Hyun Kim, and Moinuddin K. Qureshi. 2013. ArchShield: Architectural framework for assisting DRAM scaling by tolerating high error rates. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA’13). ACM, 72--83.
[25]
M. Patel, J. S. Kim, and O. Mutlu. 2017. The reach profiler (REAPER): Enabling the mitigation of DRAM retention failures via profiling at aggressive conditions. In Proceedings of the ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA’17). 255--268.
[26]
Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber. 2009. DRAM errors in the wild: A large-scale field study. In Proceedings of the 11th International Joint Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’09). ACM, 193--204.
[27]
Young Hoon Son, Sukhan Lee, Seongil O, Sanghyuk Kown, Nam Sung Kim, and Jung Ho Ahn. 2015. CiDRA: A cache-inspired DRAM resilience architecture. In Proceedings of IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). 502--513.
[28]
Vilas Sridharan and Dean Liberty. 2012. A study of DRAM failures in the field. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’12). IEEE Computer Society Press.
[29]
Andrew S. Tanenbaum. 2007. Modern Operating Systems (3rd ed.). Prentice Hall Press, Upper Saddle River, NJ, USA.
[30]
Dong Tang, Peter Carruthers, Zuheir Totari, and Michael W. Shapiro. 2006. Assessment of the effect of memory page retirement on system RAS against hardware faults. In Proceedings of the International Conference on Dependable Systems and Networks (DSN’06). IEEE Computer Society, 365--370.
[31]
R. K. Venkatesan, S. Herr, and E. Rotenberg. 2006. Retention-aware placement in DRAM (RAPID): Software methods for quasi-non-volatile DRAM. In Proceedings of the 12th International Symposium on High-Performance Computer Architecture. 155--165.
[32]
Ran Wang, Krishnendu Chakrabarty, and Sudipta Bhawmik. 2015. Built-in self-test and test scheduling for interposer-based 2.5D IC. ACM Trans. Design Automat. Electron. Syst. 20, 4, Article 58 (Sept. 2015).
[33]
Xianwei Zhang, Youtao Zhang, Bruce R. Childers, and Jun Yang. 2017. On the restore time variations of future DRAM memory. ACM Trans. Design Automat. Electron. Syst. 22, 2, Article 26 (Feb. 2017).
[34]
Ruohuang Zheng and Michael C. Huang. 2017. Redundant memory array architecture for efficient selective protection. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA’17). ACM, 214--227.

Cited By

View all
  • (2020)Generating Representative Test Sequences from Real Workload for Minimizing DRAM Verification OverheadACM Transactions on Design Automation of Electronic Systems10.1145/339189125:4(1-23)Online publication date: 27-May-2020

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Design Automation of Electronic Systems
ACM Transactions on Design Automation of Electronic Systems  Volume 24, Issue 4
July 2019
258 pages
ISSN:1084-4309
EISSN:1557-7309
DOI:10.1145/3326461
  • Editor:
  • Naehyuck Chang
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 05 June 2019
Accepted: 01 April 2019
Revised: 01 April 2019
Received: 01 July 2018
Published in TODAES Volume 24, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tag

  1. DRAM fault recovery

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • IT R8D program of MOTIE/KEIT
  • Design technology development of ultra-low voltage operating circuit and IP for smart sensor SoC

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)0
Reflects downloads up to 22 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2020)Generating Representative Test Sequences from Real Workload for Minimizing DRAM Verification OverheadACM Transactions on Design Automation of Electronic Systems10.1145/339189125:4(1-23)Online publication date: 27-May-2020

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media