Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

A Novel ReRAM-Based Processing-in-Memory Architecture for Graph Traversal

Published: 26 February 2018 Publication History

Abstract

Graph algorithms such as graph traversal have been gaining ever-increasing importance in the era of big data. However, graph processing on traditional architectures issues many random and irregular memory accesses, leading to a huge number of data movements and the consumption of very large amounts of energy. To minimize the waste of memory bandwidth, we investigate utilizing processing-in-memory (PIM), combined with non-volatile metal-oxide resistive random access memory (ReRAM), to improve both computation and I/O performance.
We propose a new ReRAM-based processing-in-memory architecture called RPBFS, in which graph data can be persistently stored and processed in place. We study the problem of graph traversal, and we design an efficient graph traversal algorithm in RPBFS. Benefiting from low data movement overhead and high bank-level parallel computation, RPBFS shows a significant performance improvement compared with both the CPU-based and the GPU-based BFS implementations. On a suite of real-world graphs, our architecture yields a speedup in graph traversal performance of up to 33.8×, and achieves a reduction in energy over conventional systems of up to 142.8×.

References

[1]
Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. A scalable processing-in-memory accelerator for parallel graph processing. In Proceedings of the 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA). IEEE, 105--117.
[2]
Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture. In Proceedings of the 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA). IEEE, 336--348.
[3]
Berkin Akin, Franz Franchetti, and James C. Hoe. 2015. Data reorganization in memory using 3D-stacked DRAM. In ACM SIGARCH Computer Architecture News, Vol. 43. ACM, 131--143.
[4]
Fabien Alibart, Ligang Gao, Brian D. Hoskins, and Dmitri B. Strukov. 2012. High precision tuning of state for memristive devices by adaptable variation-tolerant algorithm. Nanotechnology 23, 7 (2012), 075201.
[5]
Rajeev Balasubramonian, Jichuan Chang, Troy Manning, Jaime H. Moreno, Richard Murphy, Ravi Nair, and Steven Swanson. 2014. Near-data processing: Insights from a MICRO-46 workshop. IEEE Micro 34, 4 (2014), 36--42.
[6]
Scott Beamer, Krste Asanović, and David Patterson. 2013. Direction-optimizing breadth-first search. Scientific Programming 21, 3--4 (2013), 137--148.
[7]
Scott Beamer, Krste Asanovic, and David Patterson. 2015. Locality exists in graph processing: Workload characterization on an ivy bridge server. In Proceedings of the 2015 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 56--65.
[8]
Rong Chen, Jiaxin Shi, Yanzhe Chen, and Haibo Chen. 2015. Powerlyra: Differentiated graph computation and partitioning on skewed graphs. In Proceedings of the 10th European Conference on Computer Systems. ACM, 1.
[9]
Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. 2016. Prime: A novel processing-in-memory architecture for neural network computation in ReRAM-based main memory. In Proceedings of the 43rd International Symposium on Computer Architecture. IEEE, 27--39.
[10]
Bram Cohen. 2003. Incentives build robustness in bittorrent. In Proceedings of the Workshop on Economics of Peer-to-Peer Systems, Vol. 6. 68--72.
[11]
Thomas H. Cormen. 2009. Introduction to Algorithms. MIT Press.
[12]
Leonardo Dagum and Ramesh Menon. 1998. OpenMP: An industry standard API for shared-memory programming. IEEE Computational Science and Engineering 5, 1 (1998), 46--55.
[13]
Xiangyu Dong, Cong Xu, Norm Jouppi, and Yuan Xie. 2014. NVSim: A circuit-level performance, energy, and area model for emerging non-volatile memory. In Emerging Memory Technologies. Springer, 15--50.
[14]
Lei Han, Zhaoyan Shen, Zili Shao, H. Howie Huang, and Tao Li. 2017. A novel ReRAM-based processing-in-memory architecture for graph computing. In Proceedings of 2017 IEEE 6th Non-Volatile Memory Systems and Applications Symposium (NVMSA). IEEE, 1--6.
[15]
Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun. 2011. Efficient parallel graph exploration on multi-core CPU and GPU. In Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, 78--88.
[16]
George Karypis and Vipin Kumar. 1998. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing 20, 1 (1998), 359--392.
[17]
Jure Leskovec and Andrej Krevl. 2015. {SNAP Datasets}:{Stanford} Large Network Dataset Collection.
[18]
Jing Li, Chao-I Wu, Scott C. Lewis, Jackie Morrish, Tien-Yen Wang, Richard Jordan, Tom Maffitt, Matthew Breitwisch, Alejandro Schrott, Roger Cheek, and others. 2011. A novel reconfigurable sensing scheme for variable level storage in phase change memory. In Proceedings of the 2011 3rd IEEE International Memory Workshop (IMW). IEEE, 1--4.
[19]
Duo Liu, Tianzheng Wang, Yi Wang, Zili Shao, Qingfeng Zhuge, and Edwin H.-M. Sha. 2014. Application-specific wear leveling for extending lifetime of phase change memory in embedded systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 33, 10 (2014), 1450--1462.
[20]
Hang Liu and H Howie Huang. 2015. Enterprise: Breadth-first graph traversal on GPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 68.
[21]
Hang Liu, H. Howie Huang, and Yang Hu. 2016. iBFS: Concurrent breadth-first search on GPUs. In Proceedings of the 2016 International Conference on Management of Data. ACM, 403--416.
[22]
Xiaoxiao Liu, Mengjie Mao, Beiye Liu, Hai Li, Yiran Chen, Boxun Li, Yu Wang, Hao Jiang, Mark Barnell, Qing Wu, and others. 2015. RENO: A high-efficient reconfigurable neuromorphic computing accelerator design. In Proceedings of the 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE, 1--6.
[23]
Xing Liu, Mikhail Smelyanskiy, Edmond Chow, and Pradeep Dubey. 2013. Efficient sparse matrix-vector multiplication on x86-based many-core processors. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing. ACM, 273--282.
[24]
Martin Dimitrov and Carl Strickland. 2016. Intel power gadget. Intel Corporation 7 (2016). https://software.intel.com/en-us/articles/intel-power-gadget-20.
[25]
Duane Merrill, Michael Garland, and Andrew Grimshaw. 2012. Scalable GPU graph traversal. In ACM SIGPLAN Notices, Vol. 47. ACM, 117--128.
[26]
Nooshin Mirzadeh, Yusuf Onur Koçberber, Babak Falsafi, and Boris Grot. 2015. Sort vs. Hash join revisited for near-memory execution. In Proceedings of the 5th Workshop on Architectures and Systems for Big Data (ASBD’15).
[27]
Dimin Niu, Cong Xu, Naveen Muralimanohar, Norman P. Jouppi, and Yuan Xie. 2013. Design of cross-point metal-oxide ReRAM emphasizing reliability and cost. In Proceedings of the 2013 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 17--23.
[28]
Eriko Nurvitadhi, Gabriel Weisz, Yu Wang, Skand Hurkat, Marie Nguyen, James C. Hoe, José F. Martínez, and Carlos Guestrin. 2014. Graphgen: An FPGA framework for vertex-centric graph computation. In Proceedings of the 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 25--28.
[29]
NVIDIA. 2017. CUDA Toolkit Documentation. Technical Report. http://docs.nvidia.com/cuda/profiler-users-guide/index.htmlnvprof-overview.
[30]
Muhammet Mustafa Ozdal, Serif Yesil, Taemin Kim, Andrey Ayupov, John Greth, Steven Burns, and Ozcan Ozturk. 2016. Energy efficient architecture for graph analytics accelerators. In Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). IEEE, 166--177.
[31]
Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab.
[32]
Seth H. Pugsley, Jeffrey Jestes, Huihui Zhang, Rajeev Balasubramonian, Vijayalakshmi Srinivasan, Alper Buyuktosunoglu, Al Davis, and Feifei Li. 2014. NDC: Analyzing the impact of 3D-stacked memory+ logic devices on MapReduce workloads. In Proceedings of the 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 190--200.
[33]
Zhiwei Qin, Yi Wang, Duo Liu, Zili Shao, and Yong Guan. 2011. MNFTL: An efficient flash translation layer for MLC NAND flash memory storage systems. In Proceedings of the 48th Design Automation Conference. ACM, 17--22.
[34]
Amitabha Roy, Ivo Mihailovic, and Willy Zwaenepoel. 2013. X-stream: Edge-centric graph processing using streaming partitions. In Proceedings of the 24th ACM Symposium on Operating Systems Principles. ACM, 472--488.
[35]
Semih Salihoglu and Jennifer Widom. 2013. GPS: A graph processing system. In Proceedings of the 25th International Conference on Scientific and Statistical Database Management. ACM, 22.
[36]
Vivek Seshadri, Yoongu Kim, Chris Fallin, Donghyuk Lee, Rachata Ausavarungnirun, Gennady Pekhimenko, Yixin Luo, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and others. 2013. RowClone: Fast and energy-efficient in-DRAM bulk data copy and initialization. In Proceedings of the 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 185--197.
[37]
Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan, Miao Hu, R. Stanley Williams, and Vivek Srikumar. 2016. ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars. In Proceedings of the 43rd International Symposium on Computer Architecture. IEEE, 14--26.
[38]
Hojun Shim, Yongsoo Joo, Yongseok Choi, Hyung Gyu Lee, and Naehyuck Chang. 2003. Low-energy off-chip SDRAM memory systems for embedded applications. ACM Transactions on Embedded Computing Systems (TECS) 2, 1 (2003), 98--130.
[39]
Linghao Song, Xuehai Qian, Hai Li, and Yiran Chen. 2017. PipeLayer: A pipelined ReRAM-based accelerator for deep learning. In Proceedings of the 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 541--552.
[40]
Linghao Song, Youwei Zhuo, Xuehai Qian, Hai Li, and Yiran Chen. 2017. GraphR: Accelerating graph processing using ReRAM. Arxiv:1708.06248.
[41]
Yuliang Sun, Yu Wang, and Huazhong Yang. 2017. Energy-efficient SQL query exploiting RRAM-based process-in-memory structure. In Proceedings of the 2017 IEEE 6th Non-Volatile Memory Systems and Applications Symposium (NVMSA). IEEE, 1--6.
[42]
Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D. Owens. 2016. Gunrock: A high-performance graph processing library on the GPU. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 11.
[43]
Yi Wang, Zhiwei Qin, Renhai Chen, Zili Shao, Qixin Wang, Shuai Li, and Laurence T. Yang. 2016. A real-time flash translation layer for NAND flash memory storage systems. IEEE Transactions on Multi-Scale Computing Systems 2, 1 (2016), 17--29.
[44]
H-S. Philip Wong, Heng-Yuan Lee, Shimeng Yu, Yu-Sheng Chen, Yi Wu, Pang-Shiu Chen, Byoungil Lee, Frederick T. Chen, and Ming-Jinn Tsai. 2012. Metal--oxide RRAM. Proceedings of the IEEE 100, 6 (2012), 1951--1970.
[45]
Cong Xu, Pai-Yu Chen, Dimin Niu, Yang Zheng, Shimeng Yu, and Yuan Xie. 2014. Architecting 3D vertical resistive memory for next-generation storage systems. In Proceedings of the 2014 IEEE/ACM International Conference on Computer-Aided Design. IEEE, 55--62.
[46]
Cong Xu, Xiangyu Dong, Norman P. Jouppi, and Yuan Xie. 2011. Design implications of memristor-based RRAM cross-point structures. In Proceedings of the Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE). IEEE, 1--6.
[47]
Cong Xu, Dimin Niu, Naveen Muralimanohar, Rajeev Balasubramonian, Tao Zhang, Shimeng Yu, and Yuan Xie. 2015. Overcoming the challenges of crossbar resistive memory architectures. In Proceedings of the 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). IEEE, 476--488.
[48]
Cong Xu, Dimin Niu, Naveen Muralimanohar, Norman P. Jouppi, and Yuan Xie. 2013. Understanding the trade-offs in multi-level cell ReRAM memory design. In Proceedings of the 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE, 1--6.
[49]
Dongping Zhang, Nuwan Jayasena, Alexander Lyashevsky, Joseph L. Greathouse, Lifan Xu, and Michael Ignatowski. 2014. TOP-PIM: Throughput-oriented programmable processing in memory. In Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing. ACM, 85--98.
[50]
Hang Zhang, Nong Xiao, Fang Liu, and Zhiguang Chen. 2016. Leader: Accelerating ReRAM-based main memory by leveraging access latency discrepancy in crossbar arrays. In Proceedings of Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE). IEEE, 756--761.
[51]
Jialiang Zhang, Soroosh Khoram, and Jing Li. 2017. Boosting the performance of FPGA-based graph processor using hybrid memory cube: A case for breadth first search. In Proceedings of FPGA. 207--216.
[52]
Jianlong Zhong and Bingsheng He. 2014. Medusa: Simplified graph processing on GPUs. IEEE Transactions on Parallel and Distributed Systems 25, 6 (2014), 1543--1552.
[53]
Qiuling Zhu, Berkin Akin, H. Ekin Sumbul, Fazle Sadi, James C. Hoe, Larry Pileggi, and Franz Franchetti. 2013. A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing. In Proceedings of the 2013 IEEE International 3D Systems Integration Conference (3DIC). IEEE, 1--7.
[54]
Qiuling Zhu, Tobias Graf, H. Ekin Sumbul, Larry Pileggi, and Franz Franchetti. 2013. Accelerating sparse matrix-matrix multiplication with 3D-stacked logic-in-memory hardware. In Proceedings of the 2013 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1--6.

Cited By

View all
  • (2024)An efficient SSSP algorithm on time-evolving graphs with prediction of computation resultsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104830186(104830)Online publication date: Apr-2024
  • (2023)Runtime Row/Column Activation Pruning for ReRAM-based Processing-in-Memory DNN Accelerators2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD)10.1109/ICCAD57390.2023.10323699(1-9)Online publication date: 28-Oct-2023
  • (2022)To PIM or not for emerging general purpose processing in DDR memory systemsProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527431(231-244)Online publication date: 18-Jun-2022
  • Show More Cited By

Index Terms

  1. A Novel ReRAM-Based Processing-in-Memory Architecture for Graph Traversal

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Transactions on Storage
        ACM Transactions on Storage  Volume 14, Issue 1
        Special Issue on NVM and Storage
        February 2018
        237 pages
        ISSN:1553-3077
        EISSN:1553-3093
        DOI:10.1145/3190860
        • Editor:
        • Sam H. Noh
        Issue’s Table of Contents
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 26 February 2018
        Accepted: 01 January 2018
        Received: 01 November 2017
        Published in TOS Volume 14, Issue 1

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. BFS
        2. ReRAM
        3. architecture
        4. processing-in-memory

        Qualifiers

        • Research-article
        • Research
        • Refereed

        Funding Sources

        • National Natural Science Foundation of China
        • Chongqing High-Tech Research Program
        • Research Grants Council of the Hong Kong Special Administrative Region, China

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)42
        • Downloads (Last 6 weeks)1
        Reflects downloads up to 04 Oct 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)An efficient SSSP algorithm on time-evolving graphs with prediction of computation resultsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104830186(104830)Online publication date: Apr-2024
        • (2023)Runtime Row/Column Activation Pruning for ReRAM-based Processing-in-Memory DNN Accelerators2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD)10.1109/ICCAD57390.2023.10323699(1-9)Online publication date: 28-Oct-2023
        • (2022)To PIM or not for emerging general purpose processing in DDR memory systemsProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527431(231-244)Online publication date: 18-Jun-2022
        • (2022)A Practical Highly Paralleled ReRAM-Based DNN Accelerator by Reusing Weight Pattern RepetitionsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2021.307111641:4(922-935)Online publication date: Apr-2022
        • (2022)Heterogeneous Data-Centric Architectures for Modern Data-Intensive Applications: Case Studies in Machine Learning and Databases2022 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)10.1109/ISVLSI54635.2022.00060(273-278)Online publication date: Jul-2022
        • (2022)GraphA: An efficient ReRAM-based architecture to accelerate large scale graph processingJournal of Systems Architecture10.1016/j.sysarc.2022.102755133(102755)Online publication date: Dec-2022
        • (2022)Emerging Memory Structures for VLSI CircuitsWiley Encyclopedia of Electrical and Electronics Engineering10.1002/047134608X.W8438(1-28)Online publication date: 12-May-2022
        • (2021)Accelerating Similarity-based Mining Tasks on High-dimensional Data by Processing-in-memory2021 IEEE 37th International Conference on Data Engineering (ICDE)10.1109/ICDE51399.2021.00167(1859-1864)Online publication date: Apr-2021
        • (2021)Task Parallelism-Aware Deep Neural Network Scheduling on Multiple Hybrid Memory Cube-Based Processing-in-MemoryIEEE Access10.1109/ACCESS.2021.30772949(68561-68572)Online publication date: 2021
        • (2021)Towards efficient allocation of graph convolutional networks on hybrid computation-in-memory architectureScience China Information Sciences10.1007/s11432-020-3248-y64:6Online publication date: 10-May-2021
        • Show More Cited By

        View Options

        Get Access

        Login options

        Full Access

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media