research-article

EMS-i: An Efficient Memory System Design with Specialized Caching Mechanism for Recommendation Inference

Authors:

Yiran ChenAuthors Info & Claims

ACM Transactions on Embedded Computing Systems, Volume 22, Issue 5s

Article No.: 100, Pages 1 - 22

https://doi.org/10.1145/3609384

Published: 09 September 2023 Publication History

Abstract

Recommendation systems have been widely embedded into many Internet services. For example, Meta’s deep learning recommendation model (DLRM) shows high prefictive accuracy of click-through rate in processing large-scale embedding tables. The SparseLengthSum (SLS) kernel of the DLRM dominates the inference time of the DLRM due to intensive irregular memory accesses to the embedding vectors. Some prior works directly adopt near data processing (NDP) solutions to obtain higher memory bandwidth to accelerate SLS. However, their inferior memory hierarchy induces low performance-cost ratio and fails to fully exploit the data locality. Although some software-managed cache policies were proposed to improve the cache hit rate, the incurred cache miss penalty is unacceptable considering the high overheads of executing the corresponding programs and the communication between the host and the accelerator. To address the issues aforementioned, we propose EMS-i, an efficient memory system design that integrates Solide State Drive (SSD) into the memory hierarchy using Compute Express Link (CXL) for recommendation system inference. We specialize the caching mechanism according to the characteristics of various DLRM workloads and propose a novel prefetching mechanism to further improve the performance. In addition, we delicately design the inference kernel and develop a customized mapping scheme for SLS operation, considering the multi-level parallelism in SLS and the data locality within a batch of queries. Compared to the state-of-the-art NDP solutions, EMS-i achieves up to 10.9× speedup over RecSSD and the performance comparable to RecNMP with 72% energy savings. EMS-i also saves up to 8.7× and 6.6 × memory cost w.r.t. RecSSD and RecNMP, respectively.

References

[1]

Amazon Personalize 2023. https://aws.amazon.com/personalize/

[2]

Ehsan K. Ardestani et al. 2022. Supporting massive DLRM inference through software defined memory. In ICDCS. IEEE.

[3]

Sunil Arya, David M. Mount, Nathan S. Netanyahu, Ruth Silverman, and Angela Y. Wu. 1998. An optimal algorithm for approximate nearest neighbor searching fixed dimensions. Journal of the ACM (JACM) 45, 6 (1998), 891–923.

Digital Library

[4]

Artem Babenko and Victor Lempitsky. 2016. Efficient indexing of billion-scale datasets of deep descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2055–2063.

[5]

Keshav Balasubramanian, Abdulla Alshabanah, Joshua D. Choe, and Murali Annavaram. 2021. cDLRM: Look ahead caching for scalable training of recommendation models. In Proceedings of the 15th ACM Conference on Recommender Systems. 263–272.

Digital Library

[6]

Criteo Kaggle Dataset 2020. https://www.kaggle.com/datasets/mrkmakr/criteo-dataset

[7]

CXL 3.0 Specification 2022. https://www.computeexpresslink.org/download-the-specification/

[8]

DRAM Market Price 2023. https://electronics-sourcing.com/2022/05/12/dram-price-increases-will-ease/

[9]

Facebook DLRM Dataset 2021. https://github.com/facebookresearch/dlrm_datasets

[10]

Udit Gupta et al. 2020. DeepRecSys: A system for optimizing end-to-end at-scale neural recommendation inference. In ISCA.

[11]

HBM Market Price 2023. https://www.networkworld.com/article/3664088/high-bandwidth-memory-hdm-delivers-impressive-performance-gains.html

[12]

Ranggi Hwang, Taehun Kim, Youngeun Kwon, and Minsoo Rhu. 2020. Centaur: A chiplet-based, hybrid sparse-dense accelerator for personalized recommendations. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA’20). IEEE, 968–981.

[13]

Myoungsoo Jung et al. 2017. SimpleSSD: Modeling solid state drives for holistic system simulation. IEEE Computer Architecture Letters (2017).

[14]

Kaggle 2023. https://www.kaggle.com

[15]

Liu Ke et al. 2020. Recnmp: Accelerating personalized recommendation with near-memory processing. In ISCA.

[16]

Liu Ke, Xuan Zhang, Jinin So, Jong-Geon Lee, Shin-Haeng Kang, Sukhan Lee, Songyi Han, YeonGon Cho, Jin Hyun Kim, Yongsuk Kwon, et al. 2021. Near-memory processing in action: Accelerating personalized recommendation with axdimm. IEEE Micro 42, 1 (2021), 116–127.

[17]

Ji-Hoon Kim, Yeo-Reum Park, Jaeyoung Do, Soo-Young Ji, and Joo-Young Kim. 2022. Accelerating large-scale graph-based nearest neighbor search on a computational storage platform. IEEE Trans. Comput. (2022), 1–1.

[18]

Yoongu Kim et al. 2015. Ramulator: A fast and extensible DRAM simulator. IEEE Computer Architecture Letters (2015).

[19]

Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2019. Tensordimm: A practical near-memory processing architecture for embeddings and tensor operations in deep learning. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 740–753.

Digital Library

[20]

Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2021. Tensor casting: Co-designing algorithm-architecture for personalized recommendation training. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA’21). IEEE, 235–248.

[21]

Huaicheng Li et al. 2022. Pond: CXL-Based Memory Pooling Systems for Cloud Platforms.

[22]

Jason Lowe-Power et al. 2020. The gem5 simulator: Version 20.0+. arXiv preprint arXiv:2007.03152 (2020).

[23]

Yu A. Malkov and Dmitry A. Yashunin. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 4 (2018), 824–836.

Digital Library

[24]

Meta 2023. https://about.meta.com

[25]

Dheevatsa Mudigere, Yuchen Hao, Jianyu Huang, Andrew Tulloch, Srinivas Sridharan, Xing Liu, Mustafa Ozdal, Jade Nie, Jongsoo Park, Liang Luo, et al. 2021. High-performance, distributed training of large-scale deep learning recommendation models. arXiv preprint arXiv:2104.05158 (2021).

[26]

Maxim Naumov et al. 2019. Deep learning recommendation model for personalization and recommendation systems. arXiv (2019).

[27]

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1532–1543.

[28]

PM983 Product Brief 2018. https://www.samsung.com/semiconductor/global.semi.static/

[29]

sift-1b 2022. http://corpus-texmex.irisa.fr/

[30]

Mohammadreza Soltaniyeh et al. 2022. Near-storage processing for solid state drive based recommendation inference with SmartSSDs®. In ICPE.

[31]

spacev-1b 2021. https://github.com/microsoft/SPTAG/tree/main/datasets/SPACEV1B

[32]

SSD Market Price 2023. https://www.disctech.com/Samsung-PM1725B-3.2TB-MZ-PLL3T2C-MZPLK1T6HCHP-00005-Dell-73KJ7-PCIe-NVMe-SSD?partner=1011&gclid=CjwKCAiAzp6eBhByEiwA_gGq5BswRyE1M-T6X7Gjbw9dlC_GAWnrc0kRwddyzN9IQ6mbkMA3mfSvpxoCmvEQAvD_BwE

[33]

Xuan Sun, Hu Wan, Qiao Li, Chia-Lin Yang, Tei-Wei Kuo, and Chun Jason Xue. 2022. Rm-ssd: In-storage computing for large-scale recommendation inference. In 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA’22). IEEE, 1056–1070.

[34]

torchrec 2022. https://pytorch.org/torchrec/

[35]

Frank Edward Walter et al. 2008. A model of a trust-based recommendation system on a social network. AAMAS (2008).

[36]

Yitu Wang, Zhenhua Zhu, Fan Chen, Mingyuan Ma, Guohao Dai, Yu Wang, Hai Li, and Yiran Chen. 2021. REREC: In-ReRAM acceleration with access-aware mapping for personalized recommendation. In 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD’21). IEEE, 1–9.

Digital Library

[37]

Mark Wilkening, Gupta, et al. 2021. RecSSD: Near data processing for solid state drive based recommendation inference. In ASPLOS.

[38]

Han Xiao, Kashif Rasul, and Roland Vollgraf. 2017. Fashion-mnist: A novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017).

[39]

Xilinx VU57P HBM 2023. https://www.xilinx.com/products/silicon-devices/fpga/virtex-ultrascale-plus-vu57p.html

[40]

Xiangmin Zhou et al. 2015. Online video recommendation in sharing community. In ICMD.

Cited By

Huo PDevulapally AMaruf HPark MNair KArunachalam MAkbulut GKandemir MNarayanan V(2024)PIFS-Rec: Process-In-Fabric-Switch for Large-Scale Recommendation System Inferences2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00052(612-626)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00052
Park TYang SSeok JLee HKim JRhee C(2024)Accelerating Large-Scale DLRM Inference through Dynamic Hot Data Rearrangement2024 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS58744.2024.10558132(1-5)Online publication date: 19-May-2024
https://doi.org/10.1109/ISCAS58744.2024.10558132
Wang YLi SZheng QSong LLi ZChang ALi HChen Y(2024)NDSEARCH: Accelerating Graph-Traversal-Based Approximate Nearest Neighbor Search through Near Data Processing2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00035(368-381)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00035
Show More Cited By

Index Terms

EMS-i: An Efficient Memory System Design with Specialized Caching Mechanism for Recommendation Inference
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Data flow architectures
2. Hardware
  1. Emerging technologies
    1. Memory and dense storage

Recommendations

Cooperative hardware/software caching for next-generation memory systems
Selective Victim Caching: A Method to Improve the Performance of Direct-Mapped Caches

Although direct-mapped caches suffer from higher miss ratios as compared to set-associative caches, they are attractive for today's high-speed pipelined processors that require very low access times. Victim caching was proposed by Jouppi [1] as an ...
An efficient cache design for scalable glueless shared-memory multiprocessors
CF '06: Proceedings of the 3rd conference on Computing frontiers

Traditionally, cache coherence in large-scale shared-memory multiprocessors has been ensured by means of a distributed directory structure stored in main memory. In this way, the access to main memory to recover the sharing status of the block is ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems

ACM Transactions on Embedded Computing Systems Volume 22, Issue 5s

Special Issue ESWEEK 2023

October 2023

1394 pages

ISSN:1539-9087

EISSN:1558-3465

DOI:10.1145/3614235

Editor:
Tulika Mitra
National University of Singapore, Singapore

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 09 September 2023

Accepted: 13 July 2023

Revised: 02 June 2023

Received: 23 March 2023

Published in TECS Volume 22, Issue 5s

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation
National Science Foundation IUCRC memberships from Samsung and other companies

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
919
Total Downloads

Downloads (Last 12 months)520
Downloads (Last 6 weeks)17

Reflects downloads up to 11 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Huo PDevulapally AMaruf HPark MNair KArunachalam MAkbulut GKandemir MNarayanan V(2024)PIFS-Rec: Process-In-Fabric-Switch for Large-Scale Recommendation System Inferences2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00052(612-626)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00052
Park TYang SSeok JLee HKim JRhee C(2024)Accelerating Large-Scale DLRM Inference through Dynamic Hot Data Rearrangement2024 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS58744.2024.10558132(1-5)Online publication date: 19-May-2024
https://doi.org/10.1109/ISCAS58744.2024.10558132
Wang YLi SZheng QSong LLi ZChang ALi HChen Y(2024)NDSEARCH: Accelerating Graph-Traversal-Based Approximate Nearest Neighbor Search through Near Data Processing2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00035(368-381)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00035
Piastou M(2023)Evaluating the Efficiency of Caching Strategies in Reducing Application LatencyJournal of Science & Technology10.55662/JST.2023.46064:6(83-98)Online publication date: 6-Nov-2023
https://doi.org/10.55662/JST.2023.4606

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents