Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- research-articleJune 2023
MTIA: First Generation Silicon Targeting Meta's Recommendation Systems
- Amin Firoozshahian,
- Joel Coburn,
- Roman Levenstein,
- Rakesh Nattoji,
- Ashwin Kamath,
- Olivia Wu,
- Gurdeepak Grewal,
- Harish Aepala,
- Bhasker Jakka,
- Bob Dreyer,
- Adam Hutchin,
- Utku Diril,
- Krishnakumar Nair,
- Ehsan K. Aredestani,
- Martin Schatz,
- Yuchen Hao,
- Rakesh Komuravelli,
- Kunming Ho,
- Sameer Abu Asal,
- Joe Shajrawi,
- Kevin Quinn,
- Nagesh Sreedhara,
- Pankaj Kansal,
- Willie Wei,
- Dheepak Jayaraman,
- Linda Cheng,
- Pritam Chopda,
- Eric Wang,
- Ajay Bikumandla,
- Arun Karthik Sengottuvel,
- Krishna Thottempudi,
- Ashwin Narasimha,
- Brian Dodds,
- Cao Gao,
- Jiyuan Zhang,
- Mohammed Al-Sanabani,
- Ana Zehtabioskuie,
- Jordan Fix,
- Hangchen Yu,
- Richard Li,
- Kaustubh Gondkar,
- Jack Montgomery,
- Mike Tsai,
- Saritha Dwarakapuram,
- Sanjay Desai,
- Nili Avidan,
- Poorvaja Ramani,
- Karthik Narayanan,
- Ajit Mathews,
- Sethu Gopal,
- Maxim Naumov,
- Vijay Rao,
- Krishna Noru,
- Harikrishna Reddy,
- Prahlad Venkatapuram,
- Alexis Bjorlin
ISCA '23: Proceedings of the 50th Annual International Symposium on Computer ArchitectureArticle No.: 80, Pages 1–13https://doi.org/10.1145/3579371.3589348Meta has traditionally relied on using CPU-based servers for running inference workloads, specifically Deep Learning Recommendation Models (DLRM), but the increasing compute and memory requirements of these models have pushed the company towards using ...
- research-articleJune 2022
Software-hardware co-design for fast and scalable training of deep learning recommendation models
- Dheevatsa Mudigere,
- Yuchen Hao,
- Jianyu Huang,
- Zhihao Jia,
- Andrew Tulloch,
- Srinivas Sridharan,
- Xing Liu,
- Mustafa Ozdal,
- Jade Nie,
- Jongsoo Park,
- Liang Luo,
- Jie (Amy) Yang,
- Leon Gao,
- Dmytro Ivchenko,
- Aarti Basant,
- Yuxi Hu,
- Jiyan Yang,
- Ehsan K. Ardestani,
- Xiaodong Wang,
- Rakesh Komuravelli,
- Ching-Hsiang Chu,
- Serhat Yilmaz,
- Huayu Li,
- Jiyuan Qian,
- Zhuobo Feng,
- Yinbin Ma,
- Junjie Yang,
- Ellie Wen,
- Hong Li,
- Lin Yang,
- Chonglin Sun,
- Whitney Zhao,
- Dimitry Melts,
- Krishna Dhulipala,
- KR Kishore,
- Tyler Graf,
- Assaf Eisenman,
- Kiran Kumar Matam,
- Adi Gangidi,
- Guoqiang Jerry Chen,
- Manoj Krishnan,
- Avinash Nayak,
- Krishnakumar Nair,
- Bharath Muthiah,
- Mahmoud khorashadi,
- Pallab Bhattacharya,
- Petr Lapukhov,
- Maxim Naumov,
- Ajit Mathews,
- Lin Qiao,
- Mikhail Smelyanskiy,
- Bill Jia,
- Vijay Rao
ISCA '22: Proceedings of the 49th Annual International Symposium on Computer ArchitecturePages 993–1011https://doi.org/10.1145/3470496.3533727Deep learning recommendation models (DLRMs) have been used across many business-critical services at Meta and are the single largest AI application in terms of infrastructure demand in its data-centers. In this paper, we present Neo, a software-hardware ...
- research-articleJune 2022
Understanding data storage and ingestion for large-scale deep recommendation model training: industrial product
- Mark Zhao,
- Niket Agarwal,
- Aarti Basant,
- Buğra Gedik,
- Satadru Pan,
- Mustafa Ozdal,
- Rakesh Komuravelli,
- Jerry Pan,
- Tianshu Bao,
- Haowei Lu,
- Sundaram Narayanan,
- Jack Langman,
- Kevin Wilfong,
- Harsha Rastogi,
- Carole-Jean Wu,
- Christos Kozyrakis,
- Parik Pol
ISCA '22: Proceedings of the 49th Annual International Symposium on Computer ArchitecturePages 1042–1057https://doi.org/10.1145/3470496.3533044Datacenter-scale AI training clusters consisting of thousands of domain-specific accelerators (DSA) are used to train increasingly-complex deep learning models. These clusters rely on a data storage and ingestion (DSI) pipeline, responsible for storing ...
- research-articleFebruary 2018
HPVM: heterogeneous parallel virtual machine
PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingPages 68–80https://doi.org/10.1145/3178487.3178493We propose a parallel program representation for heterogeneous systems, designed to enable performance portability across a wide range of popular parallel hardware, including GPUs, vector instruction sets, multicore CPUs and potentially FPGAs. Our ...
Also Published in:
ACM SIGPLAN Notices: Volume 53 Issue 1 - posterSeptember 2016
POSTER: hVISC: A Portable Abstraction for Heterogeneous Parallel Systems
PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and CompilationPages 443–445https://doi.org/10.1145/2967938.2976039Programming heterogeneous parallel systems can be extremely complex because a single system may include multiple different parallelism models, instruction sets, and memory hierarchies, and different systems use different combinations of these features. ...
- research-articleJune 2015
Stash: have your scratchpad and cache it too
- Rakesh Komuravelli,
- Matthew D. Sinclair,
- Johnathan Alsop,
- Muhammad Huzaifa,
- Maria Kotsifakou,
- Prakalp Srivastava,
- Sarita V. Adve,
- Vikram S. Adve
ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer ArchitecturePages 707–719https://doi.org/10.1145/2749469.2750374Heterogeneous systems employ specialization for energy efficiency. Since data movement is expected to be a dominant consumer of energy, these systems employ specialized memories (e.g., scratchpads and FIFOs) for better efficiency for targeted data. ...
Also Published in:
ACM SIGARCH Computer Architecture News: Volume 43 Issue 3S - research-articleDecember 2014
Revisiting the Complexity of Hardware Cache Coherence and Some Implications
ACM Transactions on Architecture and Code Optimization (TACO), Volume 11, Issue 4Article No.: 37, Pages 1–22https://doi.org/10.1145/2663345Cache coherence is an integral part of shared-memory systems but is also widely considered to be one of the most complex parts of such systems. Much prior work has addressed this complexity and the verification techniques to prove the correctness of ...
- research-articleMarch 2013
DeNovoND: efficient hardware support for disciplined non-determinism
ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systemsPages 13–26https://doi.org/10.1145/2451116.2451119Recent work has shown that disciplined shared-memory programming models that provide deterministic-by-default semantics can simplify both parallel software and hardware. Specifically, the DeNovo hardware system has shown that the software guarantees of ...
Also Published in:
ACM SIGARCH Computer Architecture News: Volume 41 Issue 1ACM SIGPLAN Notices: Volume 48 Issue 4 - ArticleOctober 2011
DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism
- Byn Choi,
- Rakesh Komuravelli,
- Hyojin Sung,
- Robert Smolinski,
- Nima Honarmand,
- Sarita V. Adve,
- Vikram S. Adve,
- Nicholas P. Carter,
- Ching-Tsun Chou
PACT '11: Proceedings of the 2011 International Conference on Parallel Architectures and Compilation TechniquesPages 155–166https://doi.org/10.1109/PACT.2011.21For parallelism to become tractable for mass programmers, shared-memory languages and environments must evolve to enforce disciplined practices that ban "wild shared-memory behaviors;'' e.g., unstructured parallelism, arbitrary data races, and ...
- research-articleJune 2010
Parallel SAH k-D tree construction
The k-D tree is a well-studied acceleration data structure for ray tracing. It is used to organize primitives in a scene to allow efficient execution of intersection operations between rays and the primitives. The highest quality k-D tree can be ...
- research-articleOctober 2009
A type and effect system for deterministic parallel Java
- Robert L. Bocchino,
- Vikram S. Adve,
- Danny Dig,
- Sarita V. Adve,
- Stephen Heumann,
- Rakesh Komuravelli,
- Jeffrey Overbey,
- Patrick Simmons,
- Hyojin Sung,
- Mohsen Vakilian
OOPSLA '09: Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applicationsPages 97–116https://doi.org/10.1145/1640089.1640097Today's shared-memory parallel programming models are complex and error-prone.While many parallel programs are intended to be deterministic, unanticipated thread interleavings can lead to subtle bugs and nondeterministic semantics. In this paper, we ...
Also Published in:
ACM SIGPLAN Notices: Volume 44 Issue 10 - ArticleDecember 2007
A Prototype for Tiger Hash Primitive Hardware Architecture
ADCOM '07: Proceedings of the 15th International Conference on Advanced Computing and CommunicationsPages 327–332https://doi.org/10.1109/ADCOM.2007.25With the increasing prominence of the Internet as a tool of commerce, security has become a tremendously important issue. One essential aspect for secure communication over networks is that of cryptography. The increasing prominence of mobile devices ...