research-article

DORY: Lightweight memory hierarchy management for deep NN inference on IoT endnodes: work-in-progress

Authors:

Luca BeniniAuthors Info & Claims

CODES/ISSS '19: Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis Companion

Article No.: 17, Pages 1 - 2

https://doi.org/10.1145/3349567.3351726

Published: 13 October 2019 Publication History

Get Access

Abstract

IoT endnodes often couple a small and fast L1 scratchpad memory with higher-capacity but lower bandwidth and speed L2 background memory. The absence of a coherent hardware cache hierarchy saves energy but comes at the cost of labor-intensive explicit memory management, complicating the deployment of algorithms with large data memory footprint, such as Deep Neural Network (DNN) inference. In this work, we present DORY, a lightweight software-cache dedicated to DNN Deployment Oriented to memoRY. DORY leverages static data tiling and DMA-based double buffering to hide the complexity of manual L1-L2 memory traffic management. DORY enables storage of activations and weights in L2 with less than 4% performance overhead with respect to direct execution in L1. We show that a 142 kB DNN achieving 79.9% on CIFAR-10 runs 3.2X faster compared to its execution directly from L2 memory while consuming 1.9X less energy.

References

[1]

Google AI. 2015. OR tools. (2015). https://developers.google.com/optimization/

Google Scholar

[2]

Leonardo Cecconi et al., 2017. Optimal tiling strategy for memory bandwidth reduction for cnns. In International Conference on ACIVS. Springer, 89--100.

Google Scholar

[3]

E. Flamand et al, 2018. GAP-8: A RISC-V SoC for AI at the Edge of the IoT. In 2018 IEEE 29th ASAP International Conference. 1--4.

Crossref

Google Scholar

[4]

Daniele Palossi et al, 2019. A 64mW DNN-based Visual Navigation Engine for Autonomous Nano-Drones. IEEE Internet of Things Journal (2019).

Google Scholar

[5]

Maurice Peemen et al., 2013. Memory-centric accelerator design for convolutional neural networks. In 2013 IEEE 31st ICCD.

Google Scholar

[6]

Christian Pinto et al., 2013. A highly efficient, thread-safe software cache implementation for tightly-coupled multicore clusters. In IEEE 24th ASAP Conference

Google Scholar

[7]

Selma Saidi et al., 2013. Optimizing two-dimensional DMA transfers for scratch-pad Based MPSoCs platforms. Microprocessors and Microsystems.

Google Scholar

Cited By

View all

Giordano MDoshi RLu QMurmann BTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)TinyForge: A Design Space Exploration to Advance Energy and Silicon Area Trade-offs in tinyML Compute Architectures with Custom Latch ArraysProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651328(1033-1047)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620666.3651328
Okuhara HElnaqib ADazzi MPalestri PBenatti SBenini LRossi D(2021)A Fully Integrated 5-mW, 0.8-Gbps Energy-Efficient Chip-to-Chip Data Link for Ultralow-Power IoT End-Nodes in 65-nm CMOSIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2021.310880629:10(1800-1811)Online publication date: Oct-2021
https://doi.org/10.1109/TVLSI.2021.3108806
Glaser FTagliavini GRossi DHaugou GHuang QBenini L(2021)Energy-Efficient Hardware-Accelerated Synchronization for Shared-L1-Memory Multiprocessor ClustersIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.302869132:3(633-648)Online publication date: 1-Mar-2021
https://doi.org/10.1109/TPDS.2020.3028691
Show More Cited By

Recommendations

Designing a Modern Memory Hierarchy with Hardware Prefetching

In this paper, we address the severe performance gap caused by high processor clock rates and slow DRAM accesses. We show that, even with an aggressive, next-generation memory system using four Direct Rambus channels and an integrated one-megabyte level-...
Memory organizations for 3D-DRAMs and PCMs in processor memory hierarchy

In this paper, we describe and evaluate three possible architectures for using 3D-DRAMs and PCMs in the processor memory hierarchy. We explore: (i) using 3D-DRAM as main memory with PCM as backing store; (ii) using 3D-DRAM as the Last Level Cache and ...
Enabling Hybrid PCM Memory System with Inherent Memory Management
RACS '16: Proceedings of the International Conference on Research in Adaptive and Convergent Systems

Replacing the traditional volatile main memory, e.g., DRAM, with a non-volatile phase change memory (PCM) has become a possible solution to reduce the energy consumption of computing systems. To further reduce the bit cost of PCM, the development trend ...

Comments

Information & Contributors

Information

Published In

CODES/ISSS '19: Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis Companion

October 2019

64 pages

ISBN:9781450369237

DOI:10.1145/3349567

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 October 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

CODES/ISSS '19

CODES/ISSS '19: International Conference on Hardware/Software Codesign and System Synthesis

October 13 - 18, 2019

New York, New York

Acceptance Rates

Overall Acceptance Rate 280 of 864 submissions, 32%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
264
Total Downloads

Downloads (Last 12 months)34
Downloads (Last 6 weeks)2

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Giordano MDoshi RLu QMurmann BTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)TinyForge: A Design Space Exploration to Advance Energy and Silicon Area Trade-offs in tinyML Compute Architectures with Custom Latch ArraysProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651328(1033-1047)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620666.3651328
Okuhara HElnaqib ADazzi MPalestri PBenatti SBenini LRossi D(2021)A Fully Integrated 5-mW, 0.8-Gbps Energy-Efficient Chip-to-Chip Data Link for Ultralow-Power IoT End-Nodes in 65-nm CMOSIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2021.310880629:10(1800-1811)Online publication date: Oct-2021
https://doi.org/10.1109/TVLSI.2021.3108806
Glaser FTagliavini GRossi DHaugou GHuang QBenini L(2021)Energy-Efficient Hardware-Accelerated Synchronization for Shared-L1-Memory Multiprocessor ClustersIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.302869132:3(633-648)Online publication date: 1-Mar-2021
https://doi.org/10.1109/TPDS.2020.3028691
Mei LHoushmand PJain VGiraldo SVerhelst M(2021)ZigZag: Enlarging Joint Architecture-Mapping Design Space Exploration for DNN AcceleratorsIEEE Transactions on Computers10.1109/TC.2021.305996270:8(1160-1174)Online publication date: 1-Aug-2021
https://doi.org/10.1109/TC.2021.3059962
Burrello APagliari DBartolini ABenini LMacii EPoncino M(2021)Predicting Hard Disk Failures in Data Centers Using Temporal Convolutional Neural NetworksEuro-Par 2020: Parallel Processing Workshops10.1007/978-3-030-71593-9_22(277-289)Online publication date: 14-Mar-2021
https://doi.org/10.1007/978-3-030-71593-9_22
Ravaglia LRusci MCapotondi AConti FPellegrini LLomonaco VMaltoni DBenini L(2020)Memory-Latency-Accuracy Trade-Offs for Continual Learning on a RISC-V Extreme-Edge Node2020 IEEE Workshop on Signal Processing Systems (SiPS)10.1109/SiPS50750.2020.9195220(1-6)Online publication date: Oct-2020
https://doi.org/10.1109/SiPS50750.2020.9195220
Schneider TWang XHersche MCavigelli LBenini L(2020)Q-EEGNet: an Energy-Efficient 8-bit Quantized Parallel EEGNet Implementation for Edge Motor-Imagery Brain-Machine Interfaces2020 IEEE International Conference on Smart Computing (SMARTCOMP)10.1109/SMARTCOMP50058.2020.00065(284-289)Online publication date: Sep-2020
https://doi.org/10.1109/SMARTCOMP50058.2020.00065

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Recommendations

Designing a Modern Memory Hierarchy with Hardware Prefetching

Memory organizations for 3D-DRAMs and PCMs in processor memory hierarchy

Enabling Hybrid PCM Memory System with Inherent Memory Management

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations