Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Free access

Domain-specific hardware accelerators

Published: 18 June 2020 Publication History

Abstract

DSAs gain efficiency from specialization and performance from parallelism.

References

[1]
Abadi, M., et al. Tensorflow: A system for large-scale machine learning. In OSDI (2016), 265--283.
[2]
Agrawal, P., Dally, W.J. A hardware logic simulation system. IEEE TCAD 9, 1 (1990), 19--29.
[3]
Beers, A.C., Agrawala, M., Chaddha, N. Rendering from compressed textures. ACM Trans, Graph, (SIGGRAPH) 96 (1996), 373--378.
[4]
Choquette, J., Giroux, O., Foley, D. Volta: Performance and programmability. IEEE Micro 38, 2 (2018), 42--52.
[5]
Chung, E., Fowers, J., et al. Serving DNNs in real time at datacenter scale with project brainwave. IEEE Micro 38, 2 (2018), 8--20.
[6]
Cong, J., et al. Charm: A composable heterogeneous accelerator-rich microprocessor. In ISLPED (2012). ACM, 379--384.
[7]
Cong, J., Sarkar, V., Reinman, G., Bui, A. Customizable domain-specific computing. IEEE Des. Test Comput. 28, 2 (2010), 6--15.
[8]
Culler, D., Karp, R., et al. LogP: Towards a realistic model of parallel computation. In ACM Sigplan Notices, Vol. 28 (1993). ACM, 1--12.
[9]
Daily, J. Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments. BMC Bioinform. 17, 1 (2016), 81.
[10]
Dally, W.J., Balfour, J., Black-Shaffer, D., Chen, J., Harting, R.C., Parikh, V., Park, J., Sheffield, D. Efficient embedded computing. Computer 41, 7 (2008), 27--32.
[11]
Dally, W.J., Bryant, R.E. A hardware architecture for switch-level simulation. IEEE TCAD 4, 3 (1985), 239--250.
[12]
Esmaeilzadeh, H., Blem, E., Amant, R.S., et al. Dark silicon and the end of multicore scaling. In ISCA (2011). IEEE, 365--376.
[13]
Fillo, M., Keckler, S.W., Dally, W.J., Carter, N.P., Chang, A., Gurevich, Y., Lee, W.S. The M-machine multicomputer. Int. J. Parallel Program. 25, 3 (1997), 183--212.
[14]
Gotoh, O. An improved algorithm for matching biological sequences. J. Mol. Biol. 162, 3 (1982), 705--708.
[15]
Habana Labs. Goya Inference Platform White Paper v1.7, 2019. https://tinyurl.com/yxlcfx54
[16]
Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M.A., Dally, W.J. EIE: Efficient inference engine on compressed deep neural network. In ISCA (2016). IEEE, 243--254.
[17]
Han, S., Mao, H., Dally, W.J. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. In ICLR (2016).
[18]
Han, S., Pool, J., Tran, J., Dally, W. Learning both weights and connections for efficient neural network. In NIPS (2015), 1135--1143.
[19]
Harris, R.S. Improved pairwise alignment of genomic DNA. PhD thesis, The Pennsylvania State University (2007).
[20]
Hartenstein, R. Coarse grain re-configurable architecture (Embedded Tutorial). In ASPDAC (2001), ACM, 564--570.
[21]
He, K., Zhang, X., Ren, S., Sun, J. Deep residual learning for image recognition. In CVPR (2016), 770--778.
[22]
Hennessy, J.L., Patterson, D.A. A new golden age for computer architecture. Commun. ACM 62, 2 (2019), 48--60.
[23]
Hines, J. Stepping up to summit. Comput. Sci. Eng. 20, 2 (2018), 78--82.
[24]
Horowitz, M. Computing's energy problem (and what we can do about it). In ISSCC (2014), IEEE, 10--14.
[25]
Jain, M., Koren, S., Miga, K.H., Quick, J., Rand, A.C., Sasani, T.A., Tyson, J.R., Beggs, A.D., Dilthey, A.T., Fiddes, I.T., et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36, 4 (2018), 338.
[26]
Jia, Z., Maggioni, M., Smith, J., Scarpazza, D.P. Dissecting the NVIDIA turing T4 GPU via microbenchmarking. arXiv:1903.07486 (2019).
[27]
Jouppi, N.P., Young, C., Patil, N., Patterson, D. A domain-specific architecture for deep neural networks. Commun. ACM 61, 9 (2018), 50--59.
[28]
Khazraee, M., et al. Moonwalk: NRE optimization in ASIC clouds. In Computer Architecture News, Vol. 45 (2017), ACM, 511--526.
[29]
Kuon, I., Rose, J. Measuring the gap between FPGAs and ASICs. IEEE TCAD 26, 2 (2007), 203--215.
[30]
Lipton, R.J., Lopresti, D.P. Comparing Long Strings on a Short Systolic Array. Princeton University, Department of Computer Science, 1986.
[31]
MICRON. System power calculators, 2019. https://tinyurl.com/y5cvl857
[32]
Moore, G.E., et al. Cramming more components onto integrated circuits. 1965
[33]
Nickolls, J., Dally, W.J. The GPU computing era. IEEE Micro 30, 2 (2010), 56--69.
[34]
Noakes, M.D., Wallach, D.A., Dally, W.J. The J-machine multicomputer: An architectural evaluation. Comput. Arch. News 21, 2 (1993), 224--235.
[35]
NVIDIA. NVIDIA deep learning accelerator (NVDLA), 2017. http://nvdla.org
[36]
NVIDIA. NVIDIA Tesla deep learning product performance, 2019. https://tinyurl.com/yyu9amxh
[37]
Parashar, A., et al. SCNN: An accelerator for compressed-sparse convolutional neural networks. In ISCA (2017), IEEE, 27--40.
[38]
Qadeer, W., et al. Convolution engine: Balancing efficiency & flexibility in specialized computing. In Computer Architecture News, Vol. 41 (2013). ACM, 24--35.
[39]
Ragan-Kelley, J., et al. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In ACM Sigplan Notices, Vol. 48 (2013). ACM, 519--530.
[40]
Karpand, R.M., Ramachandran, V., Karpand, V., Karp, R.M. A survey of parallel algorithms for shared-memory machines. In Handbook of Theoretical Computer Science. North-Holland, 1988
[41]
Rixner, S., Dally, W.J., Kapasi, U.J., Mattson, P., Owens, J.D. Memory access scheduling. In Computer Architecture News, Vol. 28 (2000). ACMe, 128--138.
[42]
Scott, S.L. Synchronization and communication in the T3E multiprocessor. In ACM SIGPLAN Notices, Vol. 31 (1996). ACM, 26--36.
[43]
Shen, H., et al. Intel CPU outperforms NVIDIA GPU on ResNet-50 deep learning inference, 2019. https://tinyurl.com/y6xewz8r
[44]
Smith, T.F., Waterman, M.S. Identification of common molecular subsequences. J. Mol. Biol. 147, 1 (1981), 195--197.
[45]
Sović, I., Šikić, M., Wilm, A., Fenlon, S.N., Chen, S., Nagarajan, N. Fast and sensitive mapping of nanopore sequencing reads with GraphMap. Nat. Commun. 7, (2016).
[46]
Stone, J.E., et al. OpenCL: A parallel programming standard for heterogeneous computing systems. Comput. Sci. Eng. 12, 3 (2010), 66.
[47]
Tan, T., Nurvitadhi, E., Chiou, D. Dark wires and the opportunities for reconfigurable logic. IEEE Comput. Architect. Lett. (2019).
[48]
Thomas, D., Moorby, P. The Verilog® Hardware Description Language. Springer Science & Business Media, 2008.
[49]
Turakhia, Y., Bejerano, G., Dally, W.J. Darwin: A genomics co-processor provides up to 15,000x acceleration on long read assembly. In ASPLOS (2018). ACM, 199--213.
[50]
Turakhia, Y., Goenka, S.D., Bejerano, G., Dally, W.J. Darwin-WGA: A co-processor provides increased sensitivity in whole genome alignments with high speedup. In HPCA (2019). IEEE, 359--372.
[51]
Vasilakis, E. An instruction level energy characterization of arm processors. Tech. Rep. FORTH-ICS/TR-450. Foundation of Research and Technology Hellas, Institute of Computer Science, 2015.
[52]
Wolfe, M. Iteration space tiling for memory hierarchies. In Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing (1987), Society for Industrial and Applied Mathematics, 357--361.
[53]
Xilinx. Xilinx Imagenet Benchmarks, 2019. https://tinyurl.com/y5l4ajff

Cited By

View all
  • (2025)On the Efficacy and Vulnerabilities of Logic Locking in Tree-Based Machine LearningIEEE Transactions on Circuits and Systems I: Regular Papers10.1109/TCSI.2024.345754172:1(180-191)Online publication date: Jan-2025
  • (2024)An Examination of CXL Memory Use Cases for In-Memory Database Management Systems Using SAP HANAProceedings of the VLDB Endowment10.14778/3685800.368580917:12(3827-3840)Online publication date: 8-Nov-2024
  • (2024)Addition is Most You Need: Efficient Floating-Point SRAM Compute-in-Memory by Harnessing Mantissa AdditionProceedings of the 61st ACM/IEEE Design Automation Conference10.1145/3649329.3655930(1-6)Online publication date: 23-Jun-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Communications of the ACM
Communications of the ACM  Volume 63, Issue 7
July 2020
102 pages
ISSN:0001-0782
EISSN:1557-7317
DOI:10.1145/3407166
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2020
Published in CACM Volume 63, Issue 7

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Popular
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3,841
  • Downloads (Last 6 weeks)432
Reflects downloads up to 14 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2025)On the Efficacy and Vulnerabilities of Logic Locking in Tree-Based Machine LearningIEEE Transactions on Circuits and Systems I: Regular Papers10.1109/TCSI.2024.345754172:1(180-191)Online publication date: Jan-2025
  • (2024)An Examination of CXL Memory Use Cases for In-Memory Database Management Systems Using SAP HANAProceedings of the VLDB Endowment10.14778/3685800.368580917:12(3827-3840)Online publication date: 8-Nov-2024
  • (2024)Addition is Most You Need: Efficient Floating-Point SRAM Compute-in-Memory by Harnessing Mantissa AdditionProceedings of the 61st ACM/IEEE Design Automation Conference10.1145/3649329.3655930(1-6)Online publication date: 23-Jun-2024
  • (2024)Scalability Limitations of Processing-in-Memory using Real System EvaluationsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36390468:1(1-28)Online publication date: 21-Feb-2024
  • (2024)Artificial Intelligence for Safety-Critical Systems in Industrial and Transportation Domains: A SurveyACM Computing Surveys10.1145/362631456:7(1-40)Online publication date: 9-Apr-2024
  • (2024)BeeZip: Towards An Organized and Scalable Architecture for Data CompressionProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651323(133-148)Online publication date: 27-Apr-2024
  • (2024)METAL: Caching Multi-level Indexes in Domain-Specific ArchitecturesProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640402(715-729)Online publication date: 27-Apr-2024
  • (2024)VisionAGILE: A Versatile Domain-Specific Accelerator for Computer Vision TasksIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.346689135:12(2405-2422)Online publication date: 1-Dec-2024
  • (2024)A Heterogeneous RISC-V Based SoC for Secure Nano-UAV NavigationIEEE Transactions on Circuits and Systems I: Regular Papers10.1109/TCSI.2024.335904471:5(2266-2279)Online publication date: May-2024
  • (2024)DPU-Direct: Unleashing Remote Accelerators via Enhanced RDMA for Disaggregated DatacentersIEEE Transactions on Computers10.1109/TC.2024.340408973:8(2081-2095)Online publication date: Aug-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Digital Edition

View this article in digital edition.

Digital Edition

Magazine Site

View this article on the magazine site (external)

Magazine Site

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media