research-article

Free access

Domain-specific hardware accelerators

Authors:

William J. Dally,

Yatish Turakhia,

Song HanAuthors Info & Claims

Communications of the ACM, Volume 63, Issue 7

Pages 48 - 57

https://doi.org/10.1145/3361682

Published: 18 June 2020 Publication History

All formats PDF

Abstract

DSAs gain efficiency from specialization and performance from parallelism.

References

[1]

Abadi, M., et al. Tensorflow: A system for large-scale machine learning. In OSDI (2016), 265--283.

Digital Library

Google Scholar

[2]

Agrawal, P., Dally, W.J. A hardware logic simulation system. IEEE TCAD 9, 1 (1990), 19--29.

Google Scholar

[3]

Beers, A.C., Agrawala, M., Chaddha, N. Rendering from compressed textures. ACM Trans, Graph, (SIGGRAPH) 96 (1996), 373--378.

Google Scholar

[4]

Choquette, J., Giroux, O., Foley, D. Volta: Performance and programmability. IEEE Micro 38, 2 (2018), 42--52.

Crossref

Google Scholar

[5]

Chung, E., Fowers, J., et al. Serving DNNs in real time at datacenter scale with project brainwave. IEEE Micro 38, 2 (2018), 8--20.

Crossref

Google Scholar

[6]

Cong, J., et al. Charm: A composable heterogeneous accelerator-rich microprocessor. In ISLPED (2012). ACM, 379--384.

Digital Library

Google Scholar

[7]

Cong, J., Sarkar, V., Reinman, G., Bui, A. Customizable domain-specific computing. IEEE Des. Test Comput. 28, 2 (2010), 6--15.

Google Scholar

[8]

Culler, D., Karp, R., et al. LogP: Towards a realistic model of parallel computation. In ACM Sigplan Notices, Vol. 28 (1993). ACM, 1--12.

Digital Library

Google Scholar

[9]

Daily, J. Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments. BMC Bioinform. 17, 1 (2016), 81.

Crossref

Google Scholar

[10]

Dally, W.J., Balfour, J., Black-Shaffer, D., Chen, J., Harting, R.C., Parikh, V., Park, J., Sheffield, D. Efficient embedded computing. Computer 41, 7 (2008), 27--32.

Digital Library

Google Scholar

[11]

Dally, W.J., Bryant, R.E. A hardware architecture for switch-level simulation. IEEE TCAD 4, 3 (1985), 239--250.

Digital Library

Google Scholar

[12]

Esmaeilzadeh, H., Blem, E., Amant, R.S., et al. Dark silicon and the end of multicore scaling. In ISCA (2011). IEEE, 365--376.

Digital Library

Google Scholar

[13]

Fillo, M., Keckler, S.W., Dally, W.J., Carter, N.P., Chang, A., Gurevich, Y., Lee, W.S. The M-machine multicomputer. Int. J. Parallel Program. 25, 3 (1997), 183--212.

Digital Library

Google Scholar

[14]

Gotoh, O. An improved algorithm for matching biological sequences. J. Mol. Biol. 162, 3 (1982), 705--708.

Crossref

Google Scholar

[15]

Habana Labs. Goya Inference Platform White Paper v1.7, 2019. https://tinyurl.com/yxlcfx54

Google Scholar

[16]

Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M.A., Dally, W.J. EIE: Efficient inference engine on compressed deep neural network. In ISCA (2016). IEEE, 243--254.

Google Scholar

[17]

Han, S., Mao, H., Dally, W.J. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. In ICLR (2016).

Google Scholar

[18]

Han, S., Pool, J., Tran, J., Dally, W. Learning both weights and connections for efficient neural network. In NIPS (2015), 1135--1143.

Digital Library

Google Scholar

[19]

Harris, R.S. Improved pairwise alignment of genomic DNA. PhD thesis, The Pennsylvania State University (2007).

Digital Library

Google Scholar

[20]

Hartenstein, R. Coarse grain re-configurable architecture (Embedded Tutorial). In ASPDAC (2001), ACM, 564--570.

Google Scholar

[21]

He, K., Zhang, X., Ren, S., Sun, J. Deep residual learning for image recognition. In CVPR (2016), 770--778.

Crossref

Google Scholar

[22]

Hennessy, J.L., Patterson, D.A. A new golden age for computer architecture. Commun. ACM 62, 2 (2019), 48--60.

Digital Library

Google Scholar

[23]

Hines, J. Stepping up to summit. Comput. Sci. Eng. 20, 2 (2018), 78--82.

Crossref

Google Scholar

[24]

Horowitz, M. Computing's energy problem (and what we can do about it). In ISSCC (2014), IEEE, 10--14.

Crossref

Google Scholar

[25]

Jain, M., Koren, S., Miga, K.H., Quick, J., Rand, A.C., Sasani, T.A., Tyson, J.R., Beggs, A.D., Dilthey, A.T., Fiddes, I.T., et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36, 4 (2018), 338.

Crossref

Google Scholar

[26]

Jia, Z., Maggioni, M., Smith, J., Scarpazza, D.P. Dissecting the NVIDIA turing T4 GPU via microbenchmarking. arXiv:1903.07486 (2019).

Google Scholar

[27]

Jouppi, N.P., Young, C., Patil, N., Patterson, D. A domain-specific architecture for deep neural networks. Commun. ACM 61, 9 (2018), 50--59.

Digital Library

Google Scholar

[28]

Khazraee, M., et al. Moonwalk: NRE optimization in ASIC clouds. In Computer Architecture News, Vol. 45 (2017), ACM, 511--526.

Digital Library

Google Scholar

[29]

Kuon, I., Rose, J. Measuring the gap between FPGAs and ASICs. IEEE TCAD 26, 2 (2007), 203--215.

Google Scholar

[30]

Lipton, R.J., Lopresti, D.P. Comparing Long Strings on a Short Systolic Array. Princeton University, Department of Computer Science, 1986.

Google Scholar

[31]

MICRON. System power calculators, 2019. https://tinyurl.com/y5cvl857

Google Scholar

[32]

Moore, G.E., et al. Cramming more components onto integrated circuits. 1965

Google Scholar

[33]

Nickolls, J., Dally, W.J. The GPU computing era. IEEE Micro 30, 2 (2010), 56--69.

Digital Library

Google Scholar

[34]

Noakes, M.D., Wallach, D.A., Dally, W.J. The J-machine multicomputer: An architectural evaluation. Comput. Arch. News 21, 2 (1993), 224--235.

Google Scholar

[35]

NVIDIA. NVIDIA deep learning accelerator (NVDLA), 2017. http://nvdla.org

Google Scholar

[36]

NVIDIA. NVIDIA Tesla deep learning product performance, 2019. https://tinyurl.com/yyu9amxh

Google Scholar

[37]

Parashar, A., et al. SCNN: An accelerator for compressed-sparse convolutional neural networks. In ISCA (2017), IEEE, 27--40.

Google Scholar

[38]

Qadeer, W., et al. Convolution engine: Balancing efficiency & flexibility in specialized computing. In Computer Architecture News, Vol. 41 (2013). ACM, 24--35.

Digital Library

Google Scholar

[39]

Ragan-Kelley, J., et al. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In ACM Sigplan Notices, Vol. 48 (2013). ACM, 519--530.

Digital Library

Google Scholar

[40]

Karpand, R.M., Ramachandran, V., Karpand, V., Karp, R.M. A survey of parallel algorithms for shared-memory machines. In Handbook of Theoretical Computer Science. North-Holland, 1988

Google Scholar

[41]

Rixner, S., Dally, W.J., Kapasi, U.J., Mattson, P., Owens, J.D. Memory access scheduling. In Computer Architecture News, Vol. 28 (2000). ACMe, 128--138.

Digital Library

Google Scholar

[42]

Scott, S.L. Synchronization and communication in the T3E multiprocessor. In ACM SIGPLAN Notices, Vol. 31 (1996). ACM, 26--36.

Digital Library

Google Scholar

[43]

Shen, H., et al. Intel CPU outperforms NVIDIA GPU on ResNet-50 deep learning inference, 2019. https://tinyurl.com/y6xewz8r

Google Scholar

[44]

Smith, T.F., Waterman, M.S. Identification of common molecular subsequences. J. Mol. Biol. 147, 1 (1981), 195--197.

Crossref

Google Scholar

[45]

Sović, I., Šikić, M., Wilm, A., Fenlon, S.N., Chen, S., Nagarajan, N. Fast and sensitive mapping of nanopore sequencing reads with GraphMap. Nat. Commun. 7, (2016).

Google Scholar

[46]

Stone, J.E., et al. OpenCL: A parallel programming standard for heterogeneous computing systems. Comput. Sci. Eng. 12, 3 (2010), 66.

Crossref

Google Scholar

[47]

Tan, T., Nurvitadhi, E., Chiou, D. Dark wires and the opportunities for reconfigurable logic. IEEE Comput. Architect. Lett. (2019).

Crossref

Google Scholar

[48]

Thomas, D., Moorby, P. The Verilog^® Hardware Description Language. Springer Science & Business Media, 2008.

Digital Library

Google Scholar

[49]

Turakhia, Y., Bejerano, G., Dally, W.J. Darwin: A genomics co-processor provides up to 15,000x acceleration on long read assembly. In ASPLOS (2018). ACM, 199--213.

Digital Library

Google Scholar

[50]

Turakhia, Y., Goenka, S.D., Bejerano, G., Dally, W.J. Darwin-WGA: A co-processor provides increased sensitivity in whole genome alignments with high speedup. In HPCA (2019). IEEE, 359--372.

Crossref

Google Scholar

[51]

Vasilakis, E. An instruction level energy characterization of arm processors. Tech. Rep. FORTH-ICS/TR-450. Foundation of Research and Technology Hellas, Institute of Computer Science, 2015.

Google Scholar

[52]

Wolfe, M. Iteration space tiling for memory hierarchies. In Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing (1987), Society for Industrial and Applied Mathematics, 357--361.

Digital Library

Google Scholar

[53]

Xilinx. Xilinx Imagenet Benchmarks, 2019. https://tinyurl.com/y5l4ajff

Google Scholar

Cited By

View all

Wu BKoutsoukos DAlonso G(2025)Efficiently Processing Joins and Grouped Aggregations on GPUsProceedings of the ACM on Management of Data10.1145/37096893:1(1-27)Online publication date: 11-Feb-2025
https://dl.acm.org/doi/10.1145/3709689
Quinn DNouri MPatel NSalihu JSalemi ALee SZamani HAlian MEeckhout LSmaragdakis GLiang KSampson AKim MRossbach C(2025)Accelerating Retrieval-Augmented GenerationProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707264(15-32)Online publication date: 30-Mar-2025
https://dl.acm.org/doi/10.1145/3669940.3707264
Li ZDangi PYin CBandara TJuneja RTan CBai ZMitra TEeckhout LSmaragdakis GLiang KSampson AKim MRossbach C(2025)Enhancing CGRA Efficiency Through Aligned Compute and Communication ProvisioningProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707230(410-425)Online publication date: 30-Mar-2025
https://dl.acm.org/doi/10.1145/3669940.3707230
Show More Cited By

Index Terms

Domain-specific hardware accelerators
1. Computing methodologies
  1. Modeling and simulation
    1. Simulation types and techniques
      1. Massively parallel and high-performance simulations
2. Security and privacy
  1. Software and application security
    1. Domain-specific security and privacy architectures

Recommendations

Comparing Hardware Accelerators in Scientific Applications: A Case Study

Multicore processors and a variety of accelerators have allowed scientific applications to scale to larger problem sizes. We present a performance, design methodology, platform, and architectural comparison of several application accelerators executing ...
Performance Portable Applications for Hardware Accelerators: Lessons Learned from SPEC ACCEL
IPDPSW '15: Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop

The popular and diverse hardware accelerator ecosystem makes apples-to-apples comparisons between platforms rather difficult. SPEC ACCEL tries to offer a yardstick to compare different accelerator hardware and software ecosystems. This paper uses this ...
Leveraging OmpSs to Exploit Hardware Accelerators
SBAC-PAD '14: Proceedings of the 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing

CUDA and OpenCL are the most widely used programming models to exploit hardware accelerators. Both programming models provide a C-based programming language to write accelerator kernels and a host API used to glue the host and kernel parts. Although ...

Comments

Information & Contributors

Information

Published In

Communications of the ACM Volume 63, Issue 7

July 2020

102 pages

ISSN:0001-0782

EISSN:1557-7317

DOI:10.1145/3407166

Editor:
Andrew A. Chien
Association for Computing Machinery, New York, NY

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2020

Published in CACM Volume 63, Issue 7

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Popular
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

175
Total Citations
View Citations
36,498
Total Downloads

Downloads (Last 12 months)4,179
Downloads (Last 6 weeks)428

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Wu BKoutsoukos DAlonso G(2025)Efficiently Processing Joins and Grouped Aggregations on GPUsProceedings of the ACM on Management of Data10.1145/37096893:1(1-27)Online publication date: 11-Feb-2025
https://dl.acm.org/doi/10.1145/3709689
Quinn DNouri MPatel NSalihu JSalemi ALee SZamani HAlian MEeckhout LSmaragdakis GLiang KSampson AKim MRossbach C(2025)Accelerating Retrieval-Augmented GenerationProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707264(15-32)Online publication date: 30-Mar-2025
https://dl.acm.org/doi/10.1145/3669940.3707264
Li ZDangi PYin CBandara TJuneja RTan CBai ZMitra TEeckhout LSmaragdakis GLiang KSampson AKim MRossbach C(2025)Enhancing CGRA Efficiency Through Aligned Compute and Communication ProvisioningProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707230(410-425)Online publication date: 30-Mar-2025
https://dl.acm.org/doi/10.1145/3669940.3707230
Alves de Abreu BPaim GAlrahis LFlores PSinanoglu OBampi SAmrouch H(2025)On the Efficacy and Vulnerabilities of Logic Locking in Tree-Based Machine LearningIEEE Transactions on Circuits and Systems I: Regular Papers10.1109/TCSI.2024.345754172:1(180-191)Online publication date: Jan-2025
https://doi.org/10.1109/TCSI.2024.3457541
Morais BJiang QSchirner G(2025)Enabling ILP-Based DSE for Multigranularity, Unified Domain Platforms With DmTSAR-ILPIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.344747344:3(911-922)Online publication date: Mar-2025
https://doi.org/10.1109/TCAD.2024.3447473
Ahn MWillhalm TMay NLee DDesai SBooss DKim JSingh NRitter DRebholz O(2024)An Examination of CXL Memory Use Cases for In-Memory Database Management Systems Using SAP HANAProceedings of the VLDB Endowment10.14778/3685800.368580917:12(3827-3840)Online publication date: 8-Nov-2024
https://dl.acm.org/doi/10.14778/3685800.3685809
Cao WGao JXin XZhang XDe V(2024)Addition is Most You Need: Efficient Floating-Point SRAM Compute-in-Memory by Harnessing Mantissa AdditionProceedings of the 61st ACM/IEEE Design Automation Conference10.1145/3649329.3655930(1-6)Online publication date: 23-Jun-2024
https://dl.acm.org/doi/10.1145/3649329.3655930
Jonatan GCho HSon HWu XLivesay NMora EShivdikar KAbellán JJoshi AKaeli DKim J(2024)Scalability Limitations of Processing-in-Memory using Real System EvaluationsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36390468:1(1-28)Online publication date: 21-Feb-2024
https://dl.acm.org/doi/10.1145/3639046
Perez-Cerrolaza JAbella JBorg MDonzella CCerquides JCazorla FEnglund CTauber MNikolakopoulos GFlores J(2024)Artificial Intelligence for Safety-Critical Systems in Industrial and Transportation Domains: A SurveyACM Computing Surveys10.1145/362631456:7(1-40)Online publication date: 9-Apr-2024
https://dl.acm.org/doi/10.1145/3626314
Gao RLi ZTan GLi XTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)BeeZip: Towards An Organized and Scalable Architecture for Data CompressionProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651323(133-148)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620666.3651323
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Digital Edition

View this article in digital edition.

Digital Edition

Magazine Site

View this article on the magazine site (external)

Magazine Site

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Abstract

References

Cited By

Index Terms

Recommendations

Comparing Hardware Accelerators in Scientific Applications: A Case Study

Performance Portable Applications for Hardware Accelerators: Lessons Learned from SPEC ACCEL

Leveraging OmpSs to Exploit Hardware Accelerators

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Digital Edition

Magazine Site

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations