Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

An Out-of-Order Load-Store Queue for Spatial Computing

Published: 27 September 2017 Publication History

Abstract

The efficiency of spatial computing depends on the ability to achieve maximal parallelism. This necessitates memory interfaces that can correctly handle memory accesses that arrive in arbitrary order while still respecting data dependencies and ensuring appropriate ordering for semantic correctness. However, a typical memory interface for out-of-order processors (i.e., a load-store queue) cannot immediately meet these requirements: a different allocation policy is needed to achieve out-of-order execution in spatial systems that naturally omit the notion of sequential program order, a fundamental piece of information for correct execution. We show a novel and practical way to organize the allocation for an out-of-order load-store queue for spatial computing. The main idea is to dynamically allocate groups of memory accesses (depending on the dynamic behavior of the application), where the access order within the group is statically predetermined (for instance by a high-level synthesis tool). We detail the construction of our load-store queue and demonstrate on a few practical cases its advantages over standard accelerator-memory interfaces.

References

[1]
M. Alle, A. Morvan, and S. Derrien. 2013. Runtime dependency analysis for loop pipelining in high-level synthesis. In Proceedings of the 50th Design Automation Conference. Austin, Tex, June 2013.
[2]
M. Budiu, P. V. Artigas, and S. C. Goldstein. 2005. Dataflow: A complement to superscalar. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. Austin, Tex., 177--186, Mar. 2005.
[3]
M. Budiu and S. C. Goldstein. 2002. Pegasus: An Efficient Intermediate Representation. Technical Report CMU-CS-02-107. Carnegie Mellon University, May 2002.
[4]
M. Budiu and S. C. Goldstein. 2003. Optimizing memory accesses for spatial computation. In Proceedings of the 1st International ACM/IEEE Symposium on Code Generation and Optimization. San Francisco, Calif., 216--27, Mar. 2003.
[5]
L. P. Carloni, K. L. McMillan, and A. L. Sangiovanni-Vincentelli. 2001. Theory of latency-insensitive design. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, CAD-20, 9 (Sept. 2001), 1059--76.
[6]
S. Cheng and J. Wawrzynek. 2016. Synthesis of statically analyzable accelerator networks from sequential programs. In Proceedings of the International Conference on Computer Aided Design. Austin, Tex., 126--133, Nov. 2016.
[7]
J. Cortadella, M. Kishinevsky, and B. Grundmann. 2006. Synthesis of synchronous elastic architectures. In Proceedings of the 43rd Design Automation Conference. San Francisco, Calif., 657--62, July 2006.
[8]
S. Dai, M. Tan, K. Hao, and Z. Zhang. 2014. Flushing-enabled loop pipelining for high-level synthesis. In Proceedings of the 51st Design Automation Conference. San Francisco, Calif., 1--6, June 2014.
[9]
S. Dai, R. Zhao, S. S. Gai Liu, U. Gupta, C. Batten, and Z. Zhang. 2017. Dynamic hazard resolution for pipelining irregular loops in high-level synthesis. In Proceedings of the 25th ACM/SIGDA International Symposium on Field Programmable Gate Arrays. Monterey, Calif., 189--94, Feb. 2017.
[10]
J. Huang, Y. Huang, Y. Chen, P. Ienne, O. Temam, and C. Wu. 2014. A low-cost memory interface for high-throughput accelerators. In Proceedings of the International Conference on Compilers, Architectures, and Synthesis for Embedded Systems. New Delhi, 11:1--11:10, Oct. 2014.
[11]
Y. Huang, P. Ienne, O. Temam, Y. Chen, and C. Wu. 2013. Elastic CGRAs. In Proceedings of the 21st ACM/SIGDA International Symposium on Field Programmable Gate Arrays. Monterey, Calif., 171--80, Feb. 2013.
[12]
H. M. Jacobson, P. N. Kudva, P. Bose, P. W. Cook, S. E. Schuster, E. G. Mercer, and C. J. Myers. 2002. Synchronous interlocked pipelines. In Proceedings of the 8th International Symposium on Advanced Research in Asynchronous Circuits and Systems. Manchester, 3--12, Apr. 2002.
[13]
T. Kam, M. Kishinevsky, J. Cortadella, and M. Galceran-Oms. 2008. Correct-by-construction microarchitectural pipelining. Proceedings of the International Conference on Computer Aided Design (Nov. 2008), 434--41.
[14]
J. Liu, S. Bayliss, and G. A. Constantinides. 2015. Offline synthesis of online dependence testing: Parametric loop pipelining for HLS. In Proceedings of the 23rd IEEE Symposium on Field-Programmable Custom Computing Machines. Vancouver, B.C., 159--62, May 2015.
[15]
I. Park, C.-L. Ooi, and T. N. Vijaykumar. 2003. Reducing design complexity of the load/store queue. In Proceedings of the 36th Annual International Symposium on Microarchitecture. San Diego, Calif., 411--22, Dec. 2003.
[16]
M. Pellauer, A. Parashar, M. Adler, B. Ahsan, R. L. Allmon, N. C. Crago, K. Fleming, M. Gambhir, A. Jaleel, T. Krishna, D. Lustig, S. Maresh, V. Pavlov, R. Rayess, A. Zhai, and J. S. Emer. 2015. Efficient control and communication paradigms for coarse-grained spatial architectures. ACM Trans. Comput. Syst. 33, 3 (2015), 10:1--10:32.
[17]
M. Pericàs, A. Cristal, F. J. Cazorla, R. González, A. V. Veidenbaum, D. A. Jiménez, and M. Valero. 2008. A two-level load/store queue based on execution locality. In Proceedings of the 35th International Symposium on Computer Architecture. Beijing, 25--36, June 2008.
[18]
S. Sethumadhavan, F. Roesner, J. S. Emer, D. Burger, and S. W. Keckler. 2007. Late-binding: Enabling unordered load-store queues. In Proceedings of the 34th International Symposium on Computer Architecture. San Diego, Calif., 347--57, June 2007.
[19]
M. Tan, G. Liu, R. Zhao, S. Dai, and Z. Zhang. 2015. ElasticFlow: A complexity-effective approach for pipelining irregular loop nests. In Proceedings of the International Conference on Computer Aided Design. Austin, Tex., 78--85, Nov. 2015.
[20]
M. Vijayaraghavan and Arvind. 2009. Bounded dataflow networks and latency-insensitive circuits. In Proceedings of the 9th ACM/IEEE International Conference on Formal Methods and Models for Codesign. 171--80, July 2009.
[21]
H. Wong, V. Betz, and J. Rose. 2013. Efficient methods for out-of-order load/store execution for high-performance soft processors. In Proceedings of the IEEE International Conference on Field Programmable Technology. Kyoto, 442--445, Dec. 2013.

Cited By

View all
  • (2025)CRUSH: A Credit-Based Approach for Functional Unit Sharing in Dynamically Scheduled HLSProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707273(249-263)Online publication date: 3-Feb-2025
  • (2024)Wavefront Threading Enables Effective High-Level SynthesisProceedings of the ACM on Programming Languages10.1145/36564208:PLDI(1066-1090)Online publication date: 20-Jun-2024
  • (2024)Suppressing Spurious Dynamism of Dataflow Circuits via Latency and Occupancy BalancingProceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3626202.3637570(188-198)Online publication date: 1-Apr-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems
ACM Transactions on Embedded Computing Systems  Volume 16, Issue 5s
Special Issue ESWEEK 2017, CASES 2017, CODES + ISSS 2017 and EMSOFT 2017
October 2017
1448 pages
ISSN:1539-9087
EISSN:1558-3465
DOI:10.1145/3145508
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 27 September 2017
Accepted: 01 July 2017
Revised: 01 May 2017
Received: 01 April 2017
Published in TECS Volume 16, Issue 5s

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Load-store queue
  2. allocation
  3. dynamic scheduling
  4. spatial computing

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)75
  • Downloads (Last 6 weeks)11
Reflects downloads up to 15 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)CRUSH: A Credit-Based Approach for Functional Unit Sharing in Dynamically Scheduled HLSProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707273(249-263)Online publication date: 3-Feb-2025
  • (2024)Wavefront Threading Enables Effective High-Level SynthesisProceedings of the ACM on Programming Languages10.1145/36564208:PLDI(1066-1090)Online publication date: 20-Jun-2024
  • (2024)Suppressing Spurious Dynamism of Dataflow Circuits via Latency and Occupancy BalancingProceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3626202.3637570(188-198)Online publication date: 1-Apr-2024
  • (2024)Survival of the Fastest: Enabling More Out-of-Order Execution in Dataflow CircuitsProceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3626202.3637556(44-54)Online publication date: 1-Apr-2024
  • (2024)HIDA: A Hierarchical Dataflow Compiler for High-Level SynthesisProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624850(215-230)Online publication date: 27-Apr-2024
  • (2024)Fast Switching Activity Estimation for HLS-Produced Dataflow Circuits2024 34th International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL64840.2024.00025(118-125)Online publication date: 2-Sep-2024
  • (2024)Efficient Design Space Exploration for Dynamic & Speculative High-Level Synthesis2024 34th International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL64840.2024.00024(109-117)Online publication date: 2-Sep-2024
  • (2024)High-Level SynthesisFPGA EDA10.1007/978-981-99-7755-0_8(113-134)Online publication date: 1-Feb-2024
  • (2023)Parallelising Control Flow in Dynamic-scheduling High-level SynthesisACM Transactions on Reconfigurable Technology and Systems10.1145/359997316:4(1-32)Online publication date: 1-Sep-2023
  • (2023)Resource Sharing in Dataflow CircuitsACM Transactions on Reconfigurable Technology and Systems10.1145/359761416:4(1-27)Online publication date: 1-Sep-2023
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media