research-article

An Out-of-Order Load-Store Queue for Spatial Computing

Authors:

Lana Josipovic,

Paolo IenneAuthors Info & Claims

ACM Transactions on Embedded Computing Systems (TECS), Volume 16, Issue 5s

Article No.: 125, Pages 1 - 19

https://doi.org/10.1145/3126525

Published: 27 September 2017 Publication History

Abstract

The efficiency of spatial computing depends on the ability to achieve maximal parallelism. This necessitates memory interfaces that can correctly handle memory accesses that arrive in arbitrary order while still respecting data dependencies and ensuring appropriate ordering for semantic correctness. However, a typical memory interface for out-of-order processors (i.e., a load-store queue) cannot immediately meet these requirements: a different allocation policy is needed to achieve out-of-order execution in spatial systems that naturally omit the notion of sequential program order, a fundamental piece of information for correct execution. We show a novel and practical way to organize the allocation for an out-of-order load-store queue for spatial computing. The main idea is to dynamically allocate groups of memory accesses (depending on the dynamic behavior of the application), where the access order within the group is statically predetermined (for instance by a high-level synthesis tool). We detail the construction of our load-store queue and demonstrate on a few practical cases its advantages over standard accelerator-memory interfaces.

References

[1]

M. Alle, A. Morvan, and S. Derrien. 2013. Runtime dependency analysis for loop pipelining in high-level synthesis. In Proceedings of the 50th Design Automation Conference. Austin, Tex, June 2013.

Digital Library

[2]

M. Budiu, P. V. Artigas, and S. C. Goldstein. 2005. Dataflow: A complement to superscalar. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. Austin, Tex., 177--186, Mar. 2005.

Digital Library

[3]

M. Budiu and S. C. Goldstein. 2002. Pegasus: An Efficient Intermediate Representation. Technical Report CMU-CS-02-107. Carnegie Mellon University, May 2002.

[4]

M. Budiu and S. C. Goldstein. 2003. Optimizing memory accesses for spatial computation. In Proceedings of the 1st International ACM/IEEE Symposium on Code Generation and Optimization. San Francisco, Calif., 216--27, Mar. 2003.

Digital Library

[5]

L. P. Carloni, K. L. McMillan, and A. L. Sangiovanni-Vincentelli. 2001. Theory of latency-insensitive design. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, CAD-20, 9 (Sept. 2001), 1059--76.

Digital Library

[6]

S. Cheng and J. Wawrzynek. 2016. Synthesis of statically analyzable accelerator networks from sequential programs. In Proceedings of the International Conference on Computer Aided Design. Austin, Tex., 126--133, Nov. 2016.

Digital Library

[7]

J. Cortadella, M. Kishinevsky, and B. Grundmann. 2006. Synthesis of synchronous elastic architectures. In Proceedings of the 43rd Design Automation Conference. San Francisco, Calif., 657--62, July 2006.

Digital Library

[8]

S. Dai, M. Tan, K. Hao, and Z. Zhang. 2014. Flushing-enabled loop pipelining for high-level synthesis. In Proceedings of the 51st Design Automation Conference. San Francisco, Calif., 1--6, June 2014.

Digital Library

[9]

S. Dai, R. Zhao, S. S. Gai Liu, U. Gupta, C. Batten, and Z. Zhang. 2017. Dynamic hazard resolution for pipelining irregular loops in high-level synthesis. In Proceedings of the 25th ACM/SIGDA International Symposium on Field Programmable Gate Arrays. Monterey, Calif., 189--94, Feb. 2017.

Digital Library

[10]

J. Huang, Y. Huang, Y. Chen, P. Ienne, O. Temam, and C. Wu. 2014. A low-cost memory interface for high-throughput accelerators. In Proceedings of the International Conference on Compilers, Architectures, and Synthesis for Embedded Systems. New Delhi, 11:1--11:10, Oct. 2014.

Digital Library

[11]

Y. Huang, P. Ienne, O. Temam, Y. Chen, and C. Wu. 2013. Elastic CGRAs. In Proceedings of the 21st ACM/SIGDA International Symposium on Field Programmable Gate Arrays. Monterey, Calif., 171--80, Feb. 2013.

Digital Library

[12]

H. M. Jacobson, P. N. Kudva, P. Bose, P. W. Cook, S. E. Schuster, E. G. Mercer, and C. J. Myers. 2002. Synchronous interlocked pipelines. In Proceedings of the 8th International Symposium on Advanced Research in Asynchronous Circuits and Systems. Manchester, 3--12, Apr. 2002.

Digital Library

[13]

T. Kam, M. Kishinevsky, J. Cortadella, and M. Galceran-Oms. 2008. Correct-by-construction microarchitectural pipelining. Proceedings of the International Conference on Computer Aided Design (Nov. 2008), 434--41.

Digital Library

[14]

J. Liu, S. Bayliss, and G. A. Constantinides. 2015. Offline synthesis of online dependence testing: Parametric loop pipelining for HLS. In Proceedings of the 23rd IEEE Symposium on Field-Programmable Custom Computing Machines. Vancouver, B.C., 159--62, May 2015.

Digital Library

[15]

I. Park, C.-L. Ooi, and T. N. Vijaykumar. 2003. Reducing design complexity of the load/store queue. In Proceedings of the 36th Annual International Symposium on Microarchitecture. San Diego, Calif., 411--22, Dec. 2003.

Digital Library

[16]

M. Pellauer, A. Parashar, M. Adler, B. Ahsan, R. L. Allmon, N. C. Crago, K. Fleming, M. Gambhir, A. Jaleel, T. Krishna, D. Lustig, S. Maresh, V. Pavlov, R. Rayess, A. Zhai, and J. S. Emer. 2015. Efficient control and communication paradigms for coarse-grained spatial architectures. ACM Trans. Comput. Syst. 33, 3 (2015), 10:1--10:32.

Digital Library

[17]

M. Pericàs, A. Cristal, F. J. Cazorla, R. González, A. V. Veidenbaum, D. A. Jiménez, and M. Valero. 2008. A two-level load/store queue based on execution locality. In Proceedings of the 35th International Symposium on Computer Architecture. Beijing, 25--36, June 2008.

Digital Library

[18]

S. Sethumadhavan, F. Roesner, J. S. Emer, D. Burger, and S. W. Keckler. 2007. Late-binding: Enabling unordered load-store queues. In Proceedings of the 34th International Symposium on Computer Architecture. San Diego, Calif., 347--57, June 2007.

Digital Library

[19]

M. Tan, G. Liu, R. Zhao, S. Dai, and Z. Zhang. 2015. ElasticFlow: A complexity-effective approach for pipelining irregular loop nests. In Proceedings of the International Conference on Computer Aided Design. Austin, Tex., 78--85, Nov. 2015.

Digital Library

[20]

M. Vijayaraghavan and Arvind. 2009. Bounded dataflow networks and latency-insensitive circuits. In Proceedings of the 9th ACM/IEEE International Conference on Formal Methods and Models for Codesign. 171--80, July 2009.

Digital Library

[21]

H. Wong, V. Betz, and J. Rose. 2013. Efficient methods for out-of-order load/store execution for high-performance soft processors. In Proceedings of the IEEE International Conference on Field Programmable Technology. Kyoto, 442--445, Dec. 2013.

Cited By

Xu JJosipovic LEeckhout LSmaragdakis GLiang KSampson AKim MRossbach C(2025)CRUSH: A Credit-Based Approach for Functional Unit Sharing in Dynamically Scheduled HLSProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707273(249-263)Online publication date: 3-Feb-2025
https://dl.acm.org/doi/10.1145/3669940.3707273
Pelton BSapek AEguro KLo DForin AHumphrey MXi JCox DKarandikar Rde Fine Licht JBabin ECaulfield ABurger D(2024)Wavefront Threading Enables Effective High-Level SynthesisProceedings of the ACM on Programming Languages10.1145/36564208:PLDI(1066-1090)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656420
Xu JJosipović LZhang ZPutnam A(2024)Suppressing Spurious Dynamism of Dataflow Circuits via Latency and Occupancy BalancingProceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3626202.3637570(188-198)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1145/3626202.3637570
Show More Cited By

Index Terms

An Out-of-Order Load-Store Queue for Spatial Computing
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
    1. Processors and memory architectures
2. Hardware
  1. Electronic design automation
    1. High-level and register-transfer level synthesis
  2. Integrated circuits
    1. Reconfigurable logic and FPGAs
      1. Hardware accelerators

Recommendations

Straight to the Queue: Fast Load-Store Queue Allocation in Dataflow Circuits
FPGA '23: Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays

Dynamically scheduled high-level synthesis can exploit high levels of parallelism in poorly-predictable control-dominated applications. Yet, dataflow circuits are often generated by literal conversion of basic blocks into circuits interconnected in such ...
Streamlining long latency instructions for seamlessly combined out-of-order and in-order execution

In the current day wide-issue processors, the size of the instruction scheduling window (also called Issue Queue (IQ)) is limited mainly by the hardware complexity to design the logic, and thus limits the number of instructions scanned every cycle to ...
Reconstructing Out-of-Order Issue Queue
MICRO '22: Proceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture

Out-of-order cores provide high performance at the cost of energy efficiency. Dynamic scheduling is one of the major contributors to this: generating highly optimized issue schedules considering both data dependences and underlying execution resources, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems

ACM Transactions on Embedded Computing Systems Volume 16, Issue 5s

Special Issue ESWEEK 2017, CASES 2017, CODES + ISSS 2017 and EMSOFT 2017

October 2017

1448 pages

ISSN:1539-9087

EISSN:1558-3465

DOI:10.1145/3145508

Editor:
Sandeep K. Shukla
Indian Institute of Technology, India

Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 27 September 2017

Accepted: 01 July 2017

Revised: 01 May 2017

Received: 01 April 2017

Published in TECS Volume 16, Issue 5s

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

31
Total Citations
View Citations
440
Total Downloads

Downloads (Last 12 months)75
Downloads (Last 6 weeks)11

Reflects downloads up to 15 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Xu JJosipovic LEeckhout LSmaragdakis GLiang KSampson AKim MRossbach C(2025)CRUSH: A Credit-Based Approach for Functional Unit Sharing in Dynamically Scheduled HLSProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707273(249-263)Online publication date: 3-Feb-2025
https://dl.acm.org/doi/10.1145/3669940.3707273
Pelton BSapek AEguro KLo DForin AHumphrey MXi JCox DKarandikar Rde Fine Licht JBabin ECaulfield ABurger D(2024)Wavefront Threading Enables Effective High-Level SynthesisProceedings of the ACM on Programming Languages10.1145/36564208:PLDI(1066-1090)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656420
Xu JJosipović LZhang ZPutnam A(2024)Suppressing Spurious Dynamism of Dataflow Circuits via Latency and Occupancy BalancingProceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3626202.3637570(188-198)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1145/3626202.3637570
Elakhras AGuerrieri AJosipovic LIenne PZhang ZPutnam A(2024)Survival of the Fastest: Enabling More Out-of-Order Execution in Dataflow CircuitsProceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3626202.3637556(44-54)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1145/3626202.3637556
Ye HJun HChen DTsafrir DMUSUVATHI MGupta RAbu-Ghazaleh N(2024)HIDA: A Hierarchical Dataflow Compiler for High-Level SynthesisProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624850(215-230)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3617232.3624850
Liu JGraczyk MGuerrieri AJosipović L(2024)Fast Switching Activity Estimation for HLS-Produced Dataflow Circuits2024 34th International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL64840.2024.00025(118-125)Online publication date: 2-Sep-2024
https://doi.org/10.1109/FPL64840.2024.00025
Leothaud DGorius JRokicki SDerrien S(2024)Efficient Design Space Exploration for Dynamic & Speculative High-Level Synthesis2024 34th International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL64840.2024.00024(109-117)Online publication date: 2-Sep-2024
https://doi.org/10.1109/FPL64840.2024.00024
Tu KTang XYu CJosipović LChu ZTu KTang XYu CJosipović LChu Z(2024)High-Level SynthesisFPGA EDA10.1007/978-981-99-7755-0_8(113-134)Online publication date: 1-Feb-2024
https://doi.org/10.1007/978-981-99-7755-0_8
Cheng JJosipović LWickerson JConstantinides G(2023)Parallelising Control Flow in Dynamic-scheduling High-level SynthesisACM Transactions on Reconfigurable Technology and Systems10.1145/359997316:4(1-32)Online publication date: 1-Sep-2023
https://dl.acm.org/doi/10.1145/3599973
Josipović LMarmet AGuerrieri AIenne P(2023)Resource Sharing in Dataflow CircuitsACM Transactions on Reconfigurable Technology and Systems10.1145/359761416:4(1-27)Online publication date: 1-Sep-2023
https://dl.acm.org/doi/10.1145/3597614
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents