Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

The semantics of shared memory in Intel CPU/FPGA systems

Published: 15 October 2021 Publication History

Abstract

Heterogeneous CPU/FPGA devices, in which a CPU and an FPGA can execute together while sharing memory, are becoming popular in several computing sectors. In this paper, we study the shared-memory semantics of these devices, with a view to providing a firm foundation for reasoning about the programs that run on them. Our focus is on Intel platforms that combine an Intel FPGA with a multicore Xeon CPU. We describe the weak-memory behaviours that are allowed (and observable) on these devices when CPU threads and an FPGA thread access common memory locations in a fine-grained manner through multiple channels. Some of these behaviours are familiar from well-studied CPU and GPU concurrency; others are weaker still. We encode these behaviours in two formal memory models: one operational, one axiomatic. We develop executable implementations of both models, using the CBMC bounded model-checking tool for our operational model and the Alloy modelling language for our axiomatic model. Using these, we cross-check our models against each other via a translator that converts Alloy-generated executions into queries for the CBMC model. We also validate our models against actual hardware by translating 583 Alloy-generated executions into litmus tests that we run on CPU/FPGA devices; when doing this, we avoid the prohibitive cost of synthesising a hardware design per litmus test by creating our own 'litmus-test processor' in hardware. We expect that our models will be useful for low-level programmers, compiler writers, and designers of analysis tools. Indeed, as a demonstration of the utility of our work, we use our operational model to reason about a producer/consumer buffer implemented across the CPU and the FPGA. When the buffer uses insufficient synchronisation -- a situation that our model is able to detect -- we observe that its performance improves at the cost of occasional data corruption.

Supplementary Material

Auxiliary Presentation Video (oopsla21main-p99-p-video.mp4)
This is a presentation video for our OOPSLA 2021 paper "The Semantics of Shared Memory in Intel CPU/FPGA Systems". Dan Iorga is the presenter in the video.

References

[1]
Maleen Abeydeera, Manupa Karunaratne, Geethan Karunaratne, Kalana De Silva, and Ajith Pasqual. 2016. 4K Real-Time HEVC Decoder on an FPGA. IEEE Transactions on Circuits and Systems for Video Technology, 26, 1 (2016), Jan, 236–249. https://doi.org/10.1109/TCSVT.2015.2469113
[2]
Jade Alglave, Mark Batty, Alastair F. Donaldson, Ganesh Gopalakrishnan, Jeroen Ketema, Daniel Poetzl, Tyler Sorensen, and John Wickerson. 2015. GPU Concurrency: Weak Behaviours and Programming Assumptions. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’15). Association for Computing Machinery, New York, NY, USA. 577–591. isbn:9781450328357 https://doi.org/10.1145/2694344.2694391
[3]
Jade Alglave, Luc Maranget, Susmit Sarkar, and Peter Sewell. 2011. Litmus: Running Tests against Hardware. In Tools and Algorithms for the Construction and Analysis of Systems, Parosh Aziz Abdulla and K. Rustan M. Leino (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg. 41–44. isbn:978-3-642-19835-9 https://doi.org/10.1007/978-3-642-19835-9_5
[4]
Jade Alglave, Luc Maranget, and Michael Tautschnig. 2014. Herding Cats: Modelling, Simulation, Testing, and Data Mining for Weak Memory. ACM Trans. Program. Lang. Syst., 36, 2 (2014), Article 7, July, 74 pages. issn:0164-0925 https://doi.org/10.1145/2627752
[5]
M. Bechtel and H. Yun. 2019. Denial-of-Service Attacks on Shared Cache in Multicore: Analysis and Prevention. In 2019 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS). 357–367. https://doi.org/10.1109/RTAS.2019.00037
[6]
Young-Kyu Choi, Jason Cong, Zhenman Fang, Yuchen Hao, Glenn Reinman, and Peng Wei. 2019. In-Depth Analysis on Microarchitectures of Modern Heterogeneous CPU-FPGA Platforms. ACM Trans. Reconfigurable Technol. Syst., 12, 1 (2019), Article 4, Feb., 20 pages. issn:1936-7406 https://doi.org/10.1145/3294054
[7]
Edmund Clarke, Daniel Kroening, and Flavio Lerda. 2004. A Tool for Checking ANSI-C Programs. In Tools and Algorithms for the Construction and Analysis of Systems (TACAS 2004), Kurt Jensen and Andreas Podelski (Eds.) (Lecture Notes in Computer Science, Vol. 2988). Springer, 168–176. isbn:3-540-21299-X https://doi.org/10.1007/978-3-540-24730-2_15
[8]
Roland Dobai and Lukas Sekanina. 2013. Image filter evolution on the Xilinx Zynq Platform. In 2013 NASA/ESA Conference on Adaptive Hardware and Systems. https://doi.org/10.1109/AHS.2013.6604241
[9]
Naila Farooqui, Rajkishore Barik, Brian T. Lewis, Tatiana Shpeisman, and Karsten Schwan. 2016. Affinity-Aware Work-Stealing for Integrated CPU-GPU Processors. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’16). Association for Computing Machinery, New York, NY, USA. Article 30, 2 pages. isbn:9781450340922 https://doi.org/10.1145/2851141.2851194
[10]
Ce Guo, Wayne Luk, Stanley Qing Shui Loh, Alexander Warren, and Joshua Levine. 2019. Customisable Control Policy Learning for Robotics. In 2019 IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP). 2160-052X, 91–98. https://doi.org/10.1109/ASAP.2019.00-24
[11]
Kaiyuan Guo, Lingzhi Sui, Jiantao Qiu, Jincheng Yu, Junbin Wang, Song Yao, Song Han, Yu Wang, and Huazhong Yang. 2018. Angel-Eye: A Complete Design Flow for Mapping CNN Onto Embedded FPGA. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 37, 1 (2018), Jan, 35–47. https://doi.org/10.1109/TCAD.2017.2705069
[12]
John L. Hennessy and David A. Patterson. 2019. A New Golden Age for Computer Architecture. Commun. ACM, 62, 2 (2019), Jan., 48–60. issn:0001-0782 https://doi.org/10.1145/3282307
[13]
Derek R. Hower, Blake A. Hechtman, Bradford M. Beckmann, Benedict R. Gaster, Mark D. Hill, Steven K. Reinhardt, and David A. Wood. 2014. Heterogeneous-Race-Free Memory Models. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’14). Association for Computing Machinery, New York, NY, USA. 427–440. isbn:9781450323055 https://doi.org/10.1145/2541940.2541981
[14]
Bo-Yuan Huang, Hongce Zhang, Pramod Subramanyan, Yakir Vizel, Aarti Gupta, and Sharad Malik. 2018. Instruction-Level Abstraction (ILA): A Uniform Specification for System-on-Chip (SoC) Verification. CoRR, abs/1801.01114 (2018), arxiv:1801.01114. arxiv:1801.01114
[15]
Intel. 2019. Intel Acceleration Stack for Intel Xeon CPU with FPGAs Core Cache Interface (CCI-P) Reference Manual. https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/manual/mnl-ias-ccip.pdf Version 2019.11.04.
[16]
Intel. 2021. Intel Academic Compute Environment. https://wiki.intel-research.net/
[17]
Dan Iorga, Alastair Donaldson, Tyler Sorensen, and John Wickerson. 2021. The semantics of Shared Memory in Intel CPU/FPGA. https://doi.org/10.5281/zenodo.5468873
[18]
Dan Iorga, Tyler Sorensen, John Wickerson, and Alastair F. Donaldson. 2020. Slow and Steady: Measuring and Tuning Multicore Interference. In 2020 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS). 200–212. https://doi.org/10.1109/RTAS48715.2020.000-6
[19]
Daniel Jackson. 2012. Software Abstractions: Logic, Language, and Analysis. The MIT Press. isbn:0262017156
[20]
Jake Kirkham, Tyler Sorensen, Esin Tureci, and Margaret Martonosi. 2020. Foundations of Empirical Memory Consistency Testing. Proc. ACM Program. Lang., 4, OOPSLA (2020), Article 226, Nov., 29 pages. https://doi.org/10.1145/3428294
[21]
L. Lamport. 1979. How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs. IEEE Trans. Comput., 28, 9 (1979), Sept., 690–691. issn:0018-9340 https://doi.org/10.1109/TC.1979.1675439
[22]
Daniel Lustig, Sameer Sahasrabuddhe, and Olivier Giroux. 2019. A Formal Analysis of the NVIDIA PTX Memory Consistency Model. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’19). Association for Computing Machinery, New York, NY, USA. 257–270. isbn:9781450362405 https://doi.org/10.1145/3297858.3304043
[23]
Daniel Lustig, Caroline Trippel, Michael Pellauer, and Margaret Martonosi. 2015. ArMOR: Defending against Memory Consistency Model Mismatches in Heterogeneous Architectures. SIGARCH Comput. Archit. News, 43, 3S (2015), June, 388–400. issn:0163-5964 https://doi.org/10.1145/2872887.2750378
[24]
Daniel Lustig, Andrew Wright, Alexandros Papakonstantinou, and Olivier Giroux. 2017. Automated Synthesis of Comprehensive Memory Model Litmus Test Suites. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’17). Association for Computing Machinery, New York, NY, USA. 661–675. isbn:9781450344654 https://doi.org/10.1145/3037697.3037723
[25]
Yuan Meng, Sanmukh R. Kuppannagari, and Viktor K. Prasanna. 2020. Accelerating Proximal Policy Optimization on CPU-FPGA Heterogeneous Platforms. 28th IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM), May, http://par.nsf.gov/biblio/10144121
[26]
Duncan J.M Moss, Srivatsan Krishnan, Eriko Nurvitadhi, Piotr Ratuszniak, Chris Johnson, Jaewoong Sim, Asit Mishra, Debbie Marr, Suchit Subhaschandra, and Philip H.W. Leong. 2018. A Customizable Matrix Multiplication Framework for the Intel HARPv2 Xeon+FPGA Platform: A Deep Learning Case Study. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA ’18). Association for Computing Machinery, New York, NY, USA. 107–116. isbn:9781450356145 https://doi.org/10.1145/3174243.3174258
[27]
Neal Oliver, Rahul R. Sharma, Stephen Chang, Bhushan Chitlur, Elkin Garcia, Joseph Grecco, Aaron Grier, Nelson Ijih, Yaping Liu, Pratik Marolia, Henry Mitchel, Suchit Subhaschandra, Arthur Sheiman, Tim Whisonant, and Prabhat Gupta. 2011. A Reconfigurable Computing System Based on a Cache-Coherent Fabric. In 2011 International Conference on Reconfigurable Computing and FPGAs. 80–85. https://doi.org/10.1109/ReConFig.2011.4
[28]
Scott Owens, Susmit Sarkar, and Peter Sewell. 2009. A Better x86 Memory Model: x86-TSO. In Theorem Proving in Higher Order Logics, Stefan Berghofer, Tobias Nipkow, Christian Urban, and Makarius Wenzel (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg. 391–407. isbn:978-3-642-03359-9 https://doi.org/10.1007/978-3-642-03359-9_27
[29]
Christopher Pulte, Shaked Flur, Will Deacon, Jon French, Susmit Sarkar, and Peter Sewell. 2017. Simplifying ARM Concurrency: Multicopy-Atomic Axiomatic and Operational Models for ARMv8. Proc. ACM Program. Lang., 2, POPL (2017), Article 19, Dec., 29 pages. https://doi.org/10.1145/3158107
[30]
Christopher Pulte, Jean Pichon-Pharabod, Jeehoon Kang, Sung-Hwan Lee, and Chung-Kil Hur. 2019. Promising-ARM/RISC-V: A Simpler and Faster Operational Concurrency Model. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2019). Association for Computing Machinery, New York, NY, USA. 1–15. isbn:9781450367127 https://doi.org/10.1145/3314221.3314624
[31]
Petar Radojković, Sylvain Girbal, Arnaud Grasset, Eduardo Quiñones, Sami Yehia, and Francisco J. Cazorla. 2012. On the Evaluation of the Impact of Shared Resources in Multithreaded COTS Processors in Time-critical Environments. ACM Trans. Archit. Code Optim., 8, 4 (2012), Article 34, Jan., 25 pages. issn:1544-3566 https://doi.org/10.1145/2086696.2086713
[32]
Nadesh Ramanathan, John Wickerson, Felix Winterstein, and George A. Constantinides. 2016. A Case for Work-Stealing on FPGAs with OpenCL Atomics. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA ’16). Association for Computing Machinery, New York, NY, USA. 48–53. isbn:9781450338561 https://doi.org/10.1145/2847263.2847343
[33]
Martin C. Rinard. 2012. Unsynchronized Techniques for Approximate Parallel Computing. In RACES@SPLASH. ACM. https://people.csail.mit.edu/rinard/paper/races12.unsynchronized.pdf
[34]
Karl Rupp. 2015. 40 Years of Microprocessor Trend Data. https://www.karlrupp.net/2015/06/40-years-of-microprocessor-trend-data
[35]
Susmit Sarkar, Peter Sewell, Jade Alglave, Luc Maranget, and Derek Williams. 2011. Understanding POWER Multiprocessors. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’11). Association for Computing Machinery, New York, NY, USA. 175–186. isbn:9781450306638 https://doi.org/10.1145/1993498.1993520
[36]
Tyler Sorensen and Alastair F. Donaldson. 2016. Exposing Errors Related to Weak Memory in GPU Applications. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’16). Association for Computing Machinery, New York, NY, USA. 100–113. isbn:9781450342612 https://doi.org/10.1145/2908080.2908114
[37]
Tyler Sorensen, Alastair F. Donaldson, Mark Batty, Ganesh Gopalakrishnan, and Zvonimir Rakamarić. 2016. Portable Inter-Workgroup Barrier Synchronisation for GPUs. SIGPLAN Not., 51, 10 (2016), Oct., 39–58. issn:0362-1340 https://doi.org/10.1145/3022671.2984032
[38]
J. Stuecheli, B. Blaner, C.R. Johns, and M.S. Siegel. 2015. CAPI: A Coherent Accelerator Processor Interface. IBM Journal of Research and Development, 59, 1 (2015), 7:1–7:7. https://doi.org/10.1147/JRD.2014.2380198
[39]
Stanley Tzeng, Anjul Patney, and John D. Owens. 2010. Task Management for Irregular-Parallel Workloads on the GPU. In High Performance Graphics, Michael Doggett, Samuli Laine, and Warren Hunt (Eds.). The Eurographics Association. isbn:978-3-905674-26-2 issn:2079-8687 https://doi.org/10.2312/EGGH/HPG10/029-037
[40]
Y. Wang, J. C. Hoe, and E. Nurvitadhi. 2019. Processor Assisted Worklist Scheduling for FPGA Accelerated Graph Processing on a Shared-Memory Platform. In 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 136–144. issn:2576-2613 https://doi.org/10.1109/FCCM.2019.00028
[41]
Felix Winterstein and George Constantinides. 2017. Pass a pointer: Exploring shared virtual memory abstractions in OpenCL tools for FPGAs. In 2017 International Conference on Field Programmable Technology (ICFPT). 104–111. https://doi.org/10.1109/FPT.2017.8280127
[42]
Xilinx. 2018. Accelerating DNNs with Xilinx Alveo Accelerator Cards. https://www.xilinx.com/support/documentation/white_papers/wp504-accel-dnns.pdf
[43]
Hsin Jung Yang, Kermin Fleming, Michael Adler, and Joel Emer. 2014. LEAP Shared Memories: Automating the Construction of FPGA Coherent Memories. In 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines. 117–124. https://doi.org/10.1109/FCCM.2014.43
[44]
Chi Zhang, Ren Chen, and Viktor Prasanna. 2016. High Throughput Large Scale Sorting on a CPU-FPGA Heterogeneous Platform. In 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). https://doi.org/10.1109/IPDPSW.2016.117
[45]
Hongce Zhang, Caroline Trippel, Yatin A. Manerkar, Aarti Gupta, Margaret Martonosi, and Sharad Malik. 2018. ILA-MCM: Integrating Memory Consistency Models with Instruction-Level Abstractions for Heterogeneous System-on-Chip Verification. In 2018 Formal Methods in Computer Aided Design (FMCAD). 1–10. https://doi.org/10.23919/FMCAD.2018.8603015
[46]
S. Zhou and V. K. Prasanna. 2017. Accelerating Graph Analytics on CPU-FPGA Heterogeneous Platform. In 2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). 137–144. issn:null https://doi.org/10.1109/SBAC-PAD.2017.25

Cited By

View all
  • (2023)Building GPU TEEs using CPU Secure Enclaves with GEVisorProceedings of the 2023 ACM Symposium on Cloud Computing10.1145/3620678.3624659(249-264)Online publication date: 30-Oct-2023
  • (2023)Compound Memory ModelsProceedings of the ACM on Programming Languages10.1145/35912677:PLDI(1145-1168)Online publication date: 6-Jun-2023
  • (2023)Taking Back Control in an Intermediate Representation for GPU ComputingProceedings of the ACM on Programming Languages10.1145/35712537:POPL(1740-1769)Online publication date: 11-Jan-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Programming Languages
Proceedings of the ACM on Programming Languages  Volume 5, Issue OOPSLA
October 2021
2001 pages
EISSN:2475-1421
DOI:10.1145/3492349
Issue’s Table of Contents
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2021
Published in PACMPL Volume 5, Issue OOPSLA

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. CPU/FPGA
  2. Core Cache Interface (CCI-P)
  3. memory model

Qualifiers

  • Research-article

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)242
  • Downloads (Last 6 weeks)22
Reflects downloads up to 01 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Building GPU TEEs using CPU Secure Enclaves with GEVisorProceedings of the 2023 ACM Symposium on Cloud Computing10.1145/3620678.3624659(249-264)Online publication date: 30-Oct-2023
  • (2023)Compound Memory ModelsProceedings of the ACM on Programming Languages10.1145/35912677:PLDI(1145-1168)Online publication date: 6-Jun-2023
  • (2023)Taking Back Control in an Intermediate Representation for GPU ComputingProceedings of the ACM on Programming Languages10.1145/35712537:POPL(1740-1769)Online publication date: 11-Jan-2023
  • (2023)Simulating Operational Memory Models Using Off-the-Shelf Program Analysis ToolsIEEE Transactions on Software Engineering10.1109/TSE.2023.332605649:12(5084-5102)Online publication date: 24-Oct-2023
  • (2022)Model Checking for AlphaCode-Generated Programs2022 7th International Conference on Intelligent Computing and Signal Processing (ICSP)10.1109/ICSP54964.2022.9778472(794-798)Online publication date: 15-Apr-2022
  • (2022)HeteroGen: Automatic Synthesis of Heterogeneous Cache Coherence Protocols2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00061(756-771)Online publication date: Apr-2022

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media