research-article

Optimal compilation for exposed datapath architectures with buffered processing units by SAT solvers

Authors:

Anoop Bhagyanath and

Klaus SchneiderAuthors Info & Claims

MEMOCODE '16: Proceedings of the 14th ACM-IEEE International Conference on Formal Methods and Models for System Design

November 2016

Pages 143 - 152

Published: 18 November 2016 Publication History

Abstract

Conventional processor architectures are restricted in exploiting instruction level parallelism (ILP) due to the limited number of available registers in their instruction sets. Therefore, recent processor architectures expose their datapaths so that the compiler not only schedules instructions to functional units, but also takes care of directly moving values between functional units avoiding the need of registers at all. However, the current compiler technology is still based on classic register architectures where a nearly optimal register mapping is the key for the quality of the generated assembly code.

The Synchronous Control Asynchronous Dataflow (SCAD) architecture is a new exposed datapath architecture where processing units (PUs) are equipped with first-in first-out (FIFO) buffers at their inputs and outputs. Code generation for SCAD machines can be done as known for classic queue machines to completely eliminate the use of registers, and to improve the degree of exploited ILP. However, the SCAD code generated this way is not optimal since compared to queue machines, SCAD machines can contain many PUs and buffers which offers the compiler more freedom to reduce unnecessary computational overhead. In this paper, we map the SCAD code generation problem to a satisfiability problem, and then use SAT solvers to generate code without overhead that works with the minimal number of PUs. The generated optimal code will serve as a reference to judge the quality of heuristics that will be finally used in SCAD compilers.

References

[1]

E. Lee, "The problem with threads," IEEE Computer, vol. 39, no. 5, pp. 33--42, 2006.

Digital Library

[2]

D. Mosberger, "Memory consistency models," ACM SIGOPS: Operating Systems Review, vol. 27, no. 1, pp. 18--26, January 1993.

Digital Library

[3]

S. Adve and K. Gharachorloo, "Shared memory consistency models: A tutorial," IEEE Computer, vol. 29, no. 12, pp. 66--76, December 1996.

Digital Library

[4]

R. Steinke and G. Nutt, "A unified theory of shared memory consistency," Journal of the ACM (JACM), vol. 51, no. 5, pp. 800--849, September 2004.

Digital Library

[5]

P. Axer, R. Ernst, H. Falk, A. Girault, D. Grund, N. Guan, B. Jonsson, P. Marwedel, J. Reineke, C. Rochange, M. Sebastian, R. von Hanxleden, R. Wilhelm, and W. Yi, "Building timing predictable embedded systems," Transactions on Embedded Computing Systems (TECS), vol. 13, no. 4, pp. 82:1--82:37, February 2014.

Digital Library

[6]

N. Jouppi and D. Wall, "Available instruction-level parallelism for superscalar and superpipelined machines," in Architectural Support for Programming Languages and Operating Systems (ASPLOS), J. Emer, Ed. Boston, Massachusetts, USA: ACM, 1989, pp. 272--282.

Digital Library

[7]

D. Wall, "Limits of instruction-level parallelism," in Architectural Support for Programming Languages and Operating Systems (ASPLOS). Santa Clara, California, USA: ACM, 1991, pp. 176--188.

Digital Library

[8]

B. Rau and J. Fisher, "Instruction-level parallel processing: History, overview, and perspective," Journal of Supercomputing, vol. 7, no. 1--2, pp. 9--50, 1993.

Digital Library

[9]

R. Tomasulo, "An efficient algorithm for exploiting multiple arithmetic units," IBM Journal of Research and Development, vol. 11, no. 1, pp. 25--33, 1967.

Digital Library

[10]

R. Colwell, R. Nix, J. O'Donnell, D. Papworth, and P. Rodman, "A VLIW architecture for a trace scheduling compiler," ACM SIGARCH Computer Architecture News, vol. 15, no. 5, pp. 180--192, October 1987.

Digital Library

[11]

J. Fisher, P. Faraboschi, and C. Young, Embedded Computing: A VLIW Approach to Architecture, Compilers and Tools. Morgan Kaufmann, 2005.

Digital Library

[12]

J. Fisher, "Trace scheduling: A technique for global microcode compaction," IEEE Transactions on Computers (T-C), vol. C-30, no. 7, pp. 478--490, July 1981.

Digital Library

[13]

M. Lam, "Software pipelining: an effective scheduling technique for VLIW machines," in Programming Language Design and Implementation (PLDI), R. Wexelblat, Ed. Atlanta, Georgia, USA: ACM, 1988, pp. 318--328.

Digital Library

[14]

B. Ramakrishna Rau, "Iterative modulo scheduling: an algorithm for software pipelining loops," in Microarchitecture (MICRO). San Jose, California, USA: IEEE Computer Society, 1994, pp. 63--74.

Digital Library

[15]

R. Sethi and J. Ullman, "The generation of optimal code for arithmetic expressions," Journal of the ACM (JACM), vol. 17, no. 4, pp. 715--728, October 1970.

Digital Library

[16]

A. Aletà, J. Codina, A. González, and D. Kaeli, "Heterogeneous clustered VLIW microarchitectures," in Code Generation and Optimization (CGO). San Jose, California, USA: IEEE Computer Society, 2007, pp. 354--366.

Digital Library

[17]

W. Lee, R. Barua, M. Frank, D. Srikrishna, J. Babb, V. Sarkar, and S. Amarasinghe, "Space-time scheduling of instruction-level parallelism on a raw machine," in Architectural Support for Programming Languages and Operating Systems (ASPLOS), D. Bhandarkar and A. Agarwal, Eds. San Jose, California, USA: ACM, 1998, pp. 46--57.

Digital Library

[18]

S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay, M. Reif, B. Liewei, J. Brown, M. Mattina, C. Miao, C. Ramey, D. Wentzlaff, W. Anderson, E. Berger, N. Fairbanks, D. Khan, F. Montenegro, J. Stickney, and J. Zook, "TILE64 - processor: A 64-core SoC with mesh interconnect," in International Solid-State Circuits Conference (ISSCC). San Francisco, CA, USA: IEEE Computer Society, 2008, pp. 88--598.

[19]

S. Swanson, A. Schwerin, M. Mercaldi, A. Petersen, A. Putnam, K. Michelson, M. Oskin, and S. Eggers, "The WaveScalar architecture," ACM Transactions on Computer Systems (TOCS), vol. 25, no. 2, pp. 1--54, May 2007.

Digital Library

[20]

D. Burger, S. Keckler, K. McKinley, M. Dahlin, L. John, C. Lin, C. Moore, J. Burrill, R. McDonald, and W. Yoder, "Scaling to the end of silicon with EDGE architectures," IEEE Computer, vol. 37, no. 7, pp. 44--55, July 2004.

Digital Library

[21]

M. Thuresson, M. Själander, M. Björk, L. Svensson, P. Larsson-Edefors, and P. Stenström, "FlexCore: Utilizing exposed datapath control for efficient computing," in International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (ICSAMOS), H. Blume, G. Gaydadjiev, C. Glossner, and P. Knijnenburg, Eds. Samos, Greece: IEEE Computer Society, 2007, pp. 18--25.

[22]

L. Waeijen, D. She, H. Corporaal, and Y. He, "A low-energy wide SIMD architecture with explicit datapath," Journal of Signal Processing Systems, vol. 80, no. 1, pp. 65--86, 2015.

Digital Library

[23]

H. Corporaal, "TTAs: Missing the ILP complexity wall," Journal of Systems Architecture, vol. 45, no. 12--13, pp. 949--973, June 1999.

Digital Library

[24]

A. Bhagyanath, T. Jain, and K. Schneider, "Poster abstract: A time-predictable model of computation," in Real-Time Systems Symposium (RTSS). San Antonio, Texas, USA: IEEE Computer Society, 2015, p. 376.

Digital Library

[25]

A. Bhagyanath, "Towards code generation for the synchronous control asynchronous dataflow (SCAD) architectures," in Methoden und Beschrei-bungssprachen zur Modellierung und Verifikation von Schaltungen und Systemen (MBMV), R. Wimmer, Ed. Freiburg, Germany: University of Freiburg, 2016, pp. 77--88.

[26]

R. Nagarajan, S. Kushwaha, D. Burger, K. McKinley, C. Lin, and S. Keckler, "Static placement, dynamic issue (SPDI) scheduling for EDGE architectures," in Parallel Architectures and Compilation Techniques (PACT). Antibes Juan-les-Pins, France: IEEE Computer Society, 2004, pp. 74--84.

Digital Library

[27]

R. Vollmar, "Über einen Automaten mit Pufferspeicherung," Computing, vol. 5, no. 1, pp. 57--70, 1970.

[28]

M. Feller and M. Ercegovac, "Queue machines: An organization for parallel computation," in Conpar 81, ser. LNCS, W. Brauer, P. Brinch Hansen, D. Gries, C. Moler, G. Seegmüller, J. Stoer, N. Wirth, and W. Händler, Eds., vol. 111. Nürnberg, Germany: Springer, 1981, pp. 37--47.

Digital Library

[29]

H. Schmit, B. Levine, and B. Ylvisaker, "Queue machines: hardware compilation in hardware," in Field-Programmable Custom Computing Machines (FCCM), J. Arnold and K. Pocek, Eds. Napa, California, USA: IEEE Computer Society, 2002, pp. 152--160.

Digital Library

[30]

M. Davis, G. Logemann, and D. Loveland, "A machine program for theorem proving," Communications of the ACM (CACM), vol. 5, no. 7, pp. 394--397, 1962.

Digital Library

Cited By

Schneider KArun-Kumar SMery DSaha IZhang L(2021)Translating structured sequential programs to dataflow graphsProceedings of the 19th ACM-IEEE International Conference on Formal Methods and Models for System Design10.1145/3487212.3487343(66-77)Online publication date: 20-Nov-2021
https://dl.acm.org/doi/10.1145/3487212.3487343

Optimal compilation for exposed datapath architectures with buffered processing units by SAT solvers
1. Software and its engineering
  1. Software notations and tools

Recommendations

Code Density and Energy Efficiency of Exposed Datapath Architectures

Exposing details of the processor datapath to the programmer is motivated by improvements in the energy efficiency and the simplification of the microarchitecture. However, an instruction format that can control the data path in a more explicit manner ...
Read More
Compiling synchronous languages to optimal move code for exposed datapath architectures
SCOPES '20: Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems

Conventional processor architectures are limited in exploiting instruction level parallelism (ILP). One of the reasons for this limitation is their relatively low number of registers. Thus, recent processor architectures expose their datapaths so that ...
Read More
Energy-Efficient Exposed Datapath Architecture With a RISC-V Instruction Set Mode
Transport triggered architectures (TTAs) follow the static programming model of very long instruction word (VLIW) processors but expose additional information of the processor datapath in the programming interface, which enables low-level code ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MEMOCODE '16: Proceedings of the 14th ACM-IEEE International Conference on Formal Methods and Models for System Design

November 2016

196 pages

ISBN:9781509027910

General Chair:
Jean-Pierre Talpin
INRIA, France

Sponsors

Publisher

IEEE Press

Publication History

Published: 18 November 2016

Check for updates

Qualifiers

Research-article

Conference

MEMOCODE'16

Sponsor:

MEMOCODE'16: 14th ACM-IEEE ACM-IEEE International Conference on Formal Methods and Models for System Design

November 18 - 20, 2016

Kanpur, India

Acceptance Rates

Overall Acceptance Rate 34 of 82 submissions, 41%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
11
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Other Metrics

View Author Metrics

Citations

Cited By

Schneider KArun-Kumar SMery DSaha IZhang L(2021)Translating structured sequential programs to dataflow graphsProceedings of the 19th ACM-IEEE International Conference on Formal Methods and Models for System Design10.1145/3487212.3487343(66-77)Online publication date: 20-Nov-2021
https://dl.acm.org/doi/10.1145/3487212.3487343

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents