Reconfigurable Computing: Architectures and Design Methods
Reconfigurable Computing: Architectures and Design Methods
Reconfigurable Computing: Architectures and Design Methods
T.J. Todman, G.A. Constantinides, S.J.E. Wilton, O. Mencer, W. Luk and P.Y.K. Cheung Abstract: Recongurable computing is becoming increasingly attractive for many applications. This survey covers two aspects of recongurable computing: architectures and design methods. The paper includes recent advances in recongurable architectures, such as the Alters Stratix II and Xilinx Virtex 4 FPGA devices. The authors identify major trends in general-purpose and specialpurpose design methods. It is shown that recongurable computing designs are capable of achieving up to 500 times speedup and 70% energy savings over microprocessor implementations for specic applications.
Introduction
Recongurable computing is rapidly establishing itself as a major discipline that covers various subjects of learning, including both computing science and electronic engineering. Recongurable computing involves the use of recongurable devices, such as eld programmable gate arrays (FPGAs), for computing purposes. Recongurable computing is also known as congurable computing or custom computing, since many of the design techniques can be seen as customising a computational fabric for specic applications [1]. Recongurable computing systems often have impressive performance. Consider, as an example, the point multiplication operation in elliptic curve cryptography. For a key size of 270 bits, it has been reported [2] that a point multiplication can be computed in 0.36 ms with a recongurable computing design implemented in an XC2V6000 FPGA at 66 MHz. In contrast, an optimised software implementation requires 196.71 ms on a dual-xeon computer at 2.6 GHz; so the recongurable computing design is more than 540 times faster, while its clock speed is almost 40 times slower than the Xeon processors. This example illustrates a hardware design implemented on a recongurable computing platform. We regard such implementations as a subset of recongurable computing, which in general can involve the use of runtime reconguration and soft processors. Is this speed advantage of recongurable computing over traditional microprocessors a one-off or a sustainable trend?
Recent research suggests that it is a trend rather than a one-off for a wide variety of applications: from image processing [3] to oating-point operations [4]. Sheer speed, while important, is not the only strength of recongurable computing. Another compelling advantage is reduced energy and power consumption. In a recongurable system, the circuitry is optimised for the application, such that the power consumption will tend to be much lower than that for a general-purpose processor. A recent study [5] reports that moving critical software loops to recongurable hardware results in average energy savings of 35% to 70% with an average speedup of 3 to 7 times, depending on the particular device used. Other advantages of recongurable computing include a reduction in size and component count (and hence cost), improved time-to-market, and improved exibility and upgradability. These advantages are especially important for embedded applications. Indeed, there is evidence [6] that embedded systems developers show a growing interest in recongurable computing systems, especially with the introduction of soft cores which can contain one or more instruction processors [7 12]. In this paper, we present a survey of modern recongurable system architectures and design methods. Although we also provide background information on notable aspects of older technologies, our focus is on the most recent architectures and design methods, as well as the trends that will drive each of these areas in the near future. In other words, we intend to complement other survey papers [13 17] by: (i) providing an up-to-date survey of material that appears after the publication of the papers mentioned above; (ii) identifying explicitly the main trends in architectures and design methods for recongurable computing; (iii) examining recongurable computing from a perspective different from existing surveys, for instance classifying design methods as special-purpose and general-purpose; (iv) offering various direct comparisons of technology options according to a selected set of metrics from different perspectives. 2 Background
q IEE, 2005 IEE Proceedings online no. 20045086 doi: 10.1049/ip-cdt:20045086 Paper rst received 14th July and in revised form 9th November 2004 T.J. Todman, O. Mencer and W. Luk are with the Department of Computing, Imperial College London, 180 Queens Gate, London SW7 2AZ, UK G.A. Constantinides and P.Y.K. Cheung are with the Department of Electrical and Electronic Engineering, Imperial College London, Exhibition Rd, South Kensington, London SW7 2BT, UK S.J.E. Wilton is with the Department of Electrical and Computer Engineering, University of British Columbia, 2356 Main Mall, Vancouver, British Columbia, Canada V6T 1Z4 E-mail: tjt97@doc.ic.ac.uk
IEE Proc.-Comput. Digit. Tech., Vol. 152, No. 2, March 2005
Many of todays computationally intensive applications require more processing power than ever before. Applications such as streaming video, image recognition
193
and processing, and highly interactive services are placing new demands on the computation units that implement these applications. At the same time, the power consumption targets, the acceptable packaging and manufacturing costs, and the time-to-market requirements of these computation units are all decreasing rapidly, especially in the embedded hand-held devices market. Meeting these performance requirements under the power, cost and time-to-market constraints is becoming increasingly challenging. In the following, we describe three ways of supporting such processing requirements: high-performance microprocessors, application-specic integrated circuits and recongurable computing systems. High-performance microprocessors provide an offthe-shelf means of addressing processing requirements described earlier. Unfortunately for many applications, a single processor, even an expensive state-of-the-art processor, is not fast enough. In addition, the power consumption (100 W or more) and cost (possibly thousands of dollars) state-of-the-art processors place them out of reach for many embedded applications. Even if microprocessors continue to follow Moores Law so that their density doubles every 18 months, they may still be unable to keep up with the requirements of some of the most aggressive embedded applications. Application-specic integrated circuits (ASICs) provide another means of addressing these processing requirements. Unlike a software implementation, an ASIC implementation provides a natural mechanism for implementing the large amount of parallelism found in many of these applications. In addition, an ASIC circuit does not need to suffer from the serial (and often slow and power-hungry) instruction fetch, decode and execute cycle that is at the heart of all microprocessors. Furthermore, ASICs consume less power than recongurable devices. Finally, an ASIC can contain just the right mix of functional units for a particular application; in contrast, an off-the-shelf microprocessor contains a xed set of functional units which must be selected to satisfy a wide variety of applications. Despite the advantages of ASICs, they are often infeasible or uneconomical for many embedded systems. This is primarily due to two factors: the cost of producing an ASIC often due to the masks cost (up to $1 million [18]), and the time to develop a custom integrated circuit, can both be unacceptable. Only the very highest-volume applications would the improved performance and lower per-unit price warrant the high nonrecurring engineering (NRE) cost of designing an ASIC. A third means of providing this processing power is a recongurable computing system. A recongurable computing system typically contains one or more processors and a recongurable fabric upon which custom functional units can be built. The processor(s) executes sequential and noncritical code, while code that can be efciently mapped to hardware can be executed by processing units that have been mapped to the recongurable fabric. Like a custom integrated circuit, the functions that have been mapped to the recongurable fabric can take advantage of the parallelism achievable in a hardware implementation. Also like an ASIC, the embedded system designer can produce the right mix of functional and storage units in the recongurable fabric, providing a computing structure that matches the application. Unlike an ASIC, however, a new fabric need not be designed for each application. A given fabric can implement a wide variety of functional units. This means that a recongurable computing system can be built out of off-the-shelf components, signicantly reducing the long
194
design-time inherent in an ASIC implementation. Also unlike an ASIC, the functional units implemented in the recongurable fabric can change over time. This means that as the environment or usage of the embedded system changes, the mix of functional units can adapt to better match the new environment. The recongurable fabric in a handheld device, for instance, might implement large matrix multiply operations when the device is used in one mode, and large signal processing functions when the device is used in another mode. Typically, not all of the embedded system functionality needs to be implemented by the recongurable fabric. Only those parts of the computation that are time-critical and contain a high degree of parallelism need to be mapped to the recongurable fabric, while the remainder of the computation can be implemented by a standard instruction processor. The interface between the processor and the fabric, as well as the interface between the memory and the fabric, are therefore of the utmost importance. Modern recongurable devices are large enough to implement instruction processors within the programmable fabric itself: soft processors. These can be general purpose, or customised to a particular application; application specic instruction processors and exible instruction processors are two such approaches. Section 4.3.2 deals with soft processors in more detail. Other devices show some of the exibility of recongurable computers. Examples include graphics processor units and application specic array processors. These devices perform well on their intended application, but cannot run more general computations, unlike recongurable computers and microprocessors. Despite the compelling promise of recongurable computing, it has limitations of which designers should be aware. For instance, the exible routing on the bit level tends to produce large silicon area and performance overhead when compared with ASIC technology. Hence for large volume production of designs in applications without the need for eld upgrade, ASIC technology or gate array technology can still deliver higher performance design at lower unit cost than recongurable computing technology. However, since FPGA technology tracks advances in memory technology and has demonstrated impressive advances in the last few years, many are condent that the current rapid progress in FPGA speed, capacity and capability will continue, together with the reduction in price. It should be noted that the development of recongurable systems is still a maturing eld. There are a number of challenges in developing a recongurable system. We describe three of such challenges below. First, the structure of the recongurable fabric and the interfaces between the fabric, processor(s) and memory must be very efcient. Some recongurable computing systems use a standard eld-programmable gate array [19 24] as a recongurable fabric, while others adopt custom-designed fabrics [25 36]. Another challenge is the development of computer-aided design and compilation tools that map an application to a recongurable computing system. This involves determining which parts of the application should be mapped to the fabric and which should be mapped to the processor, determining when and how often the recongurable fabric should be recongured, which changes the functional units implemented in the fabric, as well as the specication of algorithms for efcient mappings to the recongurable system. In this paper, we provide a survey of recongurable computing, focusing our discussion on both the issues
IEE Proc.-Comput. Digit. Tech., Vol. 152, No. 2, March 2005
described above. In the following Section, we provide a survey of various architectures that are found useful for recongurable computing; material on design methods will follow. 3 Architectures
We shall rst describe system-level architectures for recongurable computing. We then present various avours of recongurable fabric. Finally we identify and summarise the main trends.
recongurable system. A summary of the main features of various architectures can be found in Table 2.
3.2.1 Recongurable functional units: Recongurable functional units can be classied as either coarsegrained or ne-grained. A ne-grained functional unit can typically implement a single function on a single (or small number) of bits. The most common kind of ne-grained
195
functional units are the small lookup tables that are used to implement the bulk of the logic in a commercial eldprogrammable gate array. A coarse-grained functional unit, on the other hand, is typically much larger, and may consist of arithmetic and logic units (ALUs) and possibly even a signicant amount of storage. In this Section, we describe the two types of functional units in more detail. Many recongurable systems use commercial FPGAs as a recongurable fabric. These commercial FPGAs contain many three to six input lookup tables, each of which can be thought of as a very ne-grained functional unit. Figure 2a illustrates a lookup table; by shifting in the correct pattern of bits, this functional unit can implement any single function of up to three inputs the extension to lookup tables with larger numbers of inputs is clear. Typically, lookup tables are combined into clusters, as shown in Fig. 2b. Figure 3 shows clusters in two popular FPGA families. Figure 3a shows a cluster in the Altera Stratix device; Altera calls these clusters logic array blocks [20]. Figure 3b shows a cluster in the Xilinx architecture [24]; Xilinx calls these clusters congurable logic blocks (CLBs). In the Altera diagram, each block labelled LE is a lookup table, while in the Xilinx diagram, each slice contains two lookup tables. Other commercial FPGAs are described in [19, 21 23]. Recongurable fabrics containing lookup tables are very exible, and can be used to implement any digital circuit. However, compared to the coarse-grained structures in Section 3.2.2, these ne-grained structures have signicantly more area, delay and power overhead. Recognising that these fabrics are often used for arithmetic purposes, FPGA companies have added additional features such as carry-chains and cascade-chains to reduce the overhead when implementing common arithmetic and logic functions. Figure 4 shows how the carry and cascade chains, as well as the ability to break a 4-input lookup table into four two-input lookup tables, can be exploited to efciently implement carry-select adders [20]. The multiplexers and the exclusive-or gate in Fig. 4 are included as part of each logic array block, and need not be implemented using other lookup tables. The example in Fig. 4 shows how the efciency of commercial FPGAs can be improved by adding
IEE Proc.-Comput. Digit. Tech., Vol. 152, No. 2, March 2005
architectural support for common functions. We can go much further than this, though, and embed signicantly larger, but far less exible, recongurable functional units. There are two kinds of devices that contain coarse-grained
197
Fig. 4
Fig. 5
functional units; modern FPGAs, which are primarily composed of ne-grained functional units, are increasingly being enhanced by the inclusion of larger blocks. As an example, the Xilinx Virtex device contains embedded 18-bit by 18-bit multiplier units [24]. When implementing algorithms requiring a large amount of multiplication, these embedded blocks can signicantly improve the density, speed and power of the device. On the other hand, for algorithms which do not perform multiplication, these blocks are rarely useful. The Altera Stratix devices contain a larger but more exible embedded block, called a DSP block, shown in Fig. 5 [20]. Each of these blocks can perform accumulate functions as well as multiply operations. The comparison between the two devices clearly illustrates the exibility and overhead tradeoff; the Altera DSP block may be more exible than the Xilinx multiplier, however, it consumes more chip area and runs somewhat slower. The commercial FPGAs described above contain both ne-grained and coarse-grained blocks. There are also devices which contain only coarse-grained blocks [25, 26, 28, 30, 31, 35]. An example of a coarse-grained architecture is the ADRES architecture shown in Fig. 6 [31]. Each recongurable functional unit in this device contains a 32bit ALU which can be congured to implement one of several functions including addition, multiplication and
198
Fig. 6
logic functions, with two small register les. Clearly, such a functional unit is far less exible than the ne-grained functional units described earlier; however, if the application requires functions which match the capabilities of the ALU, these functions can be very efciently implemented in this architecture.
IEE Proc.-Comput. Digit. Tech., Vol. 152, No. 2, March 2005
3.2.2
Recongurable interconnects: Regardless of whether a device contains ne-grained functional units, coarse-grained functional units, or a mixture of the two, the functional units needed to be connected in a exible way. Again, there is a tradeoff between the exibility of the interconnect (and hence the recongurable fabric) and the speed, area and power-efciency of the architecture. As before, recongurable interconnect architectures can be classied as ne-grained or coarse-grained. The distinction is based on the granularity with which wires are switched. This is illustrated in Fig. 7, which shows a exible interconnect between two buses. In the ne-grained architecture in Fig. 7a, each wire can be switched independently, while in Fig. 7b the entire bus is switched as a unit. The ne-grained routing architecture in Fig. 7a is more exible, since not every bit needs to be routed in the same way; however, the coarse-grained architecture in Fig. 7b contains far fewer programming bits, and hence suffers much less overhead. Fine-grained routing architectures are usually found in commercial FPGAs. In these devices, the functional units
are typically arranged in a grid pattern, and they are connected using horizontal and vertical channels. Signicant research has been performed in the optimisation of the topology of this interconnect [49, 50]. Coarse-grained routing architectures are commonly used in devices containing coarse-grained functional units. Figure 8 shows two examples of coarse-grained routing architectures: (a) the Totem recongurable system [25]; (b) the Silicon Hive recongurable system [34], which is less exible but faster and smaller.
3.2.3
Emerging directions: Several emerging directions will be covered in the following. These directions include low-power techniques, asynchronous architectures and molecular microelectronics:
. Low-power techniques: Early work explores the use of low-swing circuit techniques to reduce the power consumption in a hierarchical interconnect for a low-energy FPGA [51]. Recent work involves: (a) activity reduction in power-aware design tools, with energy saving of 23% [52]; (b) leakage current reduction methods such as gate biasing and multiple supply-voltage integration, with up to two times leakage power reduction [53]; and (c) dual supply-voltage methods with the lower voltage assigned to noncritical paths, resulting in an average power reduction of 60% [54]. . Asynchronous architectures: There is an emerging interest in asynchronous FPGA architectures. An asynchronous version of Piperench [28] is estimated to improve performance by 80%, at the expense of a signicant increase in congurable storage and wire count [55]. Other efforts in this direction include ne-grained asynchronous pipelines [56], quasi delay-insensitive architectures [57], and globally asynchronous locally synchronous techniques [58]. . Molecular microelectronics: In the long term, molecular techniques offer a promising opportunity for increasing the capacity and performance of recongurable computing architectures [59]. Current work is focused on developing programmable logic arrays based on molecular-scale nanowires [60, 61].
multiply=accumulate functions. Again, we should expect to see a migration to more heterogeneous architectures in the near future.
3.3.3 Soft cores: The use of soft cores, particularly for instruction processors, is increasing. A soft core is one in which the vendor provides a synthesisable version of the function, and the user implements the function using the recongurable fabric. Although this is less area- and speedefcient than a hard embedded core, the exibility and the ease of integrating these soft cores makes them attractive. The extra overhead becomes less of a hindrance as the number of transistors devoted to the recongurable fabric increases. Altera and Xilinx both provide numerous soft cores, including soft instruction processors such as NIOS [7] and Microblaze [12]. Soft instruction processors have also been developed by a number of researchers, ranging from customisable JVM and MIPS processors [10] to ones specialised for machine learning [8] and data encryption [9]. 4 Design methods
Hardware compilers for high-level descriptions are increasingly recognised to be the key to reducing the productivity gap for advanced circuit development in general, and for recongurable designs in particular. This Section looks at high-level design methods from two perspectives: specialpurpose design and general-purpose design. Low-level design methods and tools, covering topics such as technology mapping, oor-planning, and place and route, are beyond the scope of this paper interested readers are referred to [14].
Based on the SUIF framework [67], this approach uses loop transformations, and can take advantage of runtime reconguration and memory access optimisation. Similar methods have been advocated by other researchers [68, 69]. Streams-C [63] compiles a C program to synthesisable VHDL. Streams-C exploits coarse-grained parallelism in stream-based computations; low-level optimisations such as pipelining are performed automatically by the compiler. Sea Cucumber [64] compiles Java programs to hardware using a similar scheme to Handel-C, which we detail in Section 4.1.2. Unlike Handel-C, no language extensions are needed; like Streams-C, users must call a library, in this case based on communicating sequential processes (CSP [70]). Multiple circuit implementations of the library primitives enable tradeoffs. SPARK [65] is a high-level synthesis framework targeting multimedia and image processing. It compiles C code with the following steps: (a) list scheduling based on speculative code motions and loop transformations; (b) resource binding pass with minimisation of interconnect; (c) nite state machine controller generation for the scheduled datapath; (d) code generation producing synthesisable register-transfer level VHDL. Logic synthesis tools then synthesise the output. Catapult C synthesises register transfer level (RTL) descriptions from unannotated C ; using characterisations of the target technology from RTL synthesis tools [66]. Users can set constraints to explore the design space, controlling loop pipelining and resource sharing.
4.1.1 Annotation and constraint-driven approach: The systems mentioned below employ
annotations in the source-code and constraint ies to control the optimisation process. Their strength is that usually only minor changes are needed to produce a compilable program from a software description no extensive restructuring is required. Five representative methods are SPC [62], Streams-C [63], Sea Cucumber [64], SPARK [65] and Catapult-C [66]. SPC [62] combines vectorisation, loop transformations and retiming with automatic memory allocation to improve performance. SPC accelerates C loop nests with data dependency restrictions, compiling them into pipelines.
200
communication. The principal innovation of Haydn-C is a framework of optional annotations to enable users to describe design constraints, and to direct source-level transformations such as scheduling and resource allocation. There are automated transformations so that a single highlevel design can be used to produce many implementations with different trade-offs. This approach has been evaluated using various case studies, including FIR lters, fractal generators and morphological operators. The fastest morphological erosion design is 129 times faster and 3.4 times larger than the smallest design. Bach-C [77] is similar to Handel-C but has an untimed semantics, only synchronising between parallel threads on synchronous communications between them, possibly giving greater scope for optimisation. It also allows asynchronous communications but otherwise resembles Handel-C, using the same basic one-hot compilation scheme. Table 3 summarises the various compilers discussed in this Section, showing their approach, source and target languages, target architecture and some example applications. Note that the compilers discussed are not necessarily restricted to the architectures reported; some can usually be ported to a different architecture by using a different library of hardware primitives.
accuracy, design size, speed and power consumption. The use of such custom data representation for optimising designs is one of the main strengths of recongurable computing. Given this exibility, it is desirable to automate the process of nding a good custom data representation. The most important implementation decision to automate is the selection of an appropriate word-length and scaling for each signal [85] in a DSP system. Unlike microprocessorbased implementations, where the word-length is dened a priori by the hard-wired architecture of the processor, recongurable computing allows the word-length of each signal to be customised to produce the best tradeoffs in numerical accuracy, design size, speed, and power consumption. The use of custom data representation is one of the greatest strengths. It has been argued that, often, the most efcient hardware implementation of an algorithm is one in which a wide variety of nite precision representations of different sizes are used for different internal variables [86]. The accuracy observable at the outputs of a DSP system is a function of the word-lengths used to represent all intermediate variables in the algorithm. However, accuracy is less sensitive to some variables than to others, as is implementation area. It is demonstrated in [85] that, by considering error and area information in a structured way using analytical and semianalytical noise models, it is possible to achieve highly efcient DSP implementations. In [87] it has been demonstrated that the problem of word-length optimisation is NP-hard, even for systems with special mathematical properties that simplify the problem from a practical perspective [88]. There are, however, several published approaches to word-length optimisation. These can be classied as heuristics offering an area = signal quality tradeoff [86, 89, 90], approaches that make some simplifying assumptions on error properties [89, 91], or optimal approaches that can be applied to algorithms with particular mathematical properties [92]. Some published approaches to the word-length optimisation problem use an analytic approach to scaling and=or error estimation [90, 93, 94], some use simulation [89, 91], and some use a hybrid of the two [95]. The advantage of analytical techniques is that they do not require representative simulation stimulus, and can be faster; however, they tend to be more pessimistic. There is little analytical work on supporting data-ow graphs containing cycles, although in [94] nite loop bounds are supported, while [88] supports cyclic data-ow when the nodes are of a restricted set of types, extended to the semi-analytic technique with fewer restrictions in [96]. Some published approaches use worst-case instantaneous error as a measure of signal quality [90, 91, 93], whereas some use signal-to-noise ratio [86, 89]. The remainder of this Section reviews in some detail particular research approaches in the eld. The Bitwise Project [94] proposes propagation of integer variable ranges backwards and forwards through data-ow graphs. The focus is on removing unwanted most-signicant bits (MSBs). Results from integration in a synthesis ow indicate that area savings of between 15% and 86% combined with speed increases of up to 65% can be achieved compared to using 32-bit integers for all variables. The MATCH Project [93] also uses range propagation through data-ow graphs, except that variables with a fractional component are allowed. All signals in the model of [93] must have equal fractional precision; the authors propose an analytic worst-case error model to estimate the required number of fractional bits. Area reductions of 80%
202
combined with speed increases of 20% are reported when compared to a uniform 32-bit representation. Wadekar and Parker [90] have also proposed a methodology for word-length optimisation. Like [93], this technique also allows controlled worst-case error at system outputs; however, each intermediate variable is allowed to take a word-length appropriate to the sensitivity of the output errors to quantisation errors on that particular variable. Results indicate area reductions of between 15% and 40% over the optimum uniform word-length implementation. Kum and Sung [89] and Cantin et al. [91] have proposed several word-length optimisation techniques to tradeoff system area against system error. These techniques are heuristics based on bit-true simulation of the design under various internal word-lengths. In Bitsize [97, 98], Abdul Gaffar et al. propose a hybrid method based on the mathematical technique known as automatic differentiation to perform bitwidth optimisation. In this technique, the gradients of outputs with respect to the internal variables are calculated and then used to determine the sensitivities of the outputs to the precision of the internal variables. The results show that it is possible to achieve an area reduction of 20% for oating-point designs, and 30% for xed-point designs, when given an output error specication of 0:75% against a reference design. A useful survey of algorithmic procedures for wordlength determination has been provided by Cantin et al. [99]. In this work, existing heuristics are classied under various categories. However the exhaustive and branchand-bound procedures described in [99] do not necessarily capture the optimum solution to the word-length determination problem, due to nonconvexity in the constraint space: it is actually possible to have a lower error at a system output by reducing the word-length at an internal node [100]. Such an effect is modelled in the MILP approach proposed in [92]. A comparative summary of existing optimisation systems is provided in Table 4. Each system is classied according to the several dening features described below.
. Is the word-length and scaling selection performed through analytical or simulation-based means? . Can the system support algorithms exhibiting cyclic data ow (such as innite impulse response lters)? . What mechanisms are supported for most signicant bit (MSB) optimisations (such as ignoring MSBs that are known to contain no useful information, a technique determined by the scaling approach used)? . What mechanisms are supported for least signicant bit (LSB) optimisations? These involve the monitoring of word-length growth. In addition, for those systems that support error-tradeoffs, further optimisations include the quantisation (truncation or rounding) of unwanted LSBs. . Does the system allow the user to tradeoff numerical accuracy for a more efcient implementation?
An example optimisation ow: One possible design ow for word-length optimisation, used in the Right-Size system [96] is illustrated in Fig. 9 for Xilinx FPGAs. The inputs to this system are a specication of the system behaviour (e.g. using Simulink), a specication of the acceptable signal-to-noise ratio at each output, and a set of representative input signals. From these inputs, the tool automatically generates a synthesisable structural description of the architecture and a bit-true behavioural VHDL testbench, together with a set of expected outputs for the provided set of representative inputs. Also generated is
IEE Proc.-Comput. Digit. Tech., Vol. 152, No. 2, March 2005
4.2.3
a makele which can be used to automate the post-RightSize synthesis process. Application of Right-Size to various adaptive lters implemented in a Xilinx Virtex FPGA has resulted in area reduction of up to 80%, power reduction of up to 98%, and speedup of up to 36% over common alternative design methods without word-length optimisation.
4.2.4
Other design methods: Besides signal processing, video and image processing is another area that can benet from special-purpose design methods. Three examples will be given to provide a avour of this approach. First, the CHAMPION system [108] maps designs captured in the Cantata graphical programming environment to multiple recongurable computing platforms. Second, the IGOL framework [109] provides a layered architecture for facilitating hardware plug-ins to be incorporated in various
applications in the Microsoft Windows operating system, such as Premiere, Winamp, VirtualDub and DirectShow. Third, the SA-C compiler [110] maps a high-level singleassignment language specialised for image processing description into hardware, using various optimisation methods including loop unrolling, array value propagation, loop-carried array elimination and multi-dimensional stripmining. Recent work indicates that another application area that can benet from special-purpose techniques is networking. Two examples will be given. First, a framework has been developed to enable description of designs in the network policy language Ponder [111], into recongurable hardware implementations [112]. Second, it is shown [113] how descriptions in the Click networking language can produce efcient recongurable designs.
203
Fig. 9
4.3.2 Soft instruction processors: FPGA technology can now support one or more soft instruction processors implemented using recongurable resources on a single chip; proprietary instruction processors, like MicroBlaze and Nios, are now available from FPGA vendors. Often such processors support customisation of resources and custom instructions. Custom instructions have two main benets. First, they reduce the time for instruction fetch and decode, provided that each custom instruction replaces several regular instructions. Second, additional resources can be assigned to a custom instruction to improve performance. Bit-width optimisation, described in Section 4.2, can also be applied to customise instruction processors at compile time. A challenge of customising instruction processors is that the tools for producing and analysing instructions also need to be customised. For instance, the exible instruction processor framework [10] has been developed to automate the steps in customising an instruction processor and the corresponding tools. Other researchers have proposed similar approaches [119]. Instruction processors can also run declarative languages. For instance, a scalable architecture [8], consisting of multiple processors based on the Warren Abstract Machine, has been developed to support the execution of the Progol system [120], based on the declarative language Prolog. Its effectiveness has been demonstrated using the mutagenesis data set containing 12 000 facts about chemical compounds.
partitioned across multiple FPGAs. Both methods can split designs across several FPGAs, and are retargetable via hardware description libraries. Other C-like languages that have been developed include MoPL-3, a C extension supporting data procedural compilation for the Xputer architecture which comprises an array of recongurable ALUs [123], and spC, a systolic parallel C variant for the Enable board [124].
exploration, and reusable and extensible hardware optimisation. The framework compiles a parallel imperative language like Handel-C, and supports multiple levels of design abstraction, transformational development, optimisation by compiler passes, and metalanguage facilities. The approach has been used in producing designs for applications, such as signal and image processing, with different tradeoffs in performance and resource usage.
4.3.4
Hardware=software codesign: Several research groups have studied the problem of compiling C code to both hardware and software. The Garp compiler [125] is intended to accelerate plain C, with no annotations to help the compiler, making it more widely applicable. The work targets one architecture only: the Garp chip, which integrates a RISC core and recongurable logic. This compiler also uses the SUIF framework. The compiler uses a technique rst developed for VLIW architectures called hyperblock scheduling, which optimises for instructionlevel parallelism across several common paths, at the expense of rarer paths. Infeasible or rare paths are implemented on the processor with the more common, easily parallelisable paths synthesised into logic for the recongurable resource. Similarly, the NAPA C compiler targets the NAPA architecture [126], which also integrates a RISC processor recongurable logic. This compiler can also work on plain C code but the programmer can add C pragmas to indicate large-scale parallelism and the bitwidths of variables to the code. The compiler can synthesise pipelines from loops.
4.5.1
Special-purpose design: As explained earlier, special-purpose design methods and tools enable both high-level design and domain-specic optimisation. Existing methods, such as those compiling MATLAB Simulink descriptions into recongurable computing implementations [81, 82, 93, 96, 97, 136], allow application developers without electronic design experience to produce efcient hardware implementations quickly and effectively. This is an area that would assume further importance in future.
4.5.2
4.3.5
Annotation-free compilation: Some researchers aim to compile a sequential program, without any annotations, into efcient hardware design. This requires analysis of the source program to extract parallelism for an efcient result, which is necessary if compilation from languages such as C is to compete with traditional methods for designing hardware. One example is the work of Babb et al. [127], targeting custom, xed-logic implementation while also applicable to recongurable hardware. The compiler uses the SUIF infrastructure to do several analyses to nd what computations affect exactly what data, as far as possible. A tiled architecture is synthesised, where all computation is kept as local as possible to one tile. More recently, Ziegler et al. [128] have used loop transformations in mapping loop nests onto a pipeline spanning several FPGAs. A further effort is given by the Garp project [125].
Low-power design: Several hardware compilers aim to minimise the power consumption of their generated designs. Examples include special-purpose design methods such as Right-Size [96] and PyGen [136], and general-purpose methods that target loops for congurable hardware implementation [5]. These design methods, when combined with low-power architectures [54] and power-aware low-level tools [52], can provide signicant reduction in power consumption. High-level transformation: Many hardware design methods [62, 65, 110] involve high-level transformations: loop unrolling, loop restructuring and static single assignment are three examples. The development of powerful transformations for design optimisation will continue for both special-purpose and general-purpose designs.
5 Summary
4.5.3
This paper surveys two aspects of recongurable computing: architectures and design methods. The main trends in architectures are coarse-grained fabrics, heterogeneous functions and soft cores. The main trends in design methods are special-purpose design methods, low-power techniques and high-level transformations. We wonder what a survey paper on recongurable computing, written in 2015, will cover? 6 Acknowledgments
Our thanks to Ray Cheung and Sherif Yusuf for their support in preparing this paper. The support of Celoxica, Xilinx and UK EPSRC (grant numbers GR=R 31409, GR=R 55931, GR=N 66599) is gratefully acknowledged. 7 References
4.4.2
Recent work [135] explains how customisable frameworks for hardware compilation can enable rapid design
IEE Proc.-Comput. Digit. Tech., Vol. 152, No. 2, March 2005
1 Luk, W.: Customising processors: design-time and run-time opportunities, Lect. Notes Comput. Sci., 2004, 3133 2 Telle, N., Cheung, C.C., and Luk, W.: Customising hardware designs for elliptic curve cryptography, Lect. Notes Comput. Sci., 2004, 3133 3 Guo, Z., Najjar, W., Vahid, F., and Vissers, K.: A quantitative analysis of the speedup factors of FPGAs over processors. Proc. Int. Symp. on FPGAs (ACM Press, 2004) 4 Underwood, K.: FPGAs vs. CPUs: trends in peak oating-point performance. Proc. Int. Symp. on FPGAs (ACM Press, 2004) 205
5 Stitt, G., Vahid, F., and Nematbakhsh, S.: Energy savings and speedups from partitioning critical software loops to hardware in embedded systems, ACM Trans. Embedded Comput. Syst., 2004, 3, (1), pp. 218232 6 Vereen, L.: Soft FPGA cores attract embedded developers, Embedded Syst. Program., 2004, 23 April 2004, http://www.embedded.com// showArticle.jhtml?articleID=19200183 7 Altera Corp., Nios II Processor Reference Handbook, May 2004 8 Fidjeland, A., Luk, W., and Muggleton, S.: Scalable acceleration of inductive logic programs. Proc. IEEE Int. Conf. on Field-Programmable Technology, 2002 9 Leong, P.H.W., and Leung, K.H.: A microcoded elliptic curve processor using FPGA technology, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., 2002, 10, (5), pp. 550 559 10 Seng, S.P., Luk, W., and Cheung, P.Y.K.: Flexible instruction processors. Proc. Int. Conf. on Compilers, Arch. and Syn. for Embedded Systems (ACM Press, 2000) 11 Seng, S.P., Luk, W., and Cheung, P.Y.K.: Run-time adaptive exible instruction processors, Lect. Notes Comput. Sci., 2002, 2438 12 Xilinx, Inc., Microblaze Processor Reference Guide, June 2004 13 Bondalapati, K., and Prasanna, V.K.: Recongurable computing systems, Proc. IEEE, 2002, 90, (7), pp. 12011217 14 Compton, K., and Hauck, S.: Recongurable computing: a survey of systems and software, ACM Comput. Surv., 2002, 34, (2), pp. 171210 15 Luk, W., Cheung, P.Y.K., and Shirazi, N.: Congurable computing, in Chen, W.K. (Ed.): Electrical engineers handbook (Academic Press, 2004) 16 Schaumont, P., Verbauwhede, I., Keutzer, K., and Sarrafzadeh, M.: A quick safari through the reconguration jungle. Proc. Design Automation Conf., ACM Press, 2001 17 Tessier, R., and Burleson, W.: Recongurable computing and digital signal processing: a survey, J. VLSI Signal Process., 2001, 28, pp. 727 18 Saxe, T., and Faith, B.: Less is more with FPGAs EE Times, 13 September 2004 http://www.eetimes.com/showArticle. jhtml?articleID=47203801 19 Actel Corp., ProASIC Plus Family Flash FPGAs, v3.5, April 2004 20 Altera Corp., Stratix II Device Handbook, February 2004 21 Lattice Semiconductor Corp, ispXPGA Family, January 2004 22 Morris, K.: Virtex 4: Xilinx details its next generation, FPGA Program. Logic J., 2004, June 23 Quicklogic Corp., Eclipse-II Family Datasheet, January 2004 24 Xilinx, Inc., Virtex II Datasheet, June 2004 25 Compton, K., and Hauck, S.: Totem: Custom recongurable array generation. Proc. Symp. on Field-Programmable Custom Computing Machines (IEEE Computer Society Press, 2001) 26 Ebeling, C., Conquist, D., and Franklin, P.: RaPiD recongurable pipelined datapath, Lect. Notes Comput. Sci. Misc., 1996, 1142 27 Elixent Corporation, DFA 1000 Accelerator Datasheet, 2003 28 Goldstein, S.C., Schmit, H., Budiu, M., Cadambi, S., Moe, M., and Taylor, R.: PipeRench: a recongurable architecture and compiler, Computer, 2000, 33, (4), pp. 7077 29 Hauser, J.R., and Wawrzynek, J.: Garp: a MIPS processor with a recongurable processor. IEEE Symp. on Field-Programmable Custom Computing Machines (IEEE Computer Society Press, 1997) 30 Marshall, A., Stanseld, T., Kostarnov, I., Vuillemin, J., and Hutchings, B.: A recongurable arithmetic array for multimedia applications, ACM=SIGDA Int. Symp. on FPGAs, Feb 1999, pp. 135 143 31 Mei, B., Vernalde, S., Verkest, D., De Man, H., and Lauwereins, R.: ADRES: An architecture with tightly coupled VLIW processor and coarse-grained recongurable matrix, Lect. Notes Comput. Sci., 2003, 2778 32 Mirsky, E., and DeHon, A.: MATRIX: a recongurable computing architecture with congurable instruction distribution and deployable resources. Proc. Symp. on Field-Programmable Custom Computing Machines (IEEE Computer Society Press, 1996) 33 Rupp, C.R., Landguth, M., Garverick, T., Gomersall, E., Holt, H., Arnold, J., and Gokhale, M.: The NAPA adaptive processing architecture. IEEE Symp. on Field-Programmable Custom Computing Machines, May 1998, pp. 2837 34 Silicon Hive: Avispa Block Accelerator. Product Brief, 2003 35 Singh, H., Lee, M.-H., Lu, G., Kurdahi, F., Bagherzadeh, N., and Chaves, E.: MorphoSys: an integrated recongurable system for dataparallel and compute intensive applications, IEEE Trans. Comput., 2000, 49, (5), pp. 465481 36 Taylor, M., et al: The RAW microprocessor: a computational fabric for software circuits and general purpose programs, IEEE Micro, 2002, 22, (2), pp. 2535 37 Cadence Design Systems Inc, Palladium Datasheet, 2004 38 Mentor Graphics, Vstation Pro: High Performance System Verication, 2003 39 Annapolis Microsystems, Inc., Wildre Reference Manual, 1998 40 Laufer, R., Taylor, R., and Schmit, H.: PCI-PipeRench and the SwordAPI: a system for stream-based recongurable computing. Proc. Symp. on Field-Programmable Custom Computing Machines (IEEE Computer Society Press, 1999) 41 Vuillemin, J., Bertin, P., Roncin, D., Shand, M., Touati, H., and Boucard, P.: Programmable active memories: recongurable systems come of age, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., 1996, 4, (1), pp. 5669 206
42 Wittig, R.D., and Chow, P.: OneChip: an FPGA processor with recongurable logic. IEEE Symp. on FPGAs for Custom Computing Machines, 1996 43 Razdan, R., and Smith, M.D.: A high performance microarchitecture with hardware programmable functional units. Int. Symp. on Microarchitecture, 1994, pp. 172180 44 Altera Corp., Excalibur Device Overview, May 2002 45 Xilinx, Inc., PowerPC 405 Processor Block Reference Guide, October 2003 46 Celoxica, RC2000 Development and evaluation board data sheet, version 1.1, 2004 47 Leong, P., Leong, M., Cheung, O., Tung, T., Kwok, C., Wong, M., and Lee, K.: Pilchard a recongurable computing platform with memory slot interface. Proc. Symp. on Field-Programmable Custom Computing Machines (IEEE Computer Society Press, 2001) 48 Becker, J., and Glesner, M.: A parallel dynamically recongurable architecture designed for exible application-tailored hardware/ software systems in future mobile communication, J. Supercomput., 2001, 19, (1), pp. 105127 49 Betz, V., Rose, J., and Marquardt, A.: Architecture and CAD for deepsubmicron FPGAs (Kluwer Academic Publishers, February 1999) 50 Lemieux, G., and Lewis, D.: Design of interconnect networks for programmable logic (Kluwer Academic Publishers, 2004) 51 George, V., Zhang, H., and Rabaey, J.: The design of a low energy FPGA. Proc. Int. Symp. on Low Power Electronics and Design, 1999 52 Lamoureux, J., and Wilton, S.J.E.: On the interaction between poweraware FPGA CAD algorithms, IEEE Int. Conf. on Computer-Aided Design, 2003 53 Rahman, A., Polavarapuv, V.: Evaluation of low-leakage design techniques for eld programmable gate arrays. Proc. Int. Symp. on Field-Programmable Gate Arrays (ACM Press, 2004) 54 Gayasen, A., Lee, K., Vijaykrishnan, N., Kandemir, M., Irwin, M.J., and Tuan, T.: A dual-VDD low power FPGA architecture, Lect. Notes Comput. Sci., 2004, 3203 55 Kagotani, H., and Schmit, H.: Asynchronous PipeRench: architecture and performance evaluations. Proc. Symp. on Field-Programmable Custom Computing Machines (IEEE Computer Society Press, 2003) 56 Teife, J., and Manohar, R.: Programmable asynchronous pipeline arrays, Lect. Notes Comput. Sci., 2003, 2778 57 Wong, C.G., Martin, A.J., and Thomas, P.: An architecture for asynchronous FPGAs. Proc. Int. IEEE Conf. on Field-Programmable Technology, 2003 58 Royal, A., and Cheung, P.Y.K.: Globally asynchronous locally synchronous FPGA architectures, Lect. Notes Comput. Sci., 2003, 2778 59 Butts, M., DeHon, A., and Goldstein, S.: Molecular electronics: devices, systems and tools for gigagate, gigabit chips, Proc. IEEE Int. Conf. on Computer-Aided Design, 2002 60 DeHon, A., and Wilson, M.J.: Nanowire-based sublithographic programmable logic arrays. Proc. Int. Symp. on FPGAs (ACM Press, 2004) 61 Williams, R.S., and Kuekes, P.J.: Molecular nanoelectronics. Proc. IEEE Int. Symp. on Circuits and Systems, 2000 62 Weinhardt, M., and Luk, W.: Pipeline vectorization, IEEE Trans. Comput.-Aided Des., 2001, 20, (2), pp. 234 248 63 Gokhale, M., Stone, J.M., Arnold, J., and Kalinowski, M.: Stream-oriented FPGA computing in the Streams-C high level language. Proc. Symp. on Field-Programmable Custom Computing Machines (IEEE Computer Society Press, 2000) 64 Jackson, P.A., Hutchings, B.L., and Tripp, J.L.: Simulation and synthesis of CSP-based interprocess communication. Proc. Symp. on Field-Programmable Custom Computing Machines (IEEE Computer Society Press, 2003) 65 Gupta, S., Dutt, N.D., Gupta, R.K., and Nicolau, A.: SPARK: a high-level synthesis framework for applying parallelizing compiler transformations. Proc. Int. Conf. on VLSI Design, January 2003 66 McCloud, S.: Catapult C Synthesis-based design ow: speeding implementation and increasing exibility. White Paper, Mentor Graphics, 2004. 67 Wilson, R.P., French, R.S., Wilson, C.S., Amarasinghe, S.P., Anderson, J.M., Tjiang, S.W.K., Liao, S.-W., Tseng, C.-W., Hall, M.W., Lam, M.S., and Hennessy, J.L.: SUIF: an infrastructure for research on parallelizing and optimizing compilers, SIGPLAN Not., 1994, 29, (12), pp. 3137 68 Harriss, T., Walke, R., Kienhuis, B., and Deprettere, E.: Compilation from Matlab to process networks realized in FPGA, Des. Autom. Embedded Syst., 2002, 7, (4), pp. 385 403 69 Schreiber, R., et al.: PICO-NPA: high-level synthesis of nonprogrammable hardware accelerators, J. VLSI Signal Process. Syst., 2002, 31, (2), pp. 127142 70 Hoare, C.A.R.: Communicating sequential processes (Prentice Hall, 1985) 71 Mencer, O., Pearce, D.J., Howes, L.W., and Luk, W.: Design space exploration with a stream compiler. Proc. IEEE Int. Conf. on Field Programmable Technology, 2003 72 Celoxica, Handel-C Language Reference Manual for DK2.0, Document RM-1003-4.0, 2003 73 De Figueiredo Coutinho, J.G., and Luk, W.: Source-directed transformations for hardware compilation. Proc. IEEE Int. Conf. on Field-Programmable Technology, 2003 74 Mencer, O.: PAM-Blox II: design and evaluation of C++ module generation for computing with FPGAs. Proc. Symp. on FieldProgrammable Custom Computing Machines (IEEE Computer Society Press, 2002)
IEE Proc.-Comput. Digit. Tech., Vol. 152, No. 2, March 2005
75 Liang, J., Tessier, R., and Mencer, O.: Floating point unit generation and evaluation for FPGAs. Proc. Symp. on Field-Programmable Custom Computing Machines (IEEE Computer Society Press, 2003) 76 Page, I., and Luk, W.: Compiling occam into FPGAs (Abingdon EE&CS Books, 1991) 77 Yamada, A., Nishida, K., Sakurai, R., Kay, A., Nomura, T., and Kambe, T.: Hardware synthesis with the Bach system. Proc. IEEE ISCAS, 1999 78 Frigo, J., Palmer, D., Gokhale, M., Popkin-Paine, M.: Gamma-ray pulsar detection using recongurable computing hardware. Proc. Symp. on Field Programmable Custom Computing Machines (IEEE Computer Society Press, 2003) 79 Styles, H., and Luk, W.: Customising graphics applications: techniques and programming interface. Proc. Symp. on Field-Programmable Custom Computing Machines (IEEE Computer Society Press, 2000) 80 Simulink, http://www.mathworks.com 81 Hwang, J., Milne, B., Shirazi, N., and Stroomer, J.D.: System level tools for DSP in FPGAs, Lect. Notes Comput. Sci., 2001, 2147 82 Altera Corp., DSP Builder User Guide, Version 2.1.3 rev.1, July 2003 83 Lee, E.A., and Messerschmitt, D.G.: Static scheduling of synchronous data ow program for digital signal processing, IEEE Trans. Comput., 1987, 36, pp. 24 35 84 Constantinides, G.A., Cheung, P.Y.K., and Luk, W.: Optimum and heuristic synthesis of multiple wordlength architectures, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., 2003, 22 , (10), pp. 14321442 85 Constantinides, G.A., Cheung, P.Y.K., and Luk, W.: Synthesis and optimization of DSP algorithms (Kluwer Academic, Dordrecht, 2004) 86 Constantinides, G.A., Cheung, P.Y.K., and Luk, W.: The multiple wordlength paradigm. Proc. Symp. on Field-Programmable Custom Computing Machines (IEEE Computer Society Press, 2001) 87 Constantinides, G.A., and Woeginger, G.J.: The complexity of multiple wordlength assignment, Appl. Math. Lett., 2002, 15, (2), pp. 137 140 88 Constantinides, G.A., Cheung, P.Y.K., and Luk, W.: Synthesis of saturation arithmetic architectures, ACM Trans. Des. Autom. Electron. Syst., 2003, 8, (3), pp. 334354 89 Kum, K.-I., and Sung, W.: Combined word-length optimization and high-level synthesis of digital processing systems, IEEE Trans. Comput. Aided Des., 2001, 20, (8), pp. 921930 90 Wadekar, S.A., and Parker, A.C.: Accuracy sensitive word-length selection for algorithm optimization. Proc. Int. Conf. on Computer Design, 1998 91 Cantin, M.-A., Savaria, Y., and Lavoie, P.: An automatic word length determination method. Proc. IEEE Int. Symp. on Circuits and Systems, 2001, pp. V-53V-56 92 Constantinides, G.A., Cheung, P.Y.K., and Luk, W.: Optimum wordlength allocation. Proc. Symp. on Field-Programmable Custom Computing Machines (IEEE Computer Society Press, 2002) 93 Nayak, A., Haldar, M., Choudhary, A., and Banerjee, P.: Precision and error analysis of MATLAB applications during automated hardware synthesis for FPGAs. Proc. Design Automation and Test in Europe, 2001 94 Stephenson, M., Babb, J., and Amarasinghe, S.: Bitwidth analysis with application to silicon compilation. Proc. SIGPLAN Programming Language Design and Implementation, June 2000 95 Cmar, R., Rijnders, L., Schaumont, P., Vernalde, S., and Bolsens, I.: A methodology and design environment for DSP ASIC xed point renement. Proc. Design Automation and Test in Europe, 1999 96 Constantinides, G.A.: Perturbation analysis for word-length optimization. Proc. Symp. on Field-Programmable Custom Computing Machines (IEEE Computer Society Press, 2003) 97 Abdul Gaffar, A., Mencer, O., Luk, W., Cheung, P.Y.K., and Shirazi, N.: Floating-point bitwidth analysis via automatic differentiation. Proc. Int. Conf. on Field-Programmable Technology, IEEE, 2002 98 Abdul Gaffar, A., Mencer, O., Luk, W., and Cheung, P.Y.K.: Unifying bit-width optimisation for xed-point and oating-point designs. Proc. Symp. on Field-Programmable Custom Computing Machines (IEEE Computer Society Press, 2004) 99 Cantin, M.-A., Savaria, Y., and Lavoie, P.: A comparison of automatic word length optimization procedures. Proc. IEEE Int. Symp. on Circuits and Systems, 2002 100 Constantinides, G.A.: High level synthesis and word length optimization of digital signal processing systems. PhD thesis, Imperial College London, 2001 101 Benedetti, A., and Perona, B.: Bit-width optimization for congurable DSPs by multi-interval analysis. Proc. 34th Asilomar Conf. on Signals, Systems and Computers, 2000 102 Stephenson, M.W.: Bitwise: Optimizing bitwidths using data-range propagation. Masters Thesis, Massachussets Institute of Technology, Dept. Electrical Engineering and Computer Science, May 2000 103 Keding, H., Willems, M., Coors, M., and Meyr, H.: FRIDGE: A xedpoint design and simulation environment. Proc. Design Automation and Test in Europe, 1998 rsgens, V., Keding, H., Grotker, T., and Meyer, M.: 104 Willems, M., Bu System-level xed-point design based on an interpolative approach, Proc. 34th Design Automation Conf., June 1997 105 Kum, K., and Sung, W.: Word-length optimization for high-level synthesis of digital signal processing systems. Proc. IEEE Int. Workshop on Signal Processing Systems, 1998 106 Sung, W., and Kum, K.: Word-length determination and scaling software for a signal ow block diagram. Proc. IEEE Int. Conf. on Acoustics Speech and Signal Processing, 1994
IEE Proc.-Comput. Digit. Tech., Vol. 152, No. 2, March 2005
107 Sung, W., and Kum, K.: Simulation-based word-length optimization method for xed-point digital signal processing systems, IEEE Trans. Signal Process., 1995, 43, (12), pp. 3087 3090 108 Ong, S., Kerkiz, N., Srijanto, B., Tan, C., Langston, M., Newport, D., and Bouldin, D.: Automatic mapping of multiple applications to multiple adaptive computing systems. Proc. Int. Symp. on FieldProgrammable Custom Computing Machines (IEEE Computer Society Press, 2001) 109 Thomas, D., and Luk, W.: A framework for development and distribution of hardware acceleration, Proc. SPIE - Int. Soc. Opt. Eng., 2002, 4867 110 Bohm, W., Hammes, J., Draper, B., Chawathe, M., Ross, C., Rinker, R., and Najjar, W.: Mapping a single assignment programming language to recongurable systems, J. Supercomput., 2002, 21, pp. 117130 111 Damianou, N., Dulay, N., Lupu, E., and Sloman, M.: The Ponder policy specication language, Lect. Notes Comput. Sci., 2001, 1995 112 Lee, T.K., Yusuf, S., Luk, W., Sloman, M., Lupu, E., and Dulay, N.: Compiling policy descriptions into recongurable rewall processors. Proc. Symp. on Field-Programmable Custom Computing Machines (IEEE Computer Society Press, 2003) 113 Kulkarni, C., Brebner, G., and Schelle, G.: Mapping a domain specic language to a platform FPGA. Proc. Design Automation Conf., 2004 114 Lee, T.K., Derbyshire, A., Luk, W., and Cheung, P.Y.K.: High-level language extensions for run-time recongurable systems. Proc. IEEE Int. Conf. on Field-Programmable Technology, 2003 115 Shirazi, N., Luk, W., and Cheung, P.Y.K.: Framework and tools for run-time recongurable designs, IEE Proc., Comput. Digit. Tech., 2000, 147, pp. 147152 116 Derbyshire, A., and Luk, W.: Compiling run-time parametrisable designs. Proc. IEEE Int. Conf. on Field-Programmable Technology, 2002 117 Clark, D., and Hutchings, B.: The DISC programming environment. Proc. Symp. on FPGAs for Custom Computing Machines (IEEE Computer Society Press, 1996) 118 Styles, H., and Luk, W.: Branch optimisation techniques for hardware compilation, Lect. Notes Comput. Sci., 2003, 2778 119 Kathail, V., Aditya, S., Schreiber, R., Ramakrishna Rau, B., Cronquist, D.C., and Sivaraman, M.: PICO: automatically designing custom computers, Computer, 2002, 35, (9), pp. 3947 120 Muggleton, S.H.: Inverse entailment and Progol, New Gener. Comput., 1995, 13 121 Peterson, J., OConnor, B., and Athanas, P.: Scheduling and partitioning ANSI-C programs onto multi-FPGA CCM architectures. Int. Symp. on FPGAs for Custom Computing Machines (IEEE Computer Society Press, 1996) 122 Duncan, A., Hendry, D., and Gray, P.: An overview of the COBRAABS high-level synthesis system for multi-FPGA systems. Proc. IEEE Symposium on FPGAs for Custom Computing Machines (IEEE Computer Society Press, 1998) 123 Ast, A., Becker, J., Hartenstein, R., Kress, R., Reinig, H., and Schmidt, K.: Data-procedural languages for FPL-based machines, Lect. Notes. Comput. Sci., 1994, 849 gl, H., Kugel, A., Ludvig, J., Ma nner, R., Noffz, K., Zoz, R., 124 Ho Enable++ a second-generation FPGA processor. IEEE Symp. on FPGAs for Custom Computing Machines (IEEE Computer Society Press, 1995) 125 Callahan, T., and Wawrzynek, J.: Instruction-level parallelism for recongurable computing, Lect. Notes Comput. Sci., 1998, 1482 126 Gokhale, M., and Stone, J.: NAPA C: compiling for a hybrid RISC/FPGA architecture. Proc. Symp. on Field-Programmable Custom Computing Machines (IEEE Computer Society Press, 1998) 127 Babb, J., Reinard, M., Andras, Moritz, C., Lee, W., Frank, M., Barwa, S., and Amarasinghe, S.: Parallelizing applications into silicon. Proc. Symp. on FPGAs for Custom Computing Machines (IEEE Computer Society Press, 1999) 128 Ziegler, H., So, B., Hall, M., and Diniz, P.: Coarse-grain pipelining on multiple-FPGA architectures, IEEE Symp. on Field-Programmable Custom Computing Machines, 2002, pp. 7788 129 Rissa, T., Luk, W., and Cheung, P.Y.K.: Automated combination of simulation and hardware prototyping. Proc. Int. Conf. on Engineering of Recongurable Systems and Algorithms (CSREA Press, 2004) 130 Bjesse, P., Claessen, K., Sheeran, M., and Singh, S., Lava: hardware design in Haskell. Proc. ACM Int. Conf. on Functional Programming (ACM Press, 1998) 131 Singh, S., and Lillieroth, C.J.: Formal verication of recongurable cores. Proc. Symp. on Field-Programmable Custom Computing Machines (IEEE Computer Society Press, 1999) 132 Guo, S., and Luk, W.: An integrated system for developing regular array design, J. Syst. Archit., 2001, 47, pp. 315 337 133 Luk, W., and McKeever, S.W.: Pebble: a language for parametrised and recongurable hardware design, Lect. Notes Comput. Sci., 1998, 1482 134 McKeever, S.W., Luk, W., and Derbyshire, A.: Compiling hardware descriptions with relative placement information for parametrised libraries, Lect. Notes Comput. Sci., 2002, 2517 135 Todman, T., Coutinho, J.G.F., and Luk, W.: Customisable hardware compilation. Proc. Int. Conf. on Engineering of Recongurable Systems and Algorithms (CSREA Press, 2004) 136 Ou, J., and Prasanna, V.: PyGen: a MATLAB=Simulink based tool for synthesizing parameterized and energy efcient designs using FPGAs. Proc. Int. Symp. on Field-Programmable Custom Computing Machines (IEEE Computer Society Press, 2004) 207