Clock power contributes a significant portion of chip power in modern IC design. Applying multi-b... more Clock power contributes a significant portion of chip power in modern IC design. Applying multi-bit flip-flops can effectively reduce clock power. State-of-the-art work performs multi-bit flip-flop clustering at the post-placement stage. However, the solution quality may be limited because the combinational gates are immovable during the clustering process. To overcome the deficiency, in this paper, we propose multi-bit flip-flop bonding at placement. Inspired by ionic bonding in Chemistry, we direct flip-flops to merging friendly locations thus facilitating flip-flop merging. Experimental results show that our algorithm, called FF-Bond, can save 27 % clock power on average. Compared with state-of-the-art post-placement multi-bit flip-flop clustering, FF-Bond can further reduce 14 % clock power.
Transposed convolution is a learnable up-sampling operator widely-used in deep neural networks. I... more Transposed convolution is a learnable up-sampling operator widely-used in deep neural networks. It up-samples the input activations to generate useful information in applications like style transfer and super resolution. There exists a rising demand for accelerating transposed convolution layers since they occupy a large portion of computation in GAN-like networks.
ACM Transactions on Design Automation of Electronic Systems, 2016
Circuit clustering is usually done through discrete optimizations to enable circuit size reductio... more Circuit clustering is usually done through discrete optimizations to enable circuit size reduction or design-specific cluster formation. In this article, we are interested in the register-clustering technique for clock-power reduction by leveraging new opportunities introduced by multibit flip-flop (MBFF). Currently, INTEGRA is the only existing postplacement MBFF clustering optimizer with a subquadratic time complexity. However, it severely degrades the wirelength, especially for realistic designs, which may nullify the benefits of MBFF clustering. In contrast, we formulate an analytical clustering score with a nonlinear programming framework, in which the wirelength objective can be seamlessly integrated and the solver has empirical subquadratic time complexity. With the MBFF library, the application of our analytical clustering method achieves comparable clock power to the state-of-the-art techniques, but further reduces the wirelength by about 25%. Even without the MBFF library,...
Proceedings of the 2016 International Symposium on Low Power Electronics and Design, 2016
The high performance and energy requirement can be a limiting factor for the application of convo... more The high performance and energy requirement can be a limiting factor for the application of convolutional neural networks (CNN) in many areas. Recently, FPGA-based CNN accelerators have been demonstrated to have superior energy-efficiency compared to high-performance devices like GPGPUs. However, due to the constrained on-chip resource and many other factors, single-board FPGA designs may have difficulties to achieve the optimal energy-efficiency. In this paper, we present a deeply pipelined multi-FPGA architecture that expands the design space for optimal performance and energy-efficiency. A dynamic programming algorithm is proposed to map the CNN computing layers efficiently to different FPGA boards. To demonstrate the potential of the architecture, we built a prototype system with seven FPGA boards connected with high-speed serial links. The experimental results on AlexNet and VGG-16 show that the prototype can achieve up to 21× and 2× energy-efficiency compared to optimized multi-core CPU and GPU implementations respectively.
The high performance and energy requirement can be a limiting factor for the application of convo... more The high performance and energy requirement can be a limiting factor for the application of convolutional neural networks (CNN) in many areas. Recently, FPGA-based CNN accelerators have been demonstrated to have superior energy-efficiency compared to high-performance devices like GPGPUs. However, due to the constrained on-chip resource and many other factors, single-board FPGA designs may have difficulties to achieve the optimal energy-efficiency. In this paper, we present a deeply pipelined multi-FPGA architecture that expands the design space for optimal performance and energy-efficiency. A dynamic programming algorithm is proposed to map the CNN computing layers efficiently to different FPGA boards. To demonstrate the potential of the architecture, we built a prototype system with seven FPGA boards connected with high-speed serial links. The experimental results on AlexNet and VGG-16 show that the prototype can achieve up to 21× and 2× energy-efficiency compared to optimized mult...
Clock power contributes a significant portion of chip power in modern IC design. Applying multi-b... more Clock power contributes a significant portion of chip power in modern IC design. Applying multi-bit flip-flops can effectively reduce clock power. State-of-the-art work performs multi-bit flip-flop clustering at the post-placement stage. However, the solution quality may be limited because the combinational gates are immovable during the clustering process. To overcome the deficiency, in this paper, we propose multi-bit flip-flop bonding at placement. Inspired by ionic bonding in Chemistry, we direct flip-flops to merging friendly locations thus facilitating flip-flop merging. Experimental results show that our algorithm, called FF-Bond, can save 27 % clock power on average. Compared with state-of-the-art post-placement multi-bit flip-flop clustering, FF-Bond can further reduce 14 % clock power.
Transposed convolution is a learnable up-sampling operator widely-used in deep neural networks. I... more Transposed convolution is a learnable up-sampling operator widely-used in deep neural networks. It up-samples the input activations to generate useful information in applications like style transfer and super resolution. There exists a rising demand for accelerating transposed convolution layers since they occupy a large portion of computation in GAN-like networks.
ACM Transactions on Design Automation of Electronic Systems, 2016
Circuit clustering is usually done through discrete optimizations to enable circuit size reductio... more Circuit clustering is usually done through discrete optimizations to enable circuit size reduction or design-specific cluster formation. In this article, we are interested in the register-clustering technique for clock-power reduction by leveraging new opportunities introduced by multibit flip-flop (MBFF). Currently, INTEGRA is the only existing postplacement MBFF clustering optimizer with a subquadratic time complexity. However, it severely degrades the wirelength, especially for realistic designs, which may nullify the benefits of MBFF clustering. In contrast, we formulate an analytical clustering score with a nonlinear programming framework, in which the wirelength objective can be seamlessly integrated and the solver has empirical subquadratic time complexity. With the MBFF library, the application of our analytical clustering method achieves comparable clock power to the state-of-the-art techniques, but further reduces the wirelength by about 25%. Even without the MBFF library,...
Proceedings of the 2016 International Symposium on Low Power Electronics and Design, 2016
The high performance and energy requirement can be a limiting factor for the application of convo... more The high performance and energy requirement can be a limiting factor for the application of convolutional neural networks (CNN) in many areas. Recently, FPGA-based CNN accelerators have been demonstrated to have superior energy-efficiency compared to high-performance devices like GPGPUs. However, due to the constrained on-chip resource and many other factors, single-board FPGA designs may have difficulties to achieve the optimal energy-efficiency. In this paper, we present a deeply pipelined multi-FPGA architecture that expands the design space for optimal performance and energy-efficiency. A dynamic programming algorithm is proposed to map the CNN computing layers efficiently to different FPGA boards. To demonstrate the potential of the architecture, we built a prototype system with seven FPGA boards connected with high-speed serial links. The experimental results on AlexNet and VGG-16 show that the prototype can achieve up to 21× and 2× energy-efficiency compared to optimized multi-core CPU and GPU implementations respectively.
The high performance and energy requirement can be a limiting factor for the application of convo... more The high performance and energy requirement can be a limiting factor for the application of convolutional neural networks (CNN) in many areas. Recently, FPGA-based CNN accelerators have been demonstrated to have superior energy-efficiency compared to high-performance devices like GPGPUs. However, due to the constrained on-chip resource and many other factors, single-board FPGA designs may have difficulties to achieve the optimal energy-efficiency. In this paper, we present a deeply pipelined multi-FPGA architecture that expands the design space for optimal performance and energy-efficiency. A dynamic programming algorithm is proposed to map the CNN computing layers efficiently to different FPGA boards. To demonstrate the potential of the architecture, we built a prototype system with seven FPGA boards connected with high-speed serial links. The experimental results on AlexNet and VGG-16 show that the prototype can achieve up to 21× and 2× energy-efficiency compared to optimized mult...
Uploads
Papers by Guojie Luo