Guojie Luo

Peking University, Computer Science, Faculty Member

Followers

Following

Co-author

Public Views

Interests

Uploads

Papers by Guojie Luo

FF-Bond: Multi-bit

Clock power contributes a significant portion of chip power in modern IC design. Applying multi-b... more Clock power contributes a significant portion of chip power in modern IC design. Applying multi-bit flip-flops can effectively reduce clock power. State-of-the-art work performs multi-bit flip-flop clustering at the post-placement stage. However, the solution quality may be limited because the combinational gates are immovable during the clustering process. To overcome the deficiency, in this paper, we propose multi-bit flip-flop bonding at placement. Inspired by ionic bonding in Chemistry, we direct flip-flops to merging friendly locations thus facilitating flip-flop merging. Experimental results show that our algorithm, called FF-Bond, can save 27 % clock power on average. Compared with state-of-the-art post-placement multi-bit flip-flop clustering, FF-Bond can further reduce 14 % clock power.

Download

A fast and accurate approach for common path pessimism removal in static timing analysis

2016 IEEE International Symposium on Circuits and Systems (ISCAS), 2016

Download

A Fast and Simple Block-Based Approach for Common Path Pessimism Removal in Static Timing Analysis

2015 14th International Conference on Computer-Aided Design and Computer Graphics (CAD/Graphics), 2015

Download

Poster abstract: VeLoc: Finding your car in the parking lot

Proceedings of the 12th ACM Conference on Embedded Network Sensor Systems - SenSys '14, 2014

SRC 1091.001 Final Deliverable Report for Oct 2006: Highly Scalable Multilevel Placement Algorithm for Mixed-size with Complex Constraints

FF-bond: multi-bit flip-flop bonding at placement

Proceedings of the 2013 ACM international symposium on International symposium on physical design - ISPD '13, 2013

Download

TOCO: A Systolic Network for Efficient Transposed Convolutions with Output-Reuse Paths

Transposed convolution is a learnable up-sampling operator widely-used in deep neural networks. I... more Transposed convolution is a learnable up-sampling operator widely-used in deep neural networks. It up-samples the input activations to generate useful information in applications like style transfer and super resolution. There exists a rising demand for accelerating transposed convolution layers since they occupy a large portion of computation in GAN-like networks.

BlockGNN: Towards Efficient GNN Acceleration Using Block-Circulant Weight Matrices

2021 58th ACM/IEEE Design Automation Conference (DAC)

Accelerating Mobile Applications at the Network Edge with Software-Programmable FPGAs

IEEE INFOCOM 2018 - IEEE Conference on Computer Communications

Frequency Improvement of Systolic Array-Based CNNs on FPGAs

2019 IEEE International Symposium on Circuits and Systems (ISCAS)

Coarse-Grained Parallel Routing With Recursive Partitioning for FPGAs

IEEE Transactions on Parallel and Distributed Systems

Serial-Equivalent Static and Dynamic Parallel Routing for FPGAs

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Data-Flow Graph Mapping Optimization for CGRA with Deep Reinforcement Learning

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Exploring GPU-Accelerated Routing for FPGAs

IEEE Transactions on Parallel and Distributed Systems

Smartphone-Based Real Time Vehicle Tracking in Indoor Parking Structures

IEEE Transactions on Mobile Computing

Analytical Clustering Score with Application to Postplacement Register Clustering

ACM Transactions on Design Automation of Electronic Systems, 2016

Circuit clustering is usually done through discrete optimizations to enable circuit size reductio... more Circuit clustering is usually done through discrete optimizations to enable circuit size reduction or design-specific cluster formation. In this article, we are interested in the register-clustering technique for clock-power reduction by leveraging new opportunities introduced by multibit flip-flop (MBFF). Currently, INTEGRA is the only existing postplacement MBFF clustering optimizer with a subquadratic time complexity. However, it severely degrades the wirelength, especially for realistic designs, which may nullify the benefits of MBFF clustering. In contrast, we formulate an analytical clustering score with a nonlinear programming framework, in which the wirelength objective can be seamlessly integrated and the solver has empirical subquadratic time complexity. With the MBFF library, the application of our analytical clustering method achieves comparable clock power to the state-of-the-art techniques, but further reduces the wirelength by about 25%. Even without the MBFF library,...

A Fast and Low Computation Consumption Model for System-Level Thermal Management in 3D IC

2016 IEEE 66th Electronic Components and Technology Conference (ECTC), 2016

Scaling Up Physical Design

Proceedings of the 2016 on International Symposium on Physical Design, 2016

Energy-Efficient CNN Implementation on a Deeply Pipelined FPGA Cluster

Proceedings of the 2016 International Symposium on Low Power Electronics and Design, 2016

The high performance and energy requirement can be a limiting factor for the application of convo... more The high performance and energy requirement can be a limiting factor for the application of convolutional neural networks (CNN) in many areas. Recently, FPGA-based CNN accelerators have been demonstrated to have superior energy-efficiency compared to high-performance devices like GPGPUs. However, due to the constrained on-chip resource and many other factors, single-board FPGA designs may have difficulties to achieve the optimal energy-efficiency. In this paper, we present a deeply pipelined multi-FPGA architecture that expands the design space for optimal performance and energy-efficiency. A dynamic programming algorithm is proposed to map the CNN computing layers efficiently to different FPGA boards. To demonstrate the potential of the architecture, we built a prototype system with seven FPGA boards connected with high-speed serial links. The experimental results on AlexNet and VGG-16 show that the prototype can achieve up to 21× and 2× energy-efficiency compared to optimized multi-core CPU and GPU implementations respectively.