This paper proposes SSketch, a novel automated computing framework for FPGA-based online analysis... more This paper proposes SSketch, a novel automated computing framework for FPGA-based online analysis of big data with dense (non-sparse) correlation matrices. SSketch targets streaming applications where each data sample can be processed only once and storage is severely limited. The stream of input data is used by SSketch for adaptive learning and updating a corresponding ensemble of lower dimensional data structures, a.k.a., a sketch matrix. A new sketching methodology is introduced that tailors the problem of transforming the big data with dense correlations to an ensemble of lower dimensional subspaces such that it is suitable for hardware-based acceleration performed by reconfigurable hardware. The new method is scalable, while it significantly reduces costly memory interactions and enhances matrix computation performance by leveraging coarse-grained parallelism existing in the dataset. To facilitate automation, SSketch takes advantage of a HW/SW co-design approach: It provides an Application Programming Interface (API) that can be customized for rapid prototyping of an arbitrary matrix-based data analysis algorithm. Proof-of-concept evaluations on a variety of visual datasets with more than 11 million non-zeros demonstrates up to 200 folds speedup on our hardware-accelerated realization of SSketch compared to a software-based deployment on a general purpose processor. I. INTRODUCTION Computers and sensors continually generate data at an unprecedented rate. Collections of massive data are often represented as large m × n matrices, where n is the number of samples and m is the corresponding number of features. Growth of " big data " is challenging traditional matrix analysis methods like Singular Value Decomposition (SVD) and Principal Component Analysis (PCA). Both SVD and PCA incur a large memory footprint with an O(m 2 n) computational complexity which limit their practicability in big data regime. This disruption of convention changes the way we analyze modern datasets and makes designing scalable factorization methods, a.k.a., sketching algorithms, a necessity. With a properly designed sketch matrix the intended computations can be performed on an ensemble of lower dimensional structures rather than the original matrix without a significant loss. In the big data regime, there are at least two sets of challenges that should be addressed simultaneously to optimize the performance. The first challenge class is to minimize the resource requirements for obtaining the data sketch within an error threshold in a timely manner. This favors designing sketching methods with a scalable computational complexity which can be readily scale up for analyzing a large amount of data. The second challenge class has to do with mapping of computation to increasingly heterogeneous modern
We develop Idetic, a set of mechanisms to enable long computations on ultra-low power Application... more We develop Idetic, a set of mechanisms to enable long computations on ultra-low power Application Specific Integrated Circuits (ASICs) with energy harvesting sources. We address the power transiency and unpredictability problem by optimally inserting checkpoints. Idetic targets high-level synthesis designs and automatically locates and embeds the checkpoints at the register-transfer level. We define an objective function that aims to find the checkpoints which incur minimum overhead and minimize recomputation energy cost. We develop and exploit a dynamic programming technique to solve the optimization problem. For real time operation, Idetic adaptively adjusts the checkpointing rate based on the available energy level in the system. Idetic is deployed and evaluated on cryptographic benchmark circuits. The test platform harvests RF power through an RFID-reader and stores the energy in a 3.3µF capacitor. For storage of checkpointed data, we evaluate and compare the effectiveness of various non-volatile memories including NAND Flash, PCM, and STTM. Extensive evaluations show that Idetic reliably enables execution of long computations under different source power patterns with low overhead. Our benchmark evaluations demonstrate that the area and energy overheads corresponding to the checkpoints are less than 5% and 11% respectively.
We propose a framework that enables intensive computation on ultra-low power devices with discont... more We propose a framework that enables intensive computation on ultra-low power devices with discontinuous energy-harvesting supplies. We devise an optimization algorithm that efficiently partitions the applications into smaller computational steps during high-level synthesis. Our system finds low-overhead checkpoints that minimize recomputation cost due to power losses, then inserts the checkpoints at the design's register-transfer level. The checkpointing rate is automatically adapted to the source's realtime behavior. We evaluate our mechanisms on a battery-less RF energy-harvester platform. Extensive experiments targeting applications in medical implant devices demonstrate our approach's ability to successfully execute complex computations for various supply patterns with low time, energy, and area overheads.
This paper introduces AHEAD, a novel domain-specific framework for automated (hardware-based) acc... more This paper introduces AHEAD, a novel domain-specific framework for automated (hardware-based) acceleration of massive data analysis applications with a dense (non-sparse) correlation matrix. Due to non-scalability of matrix inversion, often iterative computation is used for converging to a solution. AHEAD addresses two sets of domain-specific matrix computation challenges. First, the I/O and memory bandwidth constraints which limit the performance of hardware accelerators. Second, the hardness of handling large data because of the complexity of the known matrix transformations and the inseparability of non-sparse correlations. The inseparability problem translates to an increased communication cost with the accelerators. To optimize the performance within these limits, AHEAD learns the dependency structure of the domain data and suggests a scalable matrix transformation. The transformation minimizes the memory access required for matrix computing within an error threshold and thus, optimizes the mapping of domain data to the available (bandwidth constrained) accelerator resources. To facilitate automation, AHEAD also provides an Application Programming Interface (API) so users can customize the framework to an arbitrary iterative analysis algorithm and hardware mapping. Proof-of-concept implementation of AHEAD is performed on the widely used compressive sensing and general 1 regularized least squares solvers. On a massive light field imaging data set with 4.6B non-zeros, AHEAD attains up to 320x iteration speed improvement using reconfigurable hardware accelerators compared with the conventional solver and about 4x improvement compared to our transformed matrix solver on a general purpose processor (without hardware acceleration).
This paper proposes SSketch, a novel automated computing framework for FPGA-based online analysis... more This paper proposes SSketch, a novel automated computing framework for FPGA-based online analysis of big data with dense (non-sparse) correlation matrices. SSketch targets streaming applications where each data sample can be processed only once and storage is severely limited. The stream of input data is used by SSketch for adaptive learning and updating a corresponding ensemble of lower dimensional data structures, a.k.a., a sketch matrix. A new sketching methodology is introduced that tailors the problem of transforming the big data with dense correlations to an ensemble of lower dimensional subspaces such that it is suitable for hardware-based acceleration performed by reconfigurable hardware. The new method is scalable, while it significantly reduces costly memory interactions and enhances matrix computation performance by leveraging coarse-grained parallelism existing in the dataset. To facilitate automation, SSketch takes advantage of a HW/SW co-design approach: It provides an Application Programming Interface (API) that can be customized for rapid prototyping of an arbitrary matrix-based data analysis algorithm. Proof-of-concept evaluations on a variety of visual datasets with more than 11 million non-zeros demonstrates up to 200 folds speedup on our hardware-accelerated realization of SSketch compared to a software-based deployment on a general purpose processor. I. INTRODUCTION Computers and sensors continually generate data at an unprecedented rate. Collections of massive data are often represented as large m × n matrices, where n is the number of samples and m is the corresponding number of features. Growth of " big data " is challenging traditional matrix analysis methods like Singular Value Decomposition (SVD) and Principal Component Analysis (PCA). Both SVD and PCA incur a large memory footprint with an O(m 2 n) computational complexity which limit their practicability in big data regime. This disruption of convention changes the way we analyze modern datasets and makes designing scalable factorization methods, a.k.a., sketching algorithms, a necessity. With a properly designed sketch matrix the intended computations can be performed on an ensemble of lower dimensional structures rather than the original matrix without a significant loss. In the big data regime, there are at least two sets of challenges that should be addressed simultaneously to optimize the performance. The first challenge class is to minimize the resource requirements for obtaining the data sketch within an error threshold in a timely manner. This favors designing sketching methods with a scalable computational complexity which can be readily scale up for analyzing a large amount of data. The second challenge class has to do with mapping of computation to increasingly heterogeneous modern
We develop Idetic, a set of mechanisms to enable long computations on ultra-low power Application... more We develop Idetic, a set of mechanisms to enable long computations on ultra-low power Application Specific Integrated Circuits (ASICs) with energy harvesting sources. We address the power transiency and unpredictability problem by optimally inserting checkpoints. Idetic targets high-level synthesis designs and automatically locates and embeds the checkpoints at the register-transfer level. We define an objective function that aims to find the checkpoints which incur minimum overhead and minimize recomputation energy cost. We develop and exploit a dynamic programming technique to solve the optimization problem. For real time operation, Idetic adaptively adjusts the checkpointing rate based on the available energy level in the system. Idetic is deployed and evaluated on cryptographic benchmark circuits. The test platform harvests RF power through an RFID-reader and stores the energy in a 3.3µF capacitor. For storage of checkpointed data, we evaluate and compare the effectiveness of various non-volatile memories including NAND Flash, PCM, and STTM. Extensive evaluations show that Idetic reliably enables execution of long computations under different source power patterns with low overhead. Our benchmark evaluations demonstrate that the area and energy overheads corresponding to the checkpoints are less than 5% and 11% respectively.
We propose a framework that enables intensive computation on ultra-low power devices with discont... more We propose a framework that enables intensive computation on ultra-low power devices with discontinuous energy-harvesting supplies. We devise an optimization algorithm that efficiently partitions the applications into smaller computational steps during high-level synthesis. Our system finds low-overhead checkpoints that minimize recomputation cost due to power losses, then inserts the checkpoints at the design's register-transfer level. The checkpointing rate is automatically adapted to the source's realtime behavior. We evaluate our mechanisms on a battery-less RF energy-harvester platform. Extensive experiments targeting applications in medical implant devices demonstrate our approach's ability to successfully execute complex computations for various supply patterns with low time, energy, and area overheads.
This paper introduces AHEAD, a novel domain-specific framework for automated (hardware-based) acc... more This paper introduces AHEAD, a novel domain-specific framework for automated (hardware-based) acceleration of massive data analysis applications with a dense (non-sparse) correlation matrix. Due to non-scalability of matrix inversion, often iterative computation is used for converging to a solution. AHEAD addresses two sets of domain-specific matrix computation challenges. First, the I/O and memory bandwidth constraints which limit the performance of hardware accelerators. Second, the hardness of handling large data because of the complexity of the known matrix transformations and the inseparability of non-sparse correlations. The inseparability problem translates to an increased communication cost with the accelerators. To optimize the performance within these limits, AHEAD learns the dependency structure of the domain data and suggests a scalable matrix transformation. The transformation minimizes the memory access required for matrix computing within an error threshold and thus, optimizes the mapping of domain data to the available (bandwidth constrained) accelerator resources. To facilitate automation, AHEAD also provides an Application Programming Interface (API) so users can customize the framework to an arbitrary iterative analysis algorithm and hardware mapping. Proof-of-concept implementation of AHEAD is performed on the widely used compressive sensing and general 1 regularized least squares solvers. On a massive light field imaging data set with 4.6B non-zeros, AHEAD attains up to 320x iteration speed improvement using reconfigurable hardware accelerators compared with the conventional solver and about 4x improvement compared to our transformed matrix solver on a general purpose processor (without hardware acceleration).
Uploads
Papers by Ebrahim M. Songhori