Abstract: Despite the promising performance improvement observed in emerging many-core architectu... more Abstract: Despite the promising performance improvement observed in emerging many-core architectures in high performance processors, high power consumption prohibitively affects their use and marketability in the low-energy sectors, such as embedded processors, network processors and application specific instruction processors (ASIPs). While most chip architects design power-efficient processors by finding an optimal power-performance balance in their design, some use sophisticated on-chip autonomous power management units, which dynamically reduce the voltage or frequencies of idle cores and hence extend battery life and reduce operating costs. For large scale designs of many-core processors, a holistic approach integrating both these techniques at different levels of abstraction can potentially achieve maximal power savings. In this paper we present CASPER, a robust instruction trace driven cycle-accurate many-core multi-threading micro-architecture simulation platform where we ha...
While medium- and large-sized computing centers have increasingly relied on clusters of commodity... more While medium- and large-sized computing centers have increasingly relied on clusters of commodity PC hardware to provide cost-effective capacity and capability, it is not clear that this technology will scale to the PetaFLOP range. It is expected that semiconductor technology will continue its exponential advancements over next fifteen years; how-ever, new issues are rapidly emerging and the relative im-portance of current performance metrics are shifting. Fu-ture PetaFLOP architectures will require system designers to solve computer architecture problems ranging from how to house, power, and cool the machine, all-the-while re-maining sensitive to cost. The Reconfigurable Computing Cluster (RCC) project is a multi-institution, multi-disciplinary project investigating the use of FPGAs to build cost-effective petascale comput-ers. This paper describes the nascent project’s objectives and a 64-node prototype cluster. Specifically, the aim is to provide an detailed motivation for the pr...
In this work, we quantize a trained Transformer machine language translation model leveraging INT... more In this work, we quantize a trained Transformer machine language translation model leveraging INT8/VNNI instructions in the latest Intel$^\\circledR$ Xeon$^\\circledR$ Cascade Lake processors to improve inference performance while maintaining less than 0.5$\\%$ drop in accuracy. To the best of our knowledge, this is the first attempt in the industry to quantize the Transformer model. This has high impact as it clearly demonstrates the various complexities of quantizing the language translation model. We present novel quantization techniques directly in TensorFlow to opportunistically replace 32-bit floating point (FP32) computations with 8-bit integers (INT8) and transform the FP32 computational graph. We also present a bin-packing parallel batching technique to maximize CPU utilization. Overall, our optimizations with INT8/VNNI deliver 1.5X improvement over the best FP32 performance. Furthermore, it reveals the opportunities and challenges to boost performance of quantized deep learni...
Abstract: By the middle of this decade, uniprocessor architecture performance had hit a roadblock... more Abstract: By the middle of this decade, uniprocessor architecture performance had hit a roadblock due to a combination of factors, such as excessive power dissipation due to high operating frequencies, growing memory access latencies, diminishing returns on deeper ...
Abstract: Despite the promising performance improvement observed in emerging many-core architectu... more Abstract: Despite the promising performance improvement observed in emerging many-core architectures in high performance processors, high power consumption prohibitively affects their use and marketability in the low-energy sectors, such as embedded processors, network processors and application specific instruction processors (ASIPs). While most chip architects design power-efficient processors by finding an optimal power-performance balance in their design, some use sophisticated on-chip autonomous power management units, which dynamically reduce the voltage or frequencies of idle cores and hence extend battery life and reduce operating costs. For large scale designs of many-core processors, a holistic approach integrating both these techniques at different levels of abstraction can potentially achieve maximal power savings. In this paper we present CASPER, a robust instruction trace driven cycle-accurate many-core multi-threading micro-architecture simulation platform where we ha...
While medium- and large-sized computing centers have increasingly relied on clusters of commodity... more While medium- and large-sized computing centers have increasingly relied on clusters of commodity PC hardware to provide cost-effective capacity and capability, it is not clear that this technology will scale to the PetaFLOP range. It is expected that semiconductor technology will continue its exponential advancements over next fifteen years; how-ever, new issues are rapidly emerging and the relative im-portance of current performance metrics are shifting. Fu-ture PetaFLOP architectures will require system designers to solve computer architecture problems ranging from how to house, power, and cool the machine, all-the-while re-maining sensitive to cost. The Reconfigurable Computing Cluster (RCC) project is a multi-institution, multi-disciplinary project investigating the use of FPGAs to build cost-effective petascale comput-ers. This paper describes the nascent project’s objectives and a 64-node prototype cluster. Specifically, the aim is to provide an detailed motivation for the pr...
In this work, we quantize a trained Transformer machine language translation model leveraging INT... more In this work, we quantize a trained Transformer machine language translation model leveraging INT8/VNNI instructions in the latest Intel$^\\circledR$ Xeon$^\\circledR$ Cascade Lake processors to improve inference performance while maintaining less than 0.5$\\%$ drop in accuracy. To the best of our knowledge, this is the first attempt in the industry to quantize the Transformer model. This has high impact as it clearly demonstrates the various complexities of quantizing the language translation model. We present novel quantization techniques directly in TensorFlow to opportunistically replace 32-bit floating point (FP32) computations with 8-bit integers (INT8) and transform the FP32 computational graph. We also present a bin-packing parallel batching technique to maximize CPU utilization. Overall, our optimizations with INT8/VNNI deliver 1.5X improvement over the best FP32 performance. Furthermore, it reveals the opportunities and challenges to boost performance of quantized deep learni...
Abstract: By the middle of this decade, uniprocessor architecture performance had hit a roadblock... more Abstract: By the middle of this decade, uniprocessor architecture performance had hit a roadblock due to a combination of factors, such as excessive power dissipation due to high operating frequencies, growing memory access latencies, diminishing returns on deeper ...
Uploads
Papers by Kushal Datta