Search | arXiv e-print repository

doi 10.24963/ijcai.2024/1031

Digital Avatars: Framework Development and Their Evaluation

Authors: Timothy Rupprecht, Sung-En Chang, Yushu Wu, Lei Lu, Enfu Nan, Chih-hsiang Li, Caiyue Lai, Zhimin Li, Zhijun Hu, Yumei He, David Kaeli, Yanzhi Wang

Abstract: We present a novel prompting strategy for artificial intelligence driven digital avatars. To better quantify how our prompting strategy affects anthropomorphic features like humor, authenticity, and favorability we present Crowd Vote - an adaptation of Crowd Score that allows for judges to elect a large language model (LLM) candidate over competitors answering the same or similar prompts. To visua… ▽ More We present a novel prompting strategy for artificial intelligence driven digital avatars. To better quantify how our prompting strategy affects anthropomorphic features like humor, authenticity, and favorability we present Crowd Vote - an adaptation of Crowd Score that allows for judges to elect a large language model (LLM) candidate over competitors answering the same or similar prompts. To visualize the responses of our LLM, and the effectiveness of our prompting strategy we propose an end-to-end framework for creating high-fidelity artificial intelligence (AI) driven digital avatars. This pipeline effectively captures an individual's essence for interaction and our streaming algorithm delivers a high-quality digital avatar with real-time audio-video streaming from server to mobile device. Both our visualization tool, and our Crowd Vote metrics demonstrate our AI driven digital avatars have state-of-the-art humor, authenticity, and favorability outperforming all competitors and baselines. In the case of our Donald Trump and Joe Biden avatars, their authenticity and favorability are rated higher than even their real-world equivalents. △ Less

Submitted 7 August, 2024; originally announced August 2024.

Comments: This work was presented during the IJCAI 2024 conference proceedings for demonstrations

MSC Class: 68 ACM Class: D.2.2; C.3

Journal ref: 2024 Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence Demo Track. Pages 8780-8783

arXiv:2404.15510 [pdf, other]

NeuraChip: Accelerating GNN Computations with a Hash-based Decoupled Spatial Accelerator

Authors: Kaustubh Shivdikar, Nicolas Bohm Agostini, Malith Jayaweera, Gilbert Jonatan, Jose L. Abellan, Ajay Joshi, John Kim, David Kaeli

Abstract: Graph Neural Networks (GNNs) are emerging as a formidable tool for processing non-euclidean data across various domains, ranging from social network analysis to bioinformatics. Despite their effectiveness, their adoption has not been pervasive because of scalability challenges associated with large-scale graph datasets, particularly when leveraging message passing. To tackle these challenges, we… ▽ More Graph Neural Networks (GNNs) are emerging as a formidable tool for processing non-euclidean data across various domains, ranging from social network analysis to bioinformatics. Despite their effectiveness, their adoption has not been pervasive because of scalability challenges associated with large-scale graph datasets, particularly when leveraging message passing. To tackle these challenges, we introduce NeuraChip, a novel GNN spatial accelerator based on Gustavson's algorithm. NeuraChip decouples the multiplication and addition computations in sparse matrix multiplication. This separation allows for independent exploitation of their unique data dependencies, facilitating efficient resource allocation. We introduce a rolling eviction strategy to mitigate data idling in on-chip memory as well as address the prevalent issue of memory bloat in sparse graph computations. Furthermore, the compute resource load balancing is achieved through a dynamic reseeding hash-based mapping, ensuring uniform utilization of computing resources agnostic of sparsity patterns. Finally, we present NeuraSim, an open-source, cycle-accurate, multi-threaded, modular simulator for comprehensive performance analysis. Overall, NeuraChip presents a significant improvement, yielding an average speedup of 22.1x over Intel's MKL, 17.1x over NVIDIA's cuSPARSE, 16.7x over AMD's hipSPARSE, and 1.5x over prior state-of-the-art SpGEMM accelerator and 1.3x over GNN accelerator. The source code for our open-sourced simulator and performance visualizer is publicly accessible on GitHub https://neurachip.us △ Less

Submitted 26 April, 2024; v1 submitted 23 April, 2024; originally announced April 2024.

Comments: Visit https://neurachip.us for WebGUI based simulations

arXiv:2402.19184 [pdf, other]

Data Transfer Optimizations for Host-CPU and Accelerators in AXI4MLIR

Authors: Jude Haris, Nicolas Bohm Agostini, Antonino Tumeo, David Kaeli, José Cano

Abstract: As custom hardware accelerators become more prevalent, it becomes increasingly important to automatically generate efficient host-driver code that can fully leverage the capabilities of these accelerators. This approach saves time and reduces the likelihood of errors that can occur during manual implementation. AXI4MLIR extends the MLIR compiler framework to generate host-driver code for custom ac… ▽ More As custom hardware accelerators become more prevalent, it becomes increasingly important to automatically generate efficient host-driver code that can fully leverage the capabilities of these accelerators. This approach saves time and reduces the likelihood of errors that can occur during manual implementation. AXI4MLIR extends the MLIR compiler framework to generate host-driver code for custom accelerators for linear algebra problems. By leveraging specific compiler optimizations, we can further increase accelerator utilization. In this work we offer two key observations through a MatMul accelerator case study. First, the accelerator's compute core utilization is less than 10%, and second, the critical latency bottleneck is caused by copying data between the heap and memory-mapped DMA buffers. We identify a set of missing host code optimizations to improve the under-utilization and the latency bottleneck. Therefore, we propose three key host-code data-movement-related optimizations, extending AXI4MLIR. The optimizations provide DMA-based data allocation, coalescing of DMA transfers, and pipelining of the accelerator's load, compute, and store stages. △ Less

Submitted 29 February, 2024; originally announced February 2024.

arXiv:2312.14821 [pdf, other]

doi 10.1109/CGO57630.2024.10444801

AXI4MLIR: User-Driven Automatic Host Code Generation for Custom AXI-Based Accelerators

Authors: Nicolas Bohm Agostini, Jude Haris, Perry Gibson, Malith Jayaweera, Norm Rubin, Antonino Tumeo, José L. Abellán, José Cano, David Kaeli

Abstract: This paper addresses the need for automatic and efficient generation of host driver code for arbitrary custom AXI-based accelerators targeting linear algebra algorithms, an important workload in various applications, including machine learning and scientific computing. While existing tools have focused on automating accelerator prototyping, little attention has been paid to the host-accelerator in… ▽ More This paper addresses the need for automatic and efficient generation of host driver code for arbitrary custom AXI-based accelerators targeting linear algebra algorithms, an important workload in various applications, including machine learning and scientific computing. While existing tools have focused on automating accelerator prototyping, little attention has been paid to the host-accelerator interaction. This paper introduces AXI4MLIR, an extension of the MLIR compiler framework designed to facilitate the automated generation of host-accelerator driver code. With new MLIR attributes and transformations, AXI4MLIR empowers users to specify accelerator features (including their instructions) and communication patterns and exploit the host memory hierarchy. We demonstrate AXI4MLIR's versatility across different types of accelerators and problems, showcasing significant CPU cache reference reductions (up to 56%) and up to a 1.65x speedup compared to manually optimized driver code implementations. AXI4MLIR implementation is open-source and available at: https://github.com/AXI4MLIR/axi4mlir. △ Less

Submitted 22 December, 2023; originally announced December 2023.

Comments: 13 pages, 17 figures, to appear in CGO2024

ACM Class: D.3.3

arXiv:2312.08656 [pdf, other]

MaxK-GNN: Extremely Fast GPU Kernel Design for Accelerating Graph Neural Networks Training

Authors: Hongwu Peng, Xi Xie, Kaustubh Shivdikar, MD Amit Hasan, Jiahui Zhao, Shaoyi Huang, Omer Khan, David Kaeli, Caiwen Ding

Abstract: In the acceleration of deep neural network training, the GPU has become the mainstream platform. GPUs face substantial challenges on GNNs, such as workload imbalance and memory access irregularities, leading to underutilized hardware. Existing solutions such as PyG, DGL with cuSPARSE, and GNNAdvisor frameworks partially address these challenges but memory traffic is still significant. We argue t… ▽ More In the acceleration of deep neural network training, the GPU has become the mainstream platform. GPUs face substantial challenges on GNNs, such as workload imbalance and memory access irregularities, leading to underutilized hardware. Existing solutions such as PyG, DGL with cuSPARSE, and GNNAdvisor frameworks partially address these challenges but memory traffic is still significant. We argue that drastic performance improvements can only be achieved by the vertical optimization of algorithm and system innovations, rather than treating the speedup optimization as an "after-thought" (i.e., (i) given a GNN algorithm, designing an accelerator, or (ii) given hardware, mainly optimizing the GNN algorithm). In this paper, we present MaxK-GNN, an advanced high-performance GPU training system integrating algorithm and system innovation. (i) We introduce the MaxK nonlinearity and provide a theoretical analysis of MaxK nonlinearity as a universal approximator, and present the Compressed Balanced Sparse Row (CBSR) format, designed to store the data and index of the feature matrix after nonlinearity; (ii) We design a coalescing enhanced forward computation with row-wise product-based SpGEMM Kernel using CBSR for input feature matrix fetching and strategic placement of a sparse output accumulation buffer in shared memory; (iii) We develop an optimized backward computation with outer product-based and SSpMM Kernel. We conduct extensive evaluations of MaxK-GNN and report the end-to-end system run-time. Experiments show that MaxK-GNN system could approach the theoretical speedup limit according to Amdahl's law. We achieve comparable accuracy to SOTA GNNs, but at a significantly increased speed: 3.22/4.24 times speedup (vs. theoretical limits, 5.52/7.27 times) on Reddit compared to DGL and GNNAdvisor implementations. △ Less

Submitted 18 March, 2024; v1 submitted 14 December, 2023; originally announced December 2023.

Comments: ASPLOS 2024 accepted publication

ACM Class: I.2; C.5

arXiv:2310.17746 [pdf]

Memory Efficient Multithreaded Incremental Segmented Sieve Algorithm

Authors: Evan Ning, David Kaeli

Abstract: Prime numbers are fundamental in number theory and play a significant role in various areas, from pure mathematics to practical applications, including cryptography. In this contribution, we introduce a multithreaded implementation of the Segmented Sieve algorithm. In our implementation, instead of handling large prime ranges in one iteration, the sieving process is broken down incrementally, whic… ▽ More Prime numbers are fundamental in number theory and play a significant role in various areas, from pure mathematics to practical applications, including cryptography. In this contribution, we introduce a multithreaded implementation of the Segmented Sieve algorithm. In our implementation, instead of handling large prime ranges in one iteration, the sieving process is broken down incrementally, which theoretically eliminates the challenges of working with large numbers, and can reduce memory usage, providing overall more efficient multi-core utilization over extended computations. △ Less

Submitted 26 October, 2023; originally announced October 2023.

Comments: 10 pages

arXiv:2309.11001 [pdf, other]

doi 10.1145/3613424.3614279

GME: GPU-based Microarchitectural Extensions to Accelerate Homomorphic Encryption

Authors: Kaustubh Shivdikar, Yuhui Bao, Rashmi Agrawal, Michael Shen, Gilbert Jonatan, Evelio Mora, Alexander Ingare, Neal Livesay, José L. Abellán, John Kim, Ajay Joshi, David Kaeli

Abstract: Fully Homomorphic Encryption (FHE) enables the processing of encrypted data without decrypting it. FHE has garnered significant attention over the past decade as it supports secure outsourcing of data processing to remote cloud services. Despite its promise of strong data privacy and security guarantees, FHE introduces a slowdown of up to five orders of magnitude as compared to the same computatio… ▽ More Fully Homomorphic Encryption (FHE) enables the processing of encrypted data without decrypting it. FHE has garnered significant attention over the past decade as it supports secure outsourcing of data processing to remote cloud services. Despite its promise of strong data privacy and security guarantees, FHE introduces a slowdown of up to five orders of magnitude as compared to the same computation using plaintext data. This overhead is presently a major barrier to the commercial adoption of FHE. In this work, we leverage GPUs to accelerate FHE, capitalizing on a well-established GPU ecosystem available in the cloud. We propose GME, which combines three key microarchitectural extensions along with a compile-time optimization to the current AMD CDNA GPU architecture. First, GME integrates a lightweight on-chip compute unit (CU)-side hierarchical interconnect to retain ciphertext in cache across FHE kernels, thus eliminating redundant memory transactions. Second, to tackle compute bottlenecks, GME introduces special MOD-units that provide native custom hardware support for modular reduction operations, one of the most commonly executed sets of operations in FHE. Third, by integrating the MOD-unit with our novel pipelined $64$-bit integer arithmetic cores (WMAC-units), GME further accelerates FHE workloads by $19\%$. Finally, we propose a Locality-Aware Block Scheduler (LABS) that exploits the temporal locality available in FHE primitive blocks. Incorporating these microarchitectural features and compiler optimizations, we create a synergistic approach achieving average speedups of $796\times$, $14.2\times$, and $2.3\times$ over Intel Xeon CPU, NVIDIA V100 GPU, and Xilinx FPGA implementations, respectively. △ Less

Submitted 19 September, 2023; originally announced September 2023.

arXiv:2301.13101 [pdf, other]

doi 10.1145/3544548.3581570

Thought Bubbles: A Proxy into Players' Mental Model Development

Authors: Omid Mohaddesi, Noah Chicoine, Min Gong, Ozlem Ergun, Jacqueline Griffin, David Kaeli, Stacy Marsella, Casper Harteveld

Abstract: Studying mental models has recently received more attention, aiming to understand the cognitive aspects of human-computer interaction. However, there is not enough research on the elicitation of mental models in complex dynamic systems. We present Thought Bubbles as an approach for eliciting mental models and an avenue for understanding players' mental model development in interactive virtual envi… ▽ More Studying mental models has recently received more attention, aiming to understand the cognitive aspects of human-computer interaction. However, there is not enough research on the elicitation of mental models in complex dynamic systems. We present Thought Bubbles as an approach for eliciting mental models and an avenue for understanding players' mental model development in interactive virtual environments. We demonstrate the use of Thought Bubbles in two experimental studies involving 250 participants playing a supply chain game. In our analyses, we rely on Situation Awareness (SA) levels, including perception, comprehension, and projection, and show how experimental manipulations such as disruptions and information sharing shape players' mental models and drive their decisions depending on their behavioral profile. Our results provide evidence for the use of thought bubbles in uncovering cognitive aspects of behavior by indicating how disruption location and availability of information affect people's mental model development and influence their decisions. △ Less

Submitted 30 January, 2023; originally announced January 2023.

Comments: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI '23), April 23--28, 2023, Hamburg, Germany

arXiv:2209.01290 [pdf, other]

Accelerating Polynomial Multiplication for Homomorphic Encryption on GPUs

Authors: Kaustubh Shivdikar, Gilbert Jonatan, Evelio Mora, Neal Livesay, Rashmi Agrawal, Ajay Joshi, Jose Abellan, John Kim, David Kaeli

Abstract: Homomorphic Encryption (HE) enables users to securely outsource both the storage and computation of sensitive data to untrusted servers. Not only does HE offer an attractive solution for security in cloud systems, but lattice-based HE systems are also believed to be resistant to attacks by quantum computers. However, current HE implementations suffer from prohibitively high latency. For lattice-ba… ▽ More Homomorphic Encryption (HE) enables users to securely outsource both the storage and computation of sensitive data to untrusted servers. Not only does HE offer an attractive solution for security in cloud systems, but lattice-based HE systems are also believed to be resistant to attacks by quantum computers. However, current HE implementations suffer from prohibitively high latency. For lattice-based HE to become viable for real-world systems, it is necessary for the key bottlenecks - particularly polynomial multiplication - to be highly efficient. In this paper, we present a characterization of GPU-based implementations of polynomial multiplication. We begin with a survey of modular reduction techniques and analyze several variants of the widely-used Barrett modular reduction algorithm. We then propose a modular reduction variant optimized for 64-bit integer words on the GPU, obtaining a 1.8x speedup over the existing comparable implementations. Next, we explore the following GPU-specific improvements for polynomial multiplication targeted at optimizing latency and throughput: 1) We present a 2D mixed-radix, multi-block implementation of NTT that results in a 1.85x average speedup over the previous state-of-the-art. 2) We explore shared memory optimizations aimed at reducing redundant memory accesses, further improving speedups by 1.2x. 3) Finally, we fuse the Hadamard product with neighboring stages of the NTT, reducing the twiddle factor memory footprint by 50%. By combining our NTT optimizations, we achieve an overall speedup of 123.13x and 2.37x over the previous state-of-the-art CPU and GPU implementations of NTT kernels, respectively. △ Less

Submitted 2 September, 2022; originally announced September 2022.

Comments: Accepted, to be pusblished at SEED 2022 conference (IEEE International Symposium on Secure and Private Execution Environment Design)

arXiv:2201.02694 [pdf, other]

doi 10.1145/3491102.3502089

To Trust or to Stockpile: Modeling Human-Simulation Interaction in Supply Chain Shortages

Authors: Omid Mohaddesi, Jacqueline Griffin, Ozlem Ergun, David Kaeli, Stacy Marsella, Casper Harteveld

Abstract: Understanding decision-making in dynamic and complex settings is a challenge yet essential for preventing, mitigating, and responding to adverse events (e.g., disasters, financial crises). Simulation games have shown promise to advance our understanding of decision-making in such settings. However, an open question remains on how we extract useful information from these games. We contribute an app… ▽ More Understanding decision-making in dynamic and complex settings is a challenge yet essential for preventing, mitigating, and responding to adverse events (e.g., disasters, financial crises). Simulation games have shown promise to advance our understanding of decision-making in such settings. However, an open question remains on how we extract useful information from these games. We contribute an approach to model human-simulation interaction by leveraging existing methods to characterize: (1) system states of dynamic simulation environments (with Principal Component Analysis), (2) behavioral responses from human interaction with simulation (with Hidden Markov Models), and (3) behavioral responses across system states (with Sequence Analysis). We demonstrate this approach with our game simulating drug shortages in a supply chain context. Results from our experimental study with 135 participants show different player types (hoarders, reactors, followers), how behavior changes in different system states, and how sharing information impacts behavior. We discuss how our findings challenge existing literature. △ Less

Submitted 7 January, 2022; originally announced January 2022.

arXiv:2110.00478 [pdf, other]

SECDA: Efficient Hardware/Software Co-Design of FPGA-based DNN Accelerators for Edge Inference

Authors: Jude Haris, Perry Gibson, José Cano, Nicolas Bohm Agostini, David Kaeli

Abstract: Edge computing devices inherently face tight resource constraints, which is especially apparent when deploying Deep Neural Networks (DNN) with high memory and compute demands. FPGAs are commonly available in edge devices. Since these reconfigurable circuits can achieve higher throughput and lower power consumption than general purpose processors, they are especially well-suited for DNN acceleratio… ▽ More Edge computing devices inherently face tight resource constraints, which is especially apparent when deploying Deep Neural Networks (DNN) with high memory and compute demands. FPGAs are commonly available in edge devices. Since these reconfigurable circuits can achieve higher throughput and lower power consumption than general purpose processors, they are especially well-suited for DNN acceleration. However, existing solutions for designing FPGA-based DNN accelerators for edge devices come with high development overheads, given the cost of repeated FPGA synthesis passes, reimplementation in a Hardware Description Language (HDL) of the simulated design, and accelerator system integration. In this paper we propose SECDA, a new hardware/software co-design methodology to reduce design time of optimized DNN inference accelerators on edge devices with FPGAs. SECDA combines cost-effective SystemC simulation with hardware execution, streamlining design space exploration and the development process via reduced design evaluation time. As a case study, we use SECDA to efficiently develop two different DNN accelerator designs on a PYNQ-Z1 board, a platform that includes an edge FPGA. We quickly and iteratively explore the system's hardware/software stack, while identifying and mitigating performance bottlenecks. We evaluate the two accelerator designs with four common DNN models, achieving an average performance speedup across models of up to 3.5$\times$ with a 2.9$\times$ reduction in energy consumption over CPU-only inference. Our code is available at https://github.com/gicLAB/SECDA △ Less

Submitted 1 October, 2021; originally announced October 2021.

Comments: This paper is accepted to SBAC-PAD 2021

arXiv:2108.08910 [pdf, other]

Achieving on-Mobile Real-Time Super-Resolution with Neural Architecture and Pruning Search

Authors: Zheng Zhan, Yifan Gong, Pu Zhao, Geng Yuan, Wei Niu, Yushu Wu, Tianyun Zhang, Malith Jayaweera, David Kaeli, Bin Ren, Xue Lin, Yanzhi Wang

Abstract: Though recent years have witnessed remarkable progress in single image super-resolution (SISR) tasks with the prosperous development of deep neural networks (DNNs), the deep learning methods are confronted with the computation and memory consumption issues in practice, especially for resource-limited platforms such as mobile devices. To overcome the challenge and facilitate the real-time deploymen… ▽ More Though recent years have witnessed remarkable progress in single image super-resolution (SISR) tasks with the prosperous development of deep neural networks (DNNs), the deep learning methods are confronted with the computation and memory consumption issues in practice, especially for resource-limited platforms such as mobile devices. To overcome the challenge and facilitate the real-time deployment of SISR tasks on mobile, we combine neural architecture search with pruning search and propose an automatic search framework that derives sparse super-resolution (SR) models with high image quality while satisfying the real-time inference requirement. To decrease the search cost, we leverage the weight sharing strategy by introducing a supernet and decouple the search problem into three stages, including supernet construction, compiler-aware architecture and pruning search, and compiler-aware pruning ratio search. With the proposed framework, we are the first to achieve real-time SR inference (with only tens of milliseconds per frame) for implementing 720p resolution with competitive image quality (in terms of PSNR and SSIM) on mobile platforms (Samsung Galaxy S20). △ Less

Submitted 14 February, 2023; v1 submitted 18 August, 2021; originally announced August 2021.

arXiv:2104.00828 [pdf, other]

Daisen: A Framework for Visualizing Detailed GPU Execution

Authors: Yifan Sun, Yixuan Zhang, Ali Mosallaei, Michael D. Shah, Cody Dunne, David Kaeli

Abstract: Graphics Processing Units (GPUs) have been widely used to accelerate artificial intelligence, physics simulation, medical imaging, and information visualization applications. To improve GPU performance, GPU hardware designers need to identify performance issues by inspecting a huge amount of simulator-generated traces. Visualizing the execution traces can reduce the cognitive burden of users and f… ▽ More Graphics Processing Units (GPUs) have been widely used to accelerate artificial intelligence, physics simulation, medical imaging, and information visualization applications. To improve GPU performance, GPU hardware designers need to identify performance issues by inspecting a huge amount of simulator-generated traces. Visualizing the execution traces can reduce the cognitive burden of users and facilitate making sense of behaviors of GPU hardware components. In this paper, we first formalize the process of GPU performance analysis and characterize the design requirements of visualizing execution traces based on a survey study and interviews with GPU hardware designers. We contribute data and task abstraction for GPU performance analysis. Based on our task analysis, we propose Daisen, a framework that supports data collection from GPU simulators and provides visualization of the simulator-generated GPU execution traces. Daisen features a data abstraction and trace format that can record simulator-generated GPU execution traces. Daisen also includes a web-based visualization tool that helps GPU hardware designers examine GPU execution traces, identify performance bottlenecks, and verify performance improvement. Our qualitative evaluation with GPU hardware designers demonstrates that the design of Daisen reflects the typical workflow of GPU hardware designers. Using Daisen, participants were able to effectively identify potential performance bottlenecks and opportunities for performance improvement. The open-sourced implementation of Daisen can be found at gitlab.com/akita/vis. Supplemental materials including a demo video, survey questions, evaluation study guide, and post-study evaluation survey are available at osf.io/j5ghq. △ Less

Submitted 1 April, 2021; originally announced April 2021.

Comments: EuroVis Camera Ready

arXiv:2008.02300 [pdf, other]

MGPU-TSM: A Multi-GPU System with Truly Shared Memory

Authors: Saiful A. Mojumder, Yifan Sun, Leila Delshadtehrani, Yenai Ma, Trinayan Baruah, José L. Abellán, John Kim, David Kaeli, Ajay Joshi

Abstract: The sizes of GPU applications are rapidly growing. They are exhausting the compute and memory resources of a single GPU, and are demanding the move to multiple GPUs. However, the performance of these applications scales sub-linearly with GPU count because of the overhead of data movement across multiple GPUs. Moreover, a lack of hardware support for coherency exacerbates the problem because a prog… ▽ More The sizes of GPU applications are rapidly growing. They are exhausting the compute and memory resources of a single GPU, and are demanding the move to multiple GPUs. However, the performance of these applications scales sub-linearly with GPU count because of the overhead of data movement across multiple GPUs. Moreover, a lack of hardware support for coherency exacerbates the problem because a programmer must either replicate the data across GPUs or fetch the remote data using high-overhead off-chip links. To address these problems, we propose a multi-GPU system with truly shared memory (MGPU-TSM), where the main memory is physically shared across all the GPUs. We eliminate remote accesses and avoid data replication using an MGPU-TSM system, which simplifies the memory hierarchy. Our preliminary analysis shows that MGPU-TSM with 4 GPUs performs, on average, 3.9x? better than the current best performing multi-GPU configuration for standard application benchmarks. △ Less

Submitted 8 August, 2020; v1 submitted 5 August, 2020; originally announced August 2020.

Comments: 4 pages, 3 figures

arXiv:2007.16175 [pdf, other]

Hardware/Software Obfuscation against Timing Side-channel Attack on a GPU

Authors: Elmira Karimi, Yunsi Fei, David Kaeli

Abstract: GPUs are increasingly being used in security applications, especially for accelerating encryption/decryption. While GPUs are an attractive platform in terms of performance, the security of these devices raises a number of concerns. One vulnerability is the data-dependent timing information, which can be exploited by adversary to recover the encryption key. Memory system features are frequently exp… ▽ More GPUs are increasingly being used in security applications, especially for accelerating encryption/decryption. While GPUs are an attractive platform in terms of performance, the security of these devices raises a number of concerns. One vulnerability is the data-dependent timing information, which can be exploited by adversary to recover the encryption key. Memory system features are frequently exploited since they create detectable timing variations. In this paper, our attack model is a coalescing attack, which leverages a critical GPU microarchitectural feature -- the coalescing unit. As multiple concurrent GPU memory requests can refer to the same cache block, the coalescing unit collapses them into a single memory transaction. The access time of an encryption kernel is dependent on the number of transactions. Correlation between a guessed key value and the associated timing samples can be exploited to recover the secret key. In this paper, a series of hardware/software countermeasures are proposed to obfuscate the memory timing side channel, making the GPU more resilient without impacting performance. Our hardware-based approach attempts to randomize the width of the coalescing unit to lower the signal-to-noise ratio. We present a hierarchical Miss Status Holding Register (MSHR) design that can merge transactions across different warps. This feature boosts performance, while, at the same time, secures the execution. We also present a software-based approach to permute the organization of critical data structures, significantly changing the coalescing behavior and introducing a high degree of randomness. Equipped with our new protections, the effort to launch a successful attack is increased up to 1433X . 178X, while also improving encryption/decryption performance up to 7%. △ Less

Submitted 31 July, 2020; originally announced July 2020.

Comments: 2020 IEEE International Symposium on Hardware Oriented Security and Trust (HOST)

arXiv:2007.04292 [pdf, other]

HALCONE : A Hardware-Level Timestamp-based Cache Coherence Scheme for Multi-GPU systems

Authors: Saiful A. Mojumder, Yifan Sun, Leila Delshadtehrani, Yenai Ma, Trinayan Baruah, José L. Abellán, John Kim, David Kaeli, Ajay Joshi

Abstract: While multi-GPU (MGPU) systems are extremely popular for compute-intensive workloads, several inefficiencies in the memory hierarchy and data movement result in a waste of GPU resources and difficulties in programming MGPU systems. First, due to the lack of hardware-level coherence, the MGPU programming model requires the programmer to replicate and repeatedly transfer data between the GPUs' memor… ▽ More While multi-GPU (MGPU) systems are extremely popular for compute-intensive workloads, several inefficiencies in the memory hierarchy and data movement result in a waste of GPU resources and difficulties in programming MGPU systems. First, due to the lack of hardware-level coherence, the MGPU programming model requires the programmer to replicate and repeatedly transfer data between the GPUs' memory. This leads to inefficient use of precious GPU memory. Second, to maintain coherency across an MGPU system, transferring data using low-bandwidth and high-latency off-chip links leads to degradation in system performance. Third, since the programmer needs to manually maintain data coherence, the programming of an MGPU system to maximize its throughput is extremely challenging. To address the above issues, we propose a novel lightweight timestamp-based coherence protocol, HALCONE, for MGPU systems and modify the memory hierarchy of the GPUs to support physically shared memory. HALCONE replaces the Compute Unit (CU) level logical time counters with cache level logical time counters to reduce coherence traffic. Furthermore, HALCONE introduces a novel timestamp storage unit (TSU) with no additional performance overhead in the main memory to perform coherence actions. Our proposed HALCONE protocol maintains the data coherence in the memory hierarchy of the MGPU with minimal performance overhead (less than 1\%). Using a set of standard MGPU benchmarks, we observe that a 4-GPU MGPU system with shared memory and HALCONE performs, on average, 4.6$\times$ and 3$\times$ better than a 4-GPU MGPU system with existing RDMA and with the recently proposed HMG coherence protocol, respectively. We demonstrate the scalability of HALCONE using different GPU counts (2, 4, 8, and 16) and different CU counts (32, 48, and 64 CUs per GPU) for 11 standard benchmarks. △ Less

Submitted 8 July, 2020; originally announced July 2020.

Comments: 13 pages, 9 figures

arXiv:2006.01402 [pdf, other]

A Smart Background Scheduler for Storage Systems

Authors: Maher Kachmar, David Kaeli

Abstract: In today's enterprise storage systems, supported data services such as snapshot delete or drive rebuild can cause tremendous performance interference if executed inline along with heavy foreground IO, often leading to missing SLOs (Service Level Objectives). Typical storage system applications such as web or VDI (Virtual Desktop Infrastructure) follow a repetitive high/low workload pattern that ca… ▽ More In today's enterprise storage systems, supported data services such as snapshot delete or drive rebuild can cause tremendous performance interference if executed inline along with heavy foreground IO, often leading to missing SLOs (Service Level Objectives). Typical storage system applications such as web or VDI (Virtual Desktop Infrastructure) follow a repetitive high/low workload pattern that can be learned and forecasted. We propose a priority-based background scheduler that learns this repetitive pattern and allows storage systems to maintain peak performance and in turn meet service level objectives (SLOs) while supporting a number of data services. When foreground IO demand intensifies, system resources are dedicated to service foreground IO requests and any background processing that can be deferred are recorded to be processed in future idle cycles as long as forecast shows that storage pool has remaining capacity. The smart background scheduler adopts a resource partitioning model that allows both foreground and background IO to execute together as long as foreground IOs are not impacted where the scheduler harness any free cycle to clear background debt. Using traces from VDI application, we show how our technique surpasses a method that statically limit the deferred background debt and improve SLO violations from 54.6% when using a fixed background debt watermark to merely a 6.2% if dynamically set by our smart background scheduler. △ Less

Submitted 2 June, 2020; originally announced June 2020.

arXiv:1911.11313 [pdf, other]

Summarizing CPU and GPU Design Trends with Product Data

Authors: Yifan Sun, Nicolas Bohm Agostini, Shi Dong, David Kaeli

Abstract: Moore's Law and Dennard Scaling have guided the semiconductor industry for the past few decades. Recently, both laws have faced validity challenges as transistor sizes approach the practical limits of physics. We are interested in testing the validity of these laws and reflect on the reasons responsible. In this work, we collect data of more than 4000 publicly-available CPU and GPU products. We fi… ▽ More Moore's Law and Dennard Scaling have guided the semiconductor industry for the past few decades. Recently, both laws have faced validity challenges as transistor sizes approach the practical limits of physics. We are interested in testing the validity of these laws and reflect on the reasons responsible. In this work, we collect data of more than 4000 publicly-available CPU and GPU products. We find that transistor scaling remains critical in keeping the laws valid. However, architectural solutions have become increasingly important and will play a larger role in the future. We observe that GPUs consistently deliver higher performance than CPUs. GPU performance continues to rise because of increases in GPU frequency, improvements in the thermal design power (TDP), and growth in die size. But we also see the ratio of GPU to CPU performance moving closer to parity, thanks to new SIMD extensions on CPUs and increased CPU core counts. △ Less

Submitted 13 July, 2020; v1 submitted 25 November, 2019; originally announced November 2019.

Comments: Fix flops/watt error

arXiv:1909.03441 [pdf, other]

Iterative Spectral Method for Alternative Clustering

Authors: Chieh Wu, Stratis Ioannidis, Mario Sznaier, Xiangyu Li, David Kaeli, Jennifer G. Dy

Abstract: Given a dataset and an existing clustering as input, alternative clustering aims to find an alternative partition. One of the state-of-the-art approaches is Kernel Dimension Alternative Clustering (KDAC). We propose a novel Iterative Spectral Method (ISM) that greatly improves the scalability of KDAC. Our algorithm is intuitive, relies on easily implementable spectral decompositions, and comes wit… ▽ More Given a dataset and an existing clustering as input, alternative clustering aims to find an alternative partition. One of the state-of-the-art approaches is Kernel Dimension Alternative Clustering (KDAC). We propose a novel Iterative Spectral Method (ISM) that greatly improves the scalability of KDAC. Our algorithm is intuitive, relies on easily implementable spectral decompositions, and comes with theoretical guarantees. Its computation time improves upon existing implementations of KDAC by as much as 5 orders of magnitude. △ Less

Submitted 8 September, 2019; originally announced September 2019.

arXiv:1811.02884 [pdf, other]

MGSim + MGMark: A Framework for Multi-GPU System Research

Authors: Yifan Sun, Trinayan Baruah, Saiful A. Mojumder, Shi Dong, Rafael Ubal, Xiang Gong, Shane Treadway, Yuhui Bao, Vincent Zhao, José L. Abellán, John Kim, Ajay Joshi, David Kaeli

Abstract: The rapidly growing popularity and scale of data-parallel workloads demand a corresponding increase in raw computational power of GPUs (Graphics Processing Units). As single-GPU systems struggle to satisfy the performance demands, multi-GPU systems have begun to dominate the high-performance computing world. The advent of such systems raises a number of design challenges, including the GPU microar… ▽ More The rapidly growing popularity and scale of data-parallel workloads demand a corresponding increase in raw computational power of GPUs (Graphics Processing Units). As single-GPU systems struggle to satisfy the performance demands, multi-GPU systems have begun to dominate the high-performance computing world. The advent of such systems raises a number of design challenges, including the GPU microarchitecture, multi-GPU interconnect fabrics, runtime libraries and associated programming models. The research community currently lacks a publically available and comprehensive multi-GPU simulation framework and benchmark suite to evaluate multi-GPU system design solutions. In this work, we present MGSim, a cycle-accurate, extensively validated, multi-GPU simulator, based on AMD's Graphics Core Next 3 (GCN3) instruction set architecture. We complement MGSim with MGMark, a suite of multi-GPU workloads that explores multi-GPU collaborative execution patterns. Our simulator is scalable and comes with in-built support for multi-threaded execution to enable fast and efficient simulations. In terms of performance accuracy, MGSim differs $5.5\%$ on average when compared against actual GPU hardware. We also achieve a $3.5\times$ and a $2.5\times$ average speedup in function emulation and architectural simulation with 4 CPU cores, while delivering the same accuracy as the serial simulation. We illustrate the novel simulation capabilities provided by our simulator through a case study exploring programming models based on a unified multi-GPU system (U-MGPU) and a discrete multi-GPU system (D-MGPU) that both utilize unified memory space and cross-GPU memory access. We evaluate the design implications from our case study, suggesting that D-MGPU is an attractive programming model for future multi-GPU systems. △ Less

Submitted 13 November, 2018; v1 submitted 15 October, 2018; originally announced November 2018.

Comments: Updated typo

arXiv:1809.05165 [pdf, other]

doi 10.1145/3240765.3264699

Defensive Dropout for Hardening Deep Neural Networks under Adversarial Attacks

Authors: Siyue Wang, Xiao Wang, Pu Zhao, Wujie Wen, David Kaeli, Peter Chin, Xue Lin

Abstract: Deep neural networks (DNNs) are known vulnerable to adversarial attacks. That is, adversarial examples, obtained by adding delicately crafted distortions onto original legal inputs, can mislead a DNN to classify them as any target labels. This work provides a solution to hardening DNNs under adversarial attacks through defensive dropout. Besides using dropout during training for the best test accu… ▽ More Deep neural networks (DNNs) are known vulnerable to adversarial attacks. That is, adversarial examples, obtained by adding delicately crafted distortions onto original legal inputs, can mislead a DNN to classify them as any target labels. This work provides a solution to hardening DNNs under adversarial attacks through defensive dropout. Besides using dropout during training for the best test accuracy, we propose to use dropout also at test time to achieve strong defense effects. We consider the problem of building robust DNNs as an attacker-defender two-player game, where the attacker and the defender know each others' strategies and try to optimize their own strategies towards an equilibrium. Based on the observations of the effect of test dropout rate on test accuracy and attack success rate, we propose a defensive dropout algorithm to determine an optimal test dropout rate given the neural network model and the attacker's strategy for generating adversarial examples.We also investigate the mechanism behind the outstanding defense effects achieved by the proposed defensive dropout. Comparing with stochastic activation pruning (SAP), another defense method through introducing randomness into the DNN model, we find that our defensive dropout achieves much larger variances of the gradients, which is the key for the improved defense effects (much lower attack success rate). For example, our defensive dropout can reduce the attack success rate from 100% to 13.89% under the currently strongest attack i.e., C&W attack on MNIST dataset. △ Less

Submitted 13 September, 2018; originally announced September 2018.

Comments: Accepted as conference paper on ICCAD 2018

arXiv:1711.03244 [pdf, other]

doi 10.1117/1.JBO.23.1.010504

Scalable and massively parallel Monte Carlo photon transport simulations for heterogeneous computing platforms

Authors: Leiming Yu, Fanny Nina-Paravecino, David Kaeli, Qianqian Fang

Abstract: We present a highly scalable Monte Carlo (MC) three-dimensional photon transport simulation platform designed for heterogeneous computing systems. Through the development of a massively parallel MC algorithm using the Open Computing Language (OpenCL) framework, this research extends our existing graphics processing unit (GPU)-accelerated MC technique to a highly scalable vendor-independent heterog… ▽ More We present a highly scalable Monte Carlo (MC) three-dimensional photon transport simulation platform designed for heterogeneous computing systems. Through the development of a massively parallel MC algorithm using the Open Computing Language (OpenCL) framework, this research extends our existing graphics processing unit (GPU)-accelerated MC technique to a highly scalable vendor-independent heterogeneous computing environment, achieving significantly improved performance and software portability. A number of parallel computing techniques are investigated to achieve portable performance over a wide range of computing hardware. Furthermore, multiple thread-level and device-level load-balancing strat- egies are developed to obtain efficient simulations using multiple central processing units (CPUs) and GPUs. △ Less

Submitted 25 January, 2018; v1 submitted 8 November, 2017; originally announced November 2017.

Comments: Accepted for Publication in Journal of Biomedical Optics Letters on Jan 4, 2018, to appear in Volume 23, Issue 2

Journal ref: J. Biomed. Opt. 23(1), 010504 (2018)

arXiv:1609.06756 [pdf]

21st Century Computer Architecture

Authors: Mark D. Hill, Sarita Adve, Luis Ceze, Mary Jane Irwin, David Kaeli, Margaret Martonosi, Josep Torrellas, Thomas F. Wenisch, David Wood, Katherine Yelick

Abstract: Because most technology and computer architecture innovations were (intentionally) invisible to higher layers, application and other software developers could reap the benefits of this progress without engaging in it. Higher performance has both made more computationally demanding applications feasible (e.g., virtual assistants, computer vision) and made less demanding applications easier to devel… ▽ More Because most technology and computer architecture innovations were (intentionally) invisible to higher layers, application and other software developers could reap the benefits of this progress without engaging in it. Higher performance has both made more computationally demanding applications feasible (e.g., virtual assistants, computer vision) and made less demanding applications easier to develop by enabling higher-level programming abstractions (e.g., scripting languages and reusable components). Improvements in computer system cost-effectiveness enabled value creation that could never have been imagined by the field's founders (e.g., distributed web search sufficiently inexpensive so as to be covered by advertising links). The wide benefits of computer performance growth are clear. Recently, Danowitz et al. apportioned computer performance growth roughly equally between technology and architecture, with architecture credited with ~80x improvement since 1985. As semiconductor technology approaches its "end-of-the-road" (see below), computer architecture will need to play an increasing role in enabling future ICT innovation. But instead of asking, "How can I make my chip run faster?," architects must now ask, "How can I enable the 21st century infrastructure, from sensors to clouds, adding value from performance to privacy, but without the benefit of near-perfect technology scaling?". The challenges are many, but with appropriate investment, opportunities abound. Underlying these opportunities is a common theme that future architecture innovations will require the engagement of and investments from innovators in other ICT layers. △ Less

Submitted 21 September, 2016; originally announced September 2016.

Comments: A Computing Community Consortium (CCC) white paper, 16 pages

arXiv:0807.1765 [pdf]

doi 10.1007/978-3-642-03354-4_7

Archer: A Community Distributed Computing Infrastructure for Computer Architecture Research and Education

Authors: Renato Figueiredo, P. Oscar Boykin, Jose A. B. Fortes, Tao Li, Jie-Kwon Peir, David Wolinsky, Lizy John, David Kaeli, David Lilja, Sally McKee, Gokhan Memik, Alain Roy, Gary Tyson

Abstract: This paper introduces Archer, a community-based computing resource for computer architecture research and education. The Archer infrastructure integrates virtualization and batch scheduling middleware to deliver high-throughput computing resources aggregated from resources distributed across wide-area networks and owned by different participating entities in a seamless manner. The paper discusse… ▽ More This paper introduces Archer, a community-based computing resource for computer architecture research and education. The Archer infrastructure integrates virtualization and batch scheduling middleware to deliver high-throughput computing resources aggregated from resources distributed across wide-area networks and owned by different participating entities in a seamless manner. The paper discusses the motivations leading to the design of Archer, describes its core middleware components, and presents an analysis of the functionality and performance of a prototype wide-area deployment running a representative computer architecture simulation workload. △ Less

Submitted 10 July, 2008; originally announced July 2008.

Comments: 11 pages, 2 figures. Describes the Archer project, http://archer-project.org

ACM Class: C.0; I.6.3; C.2.4

Showing 1–24 of 24 results for author: Kaeli, D