Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- research-articleDecember 2024
openGauss: An Enterprise-Grade Open-Source Database System
Journal of Computer Science and Technology (JCST), Volume 39, Issue 5Pages 1007–1028https://doi.org/10.1007/s11390-024-4302-2AbstractWe have built openGauss, an enterprise-grade open-source database system. openGauss has fulfilled its design goal of high performance, high availability, high security, and high intelligence. For high performance, it leverages NUMA (non-uniform ...
- research-articleAugust 2024
Accelerated Constrained Sparse Tensor Factorization on Massively Parallel Architectures
ICPP '24: Proceedings of the 53rd International Conference on Parallel ProcessingPages 107–116https://doi.org/10.1145/3673038.3673128This study presents the first constrained sparse tensor factorization (cSTF) framework that optimizes and fully offloads computation to massively parallel GPU architectures, and the first performance characterization of cSTF on GPU architectures. In ...
- research-articleAugust 2024
Parallel spatiotemporally adaptive DEM-based snow simulation
Proceedings of the ACM on Computer Graphics and Interactive Techniques (PACMCGIT), Volume 7, Issue 3Article No.: 50, Pages 1–20https://doi.org/10.1145/3675374This paper applies spatial and temporal adaptivity to an existing discrete element method (DEM) based snow simulation on the GPU. For spatial adaptivity, visually significant spatial regions are identified and simulated at varying resolutions. To this ...
- research-articleSeptember 2024
Mapping Parallel Matrix Multiplication in GotoBLAS2 to the AMD Versal ACAP for Deep Learning
PECS '24: Proceedings of the 4th Workshop on Performance and Energy Efficiency in Concurrent and Distributed SystemsPages 1–8https://doi.org/10.1145/3659997.3660032This paper investigates the design of parallel general matrix multiplication (gemm) for a Versal Adaptive Compute Accelerated Platform (ACAP) equipped with a VC1902 system-on-chip and multiple Artificial Intelligence Engines (AIEs). Our efforts aim to ...
-
- research-articleAugust 2024
ADTopk: All-Dimension Top-k Compression for High-Performance Data-Parallel DNN Training
HPDC '24: Proceedings of the 33rd International Symposium on High-Performance Parallel and Distributed ComputingPages 135–147https://doi.org/10.1145/3625549.3658678Data-parallel deep neural networks (DNN) training systems deployed across nodes have been widely used in various domains, while the system performance is often bottlenecked by the communication overhead among workers for synchronizing gradients. Top-k ...
- research-articleJuly 2024
Striking Trade-off Between High Performance and Energy Efficiency in an Edge Computing Application for Detecting Floating Plastic Debris
FRAME '24: Proceedings of the 4th Workshop on Flexible Resource and Application Management on the EdgePages 51–58https://doi.org/10.1145/3659994.3660309The Edge Computing environments facilitate the creation of pervasive applications distributed across vast geographical regions, addressing particular challenges associated with centralized information processing, such as network bandwidth saturation and ...
Algorithm 1039: Automatic Generators for a Family of Matrix Multiplication Routines with Apache TVM
- Guillermo Alaejos,
- Adrián Castelló,
- Pedro Alonso-Jordá,
- Francisco D. Igual,
- Héctor Martínez,
- Enrique S. Quintana-Ortí
ACM Transactions on Mathematical Software (TOMS), Volume 50, Issue 1Article No.: 6, Pages 1–34https://doi.org/10.1145/3638532We explore the utilization of the Apache TVM open source framework to automatically generate a family of algorithms that follow the approach taken by popular linear algebra libraries, such as GotoBLAS2, BLIS, and OpenBLAS, to obtain high-performance ...
Tackling the Matrix Multiplication Micro-kernel Generation with Exo
CGO '24: Proceedings of the 2024 IEEE/ACM International Symposium on Code Generation and OptimizationPages 182–192https://doi.org/10.1109/CGO57630.2024.10444883The optimization of the matrix multiplication (or GEMM) has been a need during the last decades. This operation is considered the flagship of current linear algebra libraries such as BLIS, OpenBLAS, or Intel OneAPI because of its widespread use in a ...
- research-articleJune 2024
Towards High-Performance Graph Processing: From a Hardware/Software Co-Design Perspective
- Xiao-Fei Liao,
- Wen-Ju Zhao,
- Hai Jin,
- Peng-Cheng Yao,
- Yu Huang,
- Qing-Gang Wang,
- Jin Zhao,
- Long Zheng,
- Yu Zhang,
- Zhi-Yuan Shao
Journal of Computer Science and Technology (JCST), Volume 39, Issue 2Pages 245–266https://doi.org/10.1007/s11390-024-4150-0AbstractGraph processing has been widely used in many scenarios, from scientific computing to artificial intelligence. Graph processing exhibits irregular computational parallelism and random memory accesses, unlike traditional workloads. Therefore, ...
- research-articleMarch 2024
Experiences with nested parallelism in task-parallel applications using malleable BLAS on multicore processors
- Rafael Rodríguez-Sánchez,
- Adrián Castelló,
- Sandra Catalán,
- Francisco D. Igual,
- Enrique S. Quintana-Ortí,
- Jesus Carretero Perez
International Journal of High Performance Computing Applications (SAGE-HPCA), Volume 38, Issue 2Pages 55–68https://doi.org/10.1177/10943420231157653Malleability is defined as the ability to vary the degree of parallelism at runtime, and is regarded as a means to improve core occupation on state-of-the-art multicore processors tshat contain tens of computational cores per socket. This property is ...
- research-articleJuly 2023
Fast truncated SVD of sparse and dense matrices on graphics processors
International Journal of High Performance Computing Applications (SAGE-HPCA), Volume 37, Issue 3-4Pages 380–393https://doi.org/10.1177/10943420231179699We investigate the solution of low-rank matrix approximation problems using the truncated singular value decomposition (SVD). For this purpose, we develop and optimize graphics processing unit (GPU) implementations for the randomized SVD and a blocked ...
- abstractJune 2023
Architectural Support for Efficient Data Movement in Fully Disaggregated Systems
- Christina Giannoula,
- Kailong Huang,
- Jonathan Tang,
- Nectarios Koziris,
- Georgios Goumas,
- Zeshan Chishti,
- Nandita Vijaykumar
SIGMETRICS '23: Abstract Proceedings of the 2023 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer SystemsPages 5–6https://doi.org/10.1145/3578338.3593533Traditional data centers include monolithic servers that tightly integrate CPU, memory and disk (Figure 1a). Instead, Disaggregated Systems (DSs) [8, 13, 18, 27] organize multiple compute (CC), memory (MC) and storage devices as independent, failure-...
Also Published in:
ACM SIGMETRICS Performance Evaluation Review: Volume 51 Issue 1 - ArticleDecember 2024
Performance Analysis of BERT on RISC-V Processors with SIMD Units
- Héctor Martínez,
- Sandra Catalán,
- Carlos García,
- Francisco D. Igual,
- Rafael Rodríguez-Sánchez,
- Adrián Castelló,
- Enrique S. Quintana-Ortí
High Performance Computing. ISC High Performance 2024 International WorkshopsPages 325–338https://doi.org/10.1007/978-3-031-73716-9_23AbstractFollowing the recent advances in open hardware generally, and RISC-V architectures particularly, we analyse the performance of transformer encoder inference on three low-power platforms with this type of architecture. For this purpose, we conduct ...
- research-articleMarch 2023
A Survey of Approximate Computing: From Arithmetic Units Design to High-Level Applications
Journal of Computer Science and Technology (JCST), Volume 38, Issue 2Pages 251–272https://doi.org/10.1007/s11390-023-2537-yAbstractRealizing a high-performance and energy-efficient circuit system is one of the critical tasks for circuit designers. Conventional researchers always concentrated on the tradeoffs between the energy and the performance in circuit and system design ...
- research-articleMarch 2023
DaeMon: Architectural Support for Efficient Data Movement in Fully Disaggregated Systems
- Christina Giannoula,
- Kailong Huang,
- Jonathan Tang,
- Nectarios Koziris,
- Georgios Goumas,
- Zeshan Chishti,
- Nandita Vijaykumar
Proceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS), Volume 7, Issue 1Article No.: 16, Pages 1–36https://doi.org/10.1145/3579445Resource disaggregation offers a cost effective solution to resource scaling, utilization, and failure-handling in data centers by physically separating hardware devices in a server. Servers are architected as pools of processor, memory, and storage ...
- research-articleJanuary 2023
MPSoC design and implementation using microblaze soft core processor architecture for faster execution of arithmetic application
International Journal of High Performance Systems Architecture (IJHPSA), Volume 11, Issue 3Pages 156–168https://doi.org/10.1504/ijhpsa.2023.130214The research paper presents the design methodology with novel task distribution technique on multi-processor system on chip (MPSoC) for speeding up the execution of arithmetic application. Utilisation of multiple soft core processors on field programmable ...
- research-articleOctober 2022
Efficient and stable quorum-based log replication and replay for modern cluster-databases
Frontiers of Computer Science: Selected Publications from Chinese Universities (FCS), Volume 16, Issue 5https://doi.org/10.1007/s11704-020-0210-yAbstractThe modern in-memory database (IMDB) can support highly concurrent on-line transaction processing (OLTP) workloads and generate massive transactional logs per second. Quorum-based replication protocols such as Paxos or Raft have been widely used ...
- research-articleJuly 2022
Scalable low-rank factorization using a task-based runtime system with distributed memory
PASC '22: Proceedings of the Platform for Advanced Scientific Computing ConferenceArticle No.: 8, Pages 1–11https://doi.org/10.1145/3539781.3539791We present a new parallel task-based algorithm for randomized low-rank factorizations of a matrix and its application to fast hierarchical solvers. TaskTorrent is a lightweight, distributed, task-based runtime in C++. We explain how the randomized ...
Algorithm 1022: Efficient Algorithms for Computing a Rank-Revealing UTV Factorization on Parallel Computing Architectures
ACM Transactions on Mathematical Software (TOMS), Volume 48, Issue 2Article No.: 21, Pages 1–42https://doi.org/10.1145/3507466Randomized singular value decomposition (RSVD) is by now a well-established technique for efficiently computing an approximate singular value decomposition of a matrix. Building on the ideas that underpin RSVD, the recently proposed algorithm “randUTV” ...