Keyword: high performance : Search

research-article

openGauss: An Enterprise-Grade Open-Source Database System

Journal of Computer Science and Technology (JCST), Volume 39, Issue 5Pages 1007–1028https://doi.org/10.1007/s11390-024-4302-2

Abstract

We have built openGauss, an enterprise-grade open-source database system. openGauss has fulfilled its design goal of high performance, high availability, high security, and high intelligence. For high performance, it leverages NUMA (non-uniform ...

Article

Open Access

Inference with Transformer Encoders on ARM and RISC-V Multicore Processors

Euro-Par 2024: Parallel ProcessingPages 377–392https://doi.org/10.1007/978-3-031-69766-1_26

Abstract

We delve into the performance of transformer encoder inference on low-power multi-core processors from two perspectives: First, we conduct a detailed profile of the inference process for two members of the BERT family on a modern multi-core ...

research-article

Open Access

Accelerated Constrained Sparse Tensor Factorization on Massively Parallel Architectures

ICPP '24: Proceedings of the 53rd International Conference on Parallel ProcessingPages 107–116https://doi.org/10.1145/3673038.3673128

This study presents the first constrained sparse tensor factorization (cSTF) framework that optimizes and fully offloads computation to massively parallel GPU architectures, and the first performance characterization of cSTF on GPU architectures. In ...

research-article

Open Access

Parallel spatiotemporally adaptive DEM-based snow simulation

Proceedings of the ACM on Computer Graphics and Interactive Techniques (PACMCGIT), Volume 7, Issue 3Article No.: 50, Pages 1–20https://doi.org/10.1145/3675374

This paper applies spatial and temporal adaptivity to an existing discrete element method (DEM) based snow simulation on the GPU. For spatial adaptivity, visually significant spatial regions are identified and simulated at varying resolutions. To this ...

research-article

Open Access

Mapping Parallel Matrix Multiplication in GotoBLAS2 to the AMD Versal ACAP for Deep Learning

PECS '24: Proceedings of the 4th Workshop on Performance and Energy Efficiency in Concurrent and Distributed SystemsPages 1–8https://doi.org/10.1145/3659997.3660032

This paper investigates the design of parallel general matrix multiplication (gemm) for a Versal Adaptive Compute Accelerated Platform (ACAP) equipped with a VC1902 system-on-chip and multiple Artificial Intelligence Engines (AIEs). Our efforts aim to ...

research-article

ADTopk: All-Dimension Top-k Compression for High-Performance Data-Parallel DNN Training

HPDC '24: Proceedings of the 33rd International Symposium on High-Performance Parallel and Distributed ComputingPages 135–147https://doi.org/10.1145/3625549.3658678

Data-parallel deep neural networks (DNN) training systems deployed across nodes have been widely used in various domains, while the system performance is often bottlenecked by the communication overhead among workers for synchronizing gradients. Top-k ...

research-article

Open Access

Striking Trade-off Between High Performance and Energy Efficiency in an Edge Computing Application for Detecting Floating Plastic Debris

FRAME '24: Proceedings of the 4th Workshop on Flexible Resource and Application Management on the EdgePages 51–58https://doi.org/10.1145/3659994.3660309

The Edge Computing environments facilitate the creation of pervasive applications distributed across vast geographical regions, addressing particular challenges associated with centralized information processing, such as network bandwidth saturation and ...

research-article

Algorithm 1039: Automatic Generators for a Family of Matrix Multiplication Routines with Apache TVM

ACM Transactions on Mathematical Software (TOMS), Volume 50, Issue 1Article No.: 6, Pages 1–34https://doi.org/10.1145/3638532

We explore the utilization of the Apache TVM open source framework to automatically generate a family of algorithms that follow the approach taken by popular linear algebra libraries, such as GotoBLAS2, BLIS, and OpenBLAS, to obtain high-performance ...

research-article

Tackling the Matrix Multiplication Micro-kernel Generation with Exo

CGO '24: Proceedings of the 2024 IEEE/ACM International Symposium on Code Generation and OptimizationPages 182–192https://doi.org/10.1109/CGO57630.2024.10444883

The optimization of the matrix multiplication (or GEMM) has been a need during the last decades. This operation is considered the flagship of current linear algebra libraries such as BLIS, OpenBLAS, or Intel OneAPI because of its widespread use in a ...

research-article

Towards High-Performance Graph Processing: From a Hardware/Software Co-Design Perspective

Journal of Computer Science and Technology (JCST), Volume 39, Issue 2Pages 245–266https://doi.org/10.1007/s11390-024-4150-0

Abstract

Graph processing has been widely used in many scenarios, from scientific computing to artificial intelligence. Graph processing exhibits irregular computational parallelism and random memory accesses, unlike traditional workloads. Therefore, ...

research-article

Open Access

Experiences with nested parallelism in task-parallel applications using malleable BLAS on multicore processors

International Journal of High Performance Computing Applications (SAGE-HPCA), Volume 38, Issue 2Pages 55–68https://doi.org/10.1177/10943420231157653

Malleability is defined as the ability to vary the degree of parallelism at runtime, and is regarded as a means to improve core occupation on state-of-the-art multicore processors tshat contain tens of computational cores per socket. This property is ...

research-article

Fast truncated SVD of sparse and dense matrices on graphics processors

International Journal of High Performance Computing Applications (SAGE-HPCA), Volume 37, Issue 3-4Pages 380–393https://doi.org/10.1177/10943420231179699

We investigate the solution of low-rank matrix approximation problems using the truncated singular value decomposition (SVD). For this purpose, we develop and optimize graphics processing unit (GPU) implementations for the randomized SVD and a blocked ...

abstract

Architectural Support for Efficient Data Movement in Fully Disaggregated Systems

SIGMETRICS '23: Abstract Proceedings of the 2023 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer SystemsPages 5–6https://doi.org/10.1145/3578338.3593533

Traditional data centers include monolithic servers that tightly integrate CPU, memory and disk (Figure 1a). Instead, Disaggregated Systems (DSs) [8, 13, 18, 27] organize multiple compute (CC), memory (MC) and storage devices as independent, failure-...

Also Published in:

ACM SIGMETRICS Performance Evaluation Review: Volume 51 Issue 1

Article

Performance Analysis of BERT on RISC-V Processors with SIMD Units

High Performance Computing. ISC High Performance 2024 International WorkshopsPages 325–338https://doi.org/10.1007/978-3-031-73716-9_23

Abstract

Following the recent advances in open hardware generally, and RISC-V architectures particularly, we analyse the performance of transformer encoder inference on three low-power platforms with this type of architecture. For this purpose, we conduct ...

research-article

A Survey of Approximate Computing: From Arithmetic Units Design to High-Level Applications

Journal of Computer Science and Technology (JCST), Volume 38, Issue 2Pages 251–272https://doi.org/10.1007/s11390-023-2537-y

Abstract

Realizing a high-performance and energy-efficient circuit system is one of the critical tasks for circuit designers. Conventional researchers always concentrated on the tradeoffs between the energy and the performance in circuit and system design ...

research-article

DaeMon: Architectural Support for Efficient Data Movement in Fully Disaggregated Systems

Proceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS), Volume 7, Issue 1Article No.: 16, Pages 1–36https://doi.org/10.1145/3579445

Resource disaggregation offers a cost effective solution to resource scaling, utilization, and failure-handling in data centers by physically separating hardware devices in a server. Servers are architected as pools of processor, memory, and storage ...

research-article

MPSoC design and implementation using microblaze soft core processor architecture for faster execution of arithmetic application

International Journal of High Performance Systems Architecture (IJHPSA), Volume 11, Issue 3Pages 156–168https://doi.org/10.1504/ijhpsa.2023.130214

The research paper presents the design methodology with novel task distribution technique on multi-processor system on chip (MPSoC) for speeding up the execution of arithmetic application. Utilisation of multiple soft core processors on field programmable ...

research-article

Efficient and stable quorum-based log replication and replay for modern cluster-databases

Frontiers of Computer Science: Selected Publications from Chinese Universities (FCS), Volume 16, Issue 5https://doi.org/10.1007/s11704-020-0210-y

Abstract

The modern in-memory database (IMDB) can support highly concurrent on-line transaction processing (OLTP) workloads and generate massive transactional logs per second. Quorum-based replication protocols such as Paxos or Raft have been widely used ...

research-article

Scalable low-rank factorization using a task-based runtime system with distributed memory

PASC '22: Proceedings of the Platform for Advanced Scientific Computing ConferenceArticle No.: 8, Pages 1–11https://doi.org/10.1145/3539781.3539791

We present a new parallel task-based algorithm for randomized low-rank factorizations of a matrix and its application to fast hierarchical solvers. TaskTorrent is a lightweight, distributed, task-based runtime in C++. We explain how the randomized ...

research-article

Public Access

Algorithm 1022: Efficient Algorithms for Computing a Rank-Revealing UTV Factorization on Parallel Computing Architectures

ACM Transactions on Mathematical Software (TOMS), Volume 48, Issue 2Article No.: 21, Pages 1–42https://doi.org/10.1145/3507466

Randomized singular value decomposition (RSVD) is by now a well-established technique for efficiently computing an approximate singular value decomposition of a matrix. Building on the ideas that underpin RSVD, the recently proposed algorithm “randUTV” ...

Applied Filters

People

Names

Institutions

Authors

Reviewers

Publications

Journal/Magazine Names

Proceedings/Book Names

All Publications

Content Type

Supplemental Material Type

Media Formats

Publisher

Conferences

Sponsors

Conference Event

Proceedings Series

Reproducibility Badges

Publication Date

Save to Binder

Upcoming Conferences

Also Published in: