TACO: Vol 15, No 3

Volume 15, Issue 3September 2018

Volume 15, Issue 3

September 2018

Editor:

Koen De Bosschere
Ghent University

Publisher:

Association for Computing Machinery
New York
NY
United States

ISSN:1544-3566

EISSN:1544-3973

Tags:

Distributed programming languages
Memory and dense storage
Graphics processors
Vector / streaming algorithms
GPU
Apache Spark
ASIC
Bayesian confidence estimation
Branch prediction

Get Alerts for this JournalAlerts Save to BinderBinder Export CitationCitation

Share on

Bibliometrics

Citation count

167

Downloads (6 weeks)

524

Downloads (12 months)

4,847

Downloads (cumulative)

21,436

Sections

Volume 15 , Issue 3

September 2018

PreviousIssue NextIssue

Issue Downloads

PDFfront matter (TOC, masthead, submission information)

Skip Table Of Content Section

Select All

Export Citations Save to Binder

research-article

Open Access

QuMan: Profile-based Improvement of Cluster Utilization

Yannis Sfakianakis,
Christos Kozanitis,
Christos Kozyrakis,
Angelos Bilas

Article No.: 27, Pages 1–25https://doi.org/10.1145/3210560

Modern data centers consolidate workloads to increase server utilization and reduce total cost of ownership, and cope with scaling limitations. However, server resource sharing introduces performance interference across applications and, consequently, ...

- 18
- 1,132
Metrics
Total Citations18
Total Downloads1,132
Last 12 Months221
Last 6 weeks45

More
- View online with eReader
- Abstract
HTML
PDF

research-article

Open Access

LAPPS: Locality-Aware Productive Prefetching Support for PGAS

Engin Kayraklioglu,
Michael P. Ferguson,
Tarek El-Ghazawi

Article No.: 28, Pages 1–26https://doi.org/10.1145/3233299

Prefetching is a well-known technique to mitigate scalability challenges in the Partitioned Global Address Space (PGAS) model. It has been studied as either an automated compiler optimization or a manual programmer optimization. Using the PGAS locality ...

- 9
- 608
Metrics
Total Citations9
Total Downloads608
Last 12 Months88
Last 6 weeks13

More
- View online with eReader
- Abstract
HTML
PDF

research-article

Open Access

BestSF: A Sparse Meta-Format for Optimizing SpMV on GPU

Akrem Benatia,
Weixing Ji,
Yizhuo Wang,
Feng Shi

Article No.: 29, Pages 1–27https://doi.org/10.1145/3226228

The Sparse Matrix-Vector Multiplication (SpMV) kernel dominates the computing cost in numerous scientific applications. Many implementations based on different sparse formats were proposed to improve this kernel on the recent GPU architectures. However, ...

- 21
- 1,454
Metrics
Total Citations21
Total Downloads1,454
Last 12 Months195
Last 6 weeks24

More
- View online with eReader
- Abstract
HTML
PDF

research-article

Open Access

An Alternative TAGE-like Conditional Branch Predictor

Pierre Michaud

Article No.: 30, Pages 1–23https://doi.org/10.1145/3226098

TAGE is one of the most accurate conditional branch predictors known today. However, TAGE does not exploit its input information perfectly, as it is possible to obtain significant prediction accuracy improvements by complementing TAGE with a statistical ...

- 7
- 4,804
Metrics
Total Citations7
Total Downloads4,804
Last 12 Months1,761
Last 6 weeks134

More
- View online with eReader
- Abstract
HTML
PDF

research-article

Open Access

Low Complexity Multiply-Accumulate Units for Convolutional Neural Networks with Weight-Sharing

James Garland,
David Gregg

Article No.: 31, Pages 1–24https://doi.org/10.1145/3233300

Convolutional neural networks (CNNs) are one of the most successful machine-learning techniques for image, voice, and video processing. CNNs require large amounts of processing capacity and memory bandwidth. Hardware accelerators have been proposed for ...

- 22
- 2,735
Metrics
Total Citations22
Total Downloads2,735
Last 12 Months650
Last 6 weeks61

More
- View online with eReader
- Abstract
HTML
PDF

research-article

Open Access

CODA: Enabling Co-location of Computation and Data for Multiple GPU Systems

Hyojong Kim,
Ramyad Hadidi,
Lifeng Nai,
Hyesoon Kim,
Nuwan Jayasena,
Yasuko Eckert,
Onur Kayiran,
Gabriel Loh

Article No.: 32, Pages 1–23https://doi.org/10.1145/3232521

To exploit parallelism and scalability of multiple GPUs in a system, it is critical to place compute and data together. However, two key techniques that have been used to hide memory latency and improve thread-level parallelism (TLP), memory ...

- 14
- 1,541
Metrics
Total Citations14
Total Downloads1,541
Last 12 Months315
Last 6 weeks44

More
- View online with eReader
- Abstract
HTML
PDF

research-article

Open Access

Global Dead-Block Management for Task-Parallel Programs

Madhavan Manivannan,
Miquel Pericás,
Vassilis Papaefstathiou,
Per Stenström

Article No.: 33, Pages 1–25https://doi.org/10.1145/3234337

Task-parallel programs inefficiently utilize the cache hierarchy due to the presence of dead blocks in caches. Dead blocks may occupy cache space in multiple cache levels for a long time without providing any utility until they are finally evicted. ...

- 1
- 581
Metrics
Total Citations1
Total Downloads581
Last 12 Months93
Last 6 weeks20

More
- View online with eReader
- Abstract
HTML
PDF

research-article

Open Access

High-Performance Generalized Tensor Operations: A Compiler-Oriented Approach

Roman Gareev,
Tobias Grosser,
Michael Kruse

Article No.: 34, Pages 1–27https://doi.org/10.1145/3235029

The efficiency of tensor contraction is of great importance. Compilers cannot optimize it well enough to come close to the performance of expert-tuned implementations. All existing approaches that provide competitive performance require optimized ...

- 32
- 2,168
Metrics
Total Citations32
Total Downloads2,168
Last 12 Months333
Last 6 weeks29

More
- View online with eReader
- Abstract
HTML
PDF

research-article

Open Access

Cluster Programming using the OpenMP Accelerator Model

Hervé Yviquel,
Lauro Cruz,
Guido Araujo

Article No.: 35, Pages 1–23https://doi.org/10.1145/3226112

Computation offloading is a programming model in which program fragments (e.g., hot loops) are annotated so that their execution is performed in dedicated hardware or accelerator devices. Although offloading has been extensively used to move computation ...

- 8
- 1,029
Metrics
Total Citations8
Total Downloads1,029
Last 12 Months130
Last 6 weeks7

More
- View online with eReader
- Abstract
HTML
PDF

research-article

Open Access

Block Cooperation: Advancing Lifetime of Resistive Memories by Increasing Utilization of Error Correcting Codes

Mohammad Khavari Tavana,
Amir Kavyan Ziabari,
David Kaeli

Article No.: 36, Pages 1–26https://doi.org/10.1145/3243906

Block-level cooperation is an endurance management technique that operates on top of error correction mechanisms to extend memory lifetimes. Once an error recovery scheme fails to recover from faults in a data block, the entire physical page associated ...

- 2
- 454
Metrics
Total Citations2
Total Downloads454
Last 12 Months56
Last 6 weeks12

More
- View online with eReader
- Abstract
HTML
PDF

research-article

Open Access

Layer-Centric Memory Reuse and Data Migration for Extreme-Scale Deep Learning on Many-Core Architectures

Hai Jin,
Bo Liu,
Wenbin Jiang,
Yang Ma,
Xuanhua Shi,
Bingsheng He,
Shaofeng Zhao

Article No.: 37, Pages 1–26https://doi.org/10.1145/3243904

Due to the popularity of Deep Neural Network (DNN) models, we have witnessed extreme-scale DNN models with the continued increase of the scale in terms of depth and width. However, the extremely high memory requirements for them make it difficult to run ...

- 19
- 2,157
Metrics
Total Citations19
Total Downloads2,157
Last 12 Months358
Last 6 weeks50

More
- View online with eReader
- Abstract
HTML
PDF

research-article

Open Access

Software-Directed Techniques for Improved GPU Register File Utilization

Dani Voitsechov,
Arslan Zulfiqar,
Mark Stephenson,
Mark Gebhart,
Stephen W. Keckler

Article No.: 38, Pages 1–23https://doi.org/10.1145/3243905

Throughput architectures such as GPUs require substantial hardware resources to hold the state of a massive number of simultaneously executing threads. While GPU register files are already enormous, reaching capacities of 256KB per streaming ...

- 5
- 1,359
Metrics
Total Citations5
Total Downloads1,359
Last 12 Months389
Last 6 weeks63

More
- View online with eReader
- Abstract
HTML
PDF

research-article

Open Access

On-GPU Thread-Data Remapping for Branch Divergence Reduction

Huanxin Lin,
Cho-Li Wang,
Hongyuan Liu

Article No.: 39, Pages 1–24https://doi.org/10.1145/3242089

General Purpose GPU computing (GPGPU) plays an increasingly vital role in high performance computing and other areas like deep learning. However, arising from the SIMD execution model, the branch divergence issue lowers efficiency of conditional ...

- 9
- 1,370
Metrics
Total Citations9
Total Downloads1,370
Last 12 Months258
Last 6 weeks22

Abstract
View online with eReader
PDF

Save to Binder

Create a New Binder

Name

Subjects

Comments

Export Citations

Select Citation format

Please download or close your previous search result export first before starting a new bulk export.
Preview is not available.
By clicking download,a status dialog will open to start the export process. The process may takea few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress.
Download
- Download citation
- Copy citation