research-article

Versatile and scalable parallel histogram construction

Authors:

Jaejin LeeAuthors Info & Claims

PACT '14: Proceedings of the 23rd international conference on Parallel architectures and compilation

Pages 127 - 138

https://doi.org/10.1145/2628071.2628108

Published: 24 August 2014 Publication History

Abstract

Histograms are used in various fields to quickly profile the distribution of a large amount of data. However, it is challenging to efficiently utilize abundant parallel resources in modern processors for histogram construction. To make matters worse, the most efficient implementation varies depending on input parameters (e.g., input distribution, number of bins, and data type) or architecture parameters (e.g., cache capacity and SIMD width).

This paper presents versatile histogram methods that achieve competitive performance across a wide range of input types and target architectures. Our open source implementations are highly optimized for various cases and are scalable for more threads and wider SIMD units. We also show that histogram construction can be significantly accelerated by Intel Xeon Phi coprocessors for common input data sets because of their compute power from many cores and instructions for efficient vectorization, such as gather-scatter.

For histograms with 256 fixed-width bins, a dual-socket 8-core Intel® Xeon® E5-2690 achieves 13 billion bin updates per second (GUPS), while a 60-core Intel® Xeon Phi™ 5110P coprocessor achieves 18 GUPS for a skewed input. For histograms with 256 variable-width bins, the Xeon processor achieves 4.7 GUPS, while the Xeon Phi coprocessor achieves 9.7 GUPS for a skewed input. For text histogram, or word count, the Xeon processor achieves 342.4 million words per seconds (MWPS). This is 4.12X, 3.46X faster than Phoenix and TBB. The Xeon phi processor achieves 401.4 MWPS, which is 1.17X faster than the Xeon processor. Since histogram construction captures essential characteristics of more general reduction-heavy operations, our approach can be extended to other settings.

References

[1]

Intel® 64 and IA-32 Architectures Optimization Reference Manual. http://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-optimization-manual.pdf.

[2]

Adaptive Historgram Template Library. https://github.com/pcjung/AHTL.

[3]

Intel® Xeon Phi™ Coprocessor Instruction Set Architecture Reference Manual. http://software.intel.com/sites/default/files/forum/278102/327364001en.pdf.

[4]

Wikipedia:Database download. http://en.wikipedia.org/wiki/Wikipedia:Database_download.

[5]

S. Agarwal, A. Panda, B. Mozafari, A. P. Iyer, S. Madden, and I. Stoica. Blink and It's Done: Interactive Queries on Very Large Data. In International Conference on Very Large Data Bases (VLDB), 2012.

Digital Library

[6]

J. H. Ahn, M. Erez, and W. J. Dally. Scatter-Add in Data Parallel Architectures. In International Symposium on High-Performance Computer Architecture (HPCA), 2005.

Digital Library

[7]

A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman. Compilers: principles, techniques, & tools, volume 1009. Pearson/Addison Wesley, 2007.

Digital Library

[8]

G. A. Baxes. Digital Image Processing: Principles and Applications. Wiley, 1994.

Digital Library

[9]

L. D. Brown, T. T. Cai, and A. DasGupta. Interval estimation for a binomial proportion. Statistical Science, 16(2):101--133, 05 2001.

[10]

S. Brown and J. Snoeyink. Modestly faster histogram computations on GPUs. In Innovative Parallel Computing (InPar), 2012, 2012.

[11]

J. Gray, P. Sundaresan, S. Englert, K. Baclawski, and P. J.Weinberger. Quickly Generating Billion-Record Synthetic Databases. In International Conference on Management of Data (SIG-MOD), 1994.

Digital Library

[12]

C. Gregg and K. Hazelwood. Where is the Data? Why You Cannot Debate CPU vs. GPU Performance Without the Answer. In International Performance Analysis of Systems and Software (ISPASS), 2011.

Digital Library

[13]

B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T.Wang. Mars: a mapreduce framework on graphics processors. In Proceedings of the 17th international conference on Parallel architectures and compilation techniques, pages 260--269. ACM, 2008.

Digital Library

[14]

A. Heinecke, K. Vaidyanathan, M. Smelyanskiy, A. Kobotov, R. Dubtsov, G. Henry, A. G. Shet, G. Chrysos, and P. Dubey. Design and Implementation of the Linpack Benchmark for Single and Multi-Node Systems Based on Intel® Xeon Phi™ Coprocessor. In IEEE International Parallel and Distributed Processing Systems (IPDPS), 2013.

Digital Library

[15]

R. Ihaka and R. Gentleman. R: A language for data analysis and graphics. Journal of computational and graphical statistics, 5(3), 1996.

[16]

P. Kankowski. Hash functions: An empirical comparison. http:///www.strchr.com/hash_functions.

[17]

C. Kim, J. Chhugani, N. Satish, E. Sedlar, A. D. Nguyen, T. Kaldewey, V. W. Lee, S. A. Brandt, and P. Dubey. FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs. In International Conference on Management of Data (SIGMOD), 2010.

Digital Library

[18]

C. Kim, T. Kaldewey, V. W. Lee, E. Sedlar, A. D. Nguyen, N. Satish, J. Chhugani, A. D. Blas, and P. Dubey. Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs. In International Conference on Very Large Data Bases (VLDB), 2009.

Digital Library

[19]

C. Kim, J. Park, N. Satish, H. Lee, P. Dubey, and J. Chhugani. CloudRAMSort: Fast and Efficient Large-Scale Distributed RAM Sort on Shared-Nothing Cluster. In International Conference on Management of Data (SIGMOD), 2012.

Digital Library

[20]

S. Kumar, D. Kim, M. Smelyanskiy, Y.-K. Chen, J. Chhugani, C. J. Hughes, C. Kim, V. W. Lee, and A. D. Nguyen. Atomic Vector Operations on Chip Multiprocessors. In International Symposium on Computer Architecture (ISCA), pages 441--452, 2008.

Digital Library

[21]

P. Lofti-Kamran, B. Grot, M. Ferdman, S. Volos, O. Kocberber, J. Picorel, A. Adileh, D. Jevdjic, S. Idgunji, E. Ozer, and B. Falsafi. Scale-Out Processors. In International Symposium on Computer Architecture (ISCA), 2012.

Digital Library

[22]

J. Park, P. T. P. Tang, M. Smelyanskiy, D. Kim, and T. Benson. Efficient Backprojection-based Synthetic Aperture Radar Computation with Many-core Processors. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2012.

Digital Library

[23]

V. Pdlozhnyuk. Histogram calculation in CUDA. http://docs.nvidia.com/cuda/samples/3_Imaging/histogram/doc/histogram.pdf.

[24]

K. Pearson. Contributions to the Mathematical Theory of Evolution. II. Skew Variation in Homogeneous Material. Philosophical Transactions of the Royal Society of London, 186, 1895.

[25]

V. Poosala, P. J. Haas, Y. E. Ioannidis, and E. J. Shekita. Improved Histograms for Selectivity Estimation of Range Predicates. In International Conference on Management of Data (SIGMOD), 1996.

Digital Library

[26]

T. Rantalaiho. Generalized Histograms for CUDA-capable GPUs. https://github.com/trantalaiho/Cuda-Histogram.

[27]

L. Rauchwerger and D. A. Padua. The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization. IEEE Transactions on Parallel and Distributed Systems, 10(2), 1999.

Digital Library

[28]

J. Reinders. Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism.

Digital Library

[29]

J. Reinders. Transactional Synchronization in Haswell. http://software.intel.com/en-us/blogs/2012/02/07/transactional-synchronization-in-haswell.

[30]

J. W. Romein. An Efficient Work-Distribution Strategy for Gridding Radio-Telescope Data on GPUs. In International Conference on Supercomputing (ICS), 2012.

Digital Library

[31]

N. Satish, C. Kim, J. Chhugani, A. D. Nguyen, V. W. Lee, D. Kim, and P. Dubey. Fast Sort on CPUs and GPUs: A Case for Bandwidth Oblivious SIMD Sort. In International Conference on Management of Data (SIGMOD), 2010.

Digital Library

[32]

R. Shams and R. A. Kennedy. Efficient Histogram Algorithms for NVIDIA CUDA Compatible Devices. In International Conference on Signal Processing and Communication Systems, 2007.

[33]

S. Taylor. Optimizing Applications for Multi-Core Processors, Using the Intel Integrated Performance Primitives. 2007.

Digital Library

[34]

W. Wang, H. Jiang, H. Lu, and J. X. Yu. Bloom Histogram: Path Selectivity Estimation for XML Data with Updates. In International Conference on Very Large Data Bases (VLDB), 2004.

Digital Library

[35]

R. M. Yoo, A. Romano, and C. Kozyrakis. Phoenix rebirth: Scalable mapreduce on a large-scale shared-memory system. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, pages 198--207. IEEE, 2009.

Digital Library

Cited By

Yu YZhao ZLin SLi W(2024)Accelerating Huffman Encoding Using 512-Bit SIMD InstructionsIEEE Transactions on Consumer Electronics10.1109/TCE.2023.334722970:1(554-563)Online publication date: Feb-2024
https://doi.org/10.1109/TCE.2023.3347229
Elsman MHenriksen TSerup NGibbons J(2019)Data-parallel flattening by expansionProceedings of the 6th ACM SIGPLAN International Workshop on Libraries, Languages and Compilers for Array Programming10.1145/3315454.3329955(14-24)Online publication date: 8-Jun-2019
https://dl.acm.org/doi/10.1145/3315454.3329955
Lester B(2018)Performance of Map-Reduce Using Java-8 Parallel StreamsIntelligent Computing10.1007/978-3-030-01174-1_55(723-736)Online publication date: 2-Nov-2018
https://doi.org/10.1007/978-3-030-01174-1_55
Show More Cited By

Index Terms

Versatile and scalable parallel histogram construction

Recommendations

Multi- and many-core data mining with adaptive sparse grids
CF '11: Proceedings of the 8th ACM International Conference on Computing Frontiers

Gaining knowledge out of vast datasets is a main challenge in data-driven applications nowadays. Sparse grids provide a numerical method for both classification and regression in data mining which scales only linearly in the number of data points and is ...
Performance Evaluation and Improvements of the PoCL Open-Source OpenCL Implementation on Intel CPUs
IWOCL '21: Proceedings of the 9th International Workshop on OpenCL

The Portable Computing Language (PoCL) is a vendor independent open-source OpenCL implementation that aims to support a variety of compute devices in a single platform. Evaluating PoCL versus the Intel OpenCL implementation reveals significant ...
A Comparative Evaluation of Parallel Programming Models for Shared-Memory Architectures
ISPA '12: Proceedings of the 2012 IEEE 10th International Symposium on Parallel and Distributed Processing with Applications

Nowadays, most computers that are commercially available off-the-shelf (COTS) include hardware features that increase the performance of parallel general-purpose threads (hyper threading, multicore, ccNUMA architectures) or SIMD kernels (CPU vector ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '14: Proceedings of the 23rd international conference on Parallel architectures and compilation

August 2014

514 pages

ISBN:9781450328098

DOI:10.1145/2628071

General Chair:
J. Nelson Amaral
University of Alberta, Canada
,
Program Chair:
Josep Torrellas
University of Illinois, USA

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

IFIP WG 10.3: IFIP WG 10.3
SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE CS TCPP: IEEE Computer Society Technical Committee on Parallel Processing
IEEE CS TCAA: IEEE CS technical committee on architectural acoustics

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

PACT '14

Sponsor:

IFIP WG 10.3
SIGARCH
IEEE CS TCPP
IEEE CS TCAA

PACT '14: International Conference on Parallel Architectures and Compilation

August 24 - 27, 2014

AB, Edmonton, Canada

Acceptance Rates

PACT '14 Paper Acceptance Rate 54 of 144 submissions, 38%;

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
294
Total Downloads

Downloads (Last 12 months)13
Downloads (Last 6 weeks)1

Reflects downloads up to 14 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yu YZhao ZLin SLi W(2024)Accelerating Huffman Encoding Using 512-Bit SIMD InstructionsIEEE Transactions on Consumer Electronics10.1109/TCE.2023.334722970:1(554-563)Online publication date: Feb-2024
https://doi.org/10.1109/TCE.2023.3347229
Elsman MHenriksen TSerup NGibbons J(2019)Data-parallel flattening by expansionProceedings of the 6th ACM SIGPLAN International Workshop on Libraries, Languages and Compilers for Array Programming10.1145/3315454.3329955(14-24)Online publication date: 8-Jun-2019
https://dl.acm.org/doi/10.1145/3315454.3329955
Lester B(2018)Performance of Map-Reduce Using Java-8 Parallel StreamsIntelligent Computing10.1007/978-3-030-01174-1_55(723-736)Online publication date: 2-Nov-2018
https://doi.org/10.1007/978-3-030-01174-1_55
Hou KLiu WWang HFeng WGropp WBeckman PLi ZCazorla F(2017)Fast segmented sort on GPUsProceedings of the International Conference on Supercomputing10.1145/3079079.3079105(1-10)Online publication date: 14-Jun-2017
https://dl.acm.org/doi/10.1145/3079079.3079105
Kiriansky VZhang YAmarasinghe SZaks AMendelson BRauchwerger LHwu W(2016)Optimizing Indirect Memory References with milkProceedings of the 2016 International Conference on Parallel Architectures and Compilation10.1145/2967938.2967948(299-312)Online publication date: 11-Sep-2016
https://dl.acm.org/doi/10.1145/2967938.2967948
Gautam K(2016)Parallel Histogram Calculation for FPGA: Histogram Calculation2016 IEEE 6th International Conference on Advanced Computing (IACC)10.1109/IACC.2016.148(774-777)Online publication date: Feb-2016
https://doi.org/10.1109/IACC.2016.148
Zhang GHorn WSanchez DPrvulovic M(2015)Exploiting commutativity to reduce the cost of updates to shared data in cache-coherent systemsProceedings of the 48th International Symposium on Microarchitecture10.1145/2830772.2830774(13-25)Online publication date: 5-Dec-2015
https://dl.acm.org/doi/10.1145/2830772.2830774

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents