Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2628071.2628108acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

Versatile and scalable parallel histogram construction

Published: 24 August 2014 Publication History

Abstract

Histograms are used in various fields to quickly profile the distribution of a large amount of data. However, it is challenging to efficiently utilize abundant parallel resources in modern processors for histogram construction. To make matters worse, the most efficient implementation varies depending on input parameters (e.g., input distribution, number of bins, and data type) or architecture parameters (e.g., cache capacity and SIMD width).
This paper presents versatile histogram methods that achieve competitive performance across a wide range of input types and target architectures. Our open source implementations are highly optimized for various cases and are scalable for more threads and wider SIMD units. We also show that histogram construction can be significantly accelerated by Intel Xeon Phi coprocessors for common input data sets because of their compute power from many cores and instructions for efficient vectorization, such as gather-scatter.
For histograms with 256 fixed-width bins, a dual-socket 8-core Intel® Xeon® E5-2690 achieves 13 billion bin updates per second (GUPS), while a 60-core Intel® Xeon Phi™ 5110P coprocessor achieves 18 GUPS for a skewed input. For histograms with 256 variable-width bins, the Xeon processor achieves 4.7 GUPS, while the Xeon Phi coprocessor achieves 9.7 GUPS for a skewed input. For text histogram, or word count, the Xeon processor achieves 342.4 million words per seconds (MWPS). This is 4.12X, 3.46X faster than Phoenix and TBB. The Xeon phi processor achieves 401.4 MWPS, which is 1.17X faster than the Xeon processor. Since histogram construction captures essential characteristics of more general reduction-heavy operations, our approach can be extended to other settings.

References

[1]
Intel® 64 and IA-32 Architectures Optimization Reference Manual. http://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-optimization-manual.pdf.
[2]
Adaptive Historgram Template Library. https://github.com/pcjung/AHTL.
[3]
Intel® Xeon Phi™ Coprocessor Instruction Set Architecture Reference Manual. http://software.intel.com/sites/default/files/forum/278102/327364001en.pdf.
[4]
Wikipedia:Database download. http://en.wikipedia.org/wiki/Wikipedia:Database_download.
[5]
S. Agarwal, A. Panda, B. Mozafari, A. P. Iyer, S. Madden, and I. Stoica. Blink and It's Done: Interactive Queries on Very Large Data. In International Conference on Very Large Data Bases (VLDB), 2012.
[6]
J. H. Ahn, M. Erez, and W. J. Dally. Scatter-Add in Data Parallel Architectures. In International Symposium on High-Performance Computer Architecture (HPCA), 2005.
[7]
A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman. Compilers: principles, techniques, & tools, volume 1009. Pearson/Addison Wesley, 2007.
[8]
G. A. Baxes. Digital Image Processing: Principles and Applications. Wiley, 1994.
[9]
L. D. Brown, T. T. Cai, and A. DasGupta. Interval estimation for a binomial proportion. Statistical Science, 16(2):101--133, 05 2001.
[10]
S. Brown and J. Snoeyink. Modestly faster histogram computations on GPUs. In Innovative Parallel Computing (InPar), 2012, 2012.
[11]
J. Gray, P. Sundaresan, S. Englert, K. Baclawski, and P. J.Weinberger. Quickly Generating Billion-Record Synthetic Databases. In International Conference on Management of Data (SIG-MOD), 1994.
[12]
C. Gregg and K. Hazelwood. Where is the Data? Why You Cannot Debate CPU vs. GPU Performance Without the Answer. In International Performance Analysis of Systems and Software (ISPASS), 2011.
[13]
B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T.Wang. Mars: a mapreduce framework on graphics processors. In Proceedings of the 17th international conference on Parallel architectures and compilation techniques, pages 260--269. ACM, 2008.
[14]
A. Heinecke, K. Vaidyanathan, M. Smelyanskiy, A. Kobotov, R. Dubtsov, G. Henry, A. G. Shet, G. Chrysos, and P. Dubey. Design and Implementation of the Linpack Benchmark for Single and Multi-Node Systems Based on Intel® Xeon Phi™ Coprocessor. In IEEE International Parallel and Distributed Processing Systems (IPDPS), 2013.
[15]
R. Ihaka and R. Gentleman. R: A language for data analysis and graphics. Journal of computational and graphical statistics, 5(3), 1996.
[16]
P. Kankowski. Hash functions: An empirical comparison. http:///www.strchr.com/hash_functions.
[17]
C. Kim, J. Chhugani, N. Satish, E. Sedlar, A. D. Nguyen, T. Kaldewey, V. W. Lee, S. A. Brandt, and P. Dubey. FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs. In International Conference on Management of Data (SIGMOD), 2010.
[18]
C. Kim, T. Kaldewey, V. W. Lee, E. Sedlar, A. D. Nguyen, N. Satish, J. Chhugani, A. D. Blas, and P. Dubey. Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs. In International Conference on Very Large Data Bases (VLDB), 2009.
[19]
C. Kim, J. Park, N. Satish, H. Lee, P. Dubey, and J. Chhugani. CloudRAMSort: Fast and Efficient Large-Scale Distributed RAM Sort on Shared-Nothing Cluster. In International Conference on Management of Data (SIGMOD), 2012.
[20]
S. Kumar, D. Kim, M. Smelyanskiy, Y.-K. Chen, J. Chhugani, C. J. Hughes, C. Kim, V. W. Lee, and A. D. Nguyen. Atomic Vector Operations on Chip Multiprocessors. In International Symposium on Computer Architecture (ISCA), pages 441--452, 2008.
[21]
P. Lofti-Kamran, B. Grot, M. Ferdman, S. Volos, O. Kocberber, J. Picorel, A. Adileh, D. Jevdjic, S. Idgunji, E. Ozer, and B. Falsafi. Scale-Out Processors. In International Symposium on Computer Architecture (ISCA), 2012.
[22]
J. Park, P. T. P. Tang, M. Smelyanskiy, D. Kim, and T. Benson. Efficient Backprojection-based Synthetic Aperture Radar Computation with Many-core Processors. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2012.
[23]
V. Pdlozhnyuk. Histogram calculation in CUDA. http://docs.nvidia.com/cuda/samples/3_Imaging/histogram/doc/histogram.pdf.
[24]
K. Pearson. Contributions to the Mathematical Theory of Evolution. II. Skew Variation in Homogeneous Material. Philosophical Transactions of the Royal Society of London, 186, 1895.
[25]
V. Poosala, P. J. Haas, Y. E. Ioannidis, and E. J. Shekita. Improved Histograms for Selectivity Estimation of Range Predicates. In International Conference on Management of Data (SIGMOD), 1996.
[26]
T. Rantalaiho. Generalized Histograms for CUDA-capable GPUs. https://github.com/trantalaiho/Cuda-Histogram.
[27]
L. Rauchwerger and D. A. Padua. The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization. IEEE Transactions on Parallel and Distributed Systems, 10(2), 1999.
[28]
J. Reinders. Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism.
[29]
J. Reinders. Transactional Synchronization in Haswell. http://software.intel.com/en-us/blogs/2012/02/07/transactional-synchronization-in-haswell.
[30]
J. W. Romein. An Efficient Work-Distribution Strategy for Gridding Radio-Telescope Data on GPUs. In International Conference on Supercomputing (ICS), 2012.
[31]
N. Satish, C. Kim, J. Chhugani, A. D. Nguyen, V. W. Lee, D. Kim, and P. Dubey. Fast Sort on CPUs and GPUs: A Case for Bandwidth Oblivious SIMD Sort. In International Conference on Management of Data (SIGMOD), 2010.
[32]
R. Shams and R. A. Kennedy. Efficient Histogram Algorithms for NVIDIA CUDA Compatible Devices. In International Conference on Signal Processing and Communication Systems, 2007.
[33]
S. Taylor. Optimizing Applications for Multi-Core Processors, Using the Intel Integrated Performance Primitives. 2007.
[34]
W. Wang, H. Jiang, H. Lu, and J. X. Yu. Bloom Histogram: Path Selectivity Estimation for XML Data with Updates. In International Conference on Very Large Data Bases (VLDB), 2004.
[35]
R. M. Yoo, A. Romano, and C. Kozyrakis. Phoenix rebirth: Scalable mapreduce on a large-scale shared-memory system. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, pages 198--207. IEEE, 2009.

Cited By

View all
  • (2024)Accelerating Huffman Encoding Using 512-Bit SIMD InstructionsIEEE Transactions on Consumer Electronics10.1109/TCE.2023.334722970:1(554-563)Online publication date: Feb-2024
  • (2019)Data-parallel flattening by expansionProceedings of the 6th ACM SIGPLAN International Workshop on Libraries, Languages and Compilers for Array Programming10.1145/3315454.3329955(14-24)Online publication date: 8-Jun-2019
  • (2018)Performance of Map-Reduce Using Java-8 Parallel StreamsIntelligent Computing10.1007/978-3-030-01174-1_55(723-736)Online publication date: 2-Nov-2018
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PACT '14: Proceedings of the 23rd international conference on Parallel architectures and compilation
August 2014
514 pages
ISBN:9781450328098
DOI:10.1145/2628071
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. algorithms
  2. histogram
  3. multi-core
  4. performance
  5. simd

Qualifiers

  • Research-article

Funding Sources

Conference

PACT '14
Sponsor:
  • IFIP WG 10.3
  • SIGARCH
  • IEEE CS TCPP
  • IEEE CS TCAA

Acceptance Rates

PACT '14 Paper Acceptance Rate 54 of 144 submissions, 38%;
Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)13
  • Downloads (Last 6 weeks)1
Reflects downloads up to 14 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Accelerating Huffman Encoding Using 512-Bit SIMD InstructionsIEEE Transactions on Consumer Electronics10.1109/TCE.2023.334722970:1(554-563)Online publication date: Feb-2024
  • (2019)Data-parallel flattening by expansionProceedings of the 6th ACM SIGPLAN International Workshop on Libraries, Languages and Compilers for Array Programming10.1145/3315454.3329955(14-24)Online publication date: 8-Jun-2019
  • (2018)Performance of Map-Reduce Using Java-8 Parallel StreamsIntelligent Computing10.1007/978-3-030-01174-1_55(723-736)Online publication date: 2-Nov-2018
  • (2017)Fast segmented sort on GPUsProceedings of the International Conference on Supercomputing10.1145/3079079.3079105(1-10)Online publication date: 14-Jun-2017
  • (2016)Optimizing Indirect Memory References with milkProceedings of the 2016 International Conference on Parallel Architectures and Compilation10.1145/2967938.2967948(299-312)Online publication date: 11-Sep-2016
  • (2016)Parallel Histogram Calculation for FPGA: Histogram Calculation2016 IEEE 6th International Conference on Advanced Computing (IACC)10.1109/IACC.2016.148(774-777)Online publication date: Feb-2016
  • (2015)Exploiting commutativity to reduce the cost of updates to shared data in cache-coherent systemsProceedings of the 48th International Symposium on Microarchitecture10.1145/2830772.2830774(13-25)Online publication date: 5-Dec-2015

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media