research-article

Fundamental parallel algorithms for private-cache chip multiprocessors

Authors:

Michael T. Goodrich,

Michael Nelson,

Nodari SitchinavaAuthors Info & Claims

SPAA '08: Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures

Pages 197 - 206

https://doi.org/10.1145/1378533.1378573

Published: 14 June 2008 Publication History

Abstract

In this paper, we study parallel algorithms for private-cache chip multiprocessors (CMPs), focusing on methods for foundational problems that are scalable with the number of cores. By focusing on private-cache CMPs, we show that we can design efficient algorithms that need no additional assumptions about the way cores are interconnected, for we assume that all inter-processor communication occurs through the memory hierarchy. We study several fundamental problems, including prefix sums, selection, and sorting, which often form the building blocks of other parallel algorithms. Indeed, we present two sorting algorithms, a distribution sort and a mergesort. Our algorithms are asymptotically optimal in terms of parallel cache accesses and space complexity under reasonable assumptions about the relationships between the number of processors, the size of memory, and the size of cache blocks. In addition, we study sorting lower bounds in a computational model, which we call the parallel external-memory (PEM) model, that formalizes the essential properties of our algorithms for private-cache CMPs.

References

[1]

A. Aggarwal and J. S. Vitter. The input/output complexity of sorting and related problems. Commun. ACM, 31(9):1116--1127, 1988.]]

Digital Library

[2]

M. A. Bender, J. T. Fineman, S. Gilbert, and B. C. Kuszmaul. Concurrent cache-oblivious B-trees. In Proc. 17th ACM Sympos. Parallel Algorithms Architect., pages 228--237, New York, NY, USA, 2005. ACM.]]

Digital Library

[3]

G. E. Blelloch, R. A. Chowdhury, P. B. Gibbons, V. Ramachandran, S. Chen, and M. Kozuch. Provably good multicore cache performance for divide-and-conquer algorithms. In Proc. 19th ACM-SIAM Sympos. Discrete Algorithms, 2008.]]

Digital Library

[4]

S. Chen, P. B. Gibbons, M. Kozuch, V. Liaskovitis, A. Ailamaki, G. E. Blelloch, B. Falsafi, L. Fix, N. Hardavellas, T. C. Mowry, and C. Wilkerson. Scheduling threads for constructive cache sharing on cmps. In Proc. 19th ACM Sympos. on Parallel Algorithms Architect., pages 105--115, New York, NY, USA, 2007. ACM.]]

Digital Library

[5]

Y.-J. Chiang, M. T. Goodrich, E. F. Grove, R. Tamassia, D. E. Vengroff, and J. S. Vitter. External-memory graph algorithms. In Proc. 6th ACM-SIAM Sympos. Discrete Algorithms}, pages 139--149, 1995.]]

Digital Library

[6]

R. Cole. Parallel merge sort. SIAM J. Comput., 17(4):770--785, 1988.]]

Digital Library

[7]

S. Cook, C. Dwork, and R. Reischuk. Upper and lower time bounds for parallel random access machines without simultaneous writes. SIAM J. Comput., 15(1):87--97, 1986.]]

Digital Library

[8]

T. H. Cormen and M. T. Goodrich. A bridging model for parallel computation, communication, and I/O. ACM Computing Surveys, 28A(4), 1996.]]

Digital Library

[9]

D. E. Culler, R. M. Karp, D. A. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. von Eicken. LogP: Towards a realistic model of parallel computation. In Principles Practice of Parallel Programming, pages 1--12, 1993.]]

Digital Library

[10]

P. de la Torre and C. P. Kruskal. A structural theory of recursively decomposable parallel processor-networks. In SPDP'95: Proceedings of the 7th IEEE Symposium on Parallel and Distributeed Processing, page 570, Washington, DC, USA, 1995. IEEE Computer Society.]]

Digital Library

[11]

F. Dehne, W. Dittrich, D. Hutchinson, and A. Maheshwari. Bulk synchronous parallel algorithms for the external memory model. Theory of Computing Systems, 35(6):567--598, 2002.]]

[12]

D. Geer. Chip Makers Turn to Multicore Processors. IEEE Computer, 38(5):11--13, 2005.]]

Digital Library

[13]

A. V. Gerbessiotis and C. J. Siniolakis. Deterministic sorting and randomized median finding on the BSP model. In Proc. 8th ACM Sympos. Parallel Algorithms Architect.}, pages 223--232, New York, NY, USA, 1996. ACM Press.]]

Digital Library

[14]

M. T. Goodrich. Communication-efficient parallel sorting. SIAM Journal on Computing, 29(2):416--432, 2000.]]

Digital Library

[15]

M. T. Goodrich and S. R. Kosaraju. Sorting on a parallel pointer machine with applications to set expression evaluation. J. ACM, 43(2):331--361, 1996.]]

Digital Library

[16]

M. T. Goodrich, J.-J. Tsay, D. E. Vengroff, and J. S. Vitter. External-memory computational geometry. In Proc. 34th Annu. IEEE Sympos. Found. Comput. Sci., pages 714--723, 1993.]]

Digital Library

[17]

J. JáJá. An Introduction to Parallel Algorithms. Addison-Wesley, Reading, Mass., 1992.]]

Digital Library

[18]

R. M. Karp and V. Ramachandran. Parallel algorithms for shared memory machines. In J. van Leeuwen, editor, Handbook of Theoretical Computer Science, pages 869--941. Elsevier/The MIT Press, Amsterdam, 1990.]]

Digital Library

[19]

R. M. Karp, A. Sahay, E. E. Santos, and K. E. Schauser. Optimal broadcast and summation in the LogP model. In SPAA'93: Proceedings of the fifth annual ACM symposium on Parallel algorithms and architectures, pages 142--153, New York, NY, USA, 1993. ACM Press.]]

Digital Library

[20]

G. Lowney. Why Intel is designing multi-core processors. https://conferences.umiacs.umd.edu/paa/lowney.pdf.]]

[21]

M. H. Nodine and J. S. Vitter. Deterministic distribution sort in shared and distributed memory multiprocessors. In Proc. 5th ACM Sympos. Parallel Algorithms Architect.}, pages 120--129, 1993.]]

Digital Library

[22]

M. H. Nodine and J. S. Vitter. Greed sort: An optimal sorting algorithm for multiple disks. J. ACM, 42(4):919--933, July 1995.]]

Digital Library

[23]

J. Rattner. Multi-Core to the Masses. Parallel Architectures and Compilation Techniques, 2005. PACT 2005. 14th International Conference on, pages 3--3, 2005.]]

Digital Library

[24]

J. H. Reif. Synthesis of Parallel Algorithms. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1993.]]

Digital Library

[25]

L. G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103--111, 1990.]]

Digital Library

[26]

U. Vishkin. A PRAM-on-chip Vision (Invited Abstract). Proceedings of the Seventh International Symposium on String Processing Information Retrieval (SPIRE'00), 2000.]]

Digital Library

[27]

J. Vitter. External memory algorithms. Proceedings of the 6th Annual European Symposium on Algorithms, pages 1--25, 1998.]]

Digital Library

[28]

J. S. Vitter and M. H. Nodine. Large-scale sorting in uniform memory hierarchies. J. Parallel Distrib. Comput., 17:107--114, 1993.]]

Digital Library

[29]

J. S. Vitter and E. A. M. Shriver. Optimal disk I/O with parallel block transfer. In Proc. 22nd Annu. ACM Sympos. Theory Comput., pages 159--169, 1990.]]

Digital Library

[30]

J. S. Vitter and E. A. M. Shriver. Algorithms for parallel memory I: Two-level memories. Algorithmica, 12(2--3):110--147, 1994.]]

Cited By

Sitchinava NSvenning RAgrawal KPetrank E(2024)The All Nearest Smaller Values Problem Revisited in Practice, Parallel and External MemoryProceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3626183.3659979(259-268)Online publication date: 17-Jun-2024
https://dl.acm.org/doi/10.1145/3626183.3659979
DeLayo DZhang KAgrawal KBender MBerry JDas RMoseley BPhillips CAgrawal KLee I(2022)Automatic HBM ManagementProceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3490148.3538570(147-159)Online publication date: 11-Jul-2022
https://dl.acm.org/doi/10.1145/3490148.3538570
Berney KCasanova HKarsin BSitchinava N(2022)Beyond Binary Search: Parallel In-Place Construction of Implicit Search Tree LayoutsIEEE Transactions on Computers10.1109/TC.2021.307539271:5(1104-1116)Online publication date: 1-May-2022
https://doi.org/10.1109/TC.2021.3075392
Show More Cited By

Index Terms

Fundamental parallel algorithms for private-cache chip multiprocessors
1. Theory of computation
  1. Design and analysis of algorithms
    1. Data structures design and analysis
      1. Sorting and searching

Recommendations

Reusability-aware cache memory sharing for chip multiprocessors with private L2 caches

In this paper, we propose a novel on-chip L2 cache organization for chip multiprocessors (CMPs) with private L2 caches. The proposed approach, called reusability-aware cache sharing (RACS), combines the advantages of both a private L2 cache and a shared ...
An efficient cache coherence mechanism for chip multiprocessors
TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs

Translation Lookaside Buffers (TLBs) are critical to overall system performance. Much past research has addressed uniprocessor TLBs, lowering access times and miss rates. However, as Chip MultiProcessors (CMPs) become ubiquitous, TLB design and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SPAA '08: Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures

June 2008

380 pages

ISBN:9781595939739

DOI:10.1145/1378533

General Chair:
Friedhelm Meyer auf der Heide
University of Paderborn, Germany
,
Program Chair:
Nir Shavit
Tel-Aviv University, Israel, and Sun Labs, USA

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SPAA08

Sponsor:

SPAA08: 20th ACM Symposium on Parallelism in Algorithms and Architectures

June 14 - 16, 2008

Munich, Germany

Acceptance Rates

Overall Acceptance Rate 447 of 1,461 submissions, 31%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

78
Total Citations
View Citations
840
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Sitchinava NSvenning RAgrawal KPetrank E(2024)The All Nearest Smaller Values Problem Revisited in Practice, Parallel and External MemoryProceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3626183.3659979(259-268)Online publication date: 17-Jun-2024
https://dl.acm.org/doi/10.1145/3626183.3659979
DeLayo DZhang KAgrawal KBender MBerry JDas RMoseley BPhillips CAgrawal KLee I(2022)Automatic HBM ManagementProceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3490148.3538570(147-159)Online publication date: 11-Jul-2022
https://dl.acm.org/doi/10.1145/3490148.3538570
Berney KCasanova HKarsin BSitchinava N(2022)Beyond Binary Search: Parallel In-Place Construction of Implicit Search Tree LayoutsIEEE Transactions on Computers10.1109/TC.2021.307539271:5(1104-1116)Online publication date: 1-May-2022
https://doi.org/10.1109/TC.2021.3075392
Panda MSajith G(2022)Efficient Parallel Cache-Oblivious Sorting Algorithms2022 International Conference on Electrical, Computer, Communications and Mechatronics Engineering (ICECCME)10.1109/ICECCME55909.2022.9988160(1-7)Online publication date: 16-Nov-2022
https://doi.org/10.1109/ICECCME55909.2022.9988160
Bender MConway AFarach-Colton MJannen WJiao YJohnson RKnorr EMcallister SMukherjee NPandey PPorter DYuan JZhan Y(2021)External-memory Dictionaries in the Affine and PDAM ModelsACM Transactions on Parallel Computing10.1145/34706358:3(1-20)Online publication date: 20-Sep-2021
https://dl.acm.org/doi/10.1145/3470635
Das RAgrawal KBender MBerry JMoseley BPhillips CScheideler CSpear M(2020)How to Manage High-Bandwidth Memory AutomaticallyProceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3350755.3400233(187-199)Online publication date: 6-Jul-2020
https://dl.acm.org/doi/10.1145/3350755.3400233
Blelloch GGibbons PGu YMcGuffey CShun JScheideler CFineman J(2018)The Parallel Persistent Memory ModelProceedings of the 30th on Symposium on Parallelism in Algorithms and Architectures10.1145/3210377.3210381(247-258)Online publication date: 11-Jul-2018
https://dl.acm.org/doi/10.1145/3210377.3210381
Afrati FHidders JKoutris PSroka JUllman J(2018)Report from the Fourth Workshop on Algorithms andSystems for MapReduce and Beyond (BeyondMR '17)ACM SIGMOD Record10.1145/3186549.318656146:4(44-48)Online publication date: 22-Feb-2018
https://dl.acm.org/doi/10.1145/3186549.3186561
Ma LChamberlain RAgrawal KTian CHu Z(2018)Analysis of classic algorithms on highly-threaded many-core architecturesFuture Generation Computer Systems10.1016/j.future.2017.02.00782(528-543)Online publication date: May-2018
https://doi.org/10.1016/j.future.2017.02.007
Dusefante MJacob R(2018)Cache Oblivious Sparse Matrix MultiplicationLATIN 2018: Theoretical Informatics10.1007/978-3-319-77404-6_32(437-447)Online publication date: 13-Mar-2018
https://doi.org/10.1007/978-3-319-77404-6_32
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents