article

Merging, sorting and matrix operations on the SOME-bus multiprocessor architecture

Author:

Constantine KatsinisAuthors Info & Claims

Future Generation Computer Systems, Volume 20, Issue 4

Pages 643 - 661

https://doi.org/10.1016/S0167-739X(03)00129-8

Published: 01 May 2004 Publication History

Abstract

Due to advances in fiber-optics and VLSI technology, interconnection networks which allow multiple simultaneous broadcasts are becoming feasible. This paper presents the multiprocessor architecture of the Simultaneous Optical Multiprocessor Exchange Bus (SOME-Bus), and examines the performance of representative algorithms for matrix operations, merging and sorting, using the message-passing and distributed-shared-memory paradigms. It shows that simple enhancements to the network interface and the cache and directory controllers can result in communication time of O(1) for the matrix-vector multiplication algorithm using DSM. The SOME-Bus is a low-latency, high-bandwidth, fiber-optic interconnection network which directly links arbitrary pairs of processor nodes without contention, and can efficiently interconnect over 100 nodes. It contains a dedicated channel for the data output of each node, eliminating the need for global arbitration and providing bandwidth that scales directly with the number of nodes in the system. Each of P nodes has an array of receivers, with one receiver dedicated to each node output channel. No node is ever blocked from transmitting by another transmitter or due to contention for shared switching logic. The entire P receiver array can be integrated on a single chip at a comparatively minor cost resulting in O(P) complexity. The SOME-Bus has much more functionality than a crossbar by supporting multiple simultaneous broadcasts of messages, allowing cache consistency protocols to complete much faster.

References

[1]

{1} B. Abali, F. Ozguner, A. Bataineh, Balanced parallel sort on hypercube multiprocessors, IEEE Trans. Parall. Distr. Syst. 4 (5) (1993) 572-581.

Digital Library

[2]

{2} A. Agarwala, C.R. Das, Experimenting with a shared virtual memory enviromnent for hypercubes, J. Parall. Distr. Comput. 29 (2) (1995) 228.

Digital Library

[3]

{3} A. AL Ayyoub, M. Ould Khaoua, K. Day, On the performance of parallel matrix factorisation on the hypermesh, J. Supercomput. 20 (1) (2001) 37-53.

[4]

{4} C. Amza, A.L. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, W. Yu, W. Zwaenepoel, TreadMarks: shared memory computing on networks of workstations, IEEE Comput. 29 (2) (1996) 18-28.

Digital Library

[5]

{5} L. Bhuyan, Generalized hypercube and hyperbus structures for a computer network, IEEE Trans. Comput. C-33 (4) (1984) 323-333.

[6]

{6} A. Bouzid, M.A.G. Abushagur, Thin-film approximate modeling of in-core fiber gratings, Opt. Eng. 35 (10) (1996) 2793-2797.

[7]

{7} G.T. Byrd, M.J. Flynn, Producer-consumer communication in distributed shared memory multiprocessors, Proc. IEEE 87 (3) (1999) 456-466.

[8]

{8} C. Cerinm, J.L. Gaudiot, Algorithms for stable sorting to minimize communications in networks of workstations and their implementations in BSP, in: Proceedings of the IEEE Computer Society International Workshop on Cluster Computing, ICWC'99, 1999, pp. 112-20.

[9]

{9} C. Jaeyoung, J.J. Dongarra, D.W. Walker, Parallel matrix transpose algorithms on distributed memory concurrent computers, in: Proceedings of the Scalable Parallel Libraries Conference, 1994, pp. 245-252.

[10]

{10} F. Dahlgren, P. Stenstrom, Evaluation of hardware-based stride and sequential prefetching in shared-memory multiprocessors, IEEE Trans. Parall. Distr. Syst. 7 (4) (1996) 385.

Digital Library

[11]

{11} F. Dahlgren, M. Dubois, P. Stenstrom. Performance evaluation and cost analysis of cache protocol extensions for shared memory multiprocessors, IEEE Trans. Comput. 47 (10) (1998) 1041-1055.

Digital Library

[12]

{12} L. Dong, B. Ortega, L. Reekie, Coupling characteristics of cladding modes in tilted optical fiber gratings, Appl. Opt. 37 (22) (1998) 5099-5105.

[13]

{13} T. Erdogan, J. Sipe, Tilted fiber phase gratings, J. Opt. Soc. Am. 13 (2) (1996) 296-313.

[14]

{14} G. Gravenstreter, R. Melhem, Realizing common communication patterns in partitioned optical passive stars (POPS) networks. IEEE Trans. Comput. 47 (9) (1998).

Digital Library

[15]

{15} A. Grujic, M. Tomasevic, V. Milutinovic. A simulation study, of hard-ware-oriented DSM approaches, IEEE Parall. Distr. Technol. 4 (1) (1996) 74.

[16]

{16} M. Hamdi, J. Tong, C.W. Kin, Fast sorting algorithms on reconfigurable array of processors with optical buses, in: Proceedings of the International Conference on Parallel and Distributed Systems, 1996, pp. 183-188.

[17]

{17} H.B. Lim, P.C. Yew, Efficient integration of compiler directed cache coherence and data prefetching, in: Proceedings of the 14th International Parallel and Distributed Processing Symposium, 2000, pp. 331-340.

[18]

{18} C. Katsinis, Performance analysis of the simultaneous optical multiprocessor exchange bus, Parall. Comput. J. 27 (8) (2001) 1079-1115.

[19]

{19} K. Li, Scalable parallel matrix multiplication on distributed memory parallel computers, J. Parall. Distr. Comput. 61 (12) (2001) 1709-1731.

Digital Library

[20]

{20} K. Li, V.Y. Pan, Parallel matrix multiplication on a linear array with a reconfigurable pipelined bus system, IEEE Trans. Comput. 50 (5) (2001) 519-525.

Digital Library

[21]

{21} D.M. Koppelman, Neighborhood prefetching on multiprocessors using instruction history, in: Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2000, pp. 123-132.

[22]

{22} M. Lee, G. Little, Study of radiation modes for 45-deg tilted fiber phase gratings, Opt. Eng. 37 (10) (1998) 2687-2698.

[23]

{23} Y. Li, T. Wang, Distribution of light power and optical signals using embedded mirrors inside polymer optical fibers, IEEE Photon. Technol. Lett. 8 (10) (1996) 1352-1354.

[24]

{24} Y. Li, T. Wang, K. Fasanella, Cost-effective side-coupling polymer fiber optics for optical interconnections, J. Lightwave Technol. 16 (5) (1998) 892-901.

[25]

{25} S.A. Mabbs, K.E. Forward, Performance analysis of MR-1, a clustered shared-memory multiprocessor, J. Parall. Distr. Comput. 20 (2) (1994) 158.

Digital Library

[26]

{26} A. Milenkovic, V. Milutinovic. Cache injection on bus based multiprocessors, in: Proceedings of the 17th IEEE Symposium on Reliable Distributed Systems, 20-23 October, 1998, pp. 341-346.

[27]

{27} S.S. Nemawarkar, R. Govindarajan, G.R. Gao, V.K. Agarwal, Analysis of multithreaded multiprocessors with distributed shared memory, in: Proceedings of the IEEE Symposium on Parallel Distributed Processing, 1993, pp. 114-121.

[28]

{28} C.D. Norton, T.A. Cwik. Early experiences with the myricom 2000 switch on an SMP Beowulf class cluster for unstructured adaptive meshing, in: Proceedings of the International Conference on Cluster Computing, 2001, pp. 7-14.

[29]

{29} A.G. Nowatzyk. et al., S-Connect: from networks of workstations to supercomputer performance, in: Proceedings of the 22nd International Symposium on Computer Architecture, June 1995, pp. 71-82.

[30]

{30} D. Ortega, E. Ayguade, J.L. Baer, M. Valero, Cost effective compiler directed memory prefetching and bypassing, in: Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2002, pp. 189-198.

[31]

{31} M. Ould-Khaoua, Comparative evaluation of hypermesh and multi-stage interconnection network, Comput. J. 39 (3) (1996) 232.

[32]

{32} H.F.B. Ozelo, L.E.M. de Barros Jr., B. Nabet, L.G. Neto, M.A. Romero, J.W. Swart, MSM photodetector with an integrated microlens array for improved optical coupling, in: Proceedings of the International Microwave and Optoelectronics Conference (IMOC'99), Rio de Janeiro, Brazil, 9-12 August, 1999, pp. 472-475.

[33]

{33} V.S. Pai, S.V. Adve, Comparing and combining read miss clustering and software prefetching, in: Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, pp. 292-303.

[34]

{34} Y. Pan, K. Li, Linear array with a reconfigurable pipelined bus system concepts and application, J. Inform. Sci. 106 (3-4) (1998) 237-258.

Digital Library

[35]

{35} D.V. Plant, M.B. Venditti, E. Laprise, J. Faucher, K. Razavi, M. Chateauneuf, A.G. Kirk, J.S. Ahearn, 256 channel bidirectional optical interconnect using VCSELs and photodiodes on CMOS, J. Lightwave Technol. 19 (8) (2001) 1093-1103.

[36]

{36} S. Rajasekaran, S. Sahni, Sorting, selection, and routing on the array with reconfigurable optical buses, in: Proceedings of the IEEE Transactions on Parallel and Distributed Systems, vol. 8, no. 11, 1997.

[37]

{37} R.H. Saavedra, W. Mao, D. Park, J. Chame, S. Moon, The combined effectiveness of unimodular transformations, tiling, and software prefetching, in: Proceedings of the 10th International Parallel Processing Symposium, 15-19 April, 1996, pp. 39-45.

[38]

{38} E. Speight, J.K. Bennett, Brazos: a third generatiun DSM system, in: Proceedings of the 1997 USENIX Windows/NT Workshop, August 1997.

[39]

{39} T. Szymanski, Hypermeshes: optical interconnection network for parallel computing, J. Parall. Distr. Comput. 26 (1) (1995) 1.

Digital Library

[40]

{40} S.P. Vander Wiel, D.J. Lilja, When caches aren't enough: data prefetching techniques, Computer 30 (7) (1997) 23-30.

Digital Library

[41]

{41} L. Xiang, K. Ushijima, On time bounds, the work time scheduling principle, and optimality for BSR, IEEE Trans. Parall. Distr. Syst. 12 (9) (2001) 912-921.

Digital Library

[42]

{42} L. Xiang, K. Ushijima, Optimal parallel merging algorithms on BSR, in: Proceedings of the International Symposium on Parallel Architectures, Algorithms and Networks 2000, pp. 12-17.

[43]

{43} K.K. Lau, M.J. Kumar, R. Venkatesh, Parallel matrix inversion techniques, in: Proceedings of the IEEE Second International Conference on Algorithms and Architectures for Parallel Processing, 1996, pp. 515-521.

[44]

{44} Q. Ping Gu, J. Gu, Algorithms and average time bounds of sorting on a mesh connected computer, IEEE Trans. Parall. Distr. Syst. 5 (3) (1994) 308-315.

Digital Library

[45]

{45} http://www.dolphinics.com.

[46]

{46} http://www.myrinet.com.

[47]

{47} http://www.quadrics.com.

Cited By

Bahig H(2019)A new constant-time parallel algorithm for mergingThe Journal of Supercomputing10.1007/s11227-018-2623-z75:2(968-983)Online publication date: 1-Feb-2019
https://dl.acm.org/doi/10.1007/s11227-018-2623-z
Sartakhti JJalili SRudi A(2013)A new light-based solution to the Hamiltonian path problemFuture Generation Computer Systems10.1016/j.future.2012.07.00829:2(520-527)Online publication date: 1-Feb-2013
https://dl.acm.org/doi/10.1016/j.future.2012.07.008
Akay MAbasıkeleş İOral M(2010)Application of self organizing maps for investigating network latency on a broadcast-based distributed shared memory multiprocessorExpert Systems with Applications: An International Journal10.1016/j.eswa.2009.09.04237:4(2937-2942)Online publication date: 1-Apr-2010
https://dl.acm.org/doi/10.1016/j.eswa.2009.09.042
Show More Cited By

Index Terms

Recommendations

The performance of parallel matrix algorithms on a broadcast-based architecture: Research Articles

Due to advances in fiber-optics and very large scale integration (VLSI) technology, interconnection networks which allow multiple simultaneous broadcasts are becoming feasible. This paper summarizes one such multiprocessor architecture called the ...
Bandwidth of Crossbar and Multiple-Bus Connections for Multiprocessors

In this paper we compare the effective bandwidth in a multiprocessor with shared memory using as interconnection networks the crossbar or the multiple-bus. We consider a system with N processors and N memory modules, in which the processor requests to ...
Fully Interconnecting Multiple Computers with Pipelined Sorting Nets

A pipelined multiprocessor interconnection method functionally equivalent to a full crossbar, but with a per processor cost proportional to the square of the log of the total number of processors, is presented.

Comments

Information & Contributors

Information

Published In

cover image Future Generation Computer Systems

Future Generation Computer Systems Volume 20, Issue 4

Special issue: Advanced services for clusters and internet computing

May 2004

192 pages

ISSN:0167-739X

Issue’s Table of Contents

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 May 2004

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 06 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Bahig H(2019)A new constant-time parallel algorithm for mergingThe Journal of Supercomputing10.1007/s11227-018-2623-z75:2(968-983)Online publication date: 1-Feb-2019
https://dl.acm.org/doi/10.1007/s11227-018-2623-z
Sartakhti JJalili SRudi A(2013)A new light-based solution to the Hamiltonian path problemFuture Generation Computer Systems10.1016/j.future.2012.07.00829:2(520-527)Online publication date: 1-Feb-2013
https://dl.acm.org/doi/10.1016/j.future.2012.07.008
Akay MAbasıkeleş İOral M(2010)Application of self organizing maps for investigating network latency on a broadcast-based distributed shared memory multiprocessorExpert Systems with Applications: An International Journal10.1016/j.eswa.2009.09.04237:4(2937-2942)Online publication date: 1-Apr-2010
https://dl.acm.org/doi/10.1016/j.eswa.2009.09.042
Bahig H(2010)Merging data records on EREW PRAMProceedings of the 10th international conference on Algorithms and Architectures for Parallel Processing - Volume Part II10.1007/978-3-642-13136-3_40(391-400)Online publication date: 21-May-2010
https://dl.acm.org/doi/10.1007/978-3-642-13136-3_40
Bahig H(2008)Parallel merging with restrictionThe Journal of Supercomputing10.1007/s11227-007-0141-543:1(99-104)Online publication date: 1-Jan-2008
https://dl.acm.org/doi/10.1007/s11227-007-0141-5

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents