Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3295500.3356211acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article
Public Access

Distributed enhanced suffix arrays: efficient algorithms for construction and querying

Published: 17 November 2019 Publication History

Abstract

Suffix arrays and trees are important and fundamental string data structures which lie at the foundation of many string algorithms, with important applications in computational biology, text processing, and information retrieval. Recent work enables the efficient parallel construction of suffix arrays and trees requiring at most O(n/p) memory per process in distributed memory.
However, querying these indexes in distributed memory has not been studied extensively. Querying common string indexes such as suffix arrays, enhanced suffix arrays, and FM-Index, all require random accesses into O(n) memory - which in distributed memory settings becomes prohibitively expensive.
In this paper, we introduce a novel distributed string index, the Distributed Enhanced Suffix Array (DESA). We present efficient algorithms for the construction and querying of this distributed data structure, all while requiring only O(n/p) memory per process. We further provide a scalable parallel implementation and demonstrate its performance and scalability.

References

[1]
1000 Genomes Project Consortium and others. 2012. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 7422 (2012), 56--65.
[2]
Ahmed Abdelhadi, AH Kandil, and Mohamed Abouelhoda. 2014. Cloud-based parallel suffix array construction based on MPI. In Biomedical Engineering (MECBME), 2014 Middle East Conference on. IEEE, 334--337.
[3]
Mohamed Ibrahim Abouelhoda, Stefan Kurtz, and Enno Ohlebusch. 2004. Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms 2, 1 (2004), 53--86.
[4]
Alberto Apostolico, Costas Iliopoulos, Gad M. Landau, Baruch Schieber, and Uzi Vishkin. 1988. Parallel construction of a suffix tree with applications. Algorithmica 3, 1--4 (1988), 347--365.
[5]
Diego Arroyuelo, Carolina Bonacic, Veronica Gil-Costa, Mauricio Marin, and Gonzalo Navarro. 2014. Distributed text search using suffix arrays. Parallel Comput. 40, 9 (2014), 471--495.
[6]
Robert S Boyer and J Strother Moore. 1977. A fast string searching algorithm. Commun. ACM 20, 10 (1977), 762--772.
[7]
Matteo Comin and Montse Farreras. 2013. Efficient parallel construction of suffix trees for genomes larger than main memory. In Proceedings of the 20th European MPI Users' Group Meeting. ACM, 211--216.
[8]
Mrinal Deo and Sean Keely. 2013. Parallel suffix array and least common prefix for the GPU. In ACM SIGPLAN Notices, Vol. 48. ACM, 197--206.
[9]
Paolo Ferragina, Rodrigo González, Gonzalo Navarro, and Rossano Venturini. 2009. Compressed text indexes: From theory to practice. Journal of Experimental Algorithmics (JEA) 13 (2009), 12.
[10]
Paolo Ferragina and Fabrizio Luccio. 1999. String search in coarse-grained parallel computers. Algorithmica 24, 3--4 (1999), 177--194.
[11]
Paolo Ferragina and Gonzalo Navarro. 2005. Pizza&Chili Corpus. (2005). http://pizzachili.dcc.uchile.cl/index.html
[12]
Johannes Fischer and Volker Heun. 2007. A new succinct representation of RMQ-information and improvements in the enhanced suffix array. In Combinatorics, Algorithms, Probabilistic and Experimental Methodologies. Springer, 459--470.
[13]
Johannes Fischer, Florian Kurpicz, and Peter Sanders. 2017. Engineering a Distributed Full-Text Index. In 2017 Proceedings of the Ninteenth Workshop on Algorithm Engineering and Experiments (ALENEX). SIAM, 120--134.
[14]
Patrick Flick and Srinivas Aluru. 2015. Parallel distributed memory construction of suffix and longest common prefix arrays. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 16.
[15]
Patrick Flick and Srinivas Aluru. 2017. Parallel Construction of Suffix Trees and the All-Nearest-Smaller-Values Problem. In Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International. IEEE, 12--21.
[16]
Natsuhiko Futamura, Srinivas Aluru, and Stefan Kurtz. 2001. Parallel suffix sorting. (2001), 76--81.
[17]
Amol Ghoting and Konstantin Makarychev. 2009. Indexing genomic sequences on the IBM Blue Gene. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. ACM, 61.
[18]
Amol Ghoting and Konstantin Makarychev. 2009. Serial and parallel methods for i/o efficient suffix tree construction. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. ACM, 827--840.
[19]
Simon Gog and Matthias Petri. 2014. Optimized succinct data structures for massive data. Software: Practice and Experience 44, 11 (2014), 1287--1314.
[20]
Dan Gusfield. 1997. Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge university press.
[21]
Ramesh Hariharan. 1994. Optimal parallel suffix tree construction. In Proceedings of the twenty-sixth annual ACM symposium on Theory of computing. ACM, 290--299.
[22]
CS Iliopoulos and Wojciech Rytter. 2004. On parallel transformations of suffix arrays into suffix trees. In 15th Australasian Workshop on Combinatorial Algorithms (AWOCA). Citeseer.
[23]
Juha Kärkkäinen and Peter Sanders. 2003. Simple linear work suffix array construction. In Automata, Languages and Programming. Springer, 943--955.
[24]
Richard M Karp and Michael O Rabin. 1987. Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development 31, 2 (1987), 249--260.
[25]
Toru Kasai, Gunho Lee, Hiroki Arimura, Setsuo Arikawa, and Kunsoo Park. 2001. Linear-time longest-common-prefix computation in suffix arrays and its applications. In Combinatorial pattern matching. Springer, 181--192.
[26]
Donald E Knuth, James H Morris, Jr, and Vaughan R Pratt. 1977. Fast pattern matching in strings. SIAM journal on computing 6, 2 (1977), 323--350.
[27]
Pang Ko and Srinivas Aluru. 2003. Space efficient linear time construction of suffix arrays. In Combinatorial Pattern Matching. Springer, 200--210.
[28]
Fabian Kulla and Peter Sanders. 2007. Scalable parallel suffix array construction. Parallel Comput. 33, 9 (2007), 605--612.
[29]
Julian Labeit, Julian Shun, and Guy E Blelloch. 2017. Parallel lightweight wavelet tree, suffix array and FM-index construction. Journal of Discrete Algorithms 43 (2017), 2--17.
[30]
Yunhao Li, Jiahui Jin, Runqun Xiong, and Junzhou Luo. 2017. A Distributed Approach for Constructing Generalized Suffix Tree on Spark by Using Optimized Elastic Range Algorithm. In Advanced Cloud and Big Data (CBD), 2017 Fifth International Conference on. IEEE, 117--122.
[31]
Udi Manber and Gene Myers. 1993. Suffix arrays: a new method for on-line string searches. siam Journal on Computing 22, 5 (1993), 935--948.
[32]
Essam Mansour, Amin Allam, Spiros Skiadopoulos, and Panos Kalnis. 2011. ERA: efficient serial and parallel suffix tree construction for very long strings. Proceedings of the VLDB Endowment 5, 1 (2011), 49--60.
[33]
Vitaly Osipov. 2012. Parallel suffix array construction for shared memory architectures. In String Processing and Information Retrieval. Springer, 379--384.
[34]
Simon J Puglisi, William F Smyth, and Andrew H Turpin. 2007. A taxonomy of suffix array construction algorithms. ACM Computing Surveys (CSUR) 39, 2 (2007), 4.
[35]
Julian Shun and Guy E Blelloch. 2014. A simple parallel cartesian tree algorithm and its application to parallel suffix tree construction. ACM Transactions on Parallel Computing 1, 1 (2014), 8.
[36]
Esko Ukkonen. 1995. On-line construction of suffix trees. Algorithmica 14, 3 (1995), 249--260.
[37]
Leyuan Wang, Sean Baxter, and John D Owens. 2016. Fast parallel skew and prefix-doubling suffix array construction on the GPU. Concurrency and Computation: Practice and Experience 28, 12 (2016), 3466--3484.

Cited By

View all
  • (2023)Managing Healthcare Infodemic by deep learning in providing healthcare servicesACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3610290Online publication date: 31-Jul-2023
  • (2023)Scalable Text Index ConstructionAlgorithms for Big Data10.1007/978-3-031-21534-6_14(252-284)Online publication date: 18-Jan-2023
  • (2021)Full-text search engine with suffix index for massive heterogeneous dataInformation Systems10.1016/j.is.2021.101893(101893)Online publication date: Sep-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2019
1921 pages
ISBN:9781450362290
DOI:10.1145/3295500
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

  • IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 November 2019

Permissions

Request permissions for this article.

Check for updates

Badges

Qualifiers

  • Research-article

Funding Sources

Conference

SC '19
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)112
  • Downloads (Last 6 weeks)15
Reflects downloads up to 03 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Managing Healthcare Infodemic by deep learning in providing healthcare servicesACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3610290Online publication date: 31-Jul-2023
  • (2023)Scalable Text Index ConstructionAlgorithms for Big Data10.1007/978-3-031-21534-6_14(252-284)Online publication date: 18-Jan-2023
  • (2021)Full-text search engine with suffix index for massive heterogeneous dataInformation Systems10.1016/j.is.2021.101893(101893)Online publication date: Sep-2021
  • (2020)The parallelism motifs of genomic data analysisPhilosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences10.1098/rsta.2019.0394378:2166(20190394)Online publication date: 20-Jan-2020

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media