Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

A hybrid MPI/OpenMP parallel implementation of NSGA-II for finding patterns in protein sequences

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Since the late 1970s, when the first DNA-based genome was sequenced, the field of biology is experiencing a significant growth in the amount of data that needs to be processed. Long ago it became impractical to analyze all this information manually, resulting in a great need for new techniques, algorithms and strategies to facilitate this work. Within the vast world of bioinformatics, we will focus on proteomics, more specifically, on the discovery of small repeated common patterns on sets of protein sequences that may represent some biological functionality. When we analyze a large number of sequences, the problem shows non-deterministic polynomial times, it implies that we could benefit from the combination of high-performance computing and computational intelligence techniques. In this paper, we address the discovery of repeated common patterns as a multiobjective optimization problem by means of a hybrid MPI/OpenMP approach which parallelizes a well-known multiobjective metaheuristic, the fast non-dominated sorting genetic algorithm (NSGA-II). Our main objective is to combine the benefits of shared-memory and distributed-memory programming paradigms to discover patterns in an accurate and efficient manner. Experiments conducted on six different datasets, comparisons with other well-known biological tools, and the obtained speed-ups and efficiencies show that our approach is able to achieve a significant performance in terms of parallel and biological results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Adhianto L, Chapman B (2007) Performance modeling of communication and computation in hybrid MPI and OpenMP applications. Simul Model Pract Theory 15(4):481–491

    Article  Google Scholar 

  2. Anderson NL, Anderson NG (1998) Proteome and proteomics: new technologies, new concepts, and new words. Electrophoresis 19(11):1853–1861

    Article  Google Scholar 

  3. Bailey TL, Bodn M, Buske FA, Frith M, Grant CE, Clementi L, Ren J, Li WW, Noble WS (2009) MEME SUITE: tools for motif discovery and searching. Nucl Acids Res 37(2):W202–W208

    Article  Google Scholar 

  4. Bork P, Koonin EV (1996) Protein sequence motifs. Curr Opin Struct Biol 6(3):366–376

    Article  Google Scholar 

  5. Chan TK, Leung KS, Lee KH (2008) TFBS identification based on genetic algorithm with combined representations and adaptive post-processing. Bioinformatics 24(3):341–349

    Article  Google Scholar 

  6. Chan TK, Li G, Leung KS, Lee KH (2009) Discovering multiple realistic TFBS motifs based on a generalized model. BMC Bioinform 10:321

  7. Chapman B, Jost G, van der Pas R (2007) Using OpenMP: portable shared memory parallel programming. The MIT Press, Cambridge ISBN: 978-0262533027

    Google Scholar 

  8. Che D, Song Y, Rashedd K (2005) MDGA: Motif discovery using a genetic algorithm. In: Proceedings of the 2005 Conference on Genetic and Evolutionary Computation (GECCO’05), pp 447–452

  9. Chen C, Schmidt B, Weiguo L, Mller-Wittig W (2008) GPU-MEME: using graphics hardware to accelerate motif finding in DNA sequences. Pattern Recognit Bioinform LNCS 5265:448–459

    Article  Google Scholar 

  10. Coello Coello, CA, Lamont GB, Veldhuizen DA (2007) Evolutionary algorithms for solving multi-objective problems., 2nd edn. Springer-Verlag, New York ISBN: 978-0-387-33254-3

    MATH  Google Scholar 

  11. Crooks GE, Hon G, Chandonia JM, Brenner SE (2004) WebLogo: a sequence logo generator. Genome Res 14:1188–1190

    Article  Google Scholar 

  12. Deb K, Pratap A, Agarwal S, Meyarivan T (2002) A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans Evol Comput 6(2):182–197

    Article  Google Scholar 

  13. Dempster AP, Laird NM, Rubin DB (1977) Maximum Likelihood from incomplete data via the EM algorithm (with Discussion). J R Stat Soc Ser B 39(1):1–38

    MATH  Google Scholar 

  14. Eskin E, Pevzner PA (2002) Finding composite regulatory patterns in DNA sequences. Bioinformatics 18(Suppl 1):S354–S363

    Article  Google Scholar 

  15. Favorov AV, Gelfand MS, Gerasimova AV, Ravcheev DA, Mironov AA, Makeev VJ (2005) A Gibbs sampler for identification of symmetrically structured, spaced DNA motifs with improved estimation of the signal length. Bioinformatics 21(10):2240–2245

    Article  Google Scholar 

  16. Fogel GB, Porto VW, Varga G, Dow ER, Crave AM, Powers DM, Harlow HB, Su EW, Onyia JE, Su C (2008) Evolutionary computation for discovery of composite transcription factor binding sites. Nucl Acids Res 36(21):e142, 1–14

  17. Fogel GB, Weekes DG, Varga G, Dow ER, Harlow HB, Onyia JE, Su C (2004) Discovery of sequence motifs related to coexpression of genes using evolutionary computation. Nucl Acids Res 32(13):3826–3835

    Article  Google Scholar 

  18. Frith MC, Hansen U, Spouge JL, Weng Z (2004) Finding functional sequence elements by multiple local alignment. Nucl Acids Res 32(1):189–200

    Article  Google Scholar 

  19. Grama A, Gupta A, Karypis G, Kumar V (2003) Introduction to parallel computing, 2nd edn. Pearson Education Limited, Edinburgh

    MATH  Google Scholar 

  20. Gropp W, Lusk W, Skjellum A (1999) Using MPI: portable parallel programming with the message passing interface, 2nd edn. The MIT Press, Cambridge ISBN: 0-262-57132-3

    MATH  Google Scholar 

  21. Grundy WN, Bailey TL, Elkan CP (1996) ParaMEME: a parallel implementation and a web interface for a dna and protein motif discovery tool. Comput Appl Biosci 12(4):303–310

    Google Scholar 

  22. Hertz GZ, Stormo GD (1999) Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15(7–8):563–577

    Article  Google Scholar 

  23. Holland JH (1975) Adaptation in natural and artificial systems. University of Michigan Press, Ann Arbor

    Google Scholar 

  24. Hughes JD, Estep PW, Tavazoie S, Church GM (2000) Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J Mol Biol 296(5):1205–1214

    Article  Google Scholar 

  25. James P (1997) Protein identification in the post-genome era: the rapid rise of proteomics. Q Rev Biophys 30(4):279–331

    Article  Google Scholar 

  26. Li M, Ma B, Wang L (2002) Finding similar regions in many sequences. J Comput Syst Sci 65(1):73–96

    Article  MathSciNet  MATH  Google Scholar 

  27. Liu FFM, Tsai JJP, Chen RM, Chen SN, Shih SH (2004) FMGA: finding motifs by genetic algorithm. In: Fourth IEEE Symposium on Bioinformatics and Bioengineering (BIBE’04), pp 459–466

  28. Liu Y, Schmidt B, Liu W, Maskell DL (2010) CUDA-MEME: accelerating motif discovery in biological sequences using CUDA-enabled graphics processing units. Pattern Recognit Lett 31(14):2170–2177

    Article  Google Scholar 

  29. Liu Y, Schmidt B, Maskell DL (2011) An ultrafast scalable many-core motif discovery algorithm for multiple GPUs. In: IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, pp 428–434

  30. Lones MA, Tyrrell AM (2007) Regulatory motif discovery using a population clustering evolutionary algorithm. IEEE/ACM Trans Comput Biol Bioinform 4(3):403–414

    Article  Google Scholar 

  31. Notredame C, Higgins DG (1996) SAGA: sequence alignment by genetic algorithm. Nucl Acids Res 24(8):1515–1524

    Article  Google Scholar 

  32. Pavesi G, Mereghetti P, Zambelli F, Stefani M, Mauri G, Pesole G (2006) MoD Tools: regulatory motif discovery in nucleotide sequences from co-regulated or homologous genes. Nucl Acids Res 34:W566–W570

    Article  Google Scholar 

  33. Qin J, Pinkenburg S, Rosenstiel W (2005) Parallel motif search using ParSEQ. In: IASTED International Conference on Parallel and Distributed Computing and Networks, pp 601–607

  34. Regnier M, Denise A (2004) Rare events and conditional events on random strings. Discret Math Theoret Comput Sci 6(2):191–214

    MathSciNet  MATH  Google Scholar 

  35. Sandve GK, Nedland M, Syrstad OB, Eidsheim LA, Abul O, Drablos F (2006) Accelerating motif discovery: Motif matching on parallel hardware. Algorithms Bioinform LNCS 4175:197–206

    Article  Google Scholar 

  36. Schröder J, Wienbrandt L, Pfeiffer G, Schimmler M (2008) Massively parallelized DNA motif search on the reconfigurable hardware platform COPACOBANA. In: Proceedings of the Third IAPR International Conference on Pattern Recognition in Bioinformatics, pp 436–447

  37. Shao L, Chen Y (2009) Bacterial foraging optimization algorithm integrating tabu search for motif discovery. In: IEEE International Conference on Bioinformatics and Biomedicine (BIBM’09), pp 415–418

  38. Shao L, Chen Y, Abraham A (2009) Motif discovery using evolutionary algorithms. In: International Conference of Soft Computing and Pattern Recognition (SOCPAR’09), pp 420–425

  39. Sigrist CJA, de Castro E, Cerutti L, Cuche BA, Hulo N, Bridge A, Bougueleret L., Xenarios I (2012) New and continuing developments at PROSITE. Nucl Acids Res 41(Database issue): D344–D347

  40. Sinha S, Tompa M (2003) YMF: a program for discovery of novel transcription factor binding sites by statistical overrepresentation. Nucl Acids Res 31(13):3586–3588

    Article  Google Scholar 

  41. Stine M, Dasgupta D, Mukatira S (2003) Motif discovery in upstream sequences of coordinately expressed genes. 2003 Congress Evol Comput (CEC’03) 3:1596–1603

  42. Sutou T, Tamura K, Mori Y, Kitakami H (2003) Design and implementation of parallel modified prefixspan method. Int Sympos High Perform Comput 2858:412–422

    Article  Google Scholar 

  43. Thijs G, Lescot M, Marchal K, Rombauts S, De Moor B, Rouzé P, Moreau Y (2001) A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling. Bioinformatics 17(12):1113–1122

    Article  Google Scholar 

  44. Thompson WA, Newberg LA, Conlan S, McCue LA, Lawrence CE (2007) The gibbs centroid sampler. Nucl Acids Res 35(Web Server issue):W232–W237

  45. van Helden J, Andre B, Collado-Vides J (1998) Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J Mole Biol 281(5):827–842

    Article  Google Scholar 

  46. Wei Z, Jensen S (2006) GAME: detecting cis-regulatory elements using a genetic algorithm. Bioinformatics 22(13):1577–1584

    Article  Google Scholar 

  47. Workman CT, Stormo GD (2000) ANN-Spec: a method for discovering transcription factor binding sites with improved specificity. In: Pacifc symposium on biocomputing, pp 467–478

  48. Yang JY, Yang MQ, Zhu M, Arabnia HR, Deng Y (2008) Promoting synergistic research and education in genomics and bioinformatics. BMC Genom 9(Suppl 1):I1

    Article  Google Scholar 

  49. Yang JY, Yang MQ, Arabnia HR, Deng Y (2008) Review: genomics, molecular imaging, bioinformatics, and bio-nano-info integration are synergistic components of translational medicine and personalized healthcare research. BMC Genom 9(Suppl 2):I1

    Article  Google Scholar 

  50. Yang MQ, Athey BD, Arabnia HR, Sung AH, Liu Q, Yang JY, Mao J, Deng Y (2009) High-throughput next-generation sequencing technologies foster new cutting-edge computing techniques in bioinformatics. BMC Genom 10(1)

  51. Yu L, Xu Y (2009) A parallel gibbs sampling algorithm for motif finding on gpu. In: IEEE International Symposium on Parallel and Distributed Processing with Applications, pp 555–558

  52. Zitzler E, Thiele L (1999) Multiobjective evolutionary algorithms: a comparative case study and the strength pareto approach. IEEE Trans Evol Comput 3(4):257–271

    Article  Google Scholar 

Download references

Acknowledgements

This work was partially funded by the Spanish Ministry of Economy and Competitiveness and the ERDF (European Regional Development Fund), under the contract TIN2016-76259-P (PROTEIN project). Álvaro Rubio-Largo is also supported by the post-doctoral fellowship SFRH/BPD/100872/2014 granted by Fundação para a Ciência e a Tecnologia (FCT), Portugal.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to David L. González-Álvarez.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

González-Álvarez, D.L., Vega-Rodríguez, M.A. & Rubio-Largo, Á. A hybrid MPI/OpenMP parallel implementation of NSGA-II for finding patterns in protein sequences. J Supercomput 73, 2285–2312 (2017). https://doi.org/10.1007/s11227-016-1916-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-016-1916-3

Keywords