Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Towards Multi-approaches Bioinformatics Pipeline Based on Big Data and Cloud Computing for Next Generation Sequencing Data Analysis

  • Conference paper
  • First Online:
Advanced Intelligent Systems for Sustainable Development (AI2SD’2019) (AI2SD 2019)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1103))

Abstract

Bioinformatics pipelines are an integral part of next-generation sequencing. Despite the rapid development of open source software for data analysis, the use of these tools through development of bioinformatics pipelines for sequencing analysis still remains a challenge and time-consuming task for academic research institutions and clinical laboratories. It requires substantial bioinformatics expertise to select appropriate analytical software tools, big data storage solutions and cloud infrastructure to manage the large amount of biological data generated by experimental high-throughput technologies. We propose a bioinformatics pipeline framework for DNA sequencing analysis. This pipeline is a solution for the rapid and efficient deployment of the workflow pipeline to institutions and laboratories, allowing reproducible results based on virtual machine technologies. It is capable of supporting the reference sequence and de novo assembly (without reference genome) for disease studies. The pipeline is flexible and offers the possibility to use three approaches for DNA sequencing such as, the whole genome, the whole exome and targeted sequencing. The pipeline takes into account both whole and exome sequencing to allow significant analysis results while retaining high positive predictions. If the analysis fails or researchers are spoiled for choice to interpret the results, it involves exploring targeted resequencing. The supported analyses are: functional, structural and statistical. Due to disparate data sources, storage requirements and the need for scalable analysis of biological data, the pipeline used big data technologies for storage and management and can also be deployed on the cloud, allowing access without investment overheads for additional hardware.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Calladine, C.R., Drew, H.R., Luisi, B.F., Travers, A.A.: Understanding DNA: The Molecule and How It Works. 3rd edn. Academic press (2004). https://doi.org/10.1016/b978-0-12-155089-9.x5000-5

  2. Schatz, M.C., Langmead, B., Salzberg, S.L.: Cloud computing and the DNA data race. Nat. Biotechnol. 28(7), 691–693 (2010)

    Article  Google Scholar 

  3. Calabrese, B., Cannataro, M.: Cloud computing in bioinformatics: current solutions and challenges. Peer J. Prepr. 4, e2261v1 (2016). https://doi.org/10.7287/peerj.preprints.2261v1

    Article  Google Scholar 

  4. Leipzig, J.: A review of bioinformatic pipeline frameworks. Brief. Bioinform. 18(3), 530–536 (2017). https://doi.org/10.1093/bib/bbw020

    Article  Google Scholar 

  5. Calabrese, B., Cannataro, M.: Cloud computing in healthcare and biomedicine. Scalable Comput.: Pract. Exp. 16(1), 1–18 (2015). https://doi.org/10.12694/scpe.v16i1.1057

    Article  Google Scholar 

  6. Kashyap, H., Ahmed, H.A., Hoque, N., Roy, S., Bhattacharyya, D.K.: Big data analytics in bioinformatics: a machine learning perspective. J. Latex Class Files 13, 90 (2014)

    Google Scholar 

  7. Amazon: A Amazon 2016 (2016). https://aws.amazon.com. Accessed 06 Jan 2016

  8. Danecek, P., Auton, A., Abecasis, G.: The variant call format and VCFtools. Bioinformatics 27(15), 2156–2158 (2011). Article ID btr330

    Article  Google Scholar 

  9. Waller, M., Fawcett, S.: Data science, predictive analytics, and big data: a revolution that will transform supply chain design and management. J. Bus. Logist. 34(2), 77–84 (2013)

    Article  Google Scholar 

  10. Borkar, V., Carey, M.J., Li, C.: Inside big data management: ogres, onions, or parfaits. In: Proceedings of the 15th International Conference on Extending Database Technology, pp. 3–14. ACM (2012)

    Google Scholar 

  11. White, T.: Hadoop: The Definitive Guide. O’ReillyMedia Inc., Sebastopol (2012)

    Google Scholar 

  12. Jagadish, H.: Big data and science: myths and reality. Big Data Res. 2(2), 49–52 (2015)

    Article  MathSciNet  Google Scholar 

  13. Hunt, M., Newbold, C., Berriman, M., Otto, T.D.: A comprehensive evaluation of assembly scaffolding tools. Genome Biol. 15, R42 (2014)

    Article  Google Scholar 

  14. Robison, R.J.: How big is the human genome? In: Precision Medicine (2014)

    Google Scholar 

  15. Marx, V.: Biology: the big challenges of big data. Nature 498(7453), 255–260 (2013)

    Article  Google Scholar 

  16. Dai, L., Gao, X., Guo, Y., Xiao, J., Zhang, Z.: Bioinformatics clouds for big data manipulation. Biol. Dir. 7, 43 (2012)

    Article  Google Scholar 

  17. Dilliott, A.A., Farhan, S.M., Ghani, M., Sato, C., Liang, E., Zhang, M., McIntyre, A.D., Cao, H., Racacho, L., Robinson, J.F., Strong, M.J., Masellis, M., Bulman, D.E., Rogaeva, E., Lang, A., Tartaglia, C., Finger, E., Zinman, L., Turnbull, J., Freedman, M., Swartz, R., Black, S.E., Hegele, R.A.: Targeted next-generation sequencing and bioinformatics pipeline to evaluate genetic determinants of constitutional disease. J. Vis. Exp. 134, e57266 (2018). https://doi.org/10.3791/57266

    Article  Google Scholar 

  18. Fjukstad, B., Bongo, L.A.: A review of scalable bioinformatics pipelines. Data Sci. Eng. 2, 245–251 (2017). https://doi.org/10.1007/s41019-017-0047-z

    Article  Google Scholar 

  19. GO-Consortium: The gene ontology: enhancements for 2011. Nucleic Acids Res. 40, 559–564 (2012). https://doi.org/10.1093/nar/gkr1028

    Article  Google Scholar 

  20. Brandariz-Fontes, C., Camacho-Sanchez, M., Vila, C., Vega-Pla, J.L., Rico, C., Leonard, J.A.: Effect of the enzyme and PCR conditions on the quality of high-throughput DNA sequencing results. Sci. Rep. 5, 8056 (2015). https://doi.org/10.1038/srep08056

    Article  Google Scholar 

  21. Li, J., Doyle, M.A., Saeed, I., Wong, S.Q., Mar, V., Goode, D.L., Caramia1, F., Doig, K., Ryland, G.L., Thompson, E.R., Hunter, S.M., Halgamuge, S.K., Ellul, J., Dobrovic, A., Campbell, I.G., Papenfuss, A.T., McArthur, G.A., Tothill, R.W.: Bioinformatics pipelines for targeted resequencing and whole-exome sequencing of human and mouse genomes: a virtual appliance approach for instant deployment. PLoS One. 9(4), 95217 (2014). https://doi.org/10.1371/journal.pone.0095217

    Article  Google Scholar 

  22. Ceravolo, P., Azzini, A., Angelini, M., Catarci, T., Cudré-Mauroux, P., Damiani, E., Mazak, A., Keulen, M.V., Jarrar, M., Santucci, G., Sattler, K.U., Scannapieco, M., Wimmer, M., Wrembel, R., Zaraket, F.: Big data semantics. J. Data Semant. 7(8), 65–85 (2018). https://doi.org/10.1007/s13740-018-0086-2

    Article  Google Scholar 

  23. Nordberg, H., Bhatia, K., Wang, K., Wang, Z.: BioPig: a Hadoop-based analytic toolkit for large-scale sequence data. Bioinformatics 29(23), 3014–3019 (2013)

    Article  Google Scholar 

  24. Schumacher, A., Pireddu, L., Niemenmaa, M., Kallio, A., Korpelainen, E., Zanetti, G., Heljanko, K.: SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop. Bioinformatics 30(1), 119–120 (2014)

    Article  Google Scholar 

  25. Langmead, B., Schatz, M.C., Lin, J., Pop, M., Salzberg, S.L.: Searching for SNPs with cloud computing. Genome Biol. 10(11), R134 (2009)

    Article  Google Scholar 

  26. Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10(3), R25 (2009). https://doi.org/10.1186/gb-2009-10-3-r25

    Article  Google Scholar 

  27. Li, R., Li, Y., Fang, X., Yang, H., Wang, J., Kristiansen, K., Wang, J.: SNP detection for massively parallel whole-genome resequencing. Genome Res. 19(6), 1124–1132 (2009)

    Article  Google Scholar 

  28. Nguyen, T., Shi, W., Ruden, D.: CloudAligner: a fast and full-featured MapReduce based tool for sequence mapping. BMC Res. Notes 4, 171 (2011)

    Article  Google Scholar 

  29. Schatz, M.C.: CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics 25(11), 1363–1369 (2009)

    Article  Google Scholar 

  30. Karczewski, K.J., Fernald, G.H., Martin, A.R., Snyder, M., Tatonetti, N.P., Dudley, J.T.: STORMSeq: an open-source, user-friendly pipeline for processing personal genomics data in the cloud. PLoS One 9(1), e84860 (2014)

    Article  Google Scholar 

  31. Afgan, E., Baker, D., Coraor, N., Chapman, B., Nekrutenko, A., Taylor, J.: Galaxy CloudMan: delivering cloud compute clusters. BMC Bioinf. 11(Suppl 12), S4 (2010)

    Article  Google Scholar 

  32. Habegger, L., Balasubramanian, S., Chen, D.Z., Khurana, E., Sboner, A., Harmanci, A., Rozowsky, J., Clarke, D., Snyder, M., Gerstein, M.: VAT: a computational framework to functionally annotate variants in personal genomes within a cloud-computing environment. Bioinf. Appl. Note 28(17), 2267–2269 (2012). https://doi.org/10.1093/bioinformatics/bts368

    Article  Google Scholar 

  33. Nazipova, N.N., Isaev, E.A., Kornilov, V.V., Pervukhin, D.V., Morozova, A.A., Gorbunov, A.A., Ustinin, M.N.: Big data in bioinformatics. Math. Biol. Bioinf. 13(S. P.), t1–t16 (2018). https://doi.org/10.17537/2018.13.t1

    Article  Google Scholar 

  34. Dolled-Filhart, M.P., Lee Jr, M., Ou-yang, C., Haraksingh, R.R., Lin, J.C.-H.: Computational and bioinformatics frameworks for next-generation whole exome and genome sequencing. Sci. World J. (2013). Article ID 730210. Hindawi Publishing Corporation. https://doi.org/10.1155/2013/730210

    Article  Google Scholar 

  35. Sturm, M., Schroeder, C., Bauer, P.: SeqPurge: highly-sensitive adapter trimming for paired-end NGS data. BMC Bioinf. 17, 208 (2016). https://doi.org/10.1186/s12859-016-1069-7

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Razika Driouche .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Driouche, R. (2020). Towards Multi-approaches Bioinformatics Pipeline Based on Big Data and Cloud Computing for Next Generation Sequencing Data Analysis. In: Ezziyyani, M. (eds) Advanced Intelligent Systems for Sustainable Development (AI2SD’2019). AI2SD 2019. Advances in Intelligent Systems and Computing, vol 1103. Springer, Cham. https://doi.org/10.1007/978-3-030-36664-3_43

Download citation

Publish with us

Policies and ethics