Towards Multi-approaches Bioinformatics Pipeline Based on Big Data and Cloud Computing for Next Generation Sequencing Data Analysis

Driouche, Razika

doi:10.1007/978-3-030-36664-3_43

Razika Driouche¹⁵

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1103))

Included in the following conference series:

International Conference on Advanced Intelligent Systems for Sustainable Development

565 Accesses
1 Citations

Abstract

Bioinformatics pipelines are an integral part of next-generation sequencing. Despite the rapid development of open source software for data analysis, the use of these tools through development of bioinformatics pipelines for sequencing analysis still remains a challenge and time-consuming task for academic research institutions and clinical laboratories. It requires substantial bioinformatics expertise to select appropriate analytical software tools, big data storage solutions and cloud infrastructure to manage the large amount of biological data generated by experimental high-throughput technologies. We propose a bioinformatics pipeline framework for DNA sequencing analysis. This pipeline is a solution for the rapid and efficient deployment of the workflow pipeline to institutions and laboratories, allowing reproducible results based on virtual machine technologies. It is capable of supporting the reference sequence and de novo assembly (without reference genome) for disease studies. The pipeline is flexible and offers the possibility to use three approaches for DNA sequencing such as, the whole genome, the whole exome and targeted sequencing. The pipeline takes into account both whole and exome sequencing to allow significant analysis results while retaining high positive predictions. If the analysis fails or researchers are spoiled for choice to interpret the results, it involves exploring targeted resequencing. The supported analyses are: functional, structural and statistical. Due to disparate data sources, storage requirements and the need for scalable analysis of biological data, the pipeline used big data technologies for storage and management and can also be deployed on the cloud, allowing access without investment overheads for additional hardware.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

DNAp: A Pipeline for DNA-seq Data Analysis

Article Open access 01 May 2018

Bioinformatics Tools in Clinical Genomics

Current Status and Challenges of DNA Sequencing

References

Calladine, C.R., Drew, H.R., Luisi, B.F., Travers, A.A.: Understanding DNA: The Molecule and How It Works. 3rd edn. Academic press (2004). https://doi.org/10.1016/b978-0-12-155089-9.x5000-5
Schatz, M.C., Langmead, B., Salzberg, S.L.: Cloud computing and the DNA data race. Nat. Biotechnol. 28(7), 691–693 (2010)
Article Google Scholar
Calabrese, B., Cannataro, M.: Cloud computing in bioinformatics: current solutions and challenges. Peer J. Prepr. 4, e2261v1 (2016). https://doi.org/10.7287/peerj.preprints.2261v1
Article Google Scholar
Leipzig, J.: A review of bioinformatic pipeline frameworks. Brief. Bioinform. 18(3), 530–536 (2017). https://doi.org/10.1093/bib/bbw020
Article Google Scholar
Calabrese, B., Cannataro, M.: Cloud computing in healthcare and biomedicine. Scalable Comput.: Pract. Exp. 16(1), 1–18 (2015). https://doi.org/10.12694/scpe.v16i1.1057
Article Google Scholar
Kashyap, H., Ahmed, H.A., Hoque, N., Roy, S., Bhattacharyya, D.K.: Big data analytics in bioinformatics: a machine learning perspective. J. Latex Class Files 13, 90 (2014)
Google Scholar
Amazon: A Amazon 2016 (2016). https://aws.amazon.com. Accessed 06 Jan 2016
Danecek, P., Auton, A., Abecasis, G.: The variant call format and VCFtools. Bioinformatics 27(15), 2156–2158 (2011). Article ID btr330
Article Google Scholar
Waller, M., Fawcett, S.: Data science, predictive analytics, and big data: a revolution that will transform supply chain design and management. J. Bus. Logist. 34(2), 77–84 (2013)
Article Google Scholar
Borkar, V., Carey, M.J., Li, C.: Inside big data management: ogres, onions, or parfaits. In: Proceedings of the 15th International Conference on Extending Database Technology, pp. 3–14. ACM (2012)
Google Scholar
White, T.: Hadoop: The Definitive Guide. O’ReillyMedia Inc., Sebastopol (2012)
Google Scholar
Jagadish, H.: Big data and science: myths and reality. Big Data Res. 2(2), 49–52 (2015)
Article MathSciNet Google Scholar
Hunt, M., Newbold, C., Berriman, M., Otto, T.D.: A comprehensive evaluation of assembly scaffolding tools. Genome Biol. 15, R42 (2014)
Article Google Scholar
Robison, R.J.: How big is the human genome? In: Precision Medicine (2014)
Google Scholar
Marx, V.: Biology: the big challenges of big data. Nature 498(7453), 255–260 (2013)
Article Google Scholar
Dai, L., Gao, X., Guo, Y., Xiao, J., Zhang, Z.: Bioinformatics clouds for big data manipulation. Biol. Dir. 7, 43 (2012)
Article Google Scholar
Dilliott, A.A., Farhan, S.M., Ghani, M., Sato, C., Liang, E., Zhang, M., McIntyre, A.D., Cao, H., Racacho, L., Robinson, J.F., Strong, M.J., Masellis, M., Bulman, D.E., Rogaeva, E., Lang, A., Tartaglia, C., Finger, E., Zinman, L., Turnbull, J., Freedman, M., Swartz, R., Black, S.E., Hegele, R.A.: Targeted next-generation sequencing and bioinformatics pipeline to evaluate genetic determinants of constitutional disease. J. Vis. Exp. 134, e57266 (2018). https://doi.org/10.3791/57266
Article Google Scholar
Fjukstad, B., Bongo, L.A.: A review of scalable bioinformatics pipelines. Data Sci. Eng. 2, 245–251 (2017). https://doi.org/10.1007/s41019-017-0047-z
Article Google Scholar
GO-Consortium: The gene ontology: enhancements for 2011. Nucleic Acids Res. 40, 559–564 (2012). https://doi.org/10.1093/nar/gkr1028
Article Google Scholar
Brandariz-Fontes, C., Camacho-Sanchez, M., Vila, C., Vega-Pla, J.L., Rico, C., Leonard, J.A.: Effect of the enzyme and PCR conditions on the quality of high-throughput DNA sequencing results. Sci. Rep. 5, 8056 (2015). https://doi.org/10.1038/srep08056
Article Google Scholar
Li, J., Doyle, M.A., Saeed, I., Wong, S.Q., Mar, V., Goode, D.L., Caramia1, F., Doig, K., Ryland, G.L., Thompson, E.R., Hunter, S.M., Halgamuge, S.K., Ellul, J., Dobrovic, A., Campbell, I.G., Papenfuss, A.T., McArthur, G.A., Tothill, R.W.: Bioinformatics pipelines for targeted resequencing and whole-exome sequencing of human and mouse genomes: a virtual appliance approach for instant deployment. PLoS One. 9(4), 95217 (2014). https://doi.org/10.1371/journal.pone.0095217
Article Google Scholar
Ceravolo, P., Azzini, A., Angelini, M., Catarci, T., Cudré-Mauroux, P., Damiani, E., Mazak, A., Keulen, M.V., Jarrar, M., Santucci, G., Sattler, K.U., Scannapieco, M., Wimmer, M., Wrembel, R., Zaraket, F.: Big data semantics. J. Data Semant. 7(8), 65–85 (2018). https://doi.org/10.1007/s13740-018-0086-2
Article Google Scholar
Nordberg, H., Bhatia, K., Wang, K., Wang, Z.: BioPig: a Hadoop-based analytic toolkit for large-scale sequence data. Bioinformatics 29(23), 3014–3019 (2013)
Article Google Scholar
Schumacher, A., Pireddu, L., Niemenmaa, M., Kallio, A., Korpelainen, E., Zanetti, G., Heljanko, K.: SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop. Bioinformatics 30(1), 119–120 (2014)
Article Google Scholar
Langmead, B., Schatz, M.C., Lin, J., Pop, M., Salzberg, S.L.: Searching for SNPs with cloud computing. Genome Biol. 10(11), R134 (2009)
Article Google Scholar
Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10(3), R25 (2009). https://doi.org/10.1186/gb-2009-10-3-r25
Article Google Scholar
Li, R., Li, Y., Fang, X., Yang, H., Wang, J., Kristiansen, K., Wang, J.: SNP detection for massively parallel whole-genome resequencing. Genome Res. 19(6), 1124–1132 (2009)
Article Google Scholar
Nguyen, T., Shi, W., Ruden, D.: CloudAligner: a fast and full-featured MapReduce based tool for sequence mapping. BMC Res. Notes 4, 171 (2011)
Article Google Scholar
Schatz, M.C.: CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics 25(11), 1363–1369 (2009)
Article Google Scholar
Karczewski, K.J., Fernald, G.H., Martin, A.R., Snyder, M., Tatonetti, N.P., Dudley, J.T.: STORMSeq: an open-source, user-friendly pipeline for processing personal genomics data in the cloud. PLoS One 9(1), e84860 (2014)
Article Google Scholar
Afgan, E., Baker, D., Coraor, N., Chapman, B., Nekrutenko, A., Taylor, J.: Galaxy CloudMan: delivering cloud compute clusters. BMC Bioinf. 11(Suppl 12), S4 (2010)
Article Google Scholar
Habegger, L., Balasubramanian, S., Chen, D.Z., Khurana, E., Sboner, A., Harmanci, A., Rozowsky, J., Clarke, D., Snyder, M., Gerstein, M.: VAT: a computational framework to functionally annotate variants in personal genomes within a cloud-computing environment. Bioinf. Appl. Note 28(17), 2267–2269 (2012). https://doi.org/10.1093/bioinformatics/bts368
Article Google Scholar
Nazipova, N.N., Isaev, E.A., Kornilov, V.V., Pervukhin, D.V., Morozova, A.A., Gorbunov, A.A., Ustinin, M.N.: Big data in bioinformatics. Math. Biol. Bioinf. 13(S. P.), t1–t16 (2018). https://doi.org/10.17537/2018.13.t1
Article Google Scholar
Dolled-Filhart, M.P., Lee Jr, M., Ou-yang, C., Haraksingh, R.R., Lin, J.C.-H.: Computational and bioinformatics frameworks for next-generation whole exome and genome sequencing. Sci. World J. (2013). Article ID 730210. Hindawi Publishing Corporation. https://doi.org/10.1155/2013/730210
Article Google Scholar
Sturm, M., Schroeder, C., Bauer, P.: SeqPurge: highly-sensitive adapter trimming for paired-end NGS data. BMC Bioinf. 17, 208 (2016). https://doi.org/10.1186/s12859-016-1069-7
Article Google Scholar

Download references

Author information

Authors and Affiliations

National High College of Biotechnology, Taoufik Khaznadar, Constantine, Algeria
Razika Driouche

Authors

Razika Driouche
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Razika Driouche .

Editor information

Editors and Affiliations

Faculty of Sciences and Techniques of Tangier, Abdelmalek Essaâdi University, Tangier, Morocco
Mostafa Ezziyyani

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Driouche, R. (2020). Towards Multi-approaches Bioinformatics Pipeline Based on Big Data and Cloud Computing for Next Generation Sequencing Data Analysis. In: Ezziyyani, M. (eds) Advanced Intelligent Systems for Sustainable Development (AI2SD’2019). AI2SD 2019. Advances in Intelligent Systems and Computing, vol 1103. Springer, Cham. https://doi.org/10.1007/978-3-030-36664-3_43

Download citation

DOI: https://doi.org/10.1007/978-3-030-36664-3_43
Published: 06 February 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-36663-6
Online ISBN: 978-3-030-36664-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Towards Multi-approaches Bioinformatics Pipeline Based on Big Data and Cloud Computing for Next Generation Sequencing Data Analysis

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

DNAp: A Pipeline for DNA-seq Data Analysis

Bioinformatics Tools in Clinical Genomics

Current Status and Challenges of DNA Sequencing

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Towards Multi-approaches Bioinformatics Pipeline Based on Big Data and Cloud Computing for Next Generation Sequencing Data Analysis

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

DNAp: A Pipeline for DNA-seq Data Analysis

Bioinformatics Tools in Clinical Genomics

Current Status and Challenges of DNA Sequencing

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation