Abstract
Bioinformatics pipelines are an integral part of next-generation sequencing. Despite the rapid development of open source software for data analysis, the use of these tools through development of bioinformatics pipelines for sequencing analysis still remains a challenge and time-consuming task for academic research institutions and clinical laboratories. It requires substantial bioinformatics expertise to select appropriate analytical software tools, big data storage solutions and cloud infrastructure to manage the large amount of biological data generated by experimental high-throughput technologies. We propose a bioinformatics pipeline framework for DNA sequencing analysis. This pipeline is a solution for the rapid and efficient deployment of the workflow pipeline to institutions and laboratories, allowing reproducible results based on virtual machine technologies. It is capable of supporting the reference sequence and de novo assembly (without reference genome) for disease studies. The pipeline is flexible and offers the possibility to use three approaches for DNA sequencing such as, the whole genome, the whole exome and targeted sequencing. The pipeline takes into account both whole and exome sequencing to allow significant analysis results while retaining high positive predictions. If the analysis fails or researchers are spoiled for choice to interpret the results, it involves exploring targeted resequencing. The supported analyses are: functional, structural and statistical. Due to disparate data sources, storage requirements and the need for scalable analysis of biological data, the pipeline used big data technologies for storage and management and can also be deployed on the cloud, allowing access without investment overheads for additional hardware.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Calladine, C.R., Drew, H.R., Luisi, B.F., Travers, A.A.: Understanding DNA: The Molecule and How It Works. 3rd edn. Academic press (2004). https://doi.org/10.1016/b978-0-12-155089-9.x5000-5
Schatz, M.C., Langmead, B., Salzberg, S.L.: Cloud computing and the DNA data race. Nat. Biotechnol. 28(7), 691–693 (2010)
Calabrese, B., Cannataro, M.: Cloud computing in bioinformatics: current solutions and challenges. Peer J. Prepr. 4, e2261v1 (2016). https://doi.org/10.7287/peerj.preprints.2261v1
Leipzig, J.: A review of bioinformatic pipeline frameworks. Brief. Bioinform. 18(3), 530–536 (2017). https://doi.org/10.1093/bib/bbw020
Calabrese, B., Cannataro, M.: Cloud computing in healthcare and biomedicine. Scalable Comput.: Pract. Exp. 16(1), 1–18 (2015). https://doi.org/10.12694/scpe.v16i1.1057
Kashyap, H., Ahmed, H.A., Hoque, N., Roy, S., Bhattacharyya, D.K.: Big data analytics in bioinformatics: a machine learning perspective. J. Latex Class Files 13, 90 (2014)
Amazon: A Amazon 2016 (2016). https://aws.amazon.com. Accessed 06 Jan 2016
Danecek, P., Auton, A., Abecasis, G.: The variant call format and VCFtools. Bioinformatics 27(15), 2156–2158 (2011). Article ID btr330
Waller, M., Fawcett, S.: Data science, predictive analytics, and big data: a revolution that will transform supply chain design and management. J. Bus. Logist. 34(2), 77–84 (2013)
Borkar, V., Carey, M.J., Li, C.: Inside big data management: ogres, onions, or parfaits. In: Proceedings of the 15th International Conference on Extending Database Technology, pp. 3–14. ACM (2012)
White, T.: Hadoop: The Definitive Guide. O’ReillyMedia Inc., Sebastopol (2012)
Jagadish, H.: Big data and science: myths and reality. Big Data Res. 2(2), 49–52 (2015)
Hunt, M., Newbold, C., Berriman, M., Otto, T.D.: A comprehensive evaluation of assembly scaffolding tools. Genome Biol. 15, R42 (2014)
Robison, R.J.: How big is the human genome? In: Precision Medicine (2014)
Marx, V.: Biology: the big challenges of big data. Nature 498(7453), 255–260 (2013)
Dai, L., Gao, X., Guo, Y., Xiao, J., Zhang, Z.: Bioinformatics clouds for big data manipulation. Biol. Dir. 7, 43 (2012)
Dilliott, A.A., Farhan, S.M., Ghani, M., Sato, C., Liang, E., Zhang, M., McIntyre, A.D., Cao, H., Racacho, L., Robinson, J.F., Strong, M.J., Masellis, M., Bulman, D.E., Rogaeva, E., Lang, A., Tartaglia, C., Finger, E., Zinman, L., Turnbull, J., Freedman, M., Swartz, R., Black, S.E., Hegele, R.A.: Targeted next-generation sequencing and bioinformatics pipeline to evaluate genetic determinants of constitutional disease. J. Vis. Exp. 134, e57266 (2018). https://doi.org/10.3791/57266
Fjukstad, B., Bongo, L.A.: A review of scalable bioinformatics pipelines. Data Sci. Eng. 2, 245–251 (2017). https://doi.org/10.1007/s41019-017-0047-z
GO-Consortium: The gene ontology: enhancements for 2011. Nucleic Acids Res. 40, 559–564 (2012). https://doi.org/10.1093/nar/gkr1028
Brandariz-Fontes, C., Camacho-Sanchez, M., Vila, C., Vega-Pla, J.L., Rico, C., Leonard, J.A.: Effect of the enzyme and PCR conditions on the quality of high-throughput DNA sequencing results. Sci. Rep. 5, 8056 (2015). https://doi.org/10.1038/srep08056
Li, J., Doyle, M.A., Saeed, I., Wong, S.Q., Mar, V., Goode, D.L., Caramia1, F., Doig, K., Ryland, G.L., Thompson, E.R., Hunter, S.M., Halgamuge, S.K., Ellul, J., Dobrovic, A., Campbell, I.G., Papenfuss, A.T., McArthur, G.A., Tothill, R.W.: Bioinformatics pipelines for targeted resequencing and whole-exome sequencing of human and mouse genomes: a virtual appliance approach for instant deployment. PLoS One. 9(4), 95217 (2014). https://doi.org/10.1371/journal.pone.0095217
Ceravolo, P., Azzini, A., Angelini, M., Catarci, T., Cudré-Mauroux, P., Damiani, E., Mazak, A., Keulen, M.V., Jarrar, M., Santucci, G., Sattler, K.U., Scannapieco, M., Wimmer, M., Wrembel, R., Zaraket, F.: Big data semantics. J. Data Semant. 7(8), 65–85 (2018). https://doi.org/10.1007/s13740-018-0086-2
Nordberg, H., Bhatia, K., Wang, K., Wang, Z.: BioPig: a Hadoop-based analytic toolkit for large-scale sequence data. Bioinformatics 29(23), 3014–3019 (2013)
Schumacher, A., Pireddu, L., Niemenmaa, M., Kallio, A., Korpelainen, E., Zanetti, G., Heljanko, K.: SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop. Bioinformatics 30(1), 119–120 (2014)
Langmead, B., Schatz, M.C., Lin, J., Pop, M., Salzberg, S.L.: Searching for SNPs with cloud computing. Genome Biol. 10(11), R134 (2009)
Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10(3), R25 (2009). https://doi.org/10.1186/gb-2009-10-3-r25
Li, R., Li, Y., Fang, X., Yang, H., Wang, J., Kristiansen, K., Wang, J.: SNP detection for massively parallel whole-genome resequencing. Genome Res. 19(6), 1124–1132 (2009)
Nguyen, T., Shi, W., Ruden, D.: CloudAligner: a fast and full-featured MapReduce based tool for sequence mapping. BMC Res. Notes 4, 171 (2011)
Schatz, M.C.: CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics 25(11), 1363–1369 (2009)
Karczewski, K.J., Fernald, G.H., Martin, A.R., Snyder, M., Tatonetti, N.P., Dudley, J.T.: STORMSeq: an open-source, user-friendly pipeline for processing personal genomics data in the cloud. PLoS One 9(1), e84860 (2014)
Afgan, E., Baker, D., Coraor, N., Chapman, B., Nekrutenko, A., Taylor, J.: Galaxy CloudMan: delivering cloud compute clusters. BMC Bioinf. 11(Suppl 12), S4 (2010)
Habegger, L., Balasubramanian, S., Chen, D.Z., Khurana, E., Sboner, A., Harmanci, A., Rozowsky, J., Clarke, D., Snyder, M., Gerstein, M.: VAT: a computational framework to functionally annotate variants in personal genomes within a cloud-computing environment. Bioinf. Appl. Note 28(17), 2267–2269 (2012). https://doi.org/10.1093/bioinformatics/bts368
Nazipova, N.N., Isaev, E.A., Kornilov, V.V., Pervukhin, D.V., Morozova, A.A., Gorbunov, A.A., Ustinin, M.N.: Big data in bioinformatics. Math. Biol. Bioinf. 13(S. P.), t1–t16 (2018). https://doi.org/10.17537/2018.13.t1
Dolled-Filhart, M.P., Lee Jr, M., Ou-yang, C., Haraksingh, R.R., Lin, J.C.-H.: Computational and bioinformatics frameworks for next-generation whole exome and genome sequencing. Sci. World J. (2013). Article ID 730210. Hindawi Publishing Corporation. https://doi.org/10.1155/2013/730210
Sturm, M., Schroeder, C., Bauer, P.: SeqPurge: highly-sensitive adapter trimming for paired-end NGS data. BMC Bioinf. 17, 208 (2016). https://doi.org/10.1186/s12859-016-1069-7
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Driouche, R. (2020). Towards Multi-approaches Bioinformatics Pipeline Based on Big Data and Cloud Computing for Next Generation Sequencing Data Analysis. In: Ezziyyani, M. (eds) Advanced Intelligent Systems for Sustainable Development (AI2SD’2019). AI2SD 2019. Advances in Intelligent Systems and Computing, vol 1103. Springer, Cham. https://doi.org/10.1007/978-3-030-36664-3_43
Download citation
DOI: https://doi.org/10.1007/978-3-030-36664-3_43
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-36663-6
Online ISBN: 978-3-030-36664-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)