Abstract
Usually, scientists need to execute experiments that demand high performance computing environments and parallel techniques. This is the scenario found in many bioinformatics experiments modeled as scientific workflows, such as phylogenetic and phylogenomic analyses. To execute these experiments, scientists have adopted virtual machines (VMs) instantiated in clouds. Estimating the number of VMs to instantiate is a crucial task to avoid negative impacts on the execution performance and on the financial costs with under or overestimations. Previously, the necessary number of VMs to execute bioinformatics workflows have been estimated by a GRASP heuristic and have been coupled to a Cloud-based Parallel Scientific Workflow Management System. Although this work was a step forward, this approach only provided a static dimensioning. If the characteristics of the environment change (processing capacity, network speed), this static dimensioning may not be suitable. In this way, it is of interest that the dimensioning is adjusted at runtime. To achieve this, we developed a novel framework for monitoring and dynamically dimensioning resources during the execution of parallel scientific workflows in clouds, called Dynamic Dimensioning of Cloud Computing Framework (DDC-F). We have evaluated DDC-F in real executions of bioinformatics workflows. Experiments showed that DDC-F is able to efficiently calculate the number of VMs necessary to execute bioinformatics workflows of Comparative Genomics (CG), also reducing the financial costs, when compared with other works of the related literature.
Similar content being viewed by others
References
Clustal. http://clustal.org/clustal2
codeml(PAML). http://abacus.gene.ucl.ac.uk/software/paml.html
hmmbuild/hmmsearch (HMMER3). http://hmmer.org/
ModelGenerator. http://mcinerneylab.com/software/modelgenerator
Muscle. http://www.drive5.com/muscle
ProbCons. http://probcons.stanford.edu/
RAxML. http://sco.h-its.org/exelixis/web/software/raxml/index.html
RefSeq database. http://www.ncbi.nlm.nih.gov/refseq/
Abouelhoda, M., Issa, S., Ghanem, M.: Tavaxy: Integrating Taverna and Galaxy workflows with cloud computing support. BMC Bioinforma. 13(1), 77+ (2012)
Chard, R., Chard, K., Bubendorfer, K., Lacinski, L., Madduri, R., Foster, I.: Cost-Aware Elastic Cloud Provisioning for Scientific Workloads. In: 2015 IEEE 8Th International Conference On Cloud Computing (CLOUD), pp 971–974 (2015)
Churches, D., Gombas, G., Harrison, A., Maassen, J., Robinson, C., Shields, M., Taylor, I., Wang, I.: Programming scientific and distributed workflow with Triana services. Concurr. Comput. Pract. Exper. 18(10), 1021–1037 (2006)
Coutinho, R., Drummond, L., Frota, Y., De Oliveira, D.: Optimizing virtual machine allocation for parallel scientific workflows in federated clouds. Fut. Gener. Comput. Syst. 46(0), 51 –68 (2015)
Coutinho, R., Drummond, L., Frota, Y., De Oliveira, D., Ocaña, K.: Evaluating Grasp-Based Cloud Dimensioning for Comparative Genomics: a Practical Approach. In: IEEE International Conference on Cluster Computing (CLUSTER), pp 371–379 (2014)
Crawl, D., Wang, J., Altintas, I.: Provenance for MapReduce-based Data-intensive Workflows. In: Proceedings of the 6Th Workshop on Workflows in Support of Large-Scale Science, WORKS ’11, pp 21–30. ACM, NY, USA (2011)
Deng, K., Song, J., Ren, K., Iosup, A.: Exploring Portfolio Scheduling forLong-term Execution of Scientific Workloads in IaaS Clouds. In: Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’13, pp 55:1–55:12. ACM, NY, USA (2013)
Eddy, S.: A new generation of homology search tools based on probabilistic inference. Genome Informatics. Int. Conf. Genome Inf. 23(5), 205–11 (2009)
Emeakaroha, V., Maurer, M., Stern, P., Abaj, P., Brandic, I., Kreil, D.: Managing and optimizing bioinformatics workflows for data analysis in clouds. J. Grid Comput. 11(3), 407–428 (2013)
Felsenstein, J.: PHYLIP - Phylogeny inference package (version 3.2). Cladistics 5, 164–166 (1989)
Foster, I., Kesselman, C.: The Grid 2, Second Edition: Blueprint for a New Computing Infrastructure (The Elsevier Series in Grid Computing), 2nd edn. Morgan Kaufmann (2003)
Gilbert, D.: Sequence file format conversion with commandline readseq. Current Protocols in Bioinformatics Appendix 1, Appendix 1E (2003)
Jackson, K.R., Ramakrishnan, L., Runge, K.J., Thomas, R.C.: Seeking Supernovae in the Clouds: a Performance Study. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC ’10, pp 421–429. ACM, NY, USA (2010)
Lama, P., Zhou, X.: AROMA: Automated Resource Allocation and Configuration of MapReduce Environment in the Cloud. In: Proceedings of the 9th International Conference on Autonomic Computing, ICAC ’12, pp 63–72. ACM, NY, USA (2012)
Madera, M., Gough, J.: A comparison of profile hidden markov model procedures for remote homology detection. Nucleic Acids Res. 30(19), 4321–4328 (2002)
Maheshwari, K., Jung, E.S., Meng, J., Morozov, V., Vishwanath, V., Kettimuthu, R.: Workflow performance improvement using model-based scheduling over multiple clusters and clouds. Fut. Gener. Comput. Syst. 54, 206–218 (2016)
Malawski, M., Juve, G., Deelman, E., Nabrzyski, J.: Cost- and Deadline-constrained Provisioning for Scientific Workflow Ensembles in IaaS Clouds. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’12, pp 22:1–22:11. IEEE Computer Society Press, CA, USA (2012)
Massi, M.L., Chun, B.N., Culler, D.E.: The ganglia distributed monitoring system: design, implementation and experience. Parallel Comput. 30(7), 817–840 (2004)
Nguyen, P., Halem, M.: A MapReduce Workflow System for Architecting Scientific Data Intensive Applications. In: Proceedings of the 2Nd International Workshop on Software Engineering for Cloud Computing, SECLOUD ’11, pp 57–63. ACM, NY, USA (2011)
Ocaña, K.A., De Oliveira, D., Dias, J., Ogasawara, E., Mattoso, M.: Designing a parallel cloud based comparative genomics workflow to improve phylogenetic analyses. Future Generation Computer Systems 29(8), 2205 –2219 (2013)
Ocaña, K., de Oliveira, D., Ogasawara, E.S., Dv̈ila, A.M.R., Lima, A.A.B., Mattoso, M.: Sciphy: A Cloud-Based Workflow for Phylogenetic Analysis of Drug Targets in Protozoan Genomes. In: De Souza, O.N., Telles, G.P., Palakal, M.J. (eds.) BSB, Lecture Notes in Computer Science, vol. 6832, pp 66–70. Springer (2011)
Ocaña, K.A., de Oliveira, D., Dias, J., Ogasawara, E., Mattoso, M.: Optimizing Phylogenetic Analysis Using Scihmm Cloud-based Scientific Workflow. IEEE 9th Int. Conf. e-Sci. 0, 62–69 (2011)
Ocaña, K.A., De Oliveira, D., Dias, J., Ogasawara, E., Mattoso, M.: Discovering drug targets for neglected diseases using a pharmacophylogenomic cloud workflow. IEEE 8th Int. Conf. E-Sci. 0, 1–8 (2012)
Ocaña, K.A., de Oliveira, D., Horta, F., Dias, J., Ogasawara, E., Mattoso, M.: Exploring Molecular Evolution Reconstruction Using a Parallel Cloud Based Scientific Workflow. In: Advances in Bioinformatics and Computational Biology, Lecture Notes in Computer Science, Vol. 7409, pp 179–191. Springer, Berlin Heidelberg (2012)
De Oliveira, D., Ocaña, K.A., Ogasawara, E., Dias, J., Gonlves, J., Baio, F., Mattoso, M.: Performance evaluation of parallel strategies in public clouds: a study with phylogenomic workflows. Fut. Gener. Comput. Syst. 29(7), 1816 –1825 (2013)
De Oliveira, D., Ogasawara, E., Baião, F., Mattoso, M.: Scicumulus: a Lightweight Cloud Middleware to Explore Many Task Computing Paradigm in Scientific Workflows. In: 3Rd International Conference on Cloud Computing, pp 378–385 (2010)
De Oliveira, D., Viana, V., Ogasawara, E., Ocaña, K., Mattoso, M.: Dimensioning the Virtual Cluster for Parallel Scientific Workflows in Clouds. In: Proceedings of the 4Th ACM Workshop on Scientific Cloud Computing, Science Cloud ’13, pp 5–12. ACM, NY, USA (2013)
Prodan, R., Wieczorek, M., Fard, H.: Double auction-based scheduling of scientific applications in distributed grid and cloud environments. J. Grid Comput. 9(4), 531–548 (2011)
Ragothaman, A., Boddu, S.C., Kim, N., Feinstein, W., Brylinski, M., Jha, S., Kim, J.: Developing eThread Pipeline Using SAGA-pilot Abstraction for Large-Scale Structural Bioinformatics. BioMed Res. Int. 2014, 1–12 (2014)
Rodero, I., Viswanathan, H., Lee, E.K., Gamell, M., Pompili, D., Parashar, M.: Energy-efficient thermal-aware autonomic management of virtualized hpc cloud infrastructure. J. Grid Comput. 10(3), 447–473 (2012)
Sadooghi, I., Hernandez Martin, J., Li, T., Brandstatter, K., Zhao, Y., Maheshwari, K., Pais Pitta de Lacerda Ruivo, T., Timm, S., Garzoglio, G., Raicu, I.: Understanding the performance and potential of cloud computing for scientific applications. IEEE Trans. Cloud Comput. PP (99), 1–1 (2015)
Shen, Z., Subbiah, S., Gu, X., Wilkes, J.: Cloudscale: Elastic Resource Scaling for Multi-tenant Cloud Systems. In: Proceedings of the 2Nd ACM Symposium on Cloud Computing, SOCC ’11, pp 5:1–5:14. ACM, NY, USA (2011)
Sun, X., Fan, L., Yan, L., Kong, L., Ding, Y., Guo, C., Sun, W.: Deliver Bioinformatics Services in Public Cloud: Challenges and Research Framework. In: Proceedings of the 2011 IEEE 8Th International Conference on E-Business Engineering, ICEBE ’11, pp 352–357. IEEE Computer Society, DC, USA (2011)
Szabo, C., Sheng, Q., Kroeger, T., Zhang, Y., Yu, J.: Science in the cloud: Allocation and execution of data-intensive scientific workflows. J. Grid Comput. 12(2), 245–264 (2014)
Taylor, I.J., Deelman, E., Gannon, D.B.: Workflows for e-Science: Scientific Workflows for Grids. Springer (2007)
Tian, W.: Adaptive Dimensioning of Cloud Data Centers. In: Proceedings of the 8Th International Conference on Dependable, Autonomic and Secure Computing, DASC ’09, pp 5–10. IEEE Computer Society, DC, USA (2009)
Walker, E., Guiang, C.: Challenges in Executing Large Parameter Sweep Studies across Widely Distributed Computing Environments. In: Proceedings of the 5Th IEEE Workshop on Challenges of Large Applications in Distributed Environments, CLADE ’07, pp 11–18. ACM, NY, USA (2007)
Wang, J., Crawl, D., Altintas, I.: Kepler + Hadoop: A General Architecture Facilitating Data-intensive Applications in Scientific Workflow Systems. In: Proceedings of the 4Th Workshop on Workflows in Support of Large-Scale Science, WORKS ’09, pp 12:1–12:8. ACM, NY, USA (2009)
Wozniak, J.M., Armstrong, T.G., Maheshwari, K., Lusk, E.L., Katz, D.S., Wilde, M., Foster, I.T.: Turbine: A distributed memory dataflow engine for high performance many-task applications. Fundam. Inf. J. 128(3), 337–366 (2013)
Xiao, Z., Song, W., Chen, Q.: Dynamic resource allocation using virtual machines for cloud computing environment. IEEE Trans. Parallel Distrib. Syst. 24(6), 1107–1117 (2013)
Xu, L., Zeng, Z., Ye, X.: Multi-Objective Optimization Based Virtual Resource Allocation Strategy for Cloud Computing. In: Proceedings of the 11Th International Conference on Computer and Information Science, ICIS ’12, pp 56–61. IEEE Computer Society, DC, USA (2012)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Coutinho, R., Frota, Y., Ocaña, K. et al. A Dynamic Cloud Dimensioning Approach for Parallel Scientific Workflows: a Case Study in the Comparative Genomics Domain. J Grid Computing 14, 443–461 (2016). https://doi.org/10.1007/s10723-016-9367-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10723-016-9367-x