Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2822332.2822336acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article
Free access

Contemporary challenges for data-intensive scientific workflow management systems

Published: 15 November 2015 Publication History

Abstract

Data-intensive sciences now represent the forefront of current scientific computing. To handle this 'Big Data' focus, scientists demand enabling technologies that can adapt to the increasingly distributed, collaborative, and exploratory scientific milieu. However, how these challenges have changed the design requirements of scientific workflow management systems (SWMSs) has not been assessed. First, how scientists currently use SWMSs was determined through a comprehensive usage survey examining 1455 research publications from 2013 to July 31st, 2015. To understand how data-intensive scientists are producing impactful research, we further examined usage of two major research clouds, the Open Science Data Cloud (OSDC) and Cornell's Red Cloud. Here, we present a road map for SWMS development for data-intensive sciences. SWMSs are now needed that interconnect diverse software packages while enabling data exploration and multi-user interaction across distributed software and hardware environments.

References

[1]
Big data, analytics and the path from insights to value. http://sloanreview.mit.edu/article/big-data-analytics-and-the-path-from-insights-to-value/, 2010. Accessed: 2015-07-27.
[2]
Rapidminer from Rapid-I at CeBIT 2010. http://www.data-mining-blog.com/cloud-mining/rapidminer-cebit-2010/, 2010. Accessed: 2015-07-22.
[3]
How companies learn your secrets. http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html?pagewanted=1\&\_r=2\&hp, 2012. Accessed: 2015-07-20.
[4]
Accelrys software. http://accelrys.com/, 2015. Accessed: 2015-7-27.
[5]
Amazon simple workflow service (SWF). http://aws.amazon.com/swf/, 2015. Accessed: 2015-7-29.
[6]
Cornell faculty, staff, and students. https://www.cac.cornell.edu/clients/cu.aspx, 2015. Accessed: 2015-07-5.
[7]
ESFRI. http://ec.europa.eu/research/infrastructures/index_en.cfm?pg=esfri, 2015. Accessed: 2015-7-27.
[8]
Executive summary data growth, business opportunities, and the IT imperatives. http://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm, 2015. Accessed: 2015-7-31.
[9]
IBM Bluemix. http://www.ibm.com/cloud-computing/bluemix/, 2015. Accessed: 2015-7-29.
[10]
OSDC publications. https://www.opensciencedatacloud.org/publications, 2015. Accessed: 2015-07-5.
[11]
E. Afgan, D. Baker, N. Coraor, H. Goto, I. M. Paul, K. D. Makova, A. Nekrutenko, and J. Taylor. Harnessing cloud computing with Galaxy Cloud. Nature biotechnology, 29(11):972--4, Nov. 2011.
[12]
I. Altintas, C. Berkley, E. Jaeger, M. Jones, B. Ludascher, and S. Mock. Kepler: an extensible system for design and execution of scientific workflows. Proceedings. 16th International Conference on scientific and Statistical Database Management, 2004., 2004.
[13]
L. Bavoil, S. Callahan, P. Crossno, J. Freire, C. Scheidegger, C. Silva, and H. Vo. VisTrails: Enabling Interactive Multiple-View Visualizations. In VIS 05. IEEE Visualization, 2005., pages 135--142. IEEE, 2005.
[14]
B. E. Bernstein, E. Birney, I. Dunham, E. D. Green, C. Gunter, and M. Snyder. An integrated encyclopedia of DNA elements in the human genome. Nature, 489(7414):57--74, Sept. 2012.
[15]
D. R. Blair, K. Wang, S. Nestorov, J. A. Evans, and A. Rzhetsky. Quantifying the impact and extent of undocumented biomedical synonymy. PLoS computational biology, 10(9):e1003799, Sept. 2014.
[16]
J. G. Caporaso, J. Kuczynski, J. Stombaugh, K. Bittinger, F. D. Bushman, E. K. Costello, et al. QIIME allows analysis of high-throughput community sequencing data. Nature methods, 7(5):335--6, May 2010.
[17]
B. Chassaing, O. Koren, J. K. Goodrich, A. C. Poole, S. Srinivasan, R. E. Ley, and A. T. Gewirtz. Dietary emulsifiers impact the mouse gut microbiota promoting colitis and metabolic syndrome. Nature, 519(7541):92--96, 2015.
[18]
Y. Chen, P. Martin, B. Magagna, H. Schentz, Z. Zhao, A. Hardisty, A. Preece, M. Atkinson, R. Huber, and Y. Legré. A Common Reference Model for Environmental Science Research Infrastructures. Proceedings of the 27th Conference on Environmental Informatics 2013, pages 665--673, 2013.
[19]
D. Churches, G. Gombas, A. Harrison, J. Maassen, C. Robinson, M. Shields, I. Taylor, and I. Wang. Programming scientific and distributed workflow with Triana services. Concurrency Computation Practice and Experience, 18(10):1021--1037, 2006.
[20]
J. Collins. The fourth paradigm. 2014.
[21]
M. C. Cushing, J. T. Rayner, and W. D. Vacca. An Infrared Spectroscopic Sequence of M, L, and T Dwarfs. The Astrophysical Journal, 623(2):1115--1140, Apr. 2005.
[22]
C. F. Davis, C. J. Ricketts, M. Wang, L. Yang, A. D. Cherniack, H. Shen, et al. The somatic genomic landscape of chromophobe renal cell carcinoma. Cancer cell, 26(3):319--30, Sept. 2014.
[23]
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107, Jan. 2008.
[24]
E. Deelman, J. Blythe, Y. Gil, and C. Kesselman. Pegasus: Mapping scientific workflows onto the grid. Grid Computing, 3165/2004:131--140, 2004.
[25]
H. Dharuri and P. Henneman. Automated workflow-based exploitation of pathway databases provides new insights into genetic associations of metabolite profiles. BMC ..., 2013.
[26]
M. B. Gerstein, Z. J. Lu, E. L. Van Nostrand, C. Cheng, B. I. Arshinoff, T. Liu, et al. Integrative analysis of the Caenorhabditis elegans genome by the modENCODE project. Science (New York, N.Y.), 330(6012):1775--87, Dec. 2010.
[27]
Y. Gil, V. Ratnakar, and E. Deelman. Wings for pegasus: Creating large-scale scientific applications using semantic representations of computational workflows. IAAI'07, the 19th national conference on Innovative applications of artificial intelligence, Volume 2, pages 1767--1774, 2007.
[28]
J. Goecks, A. Nekrutenko, and J. Taylor. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome biology, 11(8):R86, 2010.
[29]
J. K. Goodrich, J. L. Waters, A. C. Poole, J. L. Sutter, et al. Human Genetics Shape the Gut Microbiome. Cell, 159(4):789--799, Nov. 2014.
[30]
J. K. Goodrich, J. L. Waters, A. C. Poole, J. L. Sutter, O. Koren, R. Blekhman, et al. Human Genetics Shape the Gut Microbiome. Cell, 159(4):789--799, 2014.
[31]
I. Gorton, P. Greenfield, A. Szalay, and R. Williams. Data-intensive computing in the 21st century. Computer, 41(4):30--32, 2008.
[32]
R. L. Grossman, Y. Gu, J. Mambretti, M. Sabala, A. Szalay, and K. White. An overview of the Open Science Data Cloud. Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing - HPDC '10, page 377, 2010.
[33]
J. Hoekman, K. Frenken, and R. J. Tijssen. Research collaboration at a distance: Changing spatial patterns of scientific collaboration within Europe. Research Policy, 39(5):662--673, June 2010.
[34]
H. H. Huang and H. Liu. Big Data Machine Learning and Graph Analytics: Current State and Future Challenges. 2014 IEEE International Conference on Big Data, pages 31--32, 2014.
[35]
V. Kalavri and V. Vlassov. MapReduce: Limitations, Optimizations and Open Issues. In 2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, pages 1031--1038. IEEE, July 2013.
[36]
D. Keim. Information visualization and visual data mining. IEEE Transactions on Visualization and Computer Graphics, 8(1):1--8, 2002.
[37]
G.-H. Kim, S. Trimi, and J.-H. Chung. Big-Data Applications in the Government Sector. Communications of the ACM, 57(3):78--85, 2014.
[38]
H. S. Kim, S. C. In, and H. Y. Yeom. A task pipelining framework for e-science workflow management systems. Proceedings CCGRID 2008 - 8th IEEE International Symposium on Cluster Computing and the Grid, pages 657--662, 2008.
[39]
R. Kitchin. The real-time city? Big data and smart urbanism. GeoJournal, 79(1):1--14, Nov. 2013.
[40]
R. Kittler, J. Zhou, S. Hua, L. Ma, Y. Liu, E. Pendleton, C. Cheng, M. Gerstein, and K. P. White. A comprehensive nuclear receptor network for breast cancer cells. Cell reports, 3(2):538--51, Feb. 2013.
[41]
B. Langmead and S. L. Salzberg. Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4):357--9, Apr. 2012.
[42]
K.-H. Lee, Y.-J. Lee, H. Choi, Y. D. Chung, and B. Moon. Parallel data processing with MapReduce: A survey. ACM SIGMOD Record, 40(4):11, Jan. 2012.
[43]
C. K.-s. Leung. Reducing the Search Space for Big Data Mining for Interesting Patterns from Uncertain Data. 2014.
[44]
J. Liu, E. Pacitti, P. Valduriez, and M. Mattoso. Parallelization of Scientific Workflows in the Cloud. 2014.
[45]
J. Liu, E. Pacitti, P. Valduriez, and M. Mattoso. A Survey of Data-Intensive Scientific Workflow Management. Journal of Grid Computing, 2015.
[46]
D. Mandl, S. Frye, P. Cappelaere, M. Handy, F. Policelli, M. Katjizeu, et al. Use of the earth observing one (EO-1) satellite for the namibia sensorweb flood early warning pilot. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 6(2):298--308, 2013.
[47]
S. H. I. Meilin, Y. Guangxin, X. Yong, and W. U. Shangguang. Workflow management systems: a survey. Communication Technology Proceedings, International Conference on, 2:1--6, 1998.
[48]
P. S. Muirhead, J. Becker, G. a. Feiden, B. Rojas-Ayala, A. Vanderburg, E. M. Price, et al. CHARACTERIZING THE COOL KOIs. VI. H - AND K -BAND SPECTRA OF KEPLER M DWARF PLANET-CANDIDATE HOSTS. The Astrophysical Journal Supplement Series, 213(1):5, 2014.
[49]
M. Roser. Technological progress. http://ourworldindata.org/data/technology-and-infrastructure/moores-law-other-laws-of-exponential-technological-progress/, 2015. Accessed: 2015-7-31.
[50]
S. Roy, J. Ernst, P. V. Kharchenko, P. Kheradpour, N. Negre, M. L. Eaton, et al. Identification of functional elements and regulatory circuits by Drosophila modENCODE. Science (New York, N.Y.), 330(6012):1787--97, Dec. 2010.
[51]
E. E. Schadt, M. D. Linderman, J. Sorenson, L. Lee, and G. P. Nolan. Computational solutions to large-scale data management and analysis. Nature reviews. Genetics, 11(9):647--57, Sept. 2010.
[52]
Z. Su, P. P. Labaj, S. Li, J. Thierry-Mieg, D. Thierry-Mieg, W. Shi, et al. A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium. Nature Biotechnology, 32(9):903--914, Aug. 2014.
[53]
K. Wolstencroft, R. Haines, D. Fellows, A. Williams, D. Withers, S. Owen, et al. The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud. Nucleic acids research, 41(Web Server issue):W557--61, July 2013.
[54]
J. Y. and R. Buyya. A Taxonomy of Workflow Management Systems for Grid Computing. Journal of Grid Computing, 3(3-4):171--200, 2005.
[55]
Y. Zhao, Y. Li, S. Lu, I. Raicu, and C. Lin. Devising a cloud scientific workflow platform for big data. Services (SERVICES), 2014 ..., 2014.
[56]
Y. Zhao, I. Raicu, and I. Foster. Scientific Workflow Systems for 21st Century, New Bottle or New Wine? In 2008 IEEE Congress on Services - Part I, pages 467--471. IEEE, July 2008.
[57]
Y. Zhao, Y. Zhang, W. Tian, R. Xue, and C. Lin. Designing and deploying a scientific computing Cloud platform. Proceedings - IEEE/ACM International Workshop on Grid Computing, pages 104--113, 2012.
[58]
F. Zulkernine, P. Martin, Y. Zou, M. Bauer, F. Gwadry-Sridhar, and A. Aboulnaga. Towards cloud-based analytics-as-a-service (CLAaaS) for big data analytics in the cloud. Proceedings - 2013 IEEE International Congress on Big Data, BigData 2013, pages 62--69, 2013.

Cited By

View all
  • (2025)An enhanced list scheduling algorithm for heterogeneous computing using an optimized Predictive Cost MatrixFuture Generation Computer Systems10.1016/j.future.2025.107733166(107733)Online publication date: May-2025
  • (2024)MAESTRO: a lightweight ontology-based framework for composing and analyzing script-based scientific experimentsKnowledge and Information Systems10.1007/s10115-024-02134-266:10(5959-6000)Online publication date: 1-Oct-2024
  • (2022)Provenance-enhanced Root Cause Analysis for Jupyter Notebooks2022 IEEE/ACM 15th International Conference on Utility and Cloud Computing (UCC)10.1109/UCC56403.2022.00058(327-333)Online publication date: Dec-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WORKS '15: Proceedings of the 10th Workshop on Workflows in Support of Large-Scale Science
November 2015
98 pages
ISBN:9781450339896
DOI:10.1145/2822332
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 November 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. big data
  2. cloud computing
  3. data-intensive science
  4. enabling technology
  5. scientific workflow management systems
  6. workflow

Qualifiers

  • Research-article

Funding Sources

Conference

SC15
Sponsor:

Acceptance Rates

WORKS '15 Paper Acceptance Rate 9 of 13 submissions, 69%;
Overall Acceptance Rate 30 of 54 submissions, 56%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)125
  • Downloads (Last 6 weeks)21
Reflects downloads up to 03 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)An enhanced list scheduling algorithm for heterogeneous computing using an optimized Predictive Cost MatrixFuture Generation Computer Systems10.1016/j.future.2025.107733166(107733)Online publication date: May-2025
  • (2024)MAESTRO: a lightweight ontology-based framework for composing and analyzing script-based scientific experimentsKnowledge and Information Systems10.1007/s10115-024-02134-266:10(5959-6000)Online publication date: 1-Oct-2024
  • (2022)Provenance-enhanced Root Cause Analysis for Jupyter Notebooks2022 IEEE/ACM 15th International Conference on Utility and Cloud Computing (UCC)10.1109/UCC56403.2022.00058(327-333)Online publication date: Dec-2022
  • (2021)Fault-Tolerant and Data-Intensive Resource Scheduling and Management for Scientific Applications in Cloud ComputingSensors10.3390/s2121723821:21(7238)Online publication date: 30-Oct-2021
  • (2021)DWH-DIM: A Blockchain Based Decentralized Integrity Verification Model for Data Warehouses2021 IEEE International Conference on Blockchain (Blockchain)10.1109/Blockchain53845.2021.00037(221-228)Online publication date: Dec-2021
  • (2021)Scientific Workflows Management and Scheduling in Cloud Computing: Taxonomy, Prospects, and ChallengesIEEE Access10.1109/ACCESS.2021.30707859(53491-53508)Online publication date: 2021
  • (2020)Semantic Linking of Research Infrastructure MetadataTowards Interoperable Research Infrastructures for Environmental and Earth Sciences10.1007/978-3-030-52829-4_13(226-246)Online publication date: 25-Jul-2020
  • (2020)A fault‐tolerant workflow management system with Quality‐of‐Service‐aware scheduling for scientific workflows in cloud computingInternational Journal of Communication Systems10.1002/dac.464934:1Online publication date: 8-Nov-2020
  • (2019)Phylogenomics — principles, opportunities and pitfalls of big‐data phylogeneticsSystematic Entomology10.1111/syen.1240645:2(225-247)Online publication date: 16-Dec-2019
  • (2019)Large Distributed Virtual Infrastructure Partitioning and Provisioning Across Providers2019 IEEE International Conference on Smart Internet of Things (SmartIoT)10.1109/SmartIoT.2019.00018(56-63)Online publication date: Aug-2019
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media