Abstract
Managing large datasets has become one major application of Grids. Life science applications usually manage large databases that should be replicated to scale applications. The growing number of users and the simple access to Internet-based application has stressed Grid middleware. Such environment are thus asked to manage data and schedule computation tasks at the same time. These two important operations have to be tightly coupled. This paper presents an algorithm (Scheduling and Replication Algorithm, SRA) that combines data management and scheduling using a steady-state approach. Using a model of the platform, the number of requests as well as their distribution, the number and size of databases, we define a linear program to satisfy all the constraints at every level of the platform in steady-state. The solution of this linear program will give us a placement for the databases on the servers as well as providing, for each kind of job, the server on which they should be executed. Our theoretical results are validated using simulation and logs from a large life science application.
Similar content being viewed by others
References
“Institut de Biologie et Chime des Protéines”. http://www.ibcp.fr.
“The European DataGrid Project”. http://www.eu-datagrid.org.
W. Bell, D. Cameron, L. Capozza, A. Millar, K. Stockinger and F. Zini, “Simulation of Dynamic Grid Replication Strategies in OptorSim”, in Proc. of the 3rd Int’l. IEEE Workshop on Grid Computing (Grid'2002), 2002.
W. Bell, D. Cameron, L. Capozza, A. Millar, K. Stockinger and F. Zini, “OptorSim – A Grid Simulator for Studying Dynamic Data Replication Strategies”, International Journal of High Performance Computing Applications, Vol. 17, No. 4, 2003, http://edg-wp2.web.cern.ch/edg-wp2/publications.html.
M. Berkelaar, “LP_SOLVE”, http://www.cs.sunysb.edu/~algorith/implement/lpsolve/implement.shtml.
F. Berman, G. Fox and A. Hey (eds.), Grid Computing: Making the Global Infrastructure a Reality, Wiley, 2003.
B. Boeckmann, A. Bairoch, R. Apweiler, M.-C. Blatter, A. Estreicher, E. Gasteiger, M. Martin, K. Michoud, C. O'Donovan, I. Phan, S. Pilbout and M. Schneider, “The SWISS-PROT Protein Knowledgebase and its Supplement TrEMBL in 2003”, Nucleic Acids Research, Vol. 31, pp. 365–370, 2003.
P. Bucher and A. Bairoch, “A Generalized Profile Syntax for Biomolecular Sequences Motifs and its Function in Automatic Sequence Interpretation”, in R. Altman, D. Brutlag, P. Karp, R. Lathrop and D. Searls (eds.), Proceedings 2nd International Conference on Intelligent Systems for Molecular Biology, Vol. 2, pp. 53–61, 1994.
V. Cardellini, E. Casalicchio, M. Colajanni and P. Su, “The State of the Art in Locally Distributed Web-Server Systems”, ACM Computing Surveys, Vol. 34, No. 2, pp. 263–311, 2002.
A. Chakrabarti, R. Dheepak and S. Sengupta, “Integration of Scheduling and Replication in Data Grids”. Technical Report TR-0407-001, Infosys Tech. Ltd, 2004.
A. Chervenak, I. Foster, C. Kesselman, C. Salisbury and S. Tuecke, “The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Datasets”, Journal of Network and Computer Applications, Vol. 23, pp. 187–200, 2001.
D.G. Cameron, R. Carvajal-Schiaffino, A. Millar, C. Nicholson, K. Stockinger and F. Zini, “Evaluating Scheduling and Replica Optimisation Strategies in OptorSim”, in 4th International Workshop on Grid Computing (Grid2003), 2003.
I. Foster and C. Kesselman (eds.), The Grid 2: Blueprint for a New Computing Infrastructure, Morgan Kaufmann, 2004.
M.R. Garey and D.S. Johnson, Computers and Intractability, a Guide to the Theory of NP-Completeness, W. H. Freeman and Company, 1979.
GRIPPS, http://gripps.ibcp.fr/index.php.
W. Hoscheck, J. Jaen-Martinez, A. Samar, H. Stockinger and K. Stockinger, “Data Management in an International Data Grid Project”, First IEEE/ACM Int’l Workshop on Grid Computing (Grid 2000), 2000
K. Calvert, M. Doar and E.W. Zegura, “Modeling Internet Topology”, IEEE Communications Magazine, Vol. 35, pp. 160–163, 1997.
T. Kosar and M. Livny, “Stork: Making Data Placement a First Class Citizen in the Grid”, in Proceedings of 24th IEEE Int. Conference on Distributed Computing Systems (ICDCS2004), Tokyo, Japan, 2004.
A. Krishnan, “A Survey of Life Sciences Applications on the Grid”, New Generation Computing, Vol. 22, pp. 111–126, 2004.
H. Lamehamedi, B. Szymanski, Z. Shentu and E. Deelman, “Data Replication Strategies in Grid Environments”, in Proc. 5th International Conference on Algorithms and Architecture for Parallel Processing, ICA3PP’2002, pp. 378–383, 2002.
H. Mohamed and D. Epema, “An Evaluation of the Close-to-Files Processor and Data Co-allocation Policy in Multiclusters”, in Cluster 2004, pp. 287–298, 2004.
S. Podlipding and L. Böszörmenyi, “A Survey of Web Cache Replacement Strategies”, ACM Computing Surveys, Vol. 35, No. 4, pp. 374–398, 2003.
X. Qin and H. Jiang, “Data Grid: Supporting Data-Intensive Applications in Wide-Area Networks”. Technical Report TR-03-05-01, University of Nebraska-Lincoln, Lincoln, Nebraska, USA, 2003.
K. Ranganathan and I. Foster, “Decoupling Computation and Data Scheduling in Distributed Data Intensive Applications”, in Proceedings of the 11th International Symposium for High Performance Distributed Computing (HPDC-11), Edinburgh, 2002.
K. Ranganathan and I. Foster, “Simulation Studies of Computation and Data Scheduling Algorithms for Data Grids”, Journal of Grid Computing, Vol. 1, No. 1, pp. 53–62, 2003.
D. Thain, T. Tannenbaum and M. Livny, “Distributed Computing in Practice: The Condor experience”, Concurrency and Computation: Practice and Experience, 2004.
C. Wu, L. Yeh, H. Huang, L. Arminski, J. Castro-Alvear, Y. Chen, Z. Hu, P. Kourtesis, R. Ledley and B. Suzek et al., “The Protein Information Resource”, Nucleic Acids Research, Vol. 31, pp. 345–347, 2003.
C. Xu, H. Jin and P. Srimani, “Special Issue on Scalable Web Services and Architecture”, Journal of Parallel and Distributed Computing, Vol. 63, 2003.
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was supported in part by the ACI GRID and Grid5000 projects of the French Department of Research.
Rights and permissions
About this article
Cite this article
Desprez, F., Vernois, A. Simultaneous Scheduling of Replication and Computation for Data-Intensive Applications on the Grid. J Grid Computing 4, 19–31 (2006). https://doi.org/10.1007/s10723-005-9016-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10723-005-9016-2