Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1217935.1217958acmconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections
Article

On the road to recovery: restoring data after disasters

Published: 18 April 2006 Publication History
  • Get Citation Alerts
  • Abstract

    Restoring data operations after a disaster is a daunting task: how should recovery be performed to minimize data loss and application downtime? Administrators are under considerable pressure to recover quickly, so they lack time to make good scheduling decisions. They schedule recovery based on rules of thumb, or on pre-determined orders that might not be best for the failure occurrence. With multiple workloads and recovery techniques, the number of possibilities is large, so the decision process is not trivial.This paper makes several contributions to the area of data recovery scheduling. First, we formalize the description of potential recovery processes by defining recovery graphs. Recovery graphs explicitly capture alternative approaches for recovering workloads, including their recovery tasks, operational states, timing information and precedence relationships. Second, we formulate the data recovery scheduling problem as an optimization problem, where the goal is to find the schedule that minimizes the financial penalties due to downtime, data loss and vulnerability to subsequent failures. Third, we present several methods for finding optimal or near-optimal solutions, including priority-based, randomized and genetic algorithm-guided ad hoc heuristics. We quantitatively evaluate these methods using realistic storage system designs and workloads, and compare the quality of the algorithms' solutions to optimal solutions provided by a math programming formulation and to the solutions from a simple heuristic that emulates the choices made by human administrators. We find that our heuristics' solutions improve on the administrator heuristic's solutions, often approaching or achieving optimality.

    References

    [1]
    E. Anderson, D. Beyer, K. Chaudhuri, T. Kelly, N. Salazar, C. Santos, R. Swaminathan, R. Tarjan, J. Wiener, and Y. Zhou. Value-maximizing deadline scheduling and its application to animation rendering. In Proc. ACM Symp. on Parallelism in Algorithms and Architectures (SPAA), July 2005.
    [2]
    A. Azagury, M. Factor, and J. Satran. Point-in-Time copy: yesterday, today and tomorrow. In Proc. 10th NASA Conf. on Mass Storage Systems and Technologies/19th IEEE Symp. on Mass Storage Systems, pages 259--270, April 2002.
    [3]
    K. R. Baker. Introduction to sequencing and scheduling. John Wiley, 1974.
    [4]
    E. Balas. Project scheduling with resource constraints. In E. Beale, editor, Applications of Mathematical Programming Techniques, pages 187--200. American Elsevier, 1970.
    [5]
    R. Bhagwan, K. Tati, Y. Cheng, S. Savage, and G. Voelker. Total Recall: system support for automated availability management. In Proc. ACM/USENIX Symp. on Networked Systems Design and Implementation (NSDI), March 2004.
    [6]
    P. Brucker, A. Drexl, R. Mohring, K. Neumann, and E. Pesch. Resource constrained project scheduling: notation, classification, models, and methods. European Journal of Operations Research, 112:3--41, 1999.
    [7]
    A. Chervenak, V. Vellanki, and Z. Kurmas. Protecting file systems: a survey of backup techniques. In Proc. 6th NASA Conf. on Mass Storage Systems and Technologies/15th IEEE Symp. on Mass Storage Systems, March 1998.
    [8]
    D. Cougias, E. Heiberger, and K. Koop. The backup book: disaster recovery from desktop to data center. Schaser-Vartan Books, Lecanto, FL, 2003.
    [9]
    P. de Jong. Going with the flow. ACM Queue, pages 25--32, March 2006.
    [10]
    C. Ekelin. An optimization framework for scheduling of embedded real-time systems. PhD thesis, Chalmers University of Technology, 2004.
    [11]
    S. Hartmann. A self-adapting genetic algorithm for project scheduling under resource constraints. Naval Research Logistics, 49:433--448, 1001.
    [12]
    Hewlett-Packard Company. HP StorageWorks Enterprise Virtual Array, December 2003. h18006. www 1.hp.com/products/storageworks/enterprise/.
    [13]
    Hewlett Packard Company. HP StorageWorks Extended Tape Library Architecture, December 2003. h 18006. www.1.hp.com/products/storageworks/tlarchitecture/.
    [14]
    Hewlett-Packard Development Co. HP OpenView Storage Data Protector administrator's guide, October 2004. Mfg. Part Number B6960--90106, Release A.05.50.
    [15]
    E. S. H. Hou, N. Ansari, and H. Ren. A genetic algorithm for multiprocessor scheduling. IEEE Trans. Parallel and Distributed Systems, 5(2):113--120, 1994.
    [16]
    ILOG, Inc., Mountain View, CA. CPLEX 8.0 User's Manual, July 2002. Available from http://www.ilog.com.
    [17]
    M. Ji, A. Veitch, and J. Wilkes. Seneca: remote mirroring done write. In Proc. USENIX Annual Technical Conf., pages 253--268, June 2003.
    [18]
    K. Keeton, D. Beyer, J. Chase, C. Santos, and J. Wilkes. Lessons and challenges in automating data dependability. In Proc. 11th ACM-SIGOPS European Workshop, September 2004.
    [19]
    K. Keeton and A. Merchant. A framework for evaluating storage system dependability. In Proc. Intl. Conf. on Dependable Systems and Networks (DSN), pages 877--886, 2004.
    [20]
    K. Keeton and A. Merchant. Challenges in managing dependable data systems. ACM SIGMETRICS Performance Evaluation Review, March 2006.
    [21]
    K. Keeton, C. Santos, D. Beyer, J. Chase, and J. Wilkes. Designing for disasters. In Proc. USENIX Conf. on File and Storage Technologies (FAST), pages 59--72, March 2004.
    [22]
    R. Kolisch and S. Hartmann. Heuristic algorithms for the resource-constrainted project scheduling problem: classification and computational analysis. In J. Weglarz, editor, Project scheduling: recent models, algorithms and applications, pages 147--178. Kluwer Academic Publishers, 1999.
    [23]
    Eagle Rock Alliance Ltd. Online survey results: 2001 cost of downtime. http://contingencyplanningresearch.com/2001_ Survey.pdf, August 2001.
    [24]
    E. Marcus and H. Stern. Blueprints for high availability. Wiley Publishing, Indianapolis, IN, 2003.
    [25]
    P. Massiglia and E. Marcus, editors. The resilient enterprise: recovering information services from disaster. Veritas Software Corp., Mountain View, CA, USA, 2002.
    [26]
    Z. Michalewicz. Genetic algorithms + data structures = evolution programs. Srpinger-Verlag, third edition, 1999.
    [27]
    D. Patterson, G. Gibson, and R. Katz. A case for redundant arrays of inexpensive disks (RAID). In Proc. SIGMOD, pages 109--16, 1--3 June 1988.
    [28]
    M. Pinedo. Planning and scheduling in manufacturing and services. Springer Series in Operations Research. Springer-Verlag, 2005.
    [29]
    Y. Saito, S. Frolund, A. Veitch, A. Merchant, and S. Spence. FAB: building distributed enterprise disk arrays from commodity components. In Proc. ACM Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 48--58, October 2004.
    [30]
    R. R. Schulman. Disaster recovery issues and solutions. Hitachi Data Systems White paper, September 2004.
    [31]
    W. van der Aalst and K. van Hee. Workflow management: models, methods and systems. MIT Press, Cambridge, MA, USA, 2002.
    [32]
    M. Wall. A genetic algorithm for resource-constrained scheduling. PhD thesis, Massachusetts Institute of Technology, June 1996.
    [33]
    C. Warrick et al. IBM TotalStorage business continuity solutions guide. IBM Redbooks. IBM International Technical Support Organization, August 2005.
    [34]
    J. Wylie, M. Bigrigg, J. Strunk, G. Ganger, H. Kiliççöte, and P. Khosla. Survivable information storage systems. Computer, 33(8):61--68, August 2000.
    [35]
    J. Xu. Multiprocessor scheduling of processes with release times, deadlines, precedence, and exclusion relations. IEEE Trans. Softw. Eng., 19(2):139--154, 1993.
    [36]
    W. Zhao, K. Ramamritham, and J. A. Stankovic. Scheduling tasks with resource requirements in hard real-time systems. IEEE Trans. on Software Engineering, 13(5):564--577, 1987.
    [37]
    W.-D. Zhu et al. IBM Content Manager backup/recovery and high availability: strategies, options and procedures. IBM Redbook, March 2004.

    Cited By

    View all
    • (2021)Evaluation of Parallel Data Restoration for Metro Area Distributed Storage System with Assumption of Storage Node Restart Trend under Large-Scale DisasterIEEJ Transactions on Electronics, Information and Systems10.1541/ieejeiss.141.483141:3(483-493)Online publication date: 1-Mar-2021
    • (2020)CarverProceedings of the 18th USENIX Conference on File and Storage Technologies10.5555/3386691.3386697(43-58)Online publication date: 24-Feb-2020
    • (2015)Priority with adoptive data migration in case of disaster using cloud computing use style2015 International Conference on Communication, Information & Computing Technology (ICCICT)10.1109/ICCICT.2015.7045697(1-6)Online publication date: Jan-2015
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    EuroSys '06: Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
    April 2006
    420 pages
    ISBN:1595933220
    DOI:10.1145/1217935
    • cover image ACM SIGOPS Operating Systems Review
      ACM SIGOPS Operating Systems Review  Volume 40, Issue 4
      Proceedings of the 2006 EuroSys conference
      October 2006
      383 pages
      ISSN:0163-5980
      DOI:10.1145/1218063
      Issue’s Table of Contents

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 18 April 2006

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. backup/restore
    2. data storage
    3. disaster recovery
    4. genetic algorithms
    5. management
    6. math programming
    7. optimization
    8. scheduling

    Qualifiers

    • Article

    Conference

    EUROSYS06
    Sponsor:
    EUROSYS06: Eurosys 2006 Conference
    April 18 - 21, 2006
    Leuven, Belgium

    Acceptance Rates

    Overall Acceptance Rate 241 of 1,308 submissions, 18%

    Upcoming Conference

    EuroSys '25
    Twentieth European Conference on Computer Systems
    March 30 - April 3, 2025
    Rotterdam , Netherlands

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)24
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 10 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2021)Evaluation of Parallel Data Restoration for Metro Area Distributed Storage System with Assumption of Storage Node Restart Trend under Large-Scale DisasterIEEJ Transactions on Electronics, Information and Systems10.1541/ieejeiss.141.483141:3(483-493)Online publication date: 1-Mar-2021
    • (2020)CarverProceedings of the 18th USENIX Conference on File and Storage Technologies10.5555/3386691.3386697(43-58)Online publication date: 24-Feb-2020
    • (2015)Priority with adoptive data migration in case of disaster using cloud computing use style2015 International Conference on Communication, Information & Computing Technology (ICCICT)10.1109/ICCICT.2015.7045697(1-6)Online publication date: Jan-2015
    • (2015)Experiences with Building Disaster Recovery for Enterprise-Class CloudsProceedings of the 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks10.1109/DSN.2015.53(231-238)Online publication date: 22-Jun-2015
    • (2014)Integrated Resiliency Planning in Storage CloudsIEEE Transactions on Network and Service Management10.1109/TNSM.2013.120713.12034911:1(3-14)Online publication date: Mar-2014
    • (2014)BIRDSIEEE Transactions on Computers10.1109/TC.2013.1963:6(1392-1407)Online publication date: 1-Jun-2014
    • (2014)Multi-site data distribution for disaster recovery-A planning frameworkFuture Generation Computer Systems10.1016/j.future.2014.07.00741:C(53-64)Online publication date: 1-Dec-2014
    • (2013)Modeling Incast and its Empirical ValidationAnalysis of TCP Performance in Data Center Networks10.1007/978-1-4614-7861-4_3(31-65)Online publication date: 5-Oct-2013
    • (2012)Towards a Holistic Approach to Fault ManagementDependability and Computer Engineering10.4018/978-1-60960-747-0.ch001(1-10)Online publication date: 2012
    • (2011)Planning for optimal multi-site data distribution for disaster recoveryProceedings of the 8th international conference on Economics of Grids, Clouds, Systems, and Services10.1007/978-3-642-28675-9_12(161-172)Online publication date: 5-Dec-2011
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media