Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Adaptive Message Logging for Incremental Program Replay

Published: 01 November 1993 Publication History

Abstract

Adaptive message logging, which traces dependences between messages and checkpoints and selectively logs messages, letting users accurately and efficiently replay specific portions of parallel programs, is presented. Traces are reduced by logging only messages that cannot be quickly recomputed during replay. By restarting the execution at the right set of checkpoints, many of the messages needed for a specific replay can be recomputed during the replay itself.

References

[1]
1. D.B. Johnson and W. Zwaenepoel, "Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing," Proc. 7th Ann. ACM Symp. Principles of Distributed Computing, ACM Press, New York, 1988, pp. 171-181.
[2]
2. R.E. Strom and S. Yemini, "Optimistic Recovery in Distributed Systems," ACM Trans. Computer Systems, Vol. 3, No. 3, Aug. 1985, pp. 204-226.
[3]
3. Y.M. Wang and W.K. Fuchs, "Optimistic Message Logging for Independent Checkpointing in Message-Passing Systems," IEEE Symp. Reliable Distributed Systems, IEEE Computer Society Press, Los Alamitos, Calif., Oct. 1992, pp. 147-154.
[4]
4. A.P. Goldberg et al., "Restoring Consistent Global States of Distributed Computations," ACM/ONR Workshop on Parallel and Distributed Debugging, ACM Press, New York, May 1991, pp. 144-154. Mso appears in SIGPLAN Notices, Vol. 26, No. 11, Dec. 1991.
[5]
5. L.D. Wittie, "Debugging Distributed C Programs by Real Time Replay," SIGPlan/SIGOps Workshop on Parallel and Distributed Debugging, ACM Press, New York, May 1988, pp. 57-67. Also appears in SIGPlan Notices, Vol. 24, No. 1, Jan. 1989.
[6]
6. K. Mani Chandy and L. Lamport, "Distributed Snapshots: Determining Global States of Distributed Systems," ACM Trans. Computer Systems, Vol. 3, No. 1, Feb. 1985, pp. 63-75.
[7]
7. D.B. Johnson and W. Zwaenepoel, "Sender-Based Message Logging," Proc. Fault-Tolerant Computing Systems, IEEE Computer Society Press, Los Alamitos, CA, 1987, pp. 14-19.
[8]
1. C.J. Fidge, "Pamat Orders for Parallel Debugging," SIGPlan/SIG Ops Workshop on Parallel and Distributed Debugging, ACM Press, New York, May 1988, pp. 183-194. Also appears in SIGPlan Notices, Vol. 24, No. 1, Jan. 1989.
[9]
2. A.P. Goldberg et al., "Restoring Consistent Global States of Distributed Computations," ACM/ONR Workshop on Parallel and Distributed Debugging, ACM Press, New York, May 1991, pp. 144-154. Also appears in SIGPlan Notices, Vol. 26, No. 11, Dec. 1991.
[10]
3. L. Lamport, "Time, Clocks, and the Ordering of Events in a Distributed System," Comm. ACM, Vol. 21, No. 7, July 1978, pp. 558-565.
[11]
4. T.J. LeBlanc and J.M. Mellor-Crummey, "Debugging Parallel Programs with Instant Replay," IEEE Trans. Computers, Vol. C- 36, No. 4, Apr. 1987, pp. 471-482.
[12]
5. E. Leu, A. Schiper, and A. Zramdini, "Efficient Execution Replay Technique for Distributed Memory Architectures," Second European Distributed Memory Computing Conf., Lecture Notes in Computer Science 487, Springer-Verlag, Munich, 1991.
[13]
6. R.H.B. Netzer and B.P. Miller, "Optimal Tracing and Replay for Debugging Message-Passing Parallel Programs," Supercomputing '92, IEEE Computer Society Press, Los Alamitos, Calif., Nov. 1992, pp. 502-511.
[14]
7. R.H.B. Netzer and J. Xiu, "Adaptive Message Logging for Incremental Replay of Message-Passing Programs," Supercomputing '93, IEEE Computer Society Press, Los Alamitos, Calif., Nov. 1993.
[15]
8. K.-C. Tai and S. Ahuja, "Reproducible Testing of Communication Software," IEEE Compsac '87, IEEE Computer Society Press, Los Alamitos, Calif., 1987, pp. 331-337.
[16]
9. L.D. Wittie, "Debugging Distributed C Programs by Real Time Replay," SIGPlan/SIGOps Workshop on Parallel and Distributed Debugging, ACM Press, New York, May 1988, pp. 57-67. Also appears in SIGPlan Notices, Vol. 24, No. 1, Jan. 1989.

Cited By

View all
  • (2016)On the Potential of Event Sourcing for Retroactive Actor-based ProgrammingFirst Workshop on Programming Models and Languages for Distributed Computing10.1145/2957319.2957378(1-5)Online publication date: 17-Jul-2016
  • (2008)Execution replay of multiprocessor virtual machinesProceedings of the fourth ACM SIGPLAN/SIGOPS international conference on Virtual execution environments10.1145/1346256.1346273(121-130)Online publication date: 5-Mar-2008
  • (2002)Shortcut ReplayProceedings of the7th Asian Computing Science Conference on Advances in Computing Science: Internet Computing and Modeling, Grid Computing, Peer-to-Peer Computing, and Cluster10.5555/646068.676955(34-46)Online publication date: 4-Dec-2002
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE Parallel & Distributed Technology: Systems & Technology
IEEE Parallel & Distributed Technology: Systems & Technology  Volume 1, Issue 4
November 1993
91 pages

Publisher

IEEE Computer Society Press

Washington, DC, United States

Publication History

Published: 01 November 1993

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2016)On the Potential of Event Sourcing for Retroactive Actor-based ProgrammingFirst Workshop on Programming Models and Languages for Distributed Computing10.1145/2957319.2957378(1-5)Online publication date: 17-Jul-2016
  • (2008)Execution replay of multiprocessor virtual machinesProceedings of the fourth ACM SIGPLAN/SIGOPS international conference on Virtual execution environments10.1145/1346256.1346273(121-130)Online publication date: 5-Mar-2008
  • (2002)Shortcut ReplayProceedings of the7th Asian Computing Science Conference on Advances in Computing Science: Internet Computing and Modeling, Grid Computing, Peer-to-Peer Computing, and Cluster10.5555/646068.676955(34-46)Online publication date: 4-Dec-2002
  • (2002)ROSProceedings of the 5th international conference on High performance computing for computational science10.5555/1766851.1766904(664-678)Online publication date: 26-Jun-2002
  • (2002)A survey of rollback-recovery protocols in message-passing systemsACM Computing Surveys10.1145/568522.56852534:3(375-408)Online publication date: 1-Sep-2002
  • (1998)Support for Software Interrupts in Log-Based Rollback-RecoveryIEEE Transactions on Computers10.1109/12.72979447:10(1113-1123)Online publication date: 1-Oct-1998
  • (1996)Distributed Breakpoint Detection in Message-Passing ProgramsJournal of Parallel and Distributed Computing10.1006/jpdc.1996.016339:2(153-167)Online publication date: 15-Dec-1996
  • (1995)Performing replay in an OSF DCE environmentProceedings of the 1995 conference of the Centre for Advanced Studies on Collaborative research10.5555/781915.781977Online publication date: 7-Nov-1995

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media