Abstract

Rollback-One-Step Checkpointing and Reduced Message Logging for Debugging Message-Passing Programs
Nam Thoai - GUP Linz, Joh. Kepler University Linz
Dieter Kranzlmueller - GUP Linz, Joh. Kepler University Linz
Jens Volkert - GUP Linz, Joh. Kepler University Linz
Cyclic debugging is used to execute programs over and over again for tracking
down and eliminating bugs. During re-execution, programmers may want to stop
at breakpoints or apply step-by-step execution for inspecting the program’s
state and detecting errors. For large-scale, long-running parallel programs,
the biggest drawback is the costs associated with restarting the program’s
execution every time from the beginning. A solution is offered by combining
checkpointing and debugging, which allows debugging to be initiated at any
intermediate checkpoint. A problem is the selection of an appropriate recovery
line for a given breakpoint. The temporal distance between these two points
can be rather long if recovery lines are only chosen at consistent global
checkpoints. The method described in this paper allows selecting an arbitrary
checkpoint as a starting point for debugging, which allows shortening the
temporal distance. In addition, a mechanism for reducing the amount of trace
data (in terms of logged messages) is provided. The resulting technique is
able to reduce the waiting time and the costs of cyclic debugging.
Last update: Wed Jun 12 14:26:53 2002 WEST