Rollback-One-Step Checkpointing and Reduced Message Logging for Debugging Message-Passing Programs |
---|
Nam Thoai - GUP Linz, Joh. Kepler University Linz Dieter Kranzlmueller - GUP Linz, Joh. Kepler University Linz Jens Volkert - GUP Linz, Joh. Kepler University Linz |
Cyclic debugging is used to execute programs over and over again for tracking down and eliminating bugs. During re-execution, programmers may want to stop at breakpoints or apply step-by-step execution for inspecting the program’s state and detecting errors. For large-scale, long-running parallel programs, the biggest drawback is the costs associated with restarting the program’s execution every time from the beginning. A solution is offered by combining checkpointing and debugging, which allows debugging to be initiated at any intermediate checkpoint. A problem is the selection of an appropriate recovery line for a given breakpoint. The temporal distance between these two points can be rather long if recovery lines are only chosen at consistent global checkpoints. The method described in this paper allows selecting an arbitrary checkpoint as a starting point for debugging, which allows shortening the temporal distance. In addition, a mechanism for reducing the amount of trace data (in terms of logged messages) is provided. The resulting technique is able to reduce the waiting time and the costs of cyclic debugging. |