×

Virtual-machine-based heterogeneous checkpointing. (English) Zbl 1009.68926

Summary: Checkpointing an application is the act of saving the application’s state during its execution on stable storage, so that if the application fails it can be restarted from the last saved state, thereby avoiding loss of the work that was already done. A heterogeneous checkpoint/restart mechanism allows one to restart an application on a possibly different hardware architecture and/or operating system than those in which the application was saved. This paper explores how to construct such a mechanism at the virtual machine level. That is, rather than dumping the entire state of the application process, the mechanism reported here dumps the state of the application as maintained by a virtual machine. During restart, the saved state is loaded into a new copy of the virtual machine, which continues running from there. The heterogeneous checkpoint/restart mechanism reported here was developed for the OCaml variant of ML. The paper reports on the main issues encountered in building such a mechanism and the design choices made, presents performance evaluations, and discusses some lessons and ideas for extending the work to native code OCaml and Java.

MSC:

68U99 Computing methodologies and applications
68N01 General topics in the theory of software
68M14 Distributed systems

Software:

OCaml; GC
Full Text: DOI

References:

[1] Starfish: Fault-tolerant dynamic MPI programs on clusters of workstations. Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing, August 1999; 167-176.
[2] Anderson, IEEE Micro 15 pp 54– (1995)
[3] The Millipede Virtual Parallel Machine for NT/PC Clusters. http://www.cs.technion.ac.il/Labs/Millipede/millipede.html.
[4] In Search of Clusters. Prentice-Hall: Englewood Cliffs, NJ, 1998.
[5] Condor?A hunter of idle workstations. Proceedings of the 8th IEEE International Conference of Distributed Computing Systems, June 1988; 104-111.
[6] Making workstations a friendly environment for batch jobs. Proceedings of the 3rd IEEE Workshop on Workstation OS, April 1992.
[7] The GRID Blueprint for a New Computing Infrastructure. Morgan Kaufmann: San Mateo, CA, 1999.
[8] A survey of rollback-recovery protocols in message-passing systems. Technical Report CMU-CS-99-148, Department of Computer Science, Carnegie Mellon University, June 1999.
[9] An overview of checkpointing in uniprocessor and distributed systems, focusing on implementation and performance. Technical Report UT-CS-97-372, Department of Computer Science, Tennessee University, July 1997.
[10] Checkpointing and its applications. Proceedings of the 25th IEEE International Symposium on Fault-Tolerant Computing, June 1995; 22-31.
[11] The OCaml Home Page. http://pauillac.inria.fr/ocaml.
[12] Making Java applications mobile or persistent. 6th USENIX Conference on Object-Oriented Technologies and Systems, February 2001.
[13] Lange, World Wide Web 1 pp 111– (1998)
[14] Dijkstra, Communications of the ACM 21 pp 965– (1978)
[15] Garbage Collection: Algorithms for Automatic Dynamic Memory Management. Wiley: New York, 1996. · Zbl 0945.68508
[16] Portable, unobtrusive garbage collection for multithreaded systems. Proceedings of the 21st ACM Symposium on Principles of Programming Languages, January 1994; 70-83.
[17] A concurrent, generational garbage collector for a multithreaded implementation of ML. Proceedings of the 20th ACM Symposium on Principles of Programming Languages, 1993; 113-123.
[18] Design, implementation, and performance of checkpointing in NetSolve. Proceedings of the 1st IEEE Conference on Dependable Systems and Networks, June 2000; 49-54.
[19] Supporting checkpointing and process migration outside the UNIX Kernel. Usenix Winter Conference, 1992; 283-290.
[20] Process introspection: A heterogeneous checkpoint/restart mechanism based on automatic code modification. Technical Report CS-97-05, Department of Computer Science, University of Virginia, March 1997.
[21] Efficient incremental checkpointing of Java programs. Proceedings of the 1st IEEE Conference on Dependable Systems and Networks, June 2000; 61-70.
[22] Libckpt: Transparent checkpointing under UNIX. Usenix Winter 1995 Technical Conference, New Orleans, January 1995; 220-232.
[23] General Purpose FFT. http://momonga.t.u-tokyo.ac.jp/ooura/fft.html 1996-1998.
[24] CLIP?a checkpointing tool for message-passing parallel programs. SC97: High Performance Networking and Computing, November 1997.
[25] Job and process recovery in a UNIX-based operating system. Usenix Winter 1989 Technical Conference, January 1989; 355-364.
[26] Ousterhout, IEEE Computer 21 pp 23– (1988) · doi:10.1109/2.16
[27] Heterogenous process migration by recomilation. Proceedings of the 11th IEEE International Conference on Distributed Computing Systems, May 1991.
[28] Friedman, Cluster Computing: The Journal of Networks, Software Tools and Applications 4 pp 221– (2001) · doi:10.1023/A:1011498424351
[29] Porch Home Page. http://www.toc.lcs.mit.edu/porch/.
[30] Portable checkpointing for heterogeneous architectures. Proceedings of the 27th IEEE International Symposium on Fault-Tolerance Computing (FTCS), June 1997; 58-67.
[31] van Steen, IEEE Concurency 7 pp 70– (1999)
[32] Portable support for transparent thread migration in Java. Proceedings of Agent Systems and Applications/Mobile Agents, September 2000; 29-43.
[33] Bytecode transformation for portable thread migration in Java. Proceedings of Agent Systems and Applications/Mobile Agents, September 2000; 16-28.
[34] Persistent execution state of a Java virtual machine. Proceedings of ACM Java Grande 2000, June 2000; 160-167.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.