×

Availability of a distributed computer system with failures. (English) Zbl 0584.68007

A model for distributed systems with failing components is presented. Each node may fail and during its recovery the load is distributed to other nodes that are up. The model assumes periodic checkpointing for error recovery and testing of the status of other nodes for the distribution of load. We consider the availability of a node, which is the proportion of time a node is available for processing, as the performance measure. A methodology for optimizing the availability of a node with respect to the checkpointing and testing intervals is given. A decomposition approach that uses the steady-state flow balance condition to estimate the load at the node is proposed. Numerical examples are presented to demonstrate the usefulness of the technique. For the case in which all nodes are identical, closed form solutions are obtained.

MSC:

68N99 Theory of software
Full Text: DOI

References:

[1] Baccelli, F.: Analysis of a service facility with periodic checkpointing. Acta Inf. 15, 67-81 (1981) · Zbl 0453.68006 · doi:10.1007/BF00269809
[2] Bouchet, P.: Procédures de reprise dans les systèmes de gestion de base de données réparties. Acta Inf. 11, 305-340 (1979) · Zbl 0396.68015 · doi:10.1007/BF00289092
[3] Chandy, K.M., Ramamoorthy, C.V.: Rollback and recovery strategies for computer programs. IEEE Trans. Comput. 6, 546-556 (1972) · Zbl 0235.68008 · doi:10.1109/TC.1972.5009007
[4] Chandy, K.M.: A survey of analytic models of rollback and recovery strategies. Computer 5, 40-47 (1975) · Zbl 0307.68044 · doi:10.1109/C-M.1975.218955
[5] Chandy, K.M., Browne, J.C., Dissly, C.W., Uhrig, W.R.: Analytical models for rollback and recovery strategies in data base systems. IEEE Trans. Software Eng. 1, 100-110 (1975)
[6] Gelenbe, E., Derochette, D.: Performance of rollback recovery systems under intermittent failures. Commun. ACM 21, 493-499 (1978) · Zbl 0379.68002 · doi:10.1145/359511.359531
[7] Gelenbe, E.: On the optimum checkpoint interval. J. ACM 26, 259-270 (1979) · Zbl 0395.68024 · doi:10.1145/322123.322131
[8] Krisna, C.M., Shin, K.G., Lee, Y.-H.: Optimization criteria for checkpoint placement, Commun. ACM 27, 1008-1012 (1984) · doi:10.1145/358274.358282
[9] Tripathi, S.K., Finkel, D., Gelenbe, E.: Load Sharing in Distributed Systems with Failures. ISEM Research Report no. 30, Université de Paris-Sud 1985
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.