Abstract
Massively parallel systems represent a new challenge for fault tolerance. The designers of such systems cannot expect that no parts of the system will fail. With the significant increase in the complexity and number of components the chance of a single or multiple failure is no longer negligible. It is clear that the redundancy, reconfigurability and diagnosis techniques must be incorporated at the design stage itself and not as a subsequent add-on. In this paper we discuss the fault tolerance techniques developed for MEMSY, a massively parallel architecture. These techniques can, in principle, be easily transferred to other distributed shared memory multiprocessors.
Guest researcher from TU Budapest, Dept. Measurement and Instrumentation Engineering
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Banâtre, M.; Muller, G.; Rochat, B.; Sanchez, P.: Design Decisions for the FTM: A General Purpose Fault Tolerant Machine, Proc. 21th FTCS, pp. 71–78,1991
Chandy, K. M.; Lamport, L.: Distributed Snapshots: Determining Global States of Distributed Systems, ACM T.o.C.S., vol. 3, no. 1, pp. 63–75, 1985
Cristian, F.: Understanding Fault Tolerant Distributed Systems, Com. ACM vol. 34 (1991), pp. 56–78
Dal Cin, M.: New Trends in Parallel and Reliable Computing: Massive Parallelism and Fault Tolerance. Invited paper, Proc. μP'92, 7th Symposium on Microcomputer and Microprocessor Appl., Budapest, April 1992, pp. 1–10
Grand Challenges: High Performance Computing and Communications. The Fiscal Year 1992 U.S. Research and Development Program. Report by the Committee on Physical, Mathematical, and Engineering Sciences, NSF Washington 1992
Hildebrand, U.: A Fault Tolerant Interconnection Network for Memory-Coupled Multiprocessor Systems, In: Dal Cin, M.; Hohl, W.(eds.): Proc. 5th Int. Conf. Fault Tolerant Computing Systems, Informatik-Fachberichte 283, pp. 360–371, Springer 1991
Hofmann, F. et al.: MEMSY — A Modular Expandable Multiprocessor System, in this volume
Hohl, W.; Michel, E.; Pataricza, A.: Hardware Support for Error Detection in Multiprocessor Systems — A Case Study, Proc. μP'92, 7th Symposium on Microcomputer and Microprocessor Appl., Budapest, April 1992, pp. 81–90
Kai Li; Naughton, J. F.; Plank, J. S.: Checkpointing Multicomputer Applications, Proc. 10th Symposium on Reliable Distributed Systems, pp. 2–12, 1991
Koo, R.; Toueg, S.: Checkpointing and Rollback-Recovery for Distributed Systems, IEEE T.o.S.E., pp. 23–31, Jan. 1987
Lampson, B. W.: The Stable System, in Lampson, B. W.; Paul, M.; Siegert H. J. (ed): Distributed Systems: Architecture and Implementation, LNCS 105, pp. 254–256, 1988
Leveugle, R.; Michel, T.; Saucier, G.: Design of Microprocessors with Built-in On-Line Test, Proc. 20th FTCS, pp. 450–456, 1990
Lu, D. J.: Watchdog Processors and Structural Integrity Checking, IEEE T.o.C., Vol. 31. No.7, 681–685, 1982
Mahmood, A.; McCluskey, E. J.: Concurrent Error Detection Using Watchdog Processors — A Survey, IEEE, T.o.C., Vol. 37. No. 2, pp. 160–174, 1988
Michel, E.; Hohl, W.: Concurrent Error Detection Using Watchdog Processors in the Multiprocessor System MEMSY, Proc. 5th Int. Conf. Fault-Tolerant Computing Systems, Nürnberg, Informatik Fachberichte 283, pp. 54–64, Springer, September 1991
Russell, D. L.; Tiedeman, M. J.: Multiprocess Recovery Using Conversations, Proc. 9th FTCS, pp. 106–109, 1979
Shrivastava, S.; Mancini, L.; Randell, B.: On The Duality Of Fault Tolerant System Structures. In: J. Nehmer (ed.), Experiences With Distributed Systems, Proc. Int. WS. Kaiserslautern 1987, pp. 10–37, Springer LNCS 309, 1988
Siewiorek, D. P.: Faults And Their Manifestation, Springer LNCS 448, pp. 244–261, 1987
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1993 Springer-Verlag
About this paper
Cite this paper
Dal Cin, M. et al. (1993). Fault tolerance in distributed shared memory multiprocessors. In: Bode, A., Dal Cin, M. (eds) Parallel Computer Architectures. Lecture Notes in Computer Science, vol 732. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-57307-0_24
Download citation
DOI: https://doi.org/10.1007/3-540-57307-0_24
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-57307-4