Fault tolerant file models for parallel file systems: introducing distribution patterns for every file

106 Accesses
5 Citations
Explore all metrics

Abstract

Parallelism in file systems is obtained by using several independent server nodes supporting one or more secondary storage devices. This approach increases the performance and scalability of the system, but a fault in one single node can stop the whole system. To avoid this problem, data must be stored using some kind of redundant technique, so any data stored in a faulty element can be recovered. Fault tolerance can be provided in I/O systems by using replication or RAID based schemes. However, most of the current systems apply the same technique for all files in the system.

This paper describes the fault tolerance support provided by Expand, a parallel file system based on standard servers. This support can be applied to other parallel file systems with many benefices: fault tolerance at file level, flexible definition of fault tolerance scheme to be used, possibility to change the fault tolerant support used for a file, etc.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

High Performance Reliable File Transfers Using Automatic Many-to-Many Parallelization

Performance Impacts with Reliable Parallel File Systems at Exascale Level

The HDFS Replica Placement Policies: A Comparative Experimental Investigation

References

Cabrera L-F, Long DDE (1991) SWIFT: using distributed disk striping to provide high I/O data rates. Technical Report UCSC-CRL-91-46, UCSC
Long DDE, Montague BR, Cabrera L-F SWIFT/RAID: A distributed RAID system. Technical Report UCSC-CRL-94-06, University of California at Santa Cruz
Page TW, Popek GJ, Guy RG, Heidemann JS (1990) The Ficus distributed file system: Replication via stackable layers. Technical Report CSD-900009, University of California, Los Angeles, CA, USA
Guy R, Heidmenn J, Mak W, Page T Jr, Popek G, Rothmeier D (1990) Implementation of the Ficus replicated file system. Proceedings of the Summer 1990 USENIX Conference, pp 63–71
Swart G, Birrell A, Hisgen A, Mann T (1993) Availability in the Echo file system. Technical Report 112, Systems Research Center, Digital Equipment Corporation, Palo Alto CA, USA
Liskov B, Ghemawat S, Gruber R, Johnson P, Shrira L, Williams M (1991) Replication in the Harp file system. In: Proceedings of 13th ACM symposium on operating systems principles. Association for Computing Machinery SIGOPS, pp 226–238
Evans M (2000) FTFS: The design of a fault tolerant distributed file-system. Senior Thesis, University of Nebraska-Lincoln
Anderson TE, Dahlin MD, Neefe JM, Patterson DA, Roselli DS, Wang RY (1995) Serverless network file systems. In: Proceedings of the fifteenth ACM symposium on operating systems principles. ACM Press, pp 109–126
Soltis SR, Ruwart TM, O’Keefe MT (1996) The global file system In: Proceedings of the Fifth NASA Goddard conference on mass storage systems. IEEE Computer Society Press, pp 319–342
Stonebraker M, Schloss GA (1990) Distributed RAID—a new multiple copy algorithm proceedings of the sixth international conference on data engineering, pp 430–437
Calderon A, Garcia-Carballeira F, Carretero J, Perez JM, Fernandez J (2002) An implementation of MPI-IO on Expand: A parallel file system based on NFS servers. In: Kranzlmuller D et al, Recent advances in parallel virtual machine and message passing interface. Proceedings of the 9th European PVM/MPI Users Group Meeting, EuroPVM/MPI 2002, Linz, Austria, LNCS 2474, pp 306–313
Garcia-Carballeira F, Calderon A, Carretero J, Fernandez J, Perez JM (2003) The design of the expand parallel file system. Int J High Perform Comput Appl 17(1)
Gropp W, Takhur R, Lusk E (1999) On implementing MPI-IO portably and with high performance. In: Proceedings of the sixth workshop on I/O in parallel and distributed systems, pp 23–32
Garcia F, Calderon A, Carretero J, Perez JM, Fernandez J (2003) A parallel and fault tolerant file system based on NFS servers. In: Proceedings of the eleventh Euromicro conference on parallel, distributed and network-based processing (Euro-PDP’03), pp 83–90
Calderon A, Garcia-Carballeira F, Carretero J, Perez JM, Sanchez LM (2005) A fault tolerant MPI-IO implementation using the expand parallel file system. In: Proceedings of the 13th Euromicro conference on parallel, distributed and network-based processing (Euro-PDP’05), pp 274–281
FLASH I/O Benchmark Routine—Parallel HDF 5. http://flash.uchicago.edu/~zingale/flash_benchmark_io/
Carns PH, Ligon III WB, Ross RB, Thakur R (2000) PVFS: a parallel file system for Linux clusters. In: Proceedings of the 4th annual Linux showcase and conference, Atlanta, pp 317–327
Alvarez GA, Burkhard WA, Cristian F (1997) Tolerating multiple failures in RAID architectures with optimal storage and uniform declustering. In: Proceedings of the 24th annual international symposium on computer architecture (ISCA ’97). ACM Press, pp 62–72
Plank JS (1996) A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems. Technical Report CS-96-332, University of Tennessee
Blaum M, Brady J, Bruck J, Menon J (1995) EVENODD: an efficient scheme for tolerating double disk failures in RAID architectures. IEEE Trans Comput 44(2):192–202
Article MATH Google Scholar
Hsieh P-H, Chen I-Y, Lin Y-T, Kuo S-Y (2004) An XOR based Reed-Solomon algorithm for advanced RAID systems. In: Proceedings of the 19th IEEE international symposium on defect and fault tolerance in VLSI systems (DFT04), IEEE Computer Society, pp 165–172
Gibson G, Hellerstein L, Karp R, Katz R, Patterson D (1989) Coding techniques for handling failures in large disk arrays. In: Proceedings of the international conference on architectural support for programming languages and operating systems, pp 123–132
Perez MS, Sanchez A, Robles V, Peña JM, Perez F (2004) Optimizations based on hints in a parallel file system. In: Proceedings of the workshop on parallel input/output management techniques (PIOMT04), pp 347–354

Download references

Author information

Authors and Affiliations

Computer Architecture Group, Computer Science Department, Universidad Carlos III de Madrid, Leganés, Madrid, Spain
A. Calderón, F. García-Carballeira, L. M. Sánchez, J. D. García & J. Fernandez

Authors

A. Calderón
View author publications
You can also search for this author in PubMed Google Scholar
F. García-Carballeira
View author publications
You can also search for this author in PubMed Google Scholar
L. M. Sánchez
View author publications
You can also search for this author in PubMed Google Scholar
J. D. García
View author publications
You can also search for this author in PubMed Google Scholar
J. Fernandez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to A. Calderón.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Calderón, A., García-Carballeira, F., Sánchez, L.M. et al. Fault tolerant file models for parallel file systems: introducing distribution patterns for every file. J Supercomput 47, 312–334 (2009). https://doi.org/10.1007/s11227-008-0199-8

Download citation

Received: 15 November 2007
Accepted: 14 March 2008
Published: 22 April 2008
Issue Date: March 2009
DOI: https://doi.org/10.1007/s11227-008-0199-8

Fault tolerant file models for parallel file systems: introducing distribution patterns for every file

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

High Performance Reliable File Transfers Using Automatic Many-to-Many Parallelization

Performance Impacts with Reliable Parallel File Systems at Exascale Level

The HDFS Replica Placement Policies: A Comparative Experimental Investigation

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Fault tolerant file models for parallel file systems: introducing distribution patterns for every file

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

High Performance Reliable File Transfers Using Automatic Many-to-Many Parallelization

Performance Impacts with Reliable Parallel File Systems at Exascale Level

The HDFS Replica Placement Policies: A Comparative Experimental Investigation

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation