Ensimag Rubrique Formation 2022

Distributed Systems and Applications : Fault Tolerance - WMM53S3

  • Number of hours

    • Lectures 18.0
    • Laboratory works 12.0

    ECTS

    ECTS 3.0

Goal(s)

At the basis of reliable distributed systems are several mechanisms, such as leader election, (ordered) broadcast, consensus,
etc. This course introduces the main algorithms that are used to implement these mechanisms; and yet the design techniques to limit
the impact of software or hardware failures. We present several algorithms and give some example of basic correctness proofs.
Moreover, we study how the different assumptions that can be made on a system (synchrony, faults, etc.) impact the design of
distributed algorithms.

Contact Renaud LACHAIZE

Content(s)

The course is structured in two parts:
A- Distributed algorithms and agreement [7 lectures, Renaud Lachaize]
The course contains three parts: distributed algorithms and engineering distributed applications. Study of algorithms that are at the
basis of reliable distributed systems. Proofs that these algorithms are correct.
B - Fault-tolerance [3 lectures, Lorena Anghel]
This part focuses on the main design techniques to limit the impact of software or hardware failures: faults avoidance; robustness; N
version programming; recovery blocks techniques; acceptation test; retry; check points and rollback.



Prerequisites

Centralized operating systems; networks; elements of probability.

Test

The exam is given in english only 

Exam + Practical activity.



S1=30%TP+70%E1; S2=30%TP+70%E2

Additional Information

This course is given in english only EN

Curriculum->M2 Sec, Crypt. and Coding of Info.->SCCI - Semester 3

Bibliography

1) Siewiorek, Swarz, Reliable Computer Systems, Design and Evaluation, second edition 1992
2) D.K. Pradhan, Fault Tolerant Computing: Theory and Techniques, Prentice Hall, 1986