At the basis of reliable distributed systems are several mechanisms, such as leader election, (ordered) broadcast, consensus,
etc. This course introduces the main algorithms that are used to implement these mechanisms; and yet the design techniques to limit
the impact of software or hardware failures. We present several algorithms and give some example of basic correctness proofs.
Moreover, we study how the different assumptions that can be made on a system (synchrony, faults, etc.) impact the design of
The course is structured in two parts:
A- Distributed algorithms and agreement [7 lectures, Renaud Lachaize]
The course contains three parts: distributed algorithms and engineering distributed applications. Study of algorithms that are at the
basis of reliable distributed systems. Proofs that these algorithms are correct.
B - Fault-tolerance [3 lectures, Lorena Anghel]
This part focuses on the main design techniques to limit the impact of software or hardware failures: faults avoidance; robustness; N
version programming; recovery blocks techniques; acceptation test; retry; check points and rollback.
Centralized operating systems; networks; elements of probability.
The exam is given in english only
Exam + Practical activity.
This course is given in english only
1) Siewiorek, Swarz, Reliable Computer Systems, Design and Evaluation, second edition 1992
2) D.K. Pradhan, Fault Tolerant Computing: Theory and Techniques, Prentice Hall, 1986