Ensimag Rubrique Formation 2022

Data management in large-scale distributed systems - WMM9MO08

  • Volumes horaires

    • CM 18.0
    • Projet -
    • TD -
    • Stage -
    • TP -
    • DS -

    Crédits ECTS

    Crédits ECTS 3.0

Objectif(s)

At the end of the course, the students will have an overview of the challenges associated with storing and processing data at large scale. They will know how to use Big Data software tools to efficiently store and process large amounts of data, including tools that can operate in realtime.

Responsable(s)

Thomas ROPARS

Contenu(s)

The ability to process large amount of data is key to both industry and research today. As computing systems are getting larger, they generate more data that need to be analyzed to extract knowledge.

Data management infrastructures are growing fast, leading to the creation of large data centers and federations of data centers. Suitable software infrastructures should be used to store and process data in this context. Big Data software systems are build to take advantage of large set of distributed resources to efficiently process massive amounts of data while being able to cope with failures that are frequent at such a scale.

In addition to the amount of data to be processed, the other main challenge that such Big Data systems need to deal with is time. For some use cases, the earlier the results of a data analysis is obtained, the more valuable the result is. Some Big Data systems especially target stream processing where data are processed in realtime.

Through lectures and practical sessions, this course provides an overview of the software systems that are used to store and process data at large scale. The following topics will be covered:

  • Map-Reduce programming model
  • In-memory data processing
  • Stream processing (data movement and processing)
  • Large scale distributed data storage (distributed file systems, NoSQL data bases)

    Throughout the lectures, the challenges associated with performance and fault tolerance will also be discussed.

Prérequis

Fundamentals of DBMS, concurrent programming (threads)

Contrôle des connaissances

P = Presentation d'articles de recherche et/ou travaux pratiques notés
E = examen écrit session 1

N1 = 30% P + 70% E1
N2 = E2

L'examen existe uniquement en anglais FR

Calendrier

Le cours est programmé dans ces filières :

  • Cursus ingénieur - Master 2 Informatique - Semestre 9 (ce cours est donné uniquement en anglais EN)
  • Cursus ingénieur - Master 2 Informatique - Semestre 9 (ce cours est donné uniquement en anglais EN)
cf. l'emploi du temps 2020/2021

Informations complémentaires

Code de l'enseignement : WMM9MO08
Langue(s) d'enseignement : FR

Vous pouvez retrouver ce cours dans la liste de tous les cours.

Bibliographie

Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): 107-113.

Zaharia, Matei, et al. "Apache spark: a unified engine for big data processing." Communications of the ACM 59.11 (2016): 56-65.

Murray, Derek G., et al. "Naiad: a timely dataflow system." Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. ACM, 2013.

Lakshman, Avinash, and Prashant Malik. "Cassandra: a decentralized structured storage system." ACM SIGOPS Operating Systems Review 44.2 (2010): 35-40.