Subject: Secure and Distributed Machine Learning in the Personal Cloud
This master internship opens the way for a PhD thesis.
Thanks to smart disclosure initiatives (e.g., Blue/Green Button, MesInfos, MiData) and new regulations (e.g., the new European General Data Protection Regulation (GDPR) law), we can access our personal data from the companies or government agencies that collected them. Concurrently, Personal Data Management System (PDMS) solutions arise at a rapid pace  with the goal to offer a data platform (also called a Personal Cloud) allowing users to easily store into a single place any personal data: data directly generated by user devices (e.g., quantified-self data, smart home data, photos, etc.) and user interaction data (e.g., user preferences, social interaction data, health, bank, etc.). Users can then leverage the power of their PDMS to use their personal data for their own good and in the benefit of the community. Thus, the PDMS paradigm holds the promise of unlocking new innovative usages while preserving the current ones developed around personal data. A prominent example of novel usages is related to the computations between a large number of PDMSs, e.g., automatic data classification, recommendations, participative studies, collective decisions. Such examples often require training an Artificial Intelligence (AI) model based on a large volume of user data.
In the current model, it is required to collect the training data on a centralized server on which the AI model is built. However, this raises several problems. First, training an AI model requires a large volume of good quality data. Hence, centralizing all users’ data into powerful servers is risky since these data servers become highly desirable targets for attackers: huge amounts of personal data belonging to millions of individuals could be leaked as illustrated by recent massive attacks. Second, under the current legislation, which requires the user’s consent to collect the data, it becomes increasingly difficult to build a training database within reasonable amount of time and with reasonable resources. Third, anonymizing the data by reducing its sensitivity is not an option either since a good quality AI model requires very accurate data.
In Petrus team (@Inria/UVSQ), we take a different approach to overcome the limitations of centralized solutions. The idea is to leverage the PDMS paradigm, in which data is naturally distributed at the user side, and organize a privacy-preserving computation between PDMS nodes to securely and distributively build an AI model. This distributed approach has several advantages. It can offer a high level of data confidentiality to the participants encouraging them to participate in such collective computations. It lets the user easily choose to which kind of computations/models they are willing to participate. There is no need to degrade the data quality since the data anonymization is no longer required (i.e., the raw user data are used only during the privacy preserving AI model training and thus not available afterwards).
However, this approach also raises some important challenges. Organizing a secure and efficient distributed computation between PDMS nodes can be a difficult task especially if this is done in the presence of potentially significant number of corrupted nodes. Recently, we proposed in  a secure and efficient protocol for distributed query computation, which guarantees minimizing the private data leakage in a P2P system even in the presence of a large number of colluding corrupted nodes. Nevertheless, training an AI model in the same context is still an open and challenging issue.
The goal of this master internship is to provide a first study of this important problem, i.e., how to efficiently train an AI model (e.g., a Deep Neural Network) in a pure P2P system while providing some security guarantees to the participating nodes? The internship student will study first the recent literature (e.g., see ) presenting the existing, well-established AI models as well as the existing methods to secure a P2P computation (e.g., see ). Then, the student will build on  to design a new fully-distributed protocol adapted to the specificity of AI model training and validate it through implementations/simulations.
This internship is a first step towards a PhD thesis with the objective to develop in depth this topic. Depending on the candidate’s profile and preferences, the PhD can be envisioned in both a classical academic environment (i.e., in the context of an ANR project) or in collaboration with the industry (i.e., as a CIFRE thesis). For the latter, Petrus closely collaborates with Cozy Cloud (cozy.io), one of the leading French startups working at the development of a privacy-preserving Personal Cloud platform, and regularly has CIFRE theses with this company.
Required skills: Good programming skills, general knowledge of distributed (peer-to-peer) systems, knowledge of either security issues or machine learning/data mining techniques is a plus.
Dates & Duration: Generally, from April 2019 to September 2019, but can be adapted.
[1 Nicolas Anciaux, Philippe Bonnet, Luc Bouganim, Benjamin Nguyen, Philippe Pucheral, Iulian Sandu Popa, and Guillaume Scerri. 2019. Personal Data Management Systems: The security and functionality standpoint. Information Systems, 80, (2019). http://petrus.inria.fr/~bouganim/TMP/Stage/P1.pdf
 Julien Loudet, Iulian Sandu-Popa, and Luc Bouganim. 2019. SEP2P: Secure and Efficient P2P Personal Data Processing. In International Conference on Extending Database Technology, EDBT 2019. http://petrus.inria.fr/~bouganim/TMP/Stage/P2.pdf
 T. Ben-Nun, T. Hoefker, Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis. http://petrus.inria.fr/~bouganim/TMP/Stage/P3.pdf (2018)