Research

Context and Team objectives

In an increasingly interconnected world, privacy protection and personal data management are paramount. How can remote workers share information with their employers without revealing private details? How can a student witnessing school bullying report it anonymously? How can public forms collect less personal information from millions of citizens each year? New privacy rights are emerging in regulations.

Applications that model and enforce them, called Privacy-Enhancing Technologies (PETs) are essential to exercising these rights. However, practical adoption faces obstacles, including the need for better modeling of these rights for greater clarity, understanding, and real-world application. Secure design and implementation are also essential for adoption and deployment of proposals.

PETSCRAFT focuses primarily on modeling privacy protection concepts and on the design, optimization, security enforcement, testing and deployment of explicable and efficient PETs based on these principles. These concepts can stem from both legal requirements (e.g. GDPR concepts) or guidelines based on societal and ethical issues (e.g. helping harassment whistle-blowers). Recognizing the paramount importance of explicability, the project aims for a better definition of these concepts’ requirements and
to achieve balance between privacy and legitimate uses, especially in the expanding landscape of digital surveillance, while providing efficiency through e.g. advanced data management techniques.

Our initial goal is thus to create PETs that would be adopted by the general public, the industry or institutions.

Our ultimate goal would be to propose, and validate both a method and “cyber-fablab” to craft PETs.

Research Directions

Overall objectives

In an increasingly interconnected world, privacy protection and personal data management are paramount. How can remote workers share information with their employers without revealing private details? How can a student witnessing school bullying report it anonymously? How can public forms collect less personal information from millions of citizens each year? New privacy rights are emerging in regulations. Applications that model and enforce them, called Privacy-Enhancing Technologies (PETs) are essential to exercising these rights. However, practical adoption faces obstacles, including the need for better modeling of these rights for greater clarity, understanding, and real-world application. Secure design and implementation are also essential for adoption and deployment of proposals.

PETSCRAFT focuses primarily on modeling privacy protection concepts and on the design, optimization, security enforcement, testing and deployment of explicable and efficient PETs based on these principles. These concepts can stem from both legal requirements (e.g. GDPR concepts) or guidelines based on societal and ethical issues (e.g. helping harassment whistle-blowers). Recognizing the paramount importance of explicability, the project aims for a better definition of these concepts’ requirements and to achieve balance between privacy and legitimate uses, especially in the expanding landscape of digital surveillance 47, while providing efficiency through e.g. advanced data management techniques.

Our initial goal is thus to create PETs that would be adopted by the general public, the industry or institutions. Our ultimate goal would be to propose, and validate both a method and “cyber-fablab” to craft PETs.

Last activity report : 2024

Results

New results

The research methodology of the Petscraft team draws inspiration from our recent work on data minimisation 1, 28 obtained during the teams incubation period, and conducted collaboratively between Inria Saclay and INSA CVL. Our goal in this work is to define a new privacy model (Axis 1) to apply GDPR principles to data collection via forms, develop an explainability model (Axis 2) to help citizen understand the minimization choices (see in Section 8.1), and propose a secure implementation (Axes 3 and 4) to generalize this approach to all types of French administrative forms (still ongoing). A subsequent step involves engaging stakeholders, such as Mesalloc (which manages data collection for administrative forms in France) and CNIL, to explore the adoption and implementation of these techniques.

Since Petscraft’s official creation on June 1, 2024, we have achieved the following results. First, building on this methodology, progress has been made on new research initiatives launched in collaboration between Inria and INSA CVL, including privacy control in new data structures like Matrix Profile (Section 8.3.1), the development of text reformulation tools that preserve anonymity using Large Language Models (Section 8.3.2), and, in partnership with legal experts, an exploration of copyright and privacy issues posed by the rise of LLMs (Section 8.2.1). Second, significant results have also been achieved on Petscraft’s core research axes, including cohesive database neighborhoods for differential privacy (Section 8.3.3), anonymous signatures (Sections 8.4.2 and 8.2.2), and data management using Trusted Execution Environments (Section 8.5.1).

Results obtained during team incubation: Data Minimization Model (Axis 1 & 2)

The advent of privacy laws and principles such as data minimization and informed consent are supposed to protect citizens from over-collection of personal data. Nevertheless, current processes, mainly through filling forms are still based on practices that lead to over-collection. Indeed, any citizen wishing to apply for a benefit (or service) will transmit all their personal data involved in the evaluation of the eligibility criteria. The resulting problem of over-collection affects millions of individuals, with considerable volumes of information collected. If this problem of compliance concerns both public and private organizations (e.g., social services, banks, insurance companies), it is because it faces non-trivial issues, which hinder the implementation of data minimization by developers. At EDBT’24 18, we propose a new modeling approach that enables data minimization and informed choices for the users, for any decision problem modeled using classical logic, which covers a wide range of practical cases. Our data minimization solution uses game theoretic notions to explain and quantify the privacy payoff for the user. We show in 1 how our algorithms can be applied to practical cases study as a new PET for minimal, fully accurate (all due services must be preserved) and informed data collection. The system was also demonstrated at CCS’23 CITATION NOT FOUND: anciaux:hal-04240432.

Figure 2: A PET for Informed Data Minimization in forms.

New results for Axis 1

Compliance and Large Language Models (COMPLY-LLM): Detecting Privacy and Copyright Violations (Axis 1)

The rise of Large Language Models (LLMs) has triggered legal and ethical concerns, especially regarding the unauthorized use of copyrighted materials in their training datasets. This has led to lawsuits against tech companies accused of using protected content without permission. Membership Inference Attacks (MIAs) aim to detect whether specific documents were used in a given LLM pretraining, but their validatation is undermined by biases (e.g., due to time shifts, ngram distributions, …) between the presumed sets of member and non member datasets used for MIA assessments.

In 6 we address the evaluation of MIAs on LLMs with partially inferable training sets, under the ex-post hypothesis, which acknowledges inherent distributional biases between members and non-members datasets. We propose and validate algorithms to create “non-biased” and “non-classifiable” datasets for fairer MIA assessment. The intership topic of Yanming Li further explores this methodology. This project is conducted in partnership with Alexandra Bensamoun, Professor of Law at University Paris-Saclay.

Anonymity of Linkable Ring Signatures (Axis 1)

Security models provide a way of formalising security properties in a rigorous way, but it is sometimes difficult to ensure that the model really fits the concept that we are trying to formalise. In 4, we illustrate this fact by showing the discrepancies between the security model of anonymity in linkable ring signatures and the security that is actually expected for this kind of signature. These signatures allow a user to sign anonymously within an ad hoc group generated from the public keys of the group members, but all their signatures can be linked together. Reading the related literature, it seems obvious that users’ identities must remain hidden even when their signatures are linked, but we show that, surprisingly, almost none of the anonymity models guarantee this property. We illustrate this by presenting two counter-examples which are secure in most anonymity model of linkable ring signatures, but which trivially leak a signer’s identity after only two signatures. A natural fix to this model, already introduced in some previous work, is proposed in a corruption model where the attacker can generate the keys of certain users themselves, which seems much more coherent in a context where the group of users can be constructed in an ad hoc way at the time of signing. We believe that these two changes make the security model more realistic. Indeed, within the framework of this model, our counter-examples becomes insecure. Furthermore, we show that most of the schemes in the literature we surveyed appear to have been designed to achieve the security guaranteed by the latest model, which reinforces the idea that the model is closer to the informal intuition of what anonymity should be in linkable ring signatures.

New results for Axis 2

Privacy control in new data structures like Matrix Profile (Axis 2)

Matrix Profile (MP) enables privacy-preserving solutions for sensitive contexts like Continuous Authentication (CA) and teleworking. In 15, we propose a CA method combining incremental MP and deep learning on accelerometer data, achieving good accuracy for single-user authentication while keeping the data used during authentication stored locally (see Fig. 3). For teleworking, we initiated a study 21, 22 using matrix profiles in a telework context, called TELESAFE. This study emphasizes the right to disconnect and self-regulation of work and personal time by detecting boundary crossings between private and work activities using electric consumption data, without requiring training or intrusive monitoring. The proposal achieves an excellent Fscore , comparable to machine learning approaches, while ensuring a higher level of privacy and suitability.

Figure 3: Continuous authentication using Matrix Profile in a home care scenario.

reteLLMe: Preserve Privacy using Large Language Models (Axis 2)

The advanced inference capabilities of Large Language Models (LLMs) pose a significant threat to the privacy of individuals by enabling third parties to accurately infer certain personal attributes (such as gender, age, location, religion, and political opinions) from their writings. Paradoxically, LLMs can also be used to protect individuals by helping them to modify their textual output from certain unwanted inferences, opening the way to new tools. Examples include sanitising online reviews (e.g., of hotels, movies), or sanitising CVs and cover letters. However, how can we avoid miss estimating the risks of inference for LLM-based text sanitisers? Can the protection offered be overestimated? Is the original purpose of the produced text preserved? To the best of our knowledge, no previous work has tackled these questions.

Thus, in 2, four design rules (collectively referred to as reteLLMe) are proposed to minimise these potential issues. The main idea is to use LLMs as both an attacker and a defender (see Fig. 4). We validate these rules and quantify the benefits obtained in a given use case, sanitising hotel reviews. We show that up to 76% of at-risk texts are not flagged as such without fine-tuning. Moreover, classic techniques such as BLEU and ROUGE are shown to be incapable of assessing the amount of purposeful information in a text. Finally, a sanitisation tool based on reteLLMe demonstrates superior performance to a state-of-the-art sanitiser, with better results on up to 90% of texts. The PhD thesis of Lucas Biéchy further explores the use of LLMs in the context of privacy protection.

Figure 4: Main building blocks of a text sanitisation process using LLMs.

Cohesive database neighborhoods for differential privacy (Axis 2)

The Semantic Web represents an extension of the current web offering a metadata-rich environment based on the Resource Description Format (RDF) which supports advanced querying and inference. However, relational database (RDB) management systems remain the most widespread systems for (Web) data storage. Consequently, the key to populating the Semantic Web is the mapping of RDB to RDF, supported by standardized mechanisms. Confidentiality and privacy represent significant barriers for data owners when considering the translation and subsequent utilization of their data. In order to facilitate acceptance, it is essential to build privacy models that are equivalent and explainable within both data formats. Differential Privacy (DP) has emerged to be the flagship of data privacy when sharing or exploiting data. Recent works have proposed DP-models tailored for either multi-relational databases or RDF.

In 7, 25, 8, we leverage this field of work to study how privacy guarantees on RDB with foreign key constraints can be transposed to RDF databases and vice versa. We consider a promising DP model for RDB related to cascade deletion and demonstrate that it is sometimes similar to an existing DP graph privacy model, but inconsistently so. Consequently, we tweak this model in the relational world and propose a new model called restrict deletion. We show that it is equivalent to an existing DP graph privacy model, facilitating the comprehension, design and implementation of DP mechanisms in the context of the mapping of RDB to RDF. Building on this study of how database constraints impact differential privacy, we present in 23 a preliminary study on data Privacy for knowledge graphs, in the context of the PhD of Yasmine Hayder.

New results for Axis 3

Dissemination on the use of Cryptographic Protocol Proofs, applied to board games (Axis 3)

One of our original results is that we have worked on a dissemination article showing on the example of a board game, how privacy properties and their proofs can be used to improve the quality of the game by reducing the possibility of cheating.

Cryptid is a board game in which the goal is to be the first player to locate the cryptid, a legendary creature, on a map. Each player knows a secret clue as to which cell on the map contains the cryptid. Players take it in turns to ask each other if the cryptid could be on a given cell according to their clue, until one of them guesses the cryptid cell. This game is great fun, but completely loses its interest if one of the players cheats by answering the questions incorrectly. For example, if a player answers negatively on the cryptid cell, the game continues for a long time until all the cells have been tested, and ends without a winner.

In 3, we show how to provide cryptographic protocols to prevent cheating in Cryptid. The main idea is to use encryption to commit the players’ clues, enabling them to show that they are answering correctly in accordance with their clue using zero-knowledge proofs. We give a security model which captures soundness (a player cannot cheat) and confidentiality (the protocol does not leak more information than the players’ answers about their clues), and prove the security of our protocols in this model. We also analyze the practical efficiency of our protocols, based on an implementation of the main algorithms in Rust. Finally, we extend our protocols to ensure that the game designer has correctly constructed the cryptid games, i.e., that the clues are well formed and converge on at least one cell.

Results of this article were presented to high school students and high school teachers to try to propose a simple example of the use of cryptographic protocol proofs.

The PhD thesis of Khouredia Cissé will further explores the use of proofs for privacy and security protocols used “in the real world”.

Proofs for delegations in anonymous signatures (Axis 3)

Fully traceable k-times anonymity is a security property concerning anonymous signatures: if a user produces more than k anonymous signatures, its identity is disclosed and all its previous signatures can be identified. In 13, we show how this property can be achieved for delegation-supported signature schemes, especially proxy signatures, where the signer allows a delegate to sign on its behalf, and sanitizable signatures, where a signer allows a delegate to modify certain parts of the signed messages. In both cases, we formalize the primitive, give a suitable security model, provide a scheme and then prove its security under the DDH assumption. The size of the keys/signatures is logarithmic in k in our two schemes, making them suitable for practical applications, even for large k.

New results for Axis 4

Personal data management using Trusted Execution Environments (Axis 4)

In a rapidly evolving landscape, systems for managing personal data empower individuals with tools to collect, manage, and share their data. Simultaneously, the emergence of Trusted Execution Environments (TEEs) addresses the critical challenge of securing user data while enabling a robust ecosystem of data-driven applications.

In 5, we propose an architecture that leverages TEEs as a foundational security mechanism (Fig. 5). Unlike conventional approaches, our design supports extensible data processing by integrating user-defined functions (UDFs), even from untrusted sources. Our focus is on UDFs that involve potentially large sets of personal data objects, introducing a novel approach to mitigate the risk of data leakage. We present security building blocks that enforce an upper bound on data exposure and evaluate the efficiency of various execution strategies under scenarios relevant to personal data management. In 24, we initiate a new study in the specific context of Storage-as-a-Service (STaaS), leveraging TEEs to protect data even when the processing code is considered vulnerable, using compartmentalization. The proposed solutions are validated through an implementation using Intel SGX on real datasets, demonstrating their effectiveness in achieving secure and efficient computations across diverse environments.

Figure 5: Secure energy consumption computation scenario based on TEEs.

Comments are closed.