Data Anonymization Exercices

SKEMA ANONYMIZATION EXCERCICES  (Feb. 2021):

Instructor: Claude Castelluccia (Inria Privatics), claude.castelluccia@inria.fr

A subset of the UCI Adult dataset was anonymised with k-anonymity using the ARX anonymisation tool and different k values   (k=5, 10, 20, 50, 100, 500). All the files (README, original and anonymized datasets are available here).

The goal of these exercices is to manipulate anonymized datasets,  understand some of their limitations and practice Python coding.

  1. Compute the unicity level of each record (i.e. how many records are unique,  how many of them appear 2 times, …, how many of them appear k times) in the original dataset.  Display results as an histograms (one per anonymized dataset). What do you conclude?
  2.  Compute the unicity level of each record in the anonymised datasets. Display the results  as histograms (one per anonymized dataset). What do you conclude about the quality of the anonymization process?
  3. Predict the salary_class of these people using the different anonymized datasets? The prediction is performed by computing that number of records that correspond to the salary_class classes <=50K and <50k. Compute the probability for each of these classes for the following queries.
    What do you conclude about the accuracy of the results?

    1. Query1: “*;Private;9th;Never-married;Other-service;White;Male;United-States;”
    2. Query2: “*;Federal-gov;Masters;Never-married;*;White;Male;United-States;”
    3. Query3: “*;Private;*;Divorced;;White;Male;United-States;”
    4. Query4: “*;Self-emp-not-inc;HS-grad;Married-civ-spouse;Sales;White;Female;United-States;”
    5. Query5:  *”;Private;5th-6th;Married-civ-spouse;*;White;Male;Mexico;”

 

Note: the semantic of the request is: “age;work_class;education;marital-status;occupation;race;sex;native-country” and “*” means “any values”.

Send me your results i.e. plots and python code (use  Jupyter Notebook if possible) before the next March session (March 23rd, 2021).

Ressources:

  • some examples of Python code for the UCI Adult dataset.