INEP 1 (syntactic) Anonymization
Code and attributes hierarchies used for the anonymization process of INEP datasets using ARX Deidentifier tool.
DOI: 10.5281/zenodo.6533684.
The resulting datasets were used for vulnerability assessment using the BVM library (10.5281/zenodo.6533704). The assessment results were published in: Mário S. Alvim, Natasha Fernandes, Annabelle McIver, Carroll Morgan, Gabriel H. Nunes - Flexible and scalable privacy assessment for very large datasets, with an application to official governmental microdata (2022, 10.48550/arXiv.2204.13734).
We randomly selected only one record for each student with a same unique pseudonymization code (ID_ALUNO
) in each dataset. The enrollment code (ID_MATRICULA
) for each selected record is available in 10.5281/zenodo.6533675 (gitlab.com/nunesgh/inep-enrollment-codes).
The jar
files in arx/jars/
were compiled from the ARX fork made by @ramongonze, based on commit 8a936d3 and using the command ant -buildfile build.xml
.
This fork allows for the creation of matrices with up to (2^31-1)^2 cells, instead of the original limit of up to 2^31-1 cells. Due to some GUI errors caused by the new feature, it is necessary to run ARX via CLI. For more information, see this issue.