Design and Implementation of a privacy-preserving framework for Machine Learning

Contributors: Giovanni Camarda, Marco Mellia, Martino Trevisan, Nikhil Jha
Year: 2021
Venue:
Abstract:
During the last decade, a myriad of new technologies has changed the way society perceives everyday life, embodying the Big Data Era peculiar- ities. Almost every technological scenario produces an incredible amount of data, from disparate physical sources and at a very different generation rate, creating an interconnected and interdependent network of people and data. For this reason, data has become for companies and organizations a strategical asset to drive businesses, to tailor user-specific services and to obtain a more relevant position on data markets. More and more compa- nies collect and process customers’ personal data requiring it in exchange for services, forcing users to accept a power unbalanced transaction. To tackle this situation, regulations as the General Data Protection Regulation
(GDPR) and the California Consumer Privacy Act (CCPA) were signed in 2018 and 2020, enforcing data protection respectively in the European Union and California State: their primary goal consists in support free data flow, building trust conditions and rebalancing powers in the relationship between companies and customers.
In this context legal frameworks are necessary but not sufficient, since the absence of an international standard to technically implement data protec- tion in data processing activities is a serious obstacle for companies. The European project PIMCity aims to narrow the gap between regulations and practical privacy-preserving solutions providing a modulable framework thanks to which companies can implement ad-hoc instruments. The PIMCity project provides diverse inter-operating components: the Personal Data Safe
(P-DS) to store data from various sources, the Personal Privacy-Preserving Analytics (P-PPA) with which is possible to extract information preserving privacy, the Personal Consent Manager (P-CM) that models the user consent notion and the Personal Privacy Metrics (P-PM) to enhance users’ awareness about their shared data.
This thesis presents a generic fully-fledged P-PPA module whose input data can be both in a structured and unstructured format. The project pipeline was developed in Python language, providing REST API to interact with it and exploiting privacy properties as k-anonymity and differential privacy. This module is used as the starting point to define a Machine Learning framework to analyze the amount of information gathered from anonymized data. We further propose a deeper inquiry to investigate the correlation between increasing privacy constraints and the residual information level of the Machine Learning algorithms output.
Repository link: https://webthesis.biblio.polito.it/18105/1/tesi.pdf
Download: PDF file