Machine Learning for automatic assessment of the risk related to web tracking

  • Contributors: Marzia Maffei
  • Year: 2020
  • Venue: Master Thesis
  • Abstract:

    This work aims at understanding today’s tracking ecosystem and using machine learning tools to automatically assess the risk connected to web trackers and assigning to websites a risk indicator score. The web is a highly dynamic ecosystem and each user browses dozens of websites everyday, encountering a large number of trackers. Trackers serve different purposes, and while some of them help to improve a user’s experience on a website, others can be more or less malicious, collecting different kinds and different amounts of data in order to build user profiles, and users are often unaware of their presence. Assigning a risk indicator to websites would make users better aware of the whole web ecosystem and would improve the user’s experience as a first step toward a better protection of their data.
    In this thesis, machine learning algorithms are used to classify third party domains into non-tracking and tracking domains, based on features extracted from HTTP requests. Then, a risk indicator score is assigned to first party websites depending on the number of trackers contacted and the pervasiveness of these trackers. Trackers that appear on many websites and that collect a high amount of users’ data are considered more dangerous in terms of users’ privacy.
    The classification performs well enough and shows that machine learning algorithms can be considered for the detection of trackers in the web. The estimation of the tracking risk associated to a first party website represents a first step towards a more detailed labelling that should help users to be more aware of tracking practices and how much they are used on websites they wish to visit.
    The results of this work, both from the classification part and from the risk indicator score assignment, also give a picture of the web itself and of its tracking ecosystem, showing how much trackers are present, even if they often are unnoticed by users in everyday activities.

  • Repository link:
  • Download: PDF file