Stop tracking me Bro! Differential Tracking of User Demographics on Hyper-Partisan Websites


Abstract

Websites with hyper-partisan, left or right-leaning focus offer content that is typically biased towards the expectations of their target audience. Such content often polarizes users, who are repeatedly primed to specific (extreme) content, usually reflecting hard party lines on political and socio-economic topics. Though this polarization has been extensively studied with respect to content, it is still unknown how it associates with the online tracking experienced by browsing users, especially when they exhibit certain demographic characteristics. For example, it is unclear how such websites enable the ad-ecosystem to track users based on their gender or age. In this paper, we take a first step to shed light and measure such potential differences in tracking imposed on users when visiting specific party-line's websites. For this, we design and deploy a methodology to systematically probe such websites and measure differences in user tracking. This methodology allows us to create user personas with specific attributes like gender and age and automate their browsing behavior in a consistent and repeatable manner. Thus, we systematically study how personas are being tracked by these websites and their third parties, especially if they exhibit particular demographic properties. Overall, we test 9 personas on 556 hyper-partisan websites and find that right-leaning websites tend to track users more intensely than left-leaning, depending on user demographics, using both cookies and cookie synchronization methods and leading to more costly delivered ads.

Dataset and Codes

An anonymized version of the dataset and codes used in our paper is available for the research community.

  1. Methodology Dataset: This includes various lists of websites used in the machinery. (a) Websites from Alexa.com (FileUrlLabels.csv), (b) Hyper-partisan websites (HPWs) list by Buzzfeed (FileHPWsAlexaRankCookies.csv), (c) Disconnect List (FileServicesDisconnectP.csv) and (d) WhoTracks.me List (FileWhoTracksMeDomains.csv). For more details check GITHUB page.

  2. Personas (trained) Dataset: Stable personas (.tar) trained by consecutively visiting top Alexa ranking websites (listed in FileUrlLabels.csv) for different demographics. We include here single feature personas such as Men, Women, Senior and Youth.

  3. Cookies and HTTP Logs Dataset: This is an extraction in csv format which stores cookies and HTTP logs when personas as loaded on various HPWs (FileHPWsAlexaRankCookies.csv).

  4. Codes: Codes and additional information of above mentined files are available at GITHUB

You can find the format of the dataset from here.


Contact Us


If you are interested in using this data, please fill the form to . Request specific data to get the link where you can download the data.

We are sharing the dataset under the terms and conditions specified here below. Please note that submitting the form indicates that you accept the terms and conditions of the data. In the form, please indicate which part of the dataset you need. If you do not get any email notification for your logged request within 24 hours, please e-mail us at netsys.noreply[at]gmail.com.

Dataset Terms and Conditions

  1. You will use the data solely for the purpose of non-profit research or non-profit education.

  2. You will respect the privacy of end users and organizations that may be identified in the data. You will not attempt to reverse engineer, decrypt, de-anonymize, derive or otherwise re-identify anonymized information.

  3. You will not distribute the data beyond your immediate research group.

  4. If you create a publication using our datasets, please cite our papers as follows.


@inproceedings{agarwal2020stop,
  title={Stop Tracking Me Bro! Differential Tracking Of User Demographics On Hyper-partisan Websites},
  author={Agarwal, Pushkal and Joglekar, Sagar and Papadopoulos, Panagiotis and Sastry, Nishanth and Kourtellis, Nicolas},
  booktitle={Proceedings of the 2020 world wide web conference},
  year={2020}
}