Differential Tracking Across Topical Webpages of Indian News Media


Abstract

Online user privacy and tracking have been extensively studied in recent years, especially due to privacy and personal data-related legislations in the EU and the USA, such as the General Data Protection Regulation, ePrivacy Regulation, and California Consumer Privacy Act. Research has revealed novel tracking and personal identifiable information leakage methods that first- and third-parties employ on websites around the world, as well as the intensity of tracking performed on such websites. However, for the sake of scaling to cover a large portion of the Web, most past studies focused on homepages of websites, and did not look deeper into the tracking practices on their topical subpages. The majority of studies focused on the Global North markets such as the EU and the USA. Large markets such as India, which covers 20% of the world population and has no explicit privacy laws, have not been studied in this regard. We aim to address these gaps and focus on the following research questions: Is tracking on topical subpages of Indian news websites different from their homepage? Do third-party trackers prefer to track specific topics? How does this preference compare to the similarity of content shown on these topical subpages? To answer these questions, we propose a novel method for automatic extraction and categorization of Indian news topical subpages based on the details in their URLs. We study the identified topical subpages and compare them with their homepages with respect to the intensity of cookie injection and third-party embeddedness and type. We find differential user tracking among subpages, and between subpages and homepages. We also find a preferential attachment of third-party trackers to specific topics. Also, embedded third-parties tend to track specific subpages simultaneously, revealing possible user profiling in action.

Dataset and Codes

An anonymized version of the dataset and codes used in our paper is available for the research community.

  1. Methodology Dataset: This includes various lists of websites used in the machinery. (a) Websites (including subpages) list with topical labels and For more details check GITHUB page.

  2. Cookies and HTTP Logs Dataset: SQLite dumps of our crawls (Stateless) using OpenWPM.

  3. Codes: Codes and additional information of above mentined files are available at GITHUB

You can find the format of the dataset from here.


Contact Us


If you are interested in using this data, please fill the form to . Request specific data to get the link where you can download the data.

We are sharing the dataset under the terms and conditions specified here below. Please note that submitting the form indicates that you accept the terms and conditions of the data. In the form, please indicate which part of the dataset you need. If you do not get any email notification for your logged request within 24 hours, please e-mail us at netsys.noreply[at]gmail.com.

Dataset Terms and Conditions

  1. You will use the data solely for the purpose of non-profit research or non-profit education.

  2. You will respect the privacy of end users and organizations that may be identified in the data. You will not attempt to reverse engineer, decrypt, de-anonymize, derive or otherwise re-identify anonymized information.

  3. You will not distribute the data beyond your immediate research group.

  4. If you create a publication using our datasets, please cite our papers as follows.

@inproceedings{vekaria2021differential,
  title={Differential Tracking Across Topical Webpages of Indian News Media},
  author={Vekaria, Yash and Agarwal, Vibhor and Agarwal, Pushkal and Mahapatra, Sangeeta and Muthiah, Sakthi Balan and Sastry, Nishanth and Kourtellis, Nicolas},
  booktitle={Proceedings of the 13th ACM Web Science Conference},
  year={2021}
}
@inproceedings{agarwal2021under,
  title={Under the Spotlight: Web Tracking in Indian Partisan News Websites},
  author={Agarwal, Vibhor and Vekaria, Yash and Agarwal, Pushkal and Mahapatra, Sangeeta and Set, Shounak and Muthiah, Sakthi Balan and Sastry, Nishanth and Kourtellis, Nicolas},
  booktitle={Proceedings of the International AAAI Conference on Web and Social Media},
  year={2021}
}