Dataset Format ==================== ==================== 1. Methodology Dataset: This includes various lists of websites used in the machinery. ----------- a) Labelled dataset of top Indian News Websites (Indian_News_Websites.csv): Columns are- - Website_Name: Name of Indian news website - Political_Leaning: Labelled political leaning (LEFT_TO_LEFTCENTRE, RIGHT_TO_RIGHTCENTRE, or CENTRIST_AND_LEASTBIASED). They are referred to as Left, Right, and Centre respectively in this study. - Website_URL: URL of the News website - Language: Language of News website, eg. English, Hindi, etc. - MobileSite: Is there a separate mobile website? If yes, it has mobile website URL. - PrintMedia: Is the news website also available as Print media? - TVMedia: Is the news website also available as TV media? - OnlineMedia: Is the news website only available as Online media? - FacebookReach: Facebook reach of the website - TwitterReach: Twitter reach of the website - InstagramReach: Instagram reach of the website - AlexaGlobalRank: Global Alexa rank of the website - AlexaIndiaRank: Indian Alexa rank of the website - Parent_Company: Parent Company of the website - Website_Registrar: Website Registrar - Registered_On: When was the website first registered? - Expires_On: When will the website registration expire? - TrafficContributingSubDomains: Popular sub-domains of the website contributing to the majority of the traffic. ----------- b) Disconnect List (DisconnectList_Cookie_Categorisation.csv): We use a list from https://github.com/disconnectme/disconnect-tracking-protection/blob/master/services.json (Jan 2021) to understand the type of cookies set for users. We breakdown the cookies into six types: Advertising, Analytics, Content & Social, Fingerprinting, Other, and Unknown. Other includes 'Cryptomining' and 'Disconnect' categories. Columns are- - TP Domain: Third-party domain, eg. doubleclick.net - Raw Category: Cookie Category extracted from Disconnect List - Organisation: Organisation that owns this TP domain - Cookie Category: Cookie Category used in our study (6 categories) - # Left FP: No. of Left news websites having this TP domain - # Right FP: No. of Right news websites having this TP domain - # Centre FP: No. of Centre news websites having this TP domain - # Total FP: Total no. of news websites having this TP domain ----------- ==================== ==================== 2. Cookies and HTTP Logs Dataset: SQLite data dumps crawled using OpenWPM. ----------- a) Stateful_crawls.zip : Five stateful crawls of Indian news websites using OpenWPM. - For more details about tables in SQLite database, please refer to https://github.com/mozilla/OpenWPM/blob/master/README.md (Jan 2021). ----------- b) Stateless_crawls.zip : Five stateless crawls of Indian news websites using OpenWPM. - For more details about tables in SQLite database, please refer to https://github.com/mozilla/OpenWPM/blob/master/README.md (Jan 2021). ==================== ====================