Format ==================== ==================== 1. Methodology: This includes codes to fetch website lists, setting up OpenWPM tool and some shell scripts to orchestrate the machinery for training personas. ----------- a) Websites from Alexa.com (FileUrlLabels.csv): Columns are- -Category: Top level category such as 'Generations_and_Age_Groups' -Type: Sub-category such as 'Youth' which is used as a training parameter for personas. -Url: Website for a given category and type. ----------- b) Hyper-partisan websites (HPWs) (FileHPWsAlexaRankCookies.csv): Columns are- - visitID: Sequence in which HPWs are visited during the crawling. - site: Name (URL) of HPWs. - rankGlobal: Alexa.com rank of the HPWs in 2019. - Cookies: Number of cookies when no history is loaded. - party: Label of right or left party for HPWs. ----------- c) Disconnect List (FileServicesDisconnectP.csv): We use a list from https://github.com/disconnectme/disconnect-tracking-protection/blob/master/services.json (Feb 2019) to understand the type of cookies set for users. We breakdown the cookies into six types: first-party, advertising, analytics, content, social, and other. -type: Type of domain such as 'Advertising' etc. -topDomain: URL of top domain, eg: http://www.google.com/ -topURL: Name of top domain, eg. Google -url: Raw host url, eg. doubleclick.net ----------- d) WhoTracks.me List (FileWhoTracksMeDomains.csv): - host_tld : URL of raw host domain - others: As defined by https://github.com/cliqz-oss/whotracks.me/tree/master/whotracksme/data/assets ----------- ==================== ==================== 2. Machinery ----------- a) Build persona: The trained personas (.tar) which can be directly be downloaded and loaded in OpenWPM. b) Crawl logs: Cookies.zip and httpRequests - visit_id: Id of HPWS as mentioned above in 1.b. - docUrl: URL of raw host domain.