Format

====================
====================

1. Methodology: This includes codes to fetch website lists, setting up OpenWPM tool and some shell scripts to orchestrate the machinery for training personas. 
-----------
a) Websites from Alexa.com (FileUrlLabels.csv): Columns are-
-Category: Top level category such as 'Generations_and_Age_Groups' 
-Type: Sub-category such as 'Youth' which is used as a training parameter for personas.
-Url: Website for a given category and type.

-----------
b) Hyper-partisan websites (HPWs) (FileHPWsAlexaRankCookies.csv): Columns are-
- visitID: Sequence in which HPWs are visited during the crawling.
- site: Name (URL) of HPWs.
- rankGlobal: Alexa.com rank of the HPWs in 2019.
- Cookies: Number of cookies when no history is loaded.
- party: Label of right or left party for HPWs.

-----------
c) Disconnect List (FileServicesDisconnectP.csv): We use a list from https://github.com/disconnectme/disconnect-tracking-protection/blob/master/services.json (Feb 2019) to understand the type of cookies set for users. We breakdown the cookies into six types: first-party, advertising, analytics, content, social, and other.
-type: Type of domain such as 'Advertising' etc.	
-topDomain: URL of top domain, eg: http://www.google.com/
-topURL: Name of top domain, eg. Google
-url: Raw host url, eg. doubleclick.net


-----------
d) WhoTracks.me List (FileWhoTracksMeDomains.csv): 
- host_tld : URL of raw host domain
- others: As defined by https://github.com/cliqz-oss/whotracks.me/tree/master/whotracksme/data/assets
-----------

====================
====================

2. Machinery 

-----------
a) Build persona: The trained personas (.tar) which can be directly be downloaded and loaded in OpenWPM.

b) Crawl logs: Cookies.zip and httpRequests
- visit_id: Id of HPWS as mentioned above in 1.b.
- docUrl: URL of raw host domain.