Dataset Format ==================== ==================== 1. Methodology Dataset: This includes various lists of websites used in the machinery. ----------- a) subpage_url_labels.csv: This dataset consists of a mapping of topical subpage URLs obtained from the homepage of 112 Indian News Websites used along with the Topical label as assigned by DiBETS. Columns: - link_no: Unique serial numbering assigned to each subpage URL. - website_no: All the subpage URLs obtained from a single website's homepage are assigned same index. - homepage_url: The homepage URL of a Website from which the associated subpage URLs are extracted. - subpage_url: The single-best subpage URL as output by our model - DiBETS. - topical_label: The topic predicted for each subpage URL by our model - DiBETS. ----------- ==================== ==================== 2. Cookies and HTTP Logs Dataset: SQLite data dumps crawled using OpenWPM. ----------- a) openWPM_crawls.zip : Five stateless crawls of Indian news websites using OpenWPM. - For more details about tables in SQLite database, please refer to https://github.com/mozilla/OpenWPM/blob/master/README.md (Jan 2021). ==================== ====================