Dataset Format ==================== ==================== 1. Annotation Dataset: ----------- (a) Embeddings (no text and no user data) with spam and ham labels. ----------- - hamEmbeddings.zip: Size- (3148 rows, 768 features) - spamEmbeddings.zip: Size- (3528 rows, 768 features) ----------- (b) Spam words dictionary (spamWordsDict.zip) in multiple language. Note that some of these words can be really disturbing!!! ----------- - language: language of message - word: spam word in English and Indian languages - translation: english translation ----------- (c) Top (posted >100 times) spam URLs list (spamURLsSet.zip) in spam messages. Top 8 spam URLs are posted more than 1000 time! ----------- - url: URL which is present in spam messages - subDomain: Extrated from raw URL - parentDomain: Top domain - org: Host ==================== ====================