Dataset Format

====================
====================

1. Annotation Dataset:
-----------
(a) Embeddings (no text and no user data) with spam and ham labels.
-----------
- hamEmbeddings.zip: Size- (3148 rows, 768 features)
- spamEmbeddings.zip: Size- (3528 rows, 768 features)

-----------
(b) Spam words dictionary (spamWordsDict.zip) in multiple language. Note that some of these words can be really disturbing!!!
-----------
- language: language of message
- word: spam word in English and Indian languages
- translation: english translation

-----------

(c) Top (posted >100 times) spam URLs list (spamURLsSet.zip) in spam messages. Top 8 spam URLs are posted more than 1000 time!
-----------
- url: URL which is present in spam messages
- subDomain: Extrated from raw URL
- parentDomain: Top domain
- org: Host
====================
====================