Smart Spam Filter  1.0
A spam filter using Machine Learning.
spam_filter.py File Reference

Model building code. More...

Functions

def spam_filter.make_Dictionary (emails)
 Method to create Dictionary. More...
 
def spam_filter.extract_features (files)
 Method to extract features from all mails. More...
 
def spam_filter.mail_features (mail)
 Method to find features of a single mail. More...
 
def spam_filter.preprocessor (mail)
 Method to pre-process the mails. More...
 
def spam_filter.find_payload (mail_body, all_words)
 Method to recursively find single part payloads. More...
 
def spam_filter.split_payload (payload, all_words)
 Method to split the large payloads into smaller chunks. More...
 
def spam_filter.get_words_plain (content, all_words)
 Method to get words out of plain text content. More...
 
def spam_filter.get_words_html (content, all_words)
 Method to get words out of html content. More...
 

Variables

 spam_filter.nlp = spacy.load("en_core_web_sm")
 
 spam_filter.stopWords = spacy.lang.en.stop_words.STOP_WORDS
 
 spam_filter.ham_dir = sys.argv[1]
 
list spam_filter.ham_mails = [os.path.join(ham_dir, f) for f in os.listdir(ham_dir)]
 
 spam_filter.ham_size = len(ham_mails)
 
 spam_filter.spam_dir = sys.argv[2]
 
list spam_filter.spam_mails = [os.path.join(spam_dir, f) for f in os.listdir(spam_dir)]
 
 spam_filter.spam_size = len(spam_mails)
 
list spam_filter.all_mails = ham_mails + spam_mails
 
 spam_filter.all_size = len(all_mails)
 
int spam_filter.dic_size = 3000
 
def spam_filter.dictionary = make_Dictionary(all_mails)
 
 spam_filter.mail_labels = np.zeros(all_size)
 
def spam_filter.mail_feature_matrix = extract_features(all_mails)
 
 spam_filter.ML_model = MultinomialNB()
 

Detailed Description

Model building code.

This code builds and trains a new Machine Learning model

Author
Sudhanshu Dubey
Version
1.1
Date
9/7/2019
Parameters
ham_dirDirectory containing ham mails for training
spam_dirDirectory containing spam mails for training
Bug:
No known bugs

Function Documentation

◆ extract_features()

def spam_filter.extract_features (   files)

Method to extract features from all mails.

Parameters
mail_dirThe directory containing mails
Returns
features_matrix A np-array containing features of all mails

◆ find_payload()

def spam_filter.find_payload (   mail_body,
  all_words 
)

Method to recursively find single part payloads.

Parameters
mail_bodyThe complete mail body
all_wordsList of all words in the mail
Returns
Nothing

◆ get_words_html()

def spam_filter.get_words_html (   content,
  all_words 
)

Method to get words out of html content.

Parameters
contentThe html content
all_wordsList of all words in the mail
Returns
Nothing

◆ get_words_plain()

def spam_filter.get_words_plain (   content,
  all_words 
)

Method to get words out of plain text content.

Parameters
contentPlain text content
all_wordsList of all words in the mail
Returns
Nothing

◆ mail_features()

def spam_filter.mail_features (   mail)

Method to find features of a single mail.

Parameters
mailThe address of mail
Returns
features_matrix: The features of a single mail

◆ make_Dictionary()

def spam_filter.make_Dictionary (   emails)

Method to create Dictionary.

Parameters
train_dirThe directory containing mails
Returns
dictionary The dictionary containing most common words

◆ preprocessor()

def spam_filter.preprocessor (   mail)

Method to pre-process the mails.

Parameters
mailThe address of mail
Returns
all_words: List of all words in mail

◆ split_payload()

def spam_filter.split_payload (   payload,
  all_words 
)

Method to split the large payloads into smaller chunks.

Parameters
payloadThe complete payload
all_wordsList of all words in the mail
Returns
Nothing