Mini_Project

Date; 24th November 2021

Student: Aditya Tripathi Branch: Information Technology Group: IT-71 Guide: Mr. Anjani Kumar

E-MAIL SPAM CLASSIFICATION

.01

info

Spam E-Mails are unsolicited commercial mails, randomly sent to multiple addresses by all sorts of groups, but mostly by lazy advertisers and criminals.

Spam Mails

Introduction

Spam e-mails can not only be annoying but also dangerous to the reader too. A typical e-mail with an image can access information like OS, device type, network bandwidth, etc. of the user without permission. The spam email epidemic has been on the rise for years. In recent statistics, 40 percent of all e-mails are spam, which costs about 15.4 billion e-mails a day and about $355 million a year to internet users.

INFO

Spam is flooding the internet with many copies of the same message, in an attempt to force the message on people who would not otherwise choose to receive it. Opening and reading these emails can open the door for fraudsters into your system. Privacy breaches, Phishing are some common scams carried out through spam mails. And despite the evolution of anti-spam software, such as spam filters and spam blockers, the adverse effects of spam emails are still being felt by individuals and businesses alike. Some concrete and advanced methodology is a necessity nowadays. The E-Mail Spam Classification model can act as a basic solution. This Machine Learning model divides the e-mail into spam class and non-spam class according to different attribute values of spam.

.02

Abstract

The main objective of the project is to provide a concrete base for the methodology of classifying spam e-mails.
To classify the e-mail in spam or ham mail.
To give knowledge to the user about unwanted e-mails and important e-mails.

.03

Objectives

Follwong different methodologies/technolgies are used in the project:

Removal of Stopwords
Stemming
TF-IDF Vectorization
Encoding
Gaussian Naive Bayes

.04

Description of Technology

INFO

Removal of StopWords

Words which are present in the English stopwords are removed the final dataset acquired. The removal of stopwords do not affect the after results in any negative manner.

Description of Technology

Stopwords are English words which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence.

Stopwords

.05

It is the process of reducing the word to its word stem that affixes to suffixes and prefixes or roots of words known as a lemma. In simple words stemming is reducing a word to its base word/stem in such a way that the words of similar kind lie under a common stem.

Stemming

INFO

Stemming

Stemming is performed on the final dataset using an algorithm called SnowballStemmer. SnowballStemmer is a stemming algorithm better known as Porter2 stemming algorithm.

Description of Technology

.06

Description of Technology

TF: The number of times a words appears in a document divided by the total number of words in the document.
IDF: The log of the number of documents divided by the number of documents that contain the word.

.07

TF-IDF Vectorization

After removing all the noise from the text, another strategy to score the relative importance of words using TF-IDF is carried out. This TF-IDF vectorization is performed throuigh an algorithm called TfidVecotorizer.

INFO

LabelEncoder

Description of Technology

This algorithm has been used to encode the 'category' value in the final dataset.The LabelEncoder encodes the target labels with a value between 0 and n-1 classes.

.08

Encoding

Before feeding the dataset to the model, some features of the dataset are encoded accordingly. This encoding in our project is performed through an algorithm called LabelEncoder.

INFO

Description of Technology

.09

Gaussian Naive Bayes Classifier

In Gaussian Naive Bayes, continuous values asociated with each feature are assumed to be distributed according to Gaussian distribution. A Gaussian distribution is also called Normal distribution. When plotted, it gives a bell-shaped curve which is symmetric about the mean of the feature.

Subtitle here

.10

Streamlit

Streamlit is an open-source app framework for creating and deploying data science applications. Streamlit helps data scientists and machine learning engineers to develop performant applications in a few hours, thereby enabling businesses to create custom applications and interact with the data in their models.

Application

INFO

Ham Classification

Spam Classification

.11

Application

The accuracy score : : 0.8771300448430494
The F Beta Score(with beta=0.5): 0.59552358
The confusion matrix of the model:
[[ 829, 123], [ 14, 149]]

The Gaussian Naive Bayes Model used had

.12

Results

INFO

This project reported the use of Unsupervised ML models to classify the email into spam or ham category. The dataset having the emails and the category, first went through pre-processing in which data editing and cleaning were performed to make the dataset fit for use. Processes like stemming, removing the English stopwords, and vectorization was also performed on the dataset. The dataset then was split into training and testing data and then was fed to our Gaussian Naïve Bayes Model. As for the results, the Gaussian Naïve Bayes model was able to give a 0.97898776 f-1 score. Results also indicated a strong future for our model as its performance was commendable without any hyperparameter tuning. It can be concluded that the more the training data available, the better the models were at predicting them during the test time.

.13

Conclusion

[1] T.M. Mitchell, Machine Learning, first ed., McGraw Hill, 1997 [2] Python - Remove Stopwords (tutorialspoint.com) [3] Snowball Stemmer - NLP - GeeksforGeeks [4] TF IDF | TFIDF Python Example. An example of how to implement TFIDF… | by Cory Maklin | Towards Data Science [5] sklearn.preprocessing.LabelEncoder — scikit-learn 1.0.1 documentation [6] Gaussian Naive Bayes: What You Need to Know? | upGrad blog [7] Streamlit - Crunchbase Company Profile & Funding

.14

Bibliography

Thank you for your attention

“It is not possible to prepare a project report without the assistance & encouragement of other people. This one is certainly no exception.”On the outset of this very report, I would like to extend my sincere thanks and heartfelt regards towards all the personages who have helped to complete this report. Without their guidance, cooperation and encouragement, I would not have been able to accomplish this Report. I am greatly indebted to Mr. Anjani Kumar, Asstt. Professor, SRMGPC for providing me the opportunity to make a seminar report on the topic “E-Mail Spam Classification” and helped me greatly. I also extend my gratitude, with great sense of reverence, towards the whole staff who has given their precious time, thoughtful discussion and expert suggestions to me. At last, but not the least, gratitude goes to all my friends who directly or indirectly helped me to complete this report by providing me a conducive environment to work in. Aditya Tripathi

Acknowledgement

Mini_Project

Over 30 million people create interactive content in Genially

BRANCHES OF U.S. GOVERNMENT

QUOTE OF THE WEEK ACTIVITY - 10 WEEKS

MASTER'S THESIS ENGLISH

SPANISH: PARTES DE LA CASA WITH REVIEW

PRIVATE TOUR IN SÃO PAULO

SUMMER ZINE 2018

RACISM AND HEALTHCARE

Transcript