Machine Learning || SMS Spam Detection System Part I | by Jay Parmar | May, 2024

Exploratory Data analysis (EDA):

Let’s rely the number of spams and hams in our dataset

df['target'].value_counts()

There are entire 4825 hams and 747 spams in our data set

As we’re inside the second step i.e. EDA. let’s plot some charts to understand the dataset.

Seaborn and Matplotlib are Python libraries used for data visualization.

pip arrange seaborn,matplotlib

import seaborn as sns
import matplotlib.pyplot as pltsns.countplot('objective',data=df)
plt.title('Rely of Courses')
plt.xlabel('Class')
plt.ylabel('Rely')
plt.current()

Let’s see what ham and spam messages embrace by the use of the number of characters, letters, and sentences.

df[df['target']==1]['text'].describe()

From the image above, it appears that evidently there are 653 distinctive messages, and the very best message is “Please title our buyer help advisor,” which appears 4 events inside the dataset.

# describe the technique is used for exhibiting the necessary points from a dataset
df[df['target']==0]['text'].describe()

Now, let’s confirm the number of phrases in each kind with the help of hist plot

# get the scale of each message with respect to spam and ham
spam = df[df['target']==1]['text'].str.len()
ham = df[df['target']==0]['text'].str.len()# Plot the hist plot
plt.decide(figsize=(12, 6))
sns.histplot(x =ham,data=df)
sns.histplot(x =spam,data=df,color='pink')
plt.current()

From the chart above, it’s evident that the phrase rely inside the ham messages is bigger than in spam.

Now, let’s create new columns that embrace the number of phrases, sentences, and letters. We’ll use the NLTK library to remodel the messages into completely completely different varieties.

It stands for Pure Language Toolkit, and it’s a popular Python library used for pure language processing duties, akin to tokenization, stemming, tagging, parsing, and semantic reasoning

pip arrange nltk

from nltk.tokenize import word_tokenize,sent_tokenize# Create new columns for each type of message
df['number_of_words']= df['text'].apply(lambda x: len(word_tokenize(x)))
df['number_of_sentence']= df['text'].apply(lambda x: len(sent_tokenize(x)))

Now we have effectively added the number of phrases and sentences inside the dataset.

df[df['target']==0][['number_of_sentence','number_of_words','number_of_letters']].describe()

take considerably little little bit of time and analyze the desk.

df[df['target']==1][['number_of_sentence','number_of_words','number_of_letters']].describe()

take considerably little little bit of time and analyze the desk.

Now let’s uncover out the connection between data using a pair plot.

Discover: A pair plot in Pandas, generally created using the Seaborn library, is a visualization system that exhibits relationships between pairs of variables in a dataset. It generates a grid of scatter plots for each pair of variables, allowing for easy comparability and identification of correlations or patterns. The diagonal normally reveals the distribution of each variable by way of histograms or density plots. The type of plot is useful for exploring multidimensional data and understanding interactions between variables.

sns.pairplot(df,hue='objective')

Now lets take a look at the correlation between columns.

Discover: Correlation is a statistical measure that describes the power and route of a relationship between two variables. It quantifies how changes in a single variable are associated to changes in a single different.

sns.heatmap(df.corr(),annot=True)

Conclusion :

Step one in every of any machine learning mission is to clean the knowledge we don’t have that loads to clean nevertheless nonetheless, we have modified the column and objective column values to numerical.
As we have seen there are numerous ham messages as compared with spam.
We plotted charts for the number of counts of phrases concerning spam and ham and it was seen that ham messages have additional phrases
In the long term, we plotted the pair plot to analysis the connection between the objective and completely different columns

On account of the scale of the weblog is rising the next steps I may be defending partially 2.

Source link

Machine Learning || SMS Spam Detection System Part I | by Jay Parmar | May, 2024

Working with Input-Convex Neural Networks part3(Machine Learning 2024) | by Monodeep Mukherjee | Jul, 2024

Embracing the Future: The Rise of AI-Driven Development in Software Engineering The software… | by DevBlogs | Jul, 2024

Research on Metaheuristic methods part4(Machine Learning 2024) | by Monodeep Mukherjee | Jul, 2024

NVIDIA Accelerates Google Quantum AI Processor Design With Simulation of Quantum Device Physics

Game Development and Cloud Computing: Benefits of Cloud-Native Game Servers

Teradata AI Unlimited in Microsoft Fabric is Now Available for Public Preview through Microsoft Fabric Workload Hub

Cognigy Unveils Agentic AI: Transforming the Future of Enterprise Contact Centers

Preparing Finance Data for AI: A 5-Step Data Cleansing Checklist

Our Picks

Gen AI for Data Privacy & Protection | by Lawrence Wilson | Jun, 2024

How to use web scraping for lead generation and sales?

NVIDIA and Global Partners Launch NIM Agent Blueprints for Enterprises to Make Their Own AI

Most Popular

Revolutionizing the Way We Find Love

Will GenAI Replace Data Engineers? No – And Here’s Why.

Assortment Optimization Machine Learning | by Danishaliarshar | Mar, 2024

Machine Learning || SMS Spam Detection System Part I | by Jay Parmar | May, 2024

Related Posts