Spam¶
This example uses the 'Spam' dataset to detect SMS spam. You can download the Jupyter Notebook of the study here.
- v1: the SMS type (spam or ham)
- v2: SMS content
We will follow the data science cycle (Data Exploration - Data Preparation - Data Modeling - Model Evaluation - Model Deployment) to solve this problem.
Initialization¶
This example uses the following version of VerticaPy:
import verticapy as vp
vp.__version__
Connect to Vertica. This example uses an existing connection called "VerticaDSN." For details on how to create a connection, use see the connection tutorial.
vp.connect("VerticaDSN")
Let's create a Virtual DataFrame of the dataset. The dataset is available here.
spam = vp.read_csv("data/spam.csv")
display(spam)
Data Exploration and Preparation¶
Our dataset relies on text analysis. First, we should create some features. For example, we can use the SMS length and label encoding on the 'type' to get a dummy (1 if the message is a SPAM, 0 otherwise). We should also convert the message content to lowercase to simplify our analysis.
import verticapy.stats as st
spam["length"] = st.length(spam["content"])
spam["content"].apply("LOWER({})")
spam["type"].decode('spam', 1, 0)
Let's compute some statistics using the length of the message.
spam['type'].describe(method = 'cat_stats',
numcol = 'length')
Notice: spam tends to be longer than a normal message. First, let's create a view with just spam. Then, we'll use the CountVectorizer to create a dictionary and identify keywords.
spams = spam.search(spam["type"] == 1)
from verticapy.learn.preprocessing import CountVectorizer
dict_spams = CountVectorizer("spams_voc")
dict_spams.fit(spams, ["content"])
dict_spams = dict_spams.transform()
display(dict_spams)
Let's add the most occurent words in our vDataFrame and compute the correlation vector.
%matplotlib inline
for elem in dict_spams.head(200).values["token"]:
spam.regexp(name = elem,
pattern = elem,
method = "count",
column = "content")
x = spam.corr(focus = "type")
Let's just keep the first 100-most correlated features and merge the numbers together.
spam.drop(columns = x["index"][101:])
for elem in x["index"][1:101]:
if any(char.isdigit() for char in elem):
spam[elem].drop()
spam.regexp(column = "content",
pattern = "([0-9])+",
method = "count",
name = "nb_numbers")
Let's narrow down our keyword list to words of more than two characters.
for elem in spam.get_columns():
if len(elem.replace('"', '')) <= 2:
spam[elem].drop()
Compute the correlation vector again using the response column.
spam.corr(focus = "type")
We have enough correlated features with our response to create a fantastic model.
Machine Learning¶
The naive Bayes classifier is a powerful and performant algorithm for text analytics and binary classification. Before using it on our data, let's use a cross-validation to test the efficiency of our model.
from verticapy.learn.naive_bayes import MultinomialNB
model = MultinomialNB("spam_nb")
from verticapy.learn.model_selection import cross_validate
cross_validate(model,
spam,
spam.get_columns(exclude_columns = ["type", "content"]),
"type",
cv = 5)
We have an excellent model! Let's learn from the data.
model.fit(spam,
spam.get_columns(exclude_columns = ["type", "content"]),
"type")
model.confusion_matrix()
Our model can reliably identify spam.
Conclusion¶
We've solved our problem in a Pandas-like way, all without ever loading data into memory!
VerticaPy
About the Author
Badr Ouali
Head of Data Science
Badr Ouali works as a Lead Data Scientist for Vertica worldwide. He can embrace data projects end to end through a clear understanding of the “big picture” as well as attention to details, resulting in achieving great business outcomes – a distinctive differentiator in his role. Badr enjoys sharing knowledge and insights related to data analytics with colleagues & peers and has a sweet spot for Python. He loves helping customers finding the best value from their data and empower them to solve their use-cases.
