Spam¶
This example uses the 'Spam' dataset to detect SMS spam. You can download the Jupyter Notebook of the study here.
- v1: the SMS type (spam or ham)
- v2: SMS content
We will follow the data science cycle (Data Exploration - Data Preparation - Data Modeling - Model Evaluation - Model Deployment) to solve this problem.
Initialization¶
This example uses the following version of VerticaPy:
import verticapy as vp
vp.__version__
Connect to Vertica. This example uses an existing connection called "VerticaDSN." For details on how to create a connection, use see the connection tutorial.
vp.connect("VerticaDSN")
Let's create a Virtual DataFrame of the dataset. The dataset is available here.
spam = vp.read_csv("data/spam.csv")
display(spam)
Data Exploration and Preparation¶
Our dataset relies on text analysis. First, we should create some features. For example, we can use the SMS length and label encoding on the 'type' to get a dummy (1 if the message is a SPAM, 0 otherwise). We should also convert the message content to lowercase to simplify our analysis.
import verticapy.stats as st
spam["length"] = st.length(spam["content"])
spam["content"].apply("LOWER({})")
spam["type"].decode('spam', 1, 0)
Let's compute some statistics using the length of the message.
spam['type'].describe(method = 'cat_stats',
numcol = 'length')
Notice: spam tends to be longer than a normal message. First, let's create a view with just spam. Then, we'll use the CountVectorizer to create a dictionary and identify keywords.
spams = spam.search(spam["type"] == 1)
from verticapy.learn.preprocessing import CountVectorizer
dict_spams = CountVectorizer("spams_voc")
dict_spams.fit(spams, ["content"])
dict_spams = dict_spams.transform()
display(dict_spams)
Let's add the most occurent words in our vDataFrame and compute the correlation vector.
%matplotlib inline
for elem in dict_spams.head(200).values["token"]:
spam.regexp(name = elem,
pattern = elem,
method = "count",
column = "content")
x = spam.corr(focus = "type")
Let's just keep the first 100-most correlated features and merge the numbers together.
spam.drop(columns = x["index"][101:])
for elem in x["index"][1:101]:
if any(char.isdigit() for char in elem):
spam[elem].drop()
spam.regexp(column = "content",
pattern = "([0-9])+",
method = "count",
name = "nb_numbers")