Spam

This example uses the 'Spam' dataset to detect SMS spam. You can download the Jupyter Notebook of the study here.

  • v1: the SMS type (spam or ham)
  • v2: SMS content

We will follow the data science cycle (Data Exploration - Data Preparation - Data Modeling - Model Evaluation - Model Deployment) to solve this problem.

Initialization

This example uses the following version of VerticaPy:

In [1]:
import verticapy as vp
vp.__version__
Out[1]:
'0.9.0'

Connect to Vertica. This example uses an existing connection called "VerticaDSN." For details on how to create a connection, use see the connection tutorial.

In [2]:
vp.connect("VerticaDSN")

Let's create a Virtual DataFrame of the dataset. The dataset is available here.

In [3]:
spam = vp.read_csv("data/spam.csv")
display(spam)
Abc
type
Varchar(20)
Abc
Varchar(1820)
1ham
2ham
3ham
4ham
5ham
6ham
7ham
8ham
9ham
10ham
11ham
12ham
13ham
14ham
15ham
16ham
17ham
18ham
19ham
20ham
21ham
22ham
23ham
24ham
25ham
26ham
27ham
28ham
29ham
30ham
31ham
32ham
33ham
34ham
35ham
36ham
37ham
38ham
39ham
40ham
41ham
42ham
43ham
44ham
45ham
46ham
47ham
48ham
49ham
50ham
51ham
52ham
53ham
54ham
55ham
56ham
57ham
58ham
59ham
60ham
61ham
62ham
63ham
64ham
65ham
66ham
67ham
68ham
69ham
70ham
71ham
72ham
73ham
74ham
75ham
76ham
77ham
78ham
79ham
80ham
81ham
82ham
83ham
84ham
85ham
86ham
87ham
88ham
89ham
90ham
91ham
92ham
93ham
94ham
95ham
96ham
97ham
98ham
99ham
100ham
Rows: 1-100 | Columns: 2

Data Exploration and Preparation

Our dataset relies on text analysis. First, we should create some features. For example, we can use the SMS length and label encoding on the 'type' to get a dummy (1 if the message is a SPAM, 0 otherwise). We should also convert the message content to lowercase to simplify our analysis.

In [4]:
import verticapy.stats as st
spam["length"] = st.length(spam["content"])
spam["content"].apply("LOWER({})")
spam["type"].decode('spam', 1, 0)
Out[4]:
123
type
Integer
Abc
Varchar(3640)
123
length
Integer
10103
20133
30200
4045
5085
6035
70461
8050
9037
100327
11048
12045
130131
140155
15066
160132
17060
18043
19028
20056
21014
22027
230140
24069
25092
26028
27067
28037
290129
300169
31038
320146
33055
34023
35042
36052
370121
38023
39027
40023
410141
420141
430141
440158
45079
46017
470154
48044
49030
50058
51012
52042
53046
54022
55047
56033
57051
5803
59069
60046
61099
62024
6303
64019
65026
6607
67062
68022
69032
700152
710152
72041
73090
740169
75066
760124
77050
780114
79084
800198
8106
82068
83035
84037
85050
86026
87087
88028
89088
90087
91060
920106
93042
94080
95060
96061
970134
98088
99059
1000107
Rows: 1-100 | Columns: 3

Let's compute some statistics using the length of the message.

In [5]:
spam['type'].describe(method = 'cat_stats', 
                      numcol = 'length')
Out[5]:
count
percent
mean
std
min
approx_10%
approx_25%
approx_50%
approx_75%
approx_90%
max
0472386.5335287651154269.596866398475556.2618267177762224.033.051.089.0143.0910
173513.466471234884573139.23945578231329.125441741742813101.0133.5149.0158.0161.0224
Rows: 1-2 | Columns: 12

Notice: spam tends to be longer than a normal message. First, let's create a view with just spam. Then, we'll use the CountVectorizer to create a dictionary and identify keywords.

In [6]:
spams = spam.search(spam["type"] == 1)

from verticapy.learn.preprocessing import CountVectorizer
dict_spams = CountVectorizer("spams_voc")
dict_spams.fit(spams, ["content"])
dict_spams = dict_spams.transform()
display(dict_spams)
Abc
token
Varchar(128)
123
df
Numeric(36,18)
123
cnt
Integer
123
rnk
Integer
1to0.0282038594755071774561
2call0.019668480950024743182
3a0.017936665017318162903
4you0.0142874814448292932314
5your0.0138545274616526472245
6now0.0110712518555170711796
7or0.0108857001484413671767
8for0.0107619990103908961748
9free0.0098960910440376041609
10the0.0098342404750123715910
11txt0.00878278080158337514211
12is0.00853537852548243313812
1320.00828797624938149413413
14have0.00773132112815437912514
15from0.00760761999010390912315
16on0.00742206828302820412016
17u0.00729836714497773411817
18ur0.00705096486887679411418
1940.00680356259277585411019
20mobile0.00674171202375061910920
21claim0.00667986145472538310821
22and0.00655616031667491310622
23text0.00624690747154873810123
24with0.0061232063334982689924
25stop0.0056902523503216239225
26of0.0056284017812963889126
27reply0.0055047006432459188927
28prize0.0051954477981197438428
29our0.0051335972290945088329
30this0.0050098960910440388130
31only0.0048243443839683337831
32get0.0048243443839683337831
33just0.0046387926768926277533
34are0.0046387926768926277533
35won0.0044532409698169227235
36send0.0040821375556655126636
37new0.0040821375556655126636
38in0.0038965858485898076338
39urgent0.0037728847105393376139
40cash0.0037728847105393376139
41win0.0036491835724888675941
42no0.0036491835724888675941
43out0.0034636318654131625643
44contact0.0034017812963879275544
45service0.0033399307273626925445
46please0.0032162295893122225246
47guaranteed0.0030925284512617525047
48nokia0.0030306778822365174948
49customer0.0030306778822365174948
50we0.0029688273132112824850
51be0.0029688273132112824850
52160.0029688273132112824850
53phone0.0027832756061355764553
54per0.0027214250371103414454
555000.0027214250371103414454
56180.0026595744680851064356
57will0.0025977238990598714257
58week0.0025358733300346364158
59been0.0025358733300346364158
6010000.0025358733300346364158
61tone0.0024740227610094014061
621500.0024121721919841663962
63draw0.0023503216229589313863
64awarded0.0023503216229589313863
65i0.0022884710539336963765
66chat0.0022884710539336963765
671000.0022884710539336963765
68line0.0021647699158832263568
6910.0021647699158832263568
70mins0.0021029193468579913470
71latest0.0021029193468579913470
7220000.0021029193468579913470
73150ppm0.0021029193468579913470
74landline0.0019792182088075213274
75holiday0.0019792182088075213274
76every0.0019792182088075213274
77who0.0019173676397822863177
78shows0.0019173676397822863177
79by0.0019173676397822863177
80as0.0019173676397822863177
81number0.0018555170707570513081
82if0.0018555170707570513081
83has0.0018555170707570513081
84box0.0018555170707570513081
85po0.0017936665017318162985
86go0.0017936665017318162985
87code0.0017936665017318162985
88camera0.0017936665017318162985
89tcs0.0017318159327065812889
90receive0.0017318159327065812889
91message0.0017318159327065812889
92apply0.0017318159327065812889
93want0.0016699653636813462793
94more0.0016699653636813462793
95me0.0016699653636813462793
96it0.0016699653636813462793
97can0.0016699653636813462793
98all0.0016699653636813462793
99video0.0016081147946561112699
100that0.0016081147946561112699
Rows: 1-100 | Columns: 4

Let's add the most occurent words in our vDataFrame and compute the correlation vector.

In [15]:
%matplotlib inline
for elem in dict_spams.head(200).values["token"]:
    spam.regexp(name = elem,
                pattern = elem,
                method = "count",
                column = "content")
x = spam.corr(focus = "type")

Let's just keep the first 100-most correlated features and merge the numbers together.

In [16]:
spam.drop(columns = x["index"][101:])
for elem in x["index"][1:101]:
    if any(char.isdigit() for char in elem):
        spam[elem].drop()
spam.regexp(column = "content",
            pattern = "([0-9])+",
            method = "count",
            name = "nb_numbers")
Out[16]:
123
type
Integer
Abc
Varchar(3640)
123
length
Integer
123
to
Integer
123
call
Integer
123
a
Integer
123
your
Integer
123
now
Integer
123
or
Integer
123
free
Integer
123
txt
Integer
123
from
Integer
123
on
Integer
123
u
Integer
123
ur
Integer
123
mobile
Integer
123
claim
Integer
123
text
Integer
123
stop
Integer
123
reply
Integer
123
prize
Integer
123
our
Integer
123
won
Integer
123
new
Integer
123
urgent
Integer
...
123
mob
Integer
123
live
Integer
123
valid
Integer
123
ringtone
Integer
123
collect
Integer
123
sae
Integer
123
r
Integer
123
entry
Integer
123
cost
Integer
123
bonus
Integer
123
vouchers
Integer
123
rate
Integer
123
tones
Integer
123
await
Integer
123
land
Integer
123
award
Integer
123
c
Integer
123
winner
Integer
123
top
Integer
123
pobox
Integer
123
wk
Integer
123
voucher
Integer
123
poly
Integer
123
ltd
Integer
123
nb_numbers
Integer
101032080010000000000000000...0000001000000000200000003
201331060010001200000000000...0000003000000000200000000
3020040180000003800000000000...0000001000000000400000001
40450030000001200000000000...0000000000000000000000000
50851080000000310000000000...0000005000000000200000000
60350010010001100000000000...0000001000000000200000000
7046110471010001610000001010...00000022000000000600000001
80502010000000100001000000...0000002000000000001000000
90371010010001100000000000...0000001000000000000000000
100327201110100031110000001010...00000016000000000200000001
110480040000000200000000000...0000000000000000000000000
120450020000000200000000000...0000000000000000000000001
1301311070020011200000000000...0000006000000000400000000
14015500130020000500000000000...0000006000000000000000000
150661030000001200000000000...0000001000000000300000000
16013210101000002710000001000...0000003000000000400000000
170600010000001100000000000...0000002000000000000000000
180430030010001100010000000...0000001000000000100000000
190280010000010100000000000...0000002000000000000000000
200561070010012000000000000...0000004000000000100000000
210140020000001000000000000...0000000000000000000000000
220270000000000200000000000...0000000000000000000000000
2301400091120000210000001000...0000007000000000200000000
240691030000000300000000000...0000002000000000000000000
250920080000001310000000000...0000008000000000500000001
260280000000000000000000000...0000001000000000000000000
270671040010001100000000000...0000003000000000100000000
280370010010001100000000000...0000002000000000000000000
2901293040000002410000000000...0000003000000000100000000
3001691070010010410000001000...0000006000000000200000007
310380010000001300000000000...0000000000000000100000000
3201462071110003110000001100...00000010000000000100000000
330551130000000400000000000...0000000000000000100000002
340230010000000200000000000...0000001000000000000000000
350421010000000200000000000...0000001000000000100000000
360520020010000201000000000...1000002000000000100000000
37012100100000002100000000000...0000004000000000000000002
380230010000000000000000000...0000000000000000200000002
390270110000000000000000000...0000000000000000100000001
400230120000000000000000000...0000001000000000200000001
4101411030000002210000000000...0000008000000000300000005
4201411030000002210000000000...0000008000000000300000005
4301411030000002210000000000...0000008000000000300000005
44015800130010001100000000000...00000060000000001400000009
450790180000000100000000000...0000000010000000300000002
460170010000000000000000000...0000001000000000000000001
47015410120010002300000000000...0000006000000000500000006
480440031000000110000001000...0000003000000000000000001
490300020000000000000000000...0000001000010000100000001
500582050000001100000000000...0000002000000000001000001
510120000000000100000000000...0000000000000000000000001
520421040000001000000000000...0000002000000000000000002
530461030000000000000000000...0000001000000000200000001
540220010000000100000000000...0000000000000000100000001
550471050000000200000000000...0000000000000000200000001
560331030000000110000000000...0000004000000000100000002
570511030010000000000000000...0000001000000000000000001
58030000000000000000000000...0000000000000000000000001
590691060000000000000000000...0000001000000000200000001
600460020010000200000000000...0000001000000000300000002
610991080000000300000000010...0000003000000000200000001
620240001000000210000001000...0000002000000000000000000
63030000000000000000000000...0000000000000000000000000
640190010000000100000000000...0000000000000000000000000
650260020000000000000000000...0000001000000000100000000
66070000000000000000000000...0000000000000000000000000
670620040000000200000000000...0000003000000000100000000
680220010000000100000000000...0000001000000000100000000
690320000000001000000000000...0000000000000000000000000
70015200100030000430000000000...00000015000000000100000000
71015200100030000430000000000...00000015000000000100000000
720410010000000000000000000...0000001000000000000000000
730901030010000310000000000...0000008000000000400000000
7401692081010003710000001000...0000006000000000300000000
750660090000000110000001000...0000002000000000000000000
76012400100010001500000000000...0000006000000000100000000
770502020000000400000000000...0000001000000000000000000
7801141060000003210000001000...0000003000000000200000002
790840030000001200000000000...0000002000000000100000000
80019810131030002520000001100...0000009000000000300000000
81060010000000000000000000...0000001000000000000000000
820680030000000310000000000...0000003000000000300000000
830350040000000000000000000...0000001000000000000000000
840371020010000100000000000...0000002000000000000000000
850502040010000000000000000...0000005000000000100000000
860260031000000110000001000...0000002000000000000000000
870870060010000500000000000...0000001000000000100000000
880280040020000100000000000...0000003000000000000000000
890881050020001300000000000...0000005000000000000000000
900871160000000300000000000...0000002000000000200000000
910601040000000200000000000...0000001000000000200000000
92010610100010000600000000000...0000003000000000300000000
930421040000000200000000000...0000001000000000100000000
940800050100001100000000000...0000000000000000300000000
950600050010000200000000000...0000002000000000100000001
960610040010001100000000000...0000001000000000200000000
9701341060010001301000000000...1000003000000000100000000
980880040020000410000000000...0000006000000000200000001
990590020000000100000000000...0000003000000000200000000
10001071050050000200000000000...0000009000000000200000001
Rows: 1-100 of 5458 | Columns: 84

Let's narrow down our keyword list to words of more than two characters.

In [17]:
for elem in spam.get_columns():
    if len(elem.replace('"', '')) <= 2:
        spam[elem].drop()

Compute the correlation vector again using the response column.

In [18]:
spam.corr(focus = "type")
Out[18]:
"type"
"type"1.0
"nb_numbers"0.764
"mob"0.422
"txt"0.419
"call"0.414
"length"0.407
"free"0.396
"mobile"0.368
"claim"0.364
"box"0.354
"prize"0.314
"top"0.309
"stop"0.301
"all"0.301
"our"0.299
"text"0.29
"service"0.278
"land"0.266
"tone"0.265
"reply"0.259
"your"0.246
"line"0.246
"award"0.246
"pobox"0.246
"guaranteed"0.244
"cash"0.239
"win"0.239
"nokia"0.239
"urgent"0.233
"contact"0.232
"from"0.227
"customer"0.226
"awarded"0.212
"collect"0.212
"msg"0.211
"mins"0.205
"voucher"0.203
"code"0.2
"ringtone"0.2
"week"0.197
"draw"0.197
"landline"0.196
"chat"0.193
"video"0.191
"latest"0.19
"live"0.185
"new"0.183
"offer"0.181
"valid"0.179
"await"0.178
"camera"0.175
"collection"0.175
"poly"0.175
"shows"0.173
"cost"0.172
"apply"0.17
"now"0.169
"receive"0.168
"weekly"0.168
"winner"0.165
"won"0.163
"orange"0.161
"vouchers"0.161
"ltd"0.161
"bonus"0.158
"rate"0.158
"tones"0.158
"entry"0.157
"phone"0.156
"network"0.153
"sae"0.153
"selected"0.152
Rows: 1-72 | Columns: 2

We have enough correlated features with our response to create a fantastic model.

Machine Learning

The naive Bayes classifier is a powerful and performant algorithm for text analytics and binary classification. Before using it on our data, let's use a cross-validation to test the efficiency of our model.

In [19]:
from verticapy.learn.naive_bayes import MultinomialNB
model = MultinomialNB("spam_nb")

from verticapy.learn.model_selection import cross_validate
cross_validate(model, 
               spam, 
               spam.get_columns(exclude_columns = ["type", "content"]), 
               "type", 
               cv = 5)

Out[19]:
auc
prc_auc
accuracy
log_loss
precision
recall
f1_score
mcc
informedness
markedness
csi
time
1-fold0.95639852838818540.92695347503759470.95787545787545790.04997340778918431.00.59292035398230090.74444444444444450.75253654091568150.59292035398230090.95512195121951220.59292035398230098.60060429573059
2-fold0.97200351544909070.94303874022212830.94214876033057850.04725102255453481.00.56849315068493160.72489082969432320.72999430884780820.56849315068493160.9373757455268390.56849315068493168.586652994155884
3-fold0.95757055682685040.9285794585740430.9220898258478460.1427731943663970.98863636363636360.50877192982456140.67181467181467190.677788936227870.50768497330282210.904887609897580.50581395348837218.588281869888306
4-fold0.95474828375286170.92727793359784560.9239230064161320.07372474026687931.00.51461988304093560.67953667953667950.68704744365158390.51461988304093560.91724825523429710.51461988304093568.492604970932007
5-fold0.96066985645932990.91981471111956690.94236047575480330.1361528373851620.98780487804878050.56643356643356650.720.72374983378423870.5653809348546190.9264794576729150.56258.167183876037598
avg0.96027814817526360.92913286371023570.93767950524496360.089975040472431480.99528824833702880.55024777679325920.70813732509802390.71422341268543640.54981985917312180.92822260391022860.5488694682393088.487065601348878
std0.00690289023196766650.008495559534545760.0148495925223557070.046393907367964290.0064585256084983620.0367607637083331440.0311325470986220030.031118218245784480.036955643926388430.0192051537377022350.037209448666557630.1839894152855989
Rows: 1-7 | Columns: 13

We have an excellent model! Let's learn from the data.

In [20]:
model.fit(spam, 
          spam.get_columns(exclude_columns = ["type", "content"]), 
          "type")
model.confusion_matrix()
Out[20]:
0
1
047221
1337398
Rows: 1-2 | Columns: 3

Our model can reliably identify spam.

Conclusion

We've solved our problem in a Pandas-like way, all without ever loading data into memory!


VerticaPy

Python API for Vertica Data Science at Scale

About the Author


Badr Ouali
Head of Data Science


Badr Ouali works as a Lead Data Scientist for Vertica worldwide. He can embrace data projects end to end through a clear understanding of the “big picture” as well as attention to details, resulting in achieving great business outcomes – a distinctive differentiator in his role. Badr enjoys sharing knowledge and insights related to data analytics with colleagues & peers and has a sweet spot for Python. He loves helping customers finding the best value from their data and empower them to solve their use-cases.