Spam

This example uses the 'Spam' dataset to detect SMS spam. You can download the Jupyter Notebook of the study here.

  • v1: the SMS type (spam or ham)
  • v2: SMS content

We will follow the data science cycle (Data Exploration - Data Preparation - Data Modeling - Model Evaluation - Model Deployment) to solve this problem.

Initialization

This example uses the following version of VerticaPy:

In [1]:
import verticapy as vp
vp.__version__
Out[1]:
'0.9.0'

Connect to Vertica. This example uses an existing connection called "VerticaDSN." For details on how to create a connection, use see the connection tutorial.

In [2]:
vp.connect("VerticaDSN")

Let's create a Virtual DataFrame of the dataset. The dataset is available here.

In [3]:
spam = vp.read_csv("data/spam.csv")
display(spam)
Abc
type
Varchar(20)
Abc
Varchar(1820)
1ham
2ham
3ham
4ham
5ham
6ham
7ham
8ham
9ham
10ham
11ham
12ham
13ham
14ham
15ham
16ham
17ham
18ham
19ham
20ham
21ham
22ham
23ham
24ham
25ham
26ham
27ham
28ham
29ham
30ham
31ham
32ham
33ham
34ham
35ham
36ham
37ham
38ham
39ham
40ham
41ham
42ham
43ham
44ham
45ham
46ham
47ham
48ham
49ham
50ham
51ham
52ham
53ham
54ham
55ham
56ham
57ham
58ham
59ham
60ham
61ham
62ham
63ham
64ham
65ham
66ham
67ham
68ham
69ham
70ham
71ham
72ham
73ham
74ham
75ham
76ham
77ham
78ham
79ham
80ham
81ham
82ham
83ham
84ham
85ham
86ham
87ham
88ham
89ham
90ham
91ham
92ham
93ham
94ham
95ham
96ham
97ham
98ham
99ham
100ham
Rows: 1-100 | Columns: 2

Data Exploration and Preparation

Our dataset relies on text analysis. First, we should create some features. For example, we can use the SMS length and label encoding on the 'type' to get a dummy (1 if the message is a SPAM, 0 otherwise). We should also convert the message content to lowercase to simplify our analysis.

In [4]:
import verticapy.stats as st
spam["length"] = st.length(spam["content"])
spam["content"].apply("LOWER({})")
spam["type"].decode('spam', 1, 0)
Out[4]:
123
type
Integer
Abc
Varchar(3640)
123
length
Integer
10103
20133
30200
4045
5085
6035
70461
8050
9037
100327
11048
12045
130131
140155
15066
160132
17060
18043
19028
20056
21014
22027
230140
24069
25092
26028
27067
28037
290129
300169
31038
320146
33055
34023
35042
36052
370121
38023
39027
40023
410141
420141
430141
440158
45079
46017
470154
48044
49030
50058
51012
52042
53046
54022
55047
56033
57051
5803
59069
60046
61099
62024
6303
64019
65026
6607
67062
68022
69032
700152
710152
72041
73090
740169
75066
760124
77050
780114
79084
800198
8106
82068
83035
84037
85050
86026
87087
88028
89088
90087
91060
920106
93042
94080
95060
96061
970134
98088
99059
1000107
Rows: 1-100 | Columns: 3

Let's compute some statistics using the length of the message.

In [5]:
spam['type'].describe(method = 'cat_stats', 
                      numcol = 'length')
Out[5]:
count
percent
mean
std
min
approx_10%
approx_25%
approx_50%
approx_75%
approx_90%
max
0472386.5335287651154269.596866398475556.2618267177762224.033.051.089.0143.0910
173513.466471234884573139.23945578231329.125441741742813101.0133.5149.0158.0161.0224
Rows: 1-2 | Columns: 12

Notice: spam tends to be longer than a normal message. First, let's create a view with just spam. Then, we'll use the CountVectorizer to create a dictionary and identify keywords.

In [6]:
spams = spam.search(spam["type"] == 1)

from verticapy.learn.preprocessing import CountVectorizer
dict_spams = CountVectorizer("spams_voc")
dict_spams.fit(spams, ["content"])
dict_spams = dict_spams.transform()
display(dict_spams)
Abc
token
Varchar(128)
123
df
Numeric(36,18)
123
cnt
Integer
123
rnk
Integer
1to0.0282038594755071774561
2call0.019668480950024743182
3a0.017936665017318162903
4you0.0142874814448292932314
5your0.0138545274616526472245
6now0.0110712518555170711796
7or0.0108857001484413671767
8for0.0107619990103908961748
9free0.0098960910440376041609
10the0.0098342404750123715910
11txt0.00878278080158337514211
12is0.00853537852548243313812
1320.00828797624938149413413
14have0.00773132112815437912514
15from0.00760761999010390912315
16on0.00742206828302820412016
17u0.00729836714497773411817
18ur0.00705096486887679411418
1940.00680356259277585411019
20mobile0.00674171202375061910920
21claim0.00667986145472538310821
22and0.00655616031667491310622
23text0.00624690747154873810123
24with0.0061232063334982689924
25stop0.0056902523503216239225
26of0.0056284017812963889126
27reply0.0055047006432459188927
28prize0.0051954477981197438428
29our0.0051335972290945088329
30this0.0050098960910440388130
31only0.0048243443839683337831
32get0.0048243443839683337831
33just0.0046387926768926277533
34are0.0046387926768926277533
35won0.0044532409698169227235
36send0.0040821375556655126636
37new0.0040821375556655126636
38in0.0038965858485898076338
39urgent0.0037728847105393376139
40cash0.0037728847105393376139
41win0.0036491835724888675941
42no0.0036491835724888675941
43out0.0034636318654131625643
44contact0.0034017812963879275544
45service0.0033399307273626925445
46please0.0032162295893122225246
47guaranteed0.0030925284512617525047
48nokia0.0030306778822365174948
49customer0.0030306778822365174948
50we0.0029688273132112824850
51be0.0029688273132112824850
52160.0029688273132112824850
53phone0.0027832756061355764553
54per0.0027214250371103414454
555000.0027214250371103414454
56180.0026595744680851064356
57will0.0025977238990598714257
58week0.0025358733300346364158
59been0.0025358733300346364158
6010000.0025358733300346364158
61tone0.0024740227610094014061
621500.0024121721919841663962
63draw0.0023503216229589313863
64awarded0.0023503216229589313863
65i0.0022884710539336963765
66chat0.0022884710539336963765
671000.0022884710539336963765
68line0.0021647699158832263568
6910.0021647699158832263568
70mins0.0021029193468579913470
71latest0.0021029193468579913470
7220000.0021029193468579913470
73150ppm0.0021029193468579913470
74landline0.0019792182088075213274
75holiday0.0019792182088075213274
76every0.0019792182088075213274
77who0.0019173676397822863177
78shows0.0019173676397822863177
79by0.0019173676397822863177
80as0.0019173676397822863177
81number0.0018555170707570513081
82if0.0018555170707570513081
83has0.0018555170707570513081
84box0.0018555170707570513081
85po0.0017936665017318162985
86go0.0017936665017318162985
87code0.0017936665017318162985
88camera0.0017936665017318162985
89tcs0.0017318159327065812889
90receive0.0017318159327065812889
91message0.0017318159327065812889
92apply0.0017318159327065812889
93want0.0016699653636813462793
94more0.0016699653636813462793
95me0.0016699653636813462793
96it0.0016699653636813462793
97can0.0016699653636813462793
98all0.0016699653636813462793
99video0.0016081147946561112699
100that0.0016081147946561112699
Rows: 1-100 | Columns: 4

Let's add the most occurent words in our vDataFrame and compute the correlation vector.

In [15]:
%matplotlib inline
for elem in dict_spams.head(200).values["token"]:
    spam.regexp(name = elem,
                pattern = elem,
                method = "count",
                column = "content")
x = spam.corr(focus = "type")

Let's just keep the first 100-most correlated features and merge the numbers together.

In [16]:
spam.drop(columns = x["index"][101:])
for elem in x["index"][1:101]:
    if any(char.isdigit() for char in elem):
        spam[elem].drop()
spam.regexp(column = "content",
            pattern = "([0-9])+",
            method = "count",
            name = "nb_numbers")
Out[16]:
123
type
Integer
Abc
Varchar(3640)
123
length
Integer
123
to
Integer
123
call
Integer
123
a
Integer
123
your
Integer
123
now
Integer
123
or
Integer
123
free
Integer
123
txt
Integer
123
from
Integer
123
on
Integer
123
u
Integer
123
ur
Integer
123
mobile
Integer
123
claim
Integer
123
text
Integer
123
stop
Integer
123
reply
Integer
123
prize
Integer
123
our
Integer
123
won
Integer
123
new
Integer
123
urgent
Integer
...
123
mob
Integer
123
live
Integer
123
valid
Integer
123
ringtone
Integer
123
collect
Integer
123
sae
Integer
123
r
Integer
123
entry
Integer
123
cost
Integer
123
bonus
Integer
123
vouchers
Integer
123
rate
Integer
123
tones
Integer
123
await
Integer
123
land
Integer
123
award
Integer
123
c
Integer
123
winner
Integer
123
top
Integer
123
pobox
Integer
123
wk
Integer
123
voucher
Integer
123
poly
Integer
123
ltd
Integer
123
nb_numbers
Integer
101032080010000000000000000...0000001000000000200000003
201331060010001200000000000...0000003000000000200000000
3020040180000003800000000000...0000001000000000400000001
40450030000001200000000000...0000000000000000000000000
50851080000000310000000000...0000005000000000200000000
60350010010001100000000000...0000001000000000200000000
7046110471010001610000001010...00000022000000000600000001
80502010000000100001000000...0000002000000000001000000
90371010010001100000000000...0000001000000000000000000
100327201110100031110000001010...00000016000000000200000001
110480040000000200000000000...0000000000000000000000000
120450020000000200000000000...0000000000000000000000001
1301311070020011200000000000...0000006000000000400000000
14015500130020000500000000