VerticaPy

Python API for Vertica Data Science at Scale

Features Engineering

While preparing our data, we need to think constantly about the most suitable features we can use to achieve our overall goals.

Features engineering makes use of many techniques - too many to go over in this short lesson. We'll focus on the most popular ones.

Customized Features Engineering

To build a customized feature, you can use the 'eval' method of the vDataFrame. Let's look at an example with the well-known 'Titanic' dataset.

In [46]:
from verticapy.datasets import load_titanic
vdf = load_titanic()
display(vdf)
123
pclass
Int
123
survived
Int
Abc
Varchar(164)
Abc
sex
Varchar(20)
123
age
Numeric(6,3)
123
sibsp
Int
123
parch
Int
Abc
ticket
Varchar(36)
123
fare
Numeric(10,5)
Abc
cabin
Varchar(30)
Abc
embarked
Varchar(20)
Abc
boat
Varchar(100)
123
body
Int
Abc
home.dest
Varchar(100)
110female2.012113781151.55C22 C26S[null][null]Montreal, PQ / Chesterville, ON
210male30.012113781151.55C22 C26S[null]135Montreal, PQ / Chesterville, ON
310female25.012113781151.55C22 C26S[null][null]Montreal, PQ / Chesterville, ON
410male39.0001120500.0A36S[null][null]Belfast, NI
510male71.000PC 1760949.5042[null]C[null]22Montevideo, Uruguay
610male47.010PC 17757227.525C62 C64C[null]124New York, NY
710male[null]00PC 1731825.925[null]S[null][null]New York, NY
810male24.001PC 17558247.5208B58 B60C[null][null]Montreal, PQ
910male36.0001305075.2417C6CA[null]Winnipeg, MN
1010male25.0001390526.0[null]C[null]148San Francisco, CA
1110male45.00011378435.5TS[null][null]Trenton, NJ
1210male42.00011048926.55D22S[null][null]London / Winnipeg, MB
1310male41.00011305430.5A21S[null][null]Pomeroy, WA
1410male48.000PC 1759150.4958B10C[null]208Omaha, NE
1510male[null]0011237939.6[null]C[null][null]Philadelphia, PA
1610male45.00011305026.55B38S[null][null]Washington, DC
1710male[null]0011379831.0[null]S[null][null][null]
1810male33.0006955.0B51 B53 B55S[null][null]New York, NY
1910male28.00011305947.1[null]S[null][null]Montevideo, Uruguay
2010male17.00011305947.1[null]S[null][null]Montevideo, Uruguay
2110male49.0001992426.0[null]S[null][null]Ascot, Berkshire / Rochester, NY
2210male36.0101987778.85C46S[null]172Little Onn Hall, Staffs
2310male46.010W.E.P. 573461.175E31S[null][null]Amenia, ND
2410male[null]001120510.0[null]S[null][null]Liverpool, England / Belfast
2510male27.01013508136.7792C89C[null][null]Los Angeles, CA
2610male[null]0011046552.0A14S[null][null]Stoughton, MA
2710male47.000572725.5875E58S[null][null]Victoria, BC
2810male37.011PC 1775683.1583E52C[null][null]Lakewood, NJ
2910male[null]0011379126.55[null]S[null][null]Roachdale, IN
3010male70.011WE/P 573571.0B22S[null]269Milwaukee, WI
3110male39.010PC 1759971.2833C85C[null][null]New York, NY
3210male31.010F.C. 1275052.0B71S[null][null]Montreal, PQ
3310male50.010PC 17761106.425C86C[null]62Deephaven, MN / Cedar Rapids, IA
3410male39.000PC 1758029.7A18C[null]133Philadelphia, PA
3510female36.000PC 1753131.6792A29C[null][null]New York, NY
3610male[null]00PC 17483221.7792C95S[null][null][null]
3710male30.00011305127.75C111C[null][null]New York, NY
3810male19.03219950263.0C23 C25 C27S[null][null]Winnipeg, MB
3910male64.01419950263.0C23 C25 C27S[null][null]Winnipeg, MB
4010male[null]0011377826.55D34S[null][null]Westcliff-on-Sea, Essex
4110male[null]001120580.0B102S[null][null][null]
4210male37.01011380353.1C123S[null][null]Scituate, MA
4310male47.00011132038.5E63S[null]275St Anne's-on-Sea, Lancashire
4410male24.000PC 1759379.2B86C[null][null][null]
4510male71.000PC 1775434.6542A5C[null][null]New York, NY
4610male38.001PC 17582153.4625C91S[null]147Winnipeg, MB
4710male46.000PC 1759379.2B82 B84C[null][null]New York, NY
4810male[null]0011379642.4[null]S[null][null][null]
4910male45.0103697383.475C83S[null][null]New York, NY
5010male40.0001120590.0B94S[null]110[null]
5110male55.0111274993.5B69S[null]307Montreal, PQ
5210male42.00011303842.5B11S[null][null]London / Middlesex
5310male[null]001746351.8625E46S[null][null]Brighton, MA
5410male55.00068050.0C39S[null][null]London / Birmingham
5510male42.01011378952.0[null]S[null]38New York, NY
5610male[null]00PC 1760030.6958[null]C14[null]New York, NY
5710female50.000PC 1759528.7125C49C[null][null]Paris, France New York, NY
5810male46.00069426.0[null]S[null]80Bennington, VT
5910male50.00011304426.0E60S[null][null]London
6010male32.500113503211.5C132C[null]45[null]
6110male58.0001177129.7B37C[null]258Buffalo, NY
6210male41.0101746451.8625D21S[null][null]Southington / Noank, CT
6310male[null]0011302826.55C124S[null][null]Portland, OR
6410male[null]00PC 1761227.7208[null]C[null][null]Chicago, IL
6510male29.00011350130.0D6S[null]126Springfield, MA
6610male30.00011380145.5[null]S[null][null]London / New York, NY
6710male30.00011046926.0C106S[null][null]Brockton, MA
6810male19.01011377353.1D30S[null][null]New York, NY
6910male46.0001305075.2417C6C[null]292Vancouver, BC
7010male54.0001746351.8625E46S[null]175Dorchester, MA
7110male28.010PC 1760482.1708[null]C[null][null]New York, NY
7210male65.0001350926.55E38S[null]249East Bridgewater, MA
7310male44.0201992890.0C78Q[null]230Fond du Lac, WI
7410male55.00011378730.5C30S[null][null]Montreal, PQ
7510male47.00011379642.4[null]S[null][null]Washington, DC
7610male37.001PC 1759629.7C118C[null][null]Brooklyn, NY
7710male58.00235273113.275D48C[null]122Lexington, MA
7810male64.00069326.0[null]S[null]263Isle of Wight, England
7910male65.00111350961.9792B30C[null]234Providence, RI
8010male28.500PC 1756227.7208D43C[null]189?Havana, Cuba
8110male[null]001120520.0[null]S[null][null]Belfast
8210male45.50011304328.5C124S[null]166Surbiton Hill, Surrey
8310male23.0001274993.5B24S[null][null]Montreal, PQ
8410male29.01011377666.6C2S[null][null]Isleworth, England
8510male18.010PC 17758108.9C65C[null][null]Madrid, Spain
8610male47.00011046552.0C110S[null]207Worcester, MA
8710male38.000199720.0[null]S[null][null]Rotterdam, Netherlands
8810male22.000PC 17760135.6333[null]C[null]232[null]
8910male[null]00PC 17757227.525[null]C[null][null][null]
9010male31.000PC 1759050.4958A24S[null][null]Trenton, NJ
9110male[null]0011376750.0A32S[null][null]Seattle, WA
9210male36.0001304940.125A10C[null][null]Winnipeg, MB
9310male55.010PC 1760359.4[null]C[null][null]New York, NY
9410male33.00011379026.55[null]S[null]109London
9510male61.013PC 17608262.375B57 B59 B63 B66C[null][null]Haverford, PA / Cooperstown, NY
9610male50.0101350755.9E44S[null][null]Duluth, MN
9710male56.00011379226.55[null]S[null][null]New York, NY
9810male56.0001776430.6958A7C[null][null]St James, Long Island, NY
9910male24.0101369560.0C31S[null][null]Huntington, WV
10010male[null]0011305626.0A19S[null][null]Streatham, Surrey
Rows: 1-100 | Columns: 14

The feature 'parch' corresponds to the number of parents and children on-board. The feature 'sibsp' corresponds to the number of siblings and spouses on-board. We can create the feature 'family size' which is equal to parch + sibsp + 1.

In [47]:
vdf["family_size"] = vdf["parch"] + vdf["sibsp"] + 1
vdf.select(["parch", "sibsp", "family_size"])
Out[47]:
123
parch
Integer
123
sibsp
Integer
123
family_size
Integer
1214
2214
3214
4001
5001
6012
7001
8102
9001
10001
11001
12001
13001
14001
15001
16001
17001
18001
19001
20001
21001
22012
23012
24001
25012
26001
27001
28113
29001
30113
31012
32012
33012
34001
35001
36001
37001
38236
39416
40001
41001
42012
43001
44001
45001
46102
47001
48001
49012
50001
51113
52001
53001
54001
55012
56001
57001
58001
59001
60001
61001
62012
63001
64001
65001
66001
67001
68012
69001
70001
71012
72001
73023
74001
75001
76102
77203
78001
79102
80001
81001
82001
83001
84012
85012
86001
87001
88001
89001
90001
91001
92001
93012
94001
95315
96012
97001
98001
99012
100001
Rows: 1-100 | Columns: 3

When using the 'eval' method, you can enter any SQL expression and VerticaPy will evaluate it!

Regular Expressions

To compute features using regular expressions, we'll use the 'regexp' method.

In [49]:
help(vp.vDataFrame.regexp)
Help on function regexp in module verticapy.vdataframe:

regexp(self, column:str, pattern:str, method:str='substr', position:int=1, occurrence:int=1, replacement:str='', return_position:int=0, name:str='')
    ---------------------------------------------------------------------------
    Computes a new vColumn based on regular expressions. 
    
    Parameters
    ----------
    column: str
        Input vColumn to use to compute the regular expression.
    pattern: str
        The regular expression.
    method: str, optional
        Method to use to compute the regular expressions.
            count     : Returns the number times a regular expression matches 
                each element of the input vColumn. 
            ilike     : Returns True if the vColumn element contains a match 
                for the regular expression.
            instr     : Returns the starting or ending position in a vColumn 
                element where a regular expression matches. 
            like      : Returns True if the vColumn element matches the regular 
                expression.
            not_ilike : Returns True if the vColumn element does not match the 
                case-insensitive regular expression.
            not_like  : Returns True if the vColumn element does not contain a 
                match for the regular expression.
            replace   : Replaces all occurrences of a substring that match a 
                regular expression with another substring.
            substr    : Returns the substring that matches a regular expression 
                within a vColumn.
    position: int, optional
        The number of characters from the start of the string where the function 
        should start searching for matches.
    occurrence: int, optional
        Controls which occurrence of a pattern match in the string to return.
    replacement: str, optional
        The string to replace matched substrings.
    return_position: int, optional
        Sets the position within the string to return.
    name: str, optional
        New feature name. If empty, a name will be generated.
    
    Returns
    -------
    vDataFrame
        self
    
    See Also
    --------
    vDataFrame.eval : Evaluates a customized expression.

Consider the following example: notice that passenger names include their title.

In [50]:
vdf["name"]
Out[50]:
Abc
Varchar(164)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
Rows: 1-100 of 1234 | Column: name | Type: Varchar(164)

Let's extract the title using regular expressions.

In [51]:
vdf.regexp(column = "name",
           name = "title",
           pattern = " ([A-Za-z])+\.",
           method = "substr")
vdf.select(["name", "title"])
Out[51]:
Abc
Varchar(164)
Abc
title
Varchar(164)
1 Miss.
2 Mr.
3 Mrs.
4 Mr.
5 Mr.
6 Col.
7 Mr.
8 Mr.
9 Mr.
10 Mr.
11 Mr.
12 Mr.
13 Mr.
14 Mr.
15 Dr.
16 Major.
17 Mr.
18 Mr.
19 Mr.
20 Mr.
21 Mr.
22 Mr.
23 Mr.
24 Mr.
25 Mr.
26 Mr.
27 Mr.
28 Mr.
29 Mr.
30 Capt.
31 Mr.
32 Mr.
33 Mr.
34 Mr.
35 Miss.
36 Mr.
37 Mr.
38 Mr.
39 Mr.
40 Mr.
41 Mr.
42 Mr.
43 Mr.
44 Mr.
45 Mr.
46 Mr.
47 Mr.
48 Mr.
49 Mr.
50 Mr.
51 Mr.
52 Mr.
53 Mr.
54 Mr.
55 Mr.
56 Mr.
57 Miss.
58 Mr.
59 Mr.
60 Mr.
61 Mr.
62 Mr.
63 Mr.
64 Mr.
65 Mr.
66 Mr.
67 Mr.
68 Mr.
69 Mr.
70 Mr.
71 Mr.
72 Mr.
73 Dr.
74 Mr.
75 Mr.
76 Mr.
77 Mr.
78 Mr.
79 Mr.
80 Mr.
81 Mr.
82 Mr.
83 Mr.
84 Mr.
85 Mr.
86 Mr.
87 Jonkheer.
88 Mr.
89 Mr.
90 Mr.
91 Mr.
92 Mr.
93 Mr.
94 Mr.
95 Mr.
96 Mr.
97 Mr.
98 Mr.
99 Mr.
100 Mr.
Rows: 1-100 | Columns: 2

Advanced Analytical Functions

The 'analytic' method contains the many advanced analytical functions in VerticaPy.

In [52]:
help(vp.vDataFrame.analytic)
Help on function analytic in module verticapy.vdataframe:

analytic(self, func:str, columns:Union[str, list]=[], by:list=[], order_by:Union[dict, list]=[], name:str='', offset:int=1, x_smoothing:float=0.5, add_count:bool=True)
    ---------------------------------------------------------------------------
    Adds a new vColumn to the vDataFrame by using an advanced analytical 
    function on one or two specific vColumns.
    
    ⚠ Warning : Some analytical functions can make the vDataFrame 
                     structure more resource intensive. It is best to check 
                     the structure of the vDataFrame using the 'current_relation' 
                     method and to save it using the 'to_db' method with 
                     the parameters 'inplace = True' and 
                     'relation_type = table'
    
    Parameters
    ----------
    func: str
        Function to apply.
            aad          : average absolute deviation
            beta         : Beta Coefficient between 2 vColumns
            count        : number of non-missing elements
            corr         : Pearson's correlation between 2 vColumns
            cov          : covariance between 2 vColumns
            dense_rank   : dense rank
            ema          : exponential moving average
            first_value  : first non null lead
            iqr          : interquartile range
            kurtosis     : kurtosis
            jb           : Jarque-Bera index 
            lead         : next element
            lag          : previous element
            last_value   : first non null lag
            mad          : median absolute deviation
            max          : maximum
            mean         : average
            median       : median
            min          : minimum
            mode         : most occurent element
            q%           : q quantile (ex: 50% for the median)
            pct_change   : ratio between the current value and the previous one
            percent_rank : percent rank
            prod         : product
            range        : difference between the max and the min
            rank         : rank
            row_number   : row number
            sem          : standard error of the mean
            skewness     : skewness
            sum          : sum
            std          : standard deviation
            unique       : cardinality (count distinct)
            var          : variance
                Other analytical functions could work if it is part of 
                the DB version you are using.
    columns: str, optional
        Input vColumns. It can be a list of one or two elements.
    by: list, optional
        vColumns used in the partition.
    order_by: dict / list, optional
        List of the vColumns to use to sort the data using asc order or
        dictionary of all sorting methods. For example, to sort by "column1"
        ASC and "column2" DESC, write {"column1": "asc", "column2": "desc"}
    name: str, optional
        Name of the new vColumn. If empty a default name based on the other
        parameters will be generated.
    offset: int, optional
        Lead/Lag offset if parameter 'func' is the function 'lead'/'lag'.
    x_smoothing: float, optional
        The smoothing parameter of the 'ema' if the function is 'ema'. It must be in [0;1]
    add_count: bool, optional
        If the function is the 'mode' and this parameter is True then another column will 
        be added to the vDataFrame with the mode number of occurences.
    
    Returns
    -------
    vDataFrame
        self
    
    See Also
    --------
    vDataFrame.eval    : Evaluates a customized expression.
    vDataFrame.rolling : Computes a customized moving window.

To demonstrate some of these techniques, let's use the Amazon dataset and perform some computations.

In [53]:
from verticapy.datasets import load_amazon
vdf = load_amazon()
display(vdf)
📅
date
Date
Abc
state
Varchar(32)
123
number
Int
11998-01-01ACRE0
21998-01-01ALAGOAS0
31998-01-01AMAPÁ0
41998-01-01AMAZONAS0
51998-01-01BAHIA0
61998-01-01CEARÁ0
71998-01-01DISTRITO FEDERAL0
81998-01-01ESPÍRITO SANTO0
91998-01-01GOIÁS0
101998-01-01MARANHÃO0
111998-01-01MATO GROSSO0
121998-01-01MATO GROSSO DO SUL0
131998-01-01MINAS GERAIS0
141998-01-01PARANÁ0
151998-01-01PARAÍBA0
161998-01-01PARÁ0
171998-01-01PERNAMBUCO0
181998-01-01PIAUÍ0
191998-01-01RIO DE JANEIRO0
201998-01-01RIO GRANDE DO NORTE0
211998-01-01RIO GRANDE DO SUL0
221998-01-01RONDÔNIA0
231998-01-01RORAIMA0
241998-01-01SANTA CATARINA0
251998-01-01SERGIPE0
261998-01-01SÃO PAULO0
271998-01-01TOCANTINS0
281998-02-01ACRE0
291998-02-01ALAGOAS0
301998-02-01AMAPÁ0
311998-02-01AMAZONAS0
321998-02-01BAHIA0
331998-02-01CEARÁ0
341998-02-01DISTRITO FEDERAL0
351998-02-01ESPÍRITO SANTO0
361998-02-01GOIÁS0
371998-02-01MARANHÃO0
381998-02-01MATO GROSSO0
391998-02-01MATO GROSSO DO SUL0
401998-02-01MINAS GERAIS0
411998-02-01PARANÁ0
421998-02-01PARAÍBA0
431998-02-01PARÁ0
441998-02-01PERNAMBUCO0
451998-02-01PIAUÍ0
461998-02-01RIO DE JANEIRO0
471998-02-01RIO GRANDE DO NORTE0
481998-02-01RIO GRANDE DO SUL0
491998-02-01RONDÔNIA0
501998-02-01RORAIMA0
511998-02-01SANTA CATARINA0
521998-02-01SERGIPE0
531998-02-01SÃO PAULO0
541998-02-01TOCANTINS0
551998-03-01ACRE0
561998-03-01ALAGOAS0
571998-03-01AMAPÁ0
581998-03-01AMAZONAS0
591998-03-01BAHIA0
601998-03-01CEARÁ0
611998-03-01DISTRITO FEDERAL0
621998-03-01ESPÍRITO SANTO0
631998-03-01GOIÁS0
641998-03-01MARANHÃO0
651998-03-01MATO GROSSO0
661998-03-01MATO GROSSO DO SUL0
671998-03-01MINAS GERAIS0
681998-03-01PARANÁ0
691998-03-01PARAÍBA