VerticaPy

Python API for Vertica Data Science at Scale

Features Engineering

While preparing our data, we need to think constantly about the most suitable features we can use to achieve our overall goals.

Features engineering makes use of many techniques - too many to go over in this short lesson. We'll focus on the most popular ones.

Customized Features Engineering

To build a customized feature, you can use the 'eval' method of the vDataFrame. Let's look at an example with the well-known 'Titanic' dataset.

In [1]:
from verticapy import *
vdf = vDataFrame("titanic")
display(vdf)
123
pclass
Int
123
survived
Int
Abc
Varchar(164)
Abc
sex
Varchar(20)
123
age
Numeric(6,3)
123
sibsp
Int
123
parch
Int
Abc
ticket
Varchar(36)
123
fare
Numeric(10,5)
Abc
cabin
Varchar(30)
Abc
embarked
Varchar(20)
Abc
boat
Varchar(100)
123
body
Int
Abc
home.dest
Varchar(100)
110female2.012113781151.55C22 C26S[null][null]Montreal, PQ / Chesterville, ON
210male30.012113781151.55C22 C26S[null]135Montreal, PQ / Chesterville, ON
310female25.012113781151.55C22 C26S[null][null]Montreal, PQ / Chesterville, ON
410male39.0001120500.0A36S[null][null]Belfast, NI
510male71.000PC 1760949.5042[null]C[null]22Montevideo, Uruguay
610male47.010PC 17757227.525C62 C64C[null]124New York, NY
710male[null]00PC 1731825.925[null]S[null][null]New York, NY
810male24.001PC 17558247.5208B58 B60C[null][null]Montreal, PQ
910male36.0001305075.2417C6CA[null]Winnipeg, MN
1010male25.0001390526.0[null]C[null]148San Francisco, CA
1110male45.00011378435.5TS[null][null]Trenton, NJ
1210male42.00011048926.55D22S[null][null]London / Winnipeg, MB
1310male41.00011305430.5A21S[null][null]Pomeroy, WA
1410male48.000PC 1759150.4958B10C[null]208Omaha, NE
1510male[null]0011237939.6[null]C[null][null]Philadelphia, PA
1610male45.00011305026.55B38S[null][null]Washington, DC
1710male[null]0011379831.0[null]S[null][null][null]
1810male33.0006955.0B51 B53 B55S[null][null]New York, NY
1910male28.00011305947.1[null]S[null][null]Montevideo, Uruguay
2010male17.00011305947.1[null]S[null][null]Montevideo, Uruguay
2110male49.0001992426.0[null]S[null][null]Ascot, Berkshire / Rochester, NY
2210male36.0101987778.85C46S[null]172Little Onn Hall, Staffs
2310male46.010W.E.P. 573461.175E31S[null][null]Amenia, ND
2410male[null]001120510.0[null]S[null][null]Liverpool, England / Belfast
2510male27.01013508136.7792C89C[null][null]Los Angeles, CA
2610male[null]0011046552.0A14S[null][null]Stoughton, MA
2710male47.000572725.5875E58S[null][null]Victoria, BC
2810male37.011PC 1775683.1583E52C[null][null]Lakewood, NJ
2910male[null]0011379126.55[null]S[null][null]Roachdale, IN
3010male70.011WE/P 573571.0B22S[null]269Milwaukee, WI
3110male39.010PC 1759971.2833C85C[null][null]New York, NY
3210male31.010F.C. 1275052.0B71S[null][null]Montreal, PQ
3310male50.010PC 17761106.425C86C[null]62Deephaven, MN / Cedar Rapids, IA
3410male39.000PC 1758029.7A18C[null]133Philadelphia, PA
3510female36.000PC 1753131.6792A29C[null][null]New York, NY
3610male[null]00PC 17483221.7792C95S[null][null][null]
3710male30.00011305127.75C111C[null][null]New York, NY
3810male19.03219950263.0C23 C25 C27S[null][null]Winnipeg, MB
3910male64.01419950263.0C23 C25 C27S[null][null]Winnipeg, MB
4010male[null]0011377826.55D34S[null][null]Westcliff-on-Sea, Essex
4110male[null]001120580.0B102S[null][null][null]
4210male37.01011380353.1C123S[null][null]Scituate, MA
4310male47.00011132038.5E63S[null]275St Anne's-on-Sea, Lancashire
4410male24.000PC 1759379.2B86C[null][null][null]
4510male71.000PC 1775434.6542A5C[null][null]New York, NY
4610male38.001PC 17582153.4625C91S[null]147Winnipeg, MB
4710male46.000PC 1759379.2B82 B84C[null][null]New York, NY
4810male[null]0011379642.4[null]S[null][null][null]
4910male45.0103697383.475C83S[null][null]New York, NY
5010male40.0001120590.0B94S[null]110[null]
5110male55.0111274993.5B69S[null]307Montreal, PQ
5210male42.00011303842.5B11S[null][null]London / Middlesex
5310male[null]001746351.8625E46S[null][null]Brighton, MA
5410male55.00068050.0C39S[null][null]London / Birmingham
5510male42.01011378952.0[null]S[null]38New York, NY
5610male[null]00PC 1760030.6958[null]C14[null]New York, NY
5710female50.000PC 1759528.7125C49C[null][null]Paris, France New York, NY
5810male46.00069426.0[null]S[null]80Bennington, VT
5910male50.00011304426.0E60S[null][null]London
6010male32.500113503211.5C132C[null]45[null]
6110male58.0001177129.7B37C[null]258Buffalo, NY
6210male41.0101746451.8625D21S[null][null]Southington / Noank, CT
6310male[null]0011302826.55C124S[null][null]Portland, OR
6410male[null]00PC 1761227.7208[null]C[null][null]Chicago, IL
6510male29.00011350130.0D6S[null]126Springfield, MA
6610male30.00011380145.5[null]S[null][null]London / New York, NY
6710male30.00011046926.0C106S[null][null]Brockton, MA
6810male19.01011377353.1D30S[null][null]New York, NY
6910male46.0001305075.2417C6C[null]292Vancouver, BC
7010male54.0001746351.8625E46S[null]175Dorchester, MA
7110male28.010PC 1760482.1708[null]C[null][null]New York, NY
7210male65.0001350926.55E38S[null]249East Bridgewater, MA
7310male44.0201992890.0C78Q[null]230Fond du Lac, WI
7410male55.00011378730.5C30S[null][null]Montreal, PQ
7510male47.00011379642.4[null]S[null][null]Washington, DC
7610male37.001PC 1759629.7C118C[null][null]Brooklyn, NY
7710male58.00235273113.275D48C[null]122Lexington, MA
7810male64.00069326.0[null]S[null]263Isle of Wight, England
7910male65.00111350961.9792B30C[null]234Providence, RI
8010male28.500PC 1756227.7208D43C[null]189?Havana, Cuba
8110male[null]001120520.0[null]S[null][null]Belfast
8210male45.50011304328.5C124S[null]166Surbiton Hill, Surrey
8310male23.0001274993.5B24S[null][null]Montreal, PQ
8410male29.01011377666.6C2S[null][null]Isleworth, England
8510male18.010PC 17758108.9C65C[null][null]Madrid, Spain
8610male47.00011046552.0C110S[null]207Worcester, MA
8710male38.000199720.0[null]S[null][null]Rotterdam, Netherlands
8810male22.000PC 17760135.6333[null]C[null]232[null]
8910male[null]00PC 17757227.525[null]C[null][null][null]
9010male31.000PC 1759050.4958A24S[null][null]Trenton, NJ
9110male[null]0011376750.0A32S[null][null]Seattle, WA
9210male36.0001304940.125A10C[null][null]Winnipeg, MB
9310male55.010PC 1760359.4[null]C[null][null]New York, NY
9410male33.00011379026.55[null]S[null]109London
9510male61.013PC 17608262.375B57 B59 B63 B66C[null][null]Haverford, PA / Cooperstown, NY
9610male50.0101350755.9E44S[null][null]Duluth, MN
9710male56.00011379226.55[null]S[null][null]New York, NY
9810male56.0001776430.6958A7C[null][null]St James, Long Island, NY
9910male24.0101369560.0C31S[null][null]Huntington, WV
10010male[null]0011305626.0A19S[null][null]Streatham, Surrey
Rows: 1-100 | Columns: 14

The feature 'parch' corresponds to the number of parents and children on-board. The feature 'sibsp' corresponds to the number of siblings and spouses on-board. We can create the feature 'family size' which is equal to parch + sibsp + 1.

In [2]:
vdf.eval(name = "family_size",
         expr = "parch + sibsp + 1")
vdf.select(["parch", "sibsp", "family_size"])
Out[2]:
123
parch
Int
123
sibsp
Int
123
family_size
Int
1214
2214
3214
4001
5001
6012
7001
8102
9001
10001
11001
12001
13001
14001
15001
16001
17001
18001
19001
20001
21001
22012
23012
24001
25012
26001
27001
28113
29001
30113
31012
32012
33012
34001
35001
36001
37001
38236
39416
40001
41001
42012
43001
44001
45001
46102
47001
48001
49012
50001
51113
52001
53001
54001
55012
56001
57001
58001
59001
60001
61001
62012
63001
64001
65001
66001
67001
68012
69001
70001
71012
72001
73023
74001
75001
76102
77203
78001
79102
80001
81001
82001
83001
84012
85012
86001
87001
88001
89001
90001
91001
92001
93012
94001
95315
96012
97001
98001
99012
100001
Rows: 1-100 | Columns: 3

When using the 'eval' method, you can enter any SQL expression and VerticaPy will evaluate it!

Regular Expressions

To compute features using regular expressions, we'll use the 'regexp' method.

In [3]:
help(vDataFrame.regexp)
Help on function regexp in module verticapy.vdataframe:

regexp(self, column:str, pattern:str, method:str='substr', position:int=1, occurrence:int=1, replacement:str='', return_position:int=0, name:str='')
    ---------------------------------------------------------------------------
    Computes a new vcolumn based on regular expressions. 
    
    Parameters
    ----------
    column: str
        Input vcolumn to use to compute the regular expression.
    pattern: str
        The regular expression.
    method: str, optional
        Method to use to compute the regular expressions.
            count     : Returns the number times a regular expression matches 
                each element of the input vcolumn. 
            ilike     : Returns True if the vcolumn element contains a match 
                for the regular expression.
            instr     : Returns the starting or ending position in a vcolumn 
                element where a regular expression matches. 
            like      : Returns True if the vcolumn element matches the regular 
                expression.
            not_ilike : Returns True if the vcolumn element does not match the 
                case-insensitive regular expression.
            not_like  : Returns True if the vcolumn element does not contain a 
                match for the regular expression.
            replace   : Replaces all occurrences of a substring that match a 
                regular expression with another substring.
            substr    : Returns the substring that matches a regular expression 
                within a vcolumn.
    position: int, optional
        The number of characters from the start of the string where the function 
        should start searching for matches.
    occurrence: int, optional
        Controls which occurrence of a pattern match in the string to return.
    replacement: str, optional
        The string to replace matched substrings.
    return_position: int, optional
        Sets the position within the string to return.
    name: str, optional
        New feature name. If empty, a name will be generated.
    
    Returns
    -------
    vDataFrame
        self
    
    See Also
    --------
    vDataFrame.eval : Evaluates a customized expression.

Consider the following example: notice that passenger names include their title.

In [4]:
vdf["name"]
Out[4]:
Abc
Varchar(164)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
Rows: 1-100 of 1234 | Column: name | Type: Varchar(164)

Let's extract the title using regular expressions.

In [5]:
vdf.regexp(column = "name",
           name = "title",
           pattern = " ([A-Za-z])+\.",
           method = "substr")
vdf.select(["name", "title"])
Out[5]:
Abc
Varchar(164)
Abc
title
Varchar(164)
1 Miss.
2 Mr.
3 Mrs.
4 Mr.
5 Mr.
6 Col.
7 Mr.
8 Mr.
9 Mr.
10 Mr.
11 Mr.
12 Mr.
13 Mr.
14 Mr.
15 Dr.
16 Major.
17 Mr.
18 Mr.
19 Mr.
20 Mr.
21 Mr.
22 Mr.
23 Mr.
24 Mr.
25 Mr.
26 Mr.
27 Mr.
28 Mr.
29 Mr.
30 Capt.
31 Mr.
32 Mr.
33 Mr.
34 Mr.
35 Miss.
36 Mr.
37 Mr.
38 Mr.
39 Mr.
40 Mr.
41 Mr.
42 Mr.
43 Mr.
44 Mr.
45 Mr.
46 Mr.
47 Mr.
48 Mr.
49 Mr.
50 Mr.
51 Mr.
52 Mr.
53 Mr.
54 Mr.
55 Mr.
56 Mr.
57 Miss.
58 Mr.
59 Mr.
60 Mr.
61 Mr.
62 Mr.
63 Mr.
64 Mr.
65 Mr.
66 Mr.
67 Mr.
68 Mr.
69 Mr.
70 Mr.
71 Mr.
72 Mr.
73 Dr.
74 Mr.
75 Mr.
76 Mr.
77 Mr.
78 Mr.
79 Mr.
80 Mr.
81 Mr.
82 Mr.
83 Mr.
84 Mr.
85 Mr.
86 Mr.
87 Jonkheer.
88 Mr.
89 Mr.
90 Mr.
91 Mr.
92 Mr.
93 Mr.
94 Mr.
95 Mr.
96 Mr.
97 Mr.
98 Mr.
99 Mr.
100 Mr.
Rows: 1-100 | Columns: 2

Advanced Analytical Functions

The 'analytic' method contains the many advanced analytical functions in VerticaPy.

In [6]:
help(vDataFrame.analytic)
Help on function analytic in module verticapy.vdataframe:

analytic(self, func:str, columns:(<class 'str'>, <class 'list'>)=[], by:list=[], order_by:(<class 'dict'>, <class 'list'>)=[], name:str='', offset:int=1, x_smoothing:float=0.5, add_count:bool=True)
    ---------------------------------------------------------------------------
    Adds a new vcolumn to the vDataFrame by using an advanced analytical function 
    on one or two specific vcolumns.
    
    ⚠ Warning : Some analytical functions can make the vDataFrame structure 
                     heavier. It is recommended to always check the current structure 
                     using the 'current_relation' method and to save it using the 
                     'to_db' method with the parameters 'inplace = True' and 
                     'relation_type = table'
    
    Parameters
    ----------
    func: str
        Function to apply.
            aad          : average absolute deviation
            beta         : Beta Coefficient between 2 vcolumns
            count        : number of non-missing elements
            corr         : Pearson correlation between 2 vcolumns
            cov          : covariance between 2 vcolumns
            dense_rank   : dense rank
            ema          : exponential moving average
            first_value  : first non null lead
            iqr          : interquartile range
            kurtosis     : kurtosis
            jb           : Jarque Bera index 
            lead         : next element
            lag          : previous element
            last_value   : first non null lag
            mad          : median absolute deviation
            max          : maximum
            mean         : average
            median       : median
            min          : minimum
            mode         : most occurent element
            q%           : q quantile (ex: 50% for the median)
            pct_change   : ratio between the current value and the previous one
            percent_rank : percent rank
            prod         : product
            range        : difference between the max and the min
            rank         : rank
            row_number   : row number
            sem          : standard error of the mean
            skewness     : skewness
            sum          : sum
            std          : standard deviation
            unique       : cardinality (count distinct)
            var          : variance
                Other analytical functions could work if it is part of 
                the DB version you are using.
    columns: str, optional
        Input vcolumns. It can be a list of one or two elements.
    by: list, optional
        vcolumns used in the partition.
    order_by: dict / list, optional
        List of the vcolumns to use to sort the data using asc order or
        dictionary of all the sorting methods. For example, to sort by "column1"
        ASC and "column2" DESC, write {"column1": "asc", "column2": "desc"}
    name: str, optional
        Name of the new vcolumn. If empty a default name based on the other
        parameters will be generated.
    offset: int, optional
        Lead/Lag offset if parameter 'func' is the function 'lead'/'lag'.
    x_smoothing: float, optional
        The smoothing parameter of the 'ema' if the function is 'ema'. It must be in [0;1]
    add_count: bool, optional
        If the function is the 'mode' and this parameter is True then another column will 
        be added to the vDataFrame with the mode number of occurences.
    
    Returns
    -------
    vDataFrame
        self
    
    See Also
    --------
    vDataFrame.eval    : Evaluates a customized expression.
    vDataFrame.rolling : Computes a customized moving window.

To demonstrate some of these techniques, let's use the 'USA 2015 Flights' dataset and perform some computations.

In [7]:
from verticapy import *
vdf = vDataFrame("usa_flights")
display(vdf)
Abc
destination_airport
Varchar(20)
📅
scheduled_departure
Timestamp
123
departure_delay
Int
123
arrival_delay
Int
Abc
origin_airport
Varchar(20)
Abc
airline
Varchar(20)
1101352015-10-01 10:09:00-9-211433EV
2101352015-10-01 10:27:00-3-1410397EV
3101352015-10-01 13:57:00-4613930EV
4101352015-10-01 14:02:00-3-811433EV
5101352015-10-01 14:44:000-110397EV
6101352015-10-01 20:02:00-3-1111433EV
7101352015-10-01 21:02:00-3413930EV
8101352015-10-01 21:06:00-3-1410397DL
9101352015-10-02 10:09:00-7-1411433EV
10101352015-10-02 10:27:0012410397EV
11101352015-10-02 13:57:00-3-213930EV
12101352015-10-02 14:02:00-15-1911433EV
13101352015-10-02 14:44:00-4-610397EV
14101352015-10-02 20:02:00-3-311433EV
15101352015-10-02 21:02:00585913930EV
16101352015-10-02 21:06:00-2-1410397DL
17101352015-10-03 10:27:00-4-210397EV
18101352015-10-03 21:02:00121113930EV
19101352015-10-03 21:06:00-2-1610397DL
20101352015-10-04 10:09:0042311433EV
21101352015-10-04 10:27:0001110397EV
22101352015-10-04 13:57:00141113930EV
23101352015-10-04 14:02:00-6-1011433EV
24101352015-10-04 14:44:00181810397EV
25101352015-10-04 20:02:00191211433EV
26101352015-10-04 21:02:00-2513930EV
27101352015-10-04 21:06:00-7-1310397DL
28101352015-10-05 10:09:00-8-311433EV
29101352015-10-05 13:57:00-10-2013930EV
30101352015-10-05 14:02:00-11-1711433EV
31101352015-10-05 14:44:00102010397EV
32101352015-10-05 20:02:00-3-1111433EV
33101352015-10-05 21:02:00-6-913930EV
34101352015-10-05 21:06:001110397DL
3510135