Features Engineering

While preparing our data, we need to think constantly about the most suitable features we can use to achieve our overall goals.

Features engineering makes use of many techniques - too many to go over in this short lesson. We’ll focus on the most popular ones.

Customized Features Engineering

To build a customized feature, you can use the ‘eval’ method of the vDataFrame. Let’s look at an example with the well-known ‘Titanic’ dataset.

[1]:
import verticapy as vp
from verticapy.datasets import load_titanic

vdf = load_titanic()
display(vdf)
123
pclass
Integer
123
survived
Integer
Abc
Varchar(164)
Abc
sex
Varchar(20)
123
age
Numeric(8)
123
sibsp
Integer
123
parch
Integer
Abc
ticket
Varchar(36)
123
fare
Numeric(12)
Abc
cabin
Varchar(30)
Abc
embarked
Varchar(20)
Abc
boat
Varchar(100)
123
body
Integer
Abc
home.dest
Varchar(100)
110female2.012113781151.55C22 C26S[null][null]Montreal, PQ / Chesterville, ON
210male30.012113781151.55C22 C26S[null]135Montreal, PQ / Chesterville, ON
310female25.012113781151.55C22 C26S[null][null]Montreal, PQ / Chesterville, ON
410male39.0001120500.0A36S[null][null]Belfast, NI
510male71.000PC 1760949.5042[null]C[null]22Montevideo, Uruguay
610male47.010PC 17757227.525C62 C64C[null]124New York, NY
710male[null]00PC 1731825.925[null]S[null][null]New York, NY
810male24.001PC 17558247.5208B58 B60C[null][null]Montreal, PQ
910male36.0001305075.2417C6CA[null]Winnipeg, MN
1010male25.0001390526.0[null]C[null]148San Francisco, CA
1110male45.00011378435.5TS[null][null]Trenton, NJ
1210male42.00011048926.55D22S[null][null]London / Winnipeg, MB
1310male41.00011305430.5A21S[null][null]Pomeroy, WA
1410male48.000PC 1759150.4958B10C[null]208Omaha, NE
1510male[null]0011237939.6[null]C[null][null]Philadelphia, PA
1610male45.00011305026.55B38S[null][null]Washington, DC
1710male[null]0011379831.0[null]S[null][null][null]
1810male33.0006955.0B51 B53 B55S[null][null]New York, NY
1910male28.00011305947.1[null]S[null][null]Montevideo, Uruguay
2010male17.00011305947.1[null]S[null][null]Montevideo, Uruguay
2110male49.0001992426.0[null]S[null][null]Ascot, Berkshire / Rochester, NY
2210male36.0101987778.85C46S[null]172Little Onn Hall, Staffs
2310male46.010W.E.P. 573461.175E31S[null][null]Amenia, ND
2410male[null]001120510.0[null]S[null][null]Liverpool, England / Belfast
2510male27.01013508136.7792C89C[null][null]Los Angeles, CA
2610male[null]0011046552.0A14S[null][null]Stoughton, MA
2710male47.000572725.5875E58S[null][null]Victoria, BC
2810male37.011PC 1775683.1583E52C[null][null]Lakewood, NJ
2910male[null]0011379126.55[null]S[null][null]Roachdale, IN
3010male70.011WE/P 573571.0B22S[null]269Milwaukee, WI
3110male39.010PC 1759971.2833C85C[null][null]New York, NY
3210male31.010F.C. 1275052.0B71S[null][null]Montreal, PQ
3310male50.010PC 17761106.425C86C[null]62Deephaven, MN / Cedar Rapids, IA
3410male39.000PC 1758029.7A18C[null]133Philadelphia, PA
3510female36.000PC 1753131.6792A29C[null][null]New York, NY
3610male[null]00PC 17483221.7792C95S[null][null][null]
3710male30.00011305127.75C111C[null][null]New York, NY
3810male19.03219950263.0C23 C25 C27S[null][null]Winnipeg, MB
3910male64.01419950263.0C23 C25 C27S[null][null]Winnipeg, MB
4010male[null]0011377826.55D34S[null][null]Westcliff-on-Sea, Essex
4110male[null]001120580.0B102S[null][null][null]
4210male37.01011380353.1C123S[null][null]Scituate, MA
4310male47.00011132038.5E63S[null]275St Anne's-on-Sea, Lancashire
4410male24.000PC 1759379.2B86C[null][null][null]
4510male71.000PC 1775434.6542A5C[null][null]New York, NY
4610male38.001PC 17582153.4625C91S[null]147Winnipeg, MB
4710male46.000PC 1759379.2B82 B84C[null][null]New York, NY
4810male[null]0011379642.4[null]S[null][null][null]
4910male45.0103697383.475C83S[null][null]New York, NY
5010male40.0001120590.0B94S[null]110[null]
5110male55.0111274993.5B69S[null]307Montreal, PQ
5210male42.00011303842.5B11S[null][null]London / Middlesex
5310male[null]001746351.8625E46S[null][null]Brighton, MA
5410male55.00068050.0C39S[null][null]London / Birmingham
5510male42.01011378952.0[null]S[null]38New York, NY
5610male[null]00PC 1760030.6958[null]C14[null]New York, NY
5710female50.000PC 1759528.7125C49C[null][null]Paris, France New York, NY
5810male46.00069426.0[null]S[null]80Bennington, VT
5910male50.00011304426.0E60S[null][null]London
6010male32.500113503211.5C132C[null]45[null]
6110male58.0001177129.7B37C[null]258Buffalo, NY
6210male41.0101746451.8625D21S[null][null]Southington / Noank, CT
6310male[null]0011302826.55C124S[null][null]Portland, OR
6410male[null]00PC 1761227.7208[null]C[null][null]Chicago, IL
6510male29.00011350130.0D6S[null]126Springfield, MA
6610male30.00011380145.5[null]S[null][null]London / New York, NY
6710male30.00011046926.0C106S[null][null]Brockton, MA
6810male19.01011377353.1D30S[null][null]New York, NY
6910male46.0001305075.2417C6C[null]292Vancouver, BC
7010male54.0001746351.8625E46S[null]175Dorchester, MA
7110male28.010PC 1760482.1708[null]C[null][null]New York, NY
7210male65.0001350926.55E38S[null]249East Bridgewater, MA
7310male44.0201992890.0C78Q[null]230Fond du Lac, WI
7410male55.00011378730.5C30S[null][null]Montreal, PQ
7510male47.00011379642.4[null]S[null][null]Washington, DC
7610male37.001PC 1759629.7C118C[null][null]Brooklyn, NY
7710male58.00235273113.275D48C[null]122Lexington, MA
7810male64.00069326.0[null]S[null]263Isle of Wight, England
7910male65.00111350961.9792B30C[null]234Providence, RI
8010male28.500PC 1756227.7208D43C[null]189?Havana, Cuba
8110male[null]001120520.0[null]S[null][null]Belfast
8210male45.50011304328.5C124S[null]166Surbiton Hill, Surrey
8310male23.0001274993.5B24S[null][null]Montreal, PQ
8410male29.01011377666.6C2S[null][null]Isleworth, England
8510male18.010PC 17758108.9C65C[null][null]Madrid, Spain
8610male47.00011046552.0C110S[null]207Worcester, MA
8710male38.000199720.0[null]S[null][null]Rotterdam, Netherlands
8810male22.000PC 17760135.6333[null]C[null]232[null]
8910male[null]00PC 17757227.525[null]C[null][null][null]
9010male31.000PC 1759050.4958A24S[null][null]Trenton, NJ
9110male[null]0011376750.0A32S[null][null]Seattle, WA
9210male36.0001304940.125A10C[null][null]Winnipeg, MB
9310male55.010PC 1760359.4[null]C[null][null]New York, NY
9410male33.00011379026.55[null]S[null]109London
9510male61.013PC 17608262.375B57 B59 B63 B66C[null][null]Haverford, PA / Cooperstown, NY
9610male50.0101350755.9E44S[null][null]Duluth, MN
9710male56.00011379226.55[null]S[null][null]New York, NY
9810male56.0001776430.6958A7C[null][null]St James, Long Island, NY
9910male24.0101369560.0C31S[null][null]Huntington, WV
10010male[null]0011305626.0A19S[null][null]Streatham, Surrey
Rows: 1-100 | Columns: 14

The feature ‘parch’ corresponds to the number of parents and children on-board. The feature ‘sibsp’ corresponds to the number of siblings and spouses on-board. We can create the feature ‘family size’ which is equal to parch + sibsp + 1.

[2]:
vdf["family_size"] = vdf["parch"] + vdf["sibsp"] + 1
vdf.select(["parch", "sibsp", "family_size"])
[2]:
123
parch
Integer
123
sibsp
Integer
123
family_size
Integer
1214
2214
3214
4001
5001
6012
7001
8102
9001
10001
11001
12001
13001
14001
15001
16001
17001
18001
19001
20001
21001
22012
23012
24001
25012
26001
27001
28113
29001
30113
31012
32012
33012
34001
35001
36001
37001
38236
39416
40001
41001
42012
43001
44001
45001
46102
47001
48001
49012
50001
51113
52001
53001
54001
55012
56001
57001
58001
59001
60001
61001
62012
63001
64001
65001
66001
67001
68012
69001
70001
71012
72001
73023
74001
75001
76102
77203
78001
79102
80001
81001
82001
83001
84012
85012
86001
87001
88001
89001
90001
91001
92001
93012
94001
95315
96012
97001
98001
99012
100001
Rows: 1-100 | Columns: 3

When using the ‘eval’ method, you can enter any SQL expression and VerticaPy will evaluate it!

Regular Expressions

To compute features using regular expressions, we’ll use the ‘regexp’ method.

[3]:
help(vp.vDataFrame.regexp)
Help on function regexp in module verticapy.core.vdataframe._text:

regexp(self, column: str, pattern: str, method: Literal['count', 'ilike', 'instr', 'like', 'not_ilike', 'not_like', 'replace', 'substr'] = 'substr', position: int = 1, occurrence: int = 1, replacement: Optional[str] = None, return_position: int = 0, name: Optional[str] = None) -> 'vDataFrame'
    Computes a new vDataColumn based on regular expressions.

    Parameters
    ----------
    column: str
        Input vDataColumn  used  to compute the  regular
        expression.
    pattern: str
        The regular expression.
    method: str, optional
        Method used to compute the regular  expressions.
            count     : Returns the number of times a
                        regular expression matches each
                        element of the input vDataColumn.
            ilike     : Returns  True if  the  vDataColumn
                        element  contains a match for  the
                        regular expression.
            instr     : Returns  the  starting  or  ending
                        position in  a vDataColumn element
                        where a regular expression matches.
            like      : Returns  True  if the  vDataColumn
                        element    matches   the   regular
                        expression.
            not_ilike : Returns  True  if the  vDataColumn
                        element  does  not match the  case
                        -insensitive  regular   expression.
            not_like  : Returns  True if  the  vDataColumn
                        element  does not contain a  match
                        for the regular expression.
            replace   : Replaces   all  occurrences  of  a
                        substring  that  match  a  regular
                        expression  with another substring.
            substr    : Returns the substring that matches
                        a  regular   expression  within  a
                        vDataColumn.
    position: int, optional
        The number of characters from the start of the string
        where the function should start searching for matches.
    occurrence: int, optional
        Controls  which occurrence of a pattern match in  the
        string to return.
    replacement: str, optional
        The string to replace matched substrings.
    return_position: int, optional
        Sets the position within the string to return.
    name: str, optional
        New feature name. If empty, a name is generated.

    Returns
    -------
    vDataFrame
        self

Consider the following example: notice that passenger names include their title.

[4]:
vdf["name"]
[4]:
Abc
Varchar(164)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
Rows: 1-100 of 1234 | Column: name | Type: Varchar(164)

Let’s extract the title using regular expressions.

[5]:
vdf.regexp(column = "name",
           name = "title",
           pattern = " ([A-Za-z])+\.",
           method = "substr")
vdf.select(["name", "title"])
[5]:
Abc
Varchar(164)
Abc
title
Varchar(164)
1 Miss.
2 Mr.
3 Mrs.
4 Mr.
5 Mr.
6 Col.
7 Mr.
8 Mr.
9 Mr.
10 Mr.
11 Mr.
12 Mr.
13 Mr.
14 Mr.
15 Dr.
16 Major.
17 Mr.
18 Mr.
19 Mr.
20 Mr.
21 Mr.
22 Mr.
23 Mr.
24 Mr.
25 Mr.
26 Mr.
27 Mr.
28 Mr.
29 Mr.
30 Capt.
31 Mr.
32 Mr.
33 Mr.
34 Mr.
35 Miss.
36 Mr.
37 Mr.
38 Mr.
39 Mr.
40 Mr.
41 Mr.
42 Mr.
43 Mr.
44 Mr.
45 Mr.
46 Mr.
47 Mr.
48 Mr.
49 Mr.
50 Mr.
51 Mr.
52 Mr.
53 Mr.
54 Mr.
55 Mr.
56 Mr.
57 Miss.
58 Mr.
59 Mr.
60 Mr.
61 Mr.
62 Mr.
63 Mr.
64 Mr.
65 Mr.
66 Mr.
67 Mr.
68 Mr.
69 Mr.
70 Mr.
71 Mr.
72 Mr.
73 Dr.
74 Mr.
75 Mr.
76 Mr.
77 Mr.
78 Mr.
79 Mr.
80 Mr.
81 Mr.
82 Mr.
83 Mr.
84 Mr.
85 Mr.
86 Mr.
87 Jonkheer.
88 Mr.
89 Mr.
90 Mr.
91 Mr.
92 Mr.
93 Mr.
94 Mr.
95 Mr.
96 Mr.
97 Mr.
98 Mr.
99 Mr.
100 Mr.
Rows: 1-100 | Columns: 2

Advanced Analytical Functions

The ‘analytic’ method contains the many advanced analytical functions in VerticaPy.

[6]:
help(vp.vDataFrame.analytic)
Help on function analytic in module verticapy.core.vdataframe._math:

analytic(self, func: str, columns: Optional[Annotated[Union[str, list[str]], 'STRING representing one column or a list of columns']] = None, by: Optional[Annotated[Union[str, list[str]], 'STRING representing one column or a list of columns']] = None, order_by: Union[NoneType, Annotated[Union[str, list[str]], 'STRING representing one column or a list of columns'], dict] = None, name: Optional[str] = None, offset: int = 1, x_smoothing: float = 0.5, add_count: bool = True) -> 'vDataFrame'
    Adds a new vDataColumn to the vDataFrame by using an advanced
    analytical function on one or two specific vDataColumns.

    ⚠ Warning : Some analytical  functions can make the vDataFrame
                     structure  more resource intensive. It is best  to
                     check  the structure of  the vDataFrame with  the
                     'current_relation' method and save it with the
                     'to_db'  method,  uisng  the  parameters
                     'inplace = True' and 'relation_type = table'.

    Parameters
    ----------
    func: str
        Function to apply.
            aad          : average absolute deviation
            beta         : Beta Coefficient between 2 vDataColumns
            count        : number of non-missing elements
            corr         : Pearson's correlation between 2 vDataColumns
            cov          : covariance between 2 vDataColumns
            dense_rank   : dense rank
            ema          : exponential moving average
            first_value  : first non null lead
            iqr          : interquartile range
            kurtosis     : kurtosis
            jb           : Jarque-Bera index
            lead         : next element
            lag          : previous element
            last_value   : first non null lag
            mad          : median absolute deviation
            max          : maximum
            mean         : average
            median       : median
            min          : minimum
            mode         : most occurent element
            q%           : q quantile (ex: 50% for the median)
            pct_change   : ratio between the current value and the previous one
            percent_rank : percent rank
            prod         : product
            range        : difference between the max and the min
            rank         : rank
            row_number   : row number
            sem          : standard error of the mean
            skewness     : skewness
            sum          : sum
            std          : standard deviation
            unique       : cardinality (count distinct)
            var          : variance
        Other analytical functions could work if they are part of your DB
        version.
    columns: SQLColumns, optional
        Input vDataColumns. Must be a list of one or two elements.
    by: SQLColumns, optional
        vDataColumns used in the partition.
    order_by: dict / list, optional
        Either a list of the vDataColumns used to sort (in ascending order)
        the data, or a dictionary of vDataColumns and their sorting
        methods. For example, to sort by "column1" ASC and "column2" DESC,
        write: {"column1": "asc", "column2": "desc"}
    name: str, optional
        Name of  the new vDataColumn. If empty, a default name based on the
        other parameters is generated.
    offset: int, optional
        Lead/Lag  offset  if parameter 'func' is the function  'lead'/'lag'.
    x_smoothing: float, optional
        The  smoothing parameter of the 'ema' if the  function is 'ema'. It
        must be a float in the range [0;1].
    add_count: bool, optional
        If the 'func' is set to 'mode' and this parameter is True, a column
        with the mode number of occurences is added to the vDataFrame.

    Returns
    -------
    vDataFrame
        self

To demonstrate some of these techniques, let’s use the Amazon dataset and perform some computations.

[7]:
from verticapy.datasets import load_amazon
vdf = load_amazon()
display(vdf)
📅
date
Date
Abc
state
Varchar(32)
123
number
Integer
11998-01-01ACRE0
21998-01-01ALAGOAS0
31998-01-01AMAPÁ0
41998-01-01AMAZONAS0
51998-01-01BAHIA0
61998-01-01CEARÁ0
71998-01-01DISTRITO FEDERAL0
81998-01-01ESPÍRITO SANTO0
91998-01-01GOIÁS0
101998-01-01MARANHÃO0
111998-01-01MATO GROSSO0
121998-01-01MATO GROSSO DO SUL0
131998-01-01MINAS GERAIS0
141998-01-01PARANÁ0
151998-01-01PARAÍBA0
161998-01-01PARÁ0
171998-01-01PERNAMBUCO0
181998-01-01PIAUÍ0
191998-01-01RIO DE JANEIRO0
201998-01-01RIO GRANDE DO NORTE0
211998-01-01RIO GRANDE DO SUL0
221998-01-01RONDÔNIA0
231998-01-01RORAIMA0
241998-01-01SANTA CATARINA0
251998-01-01SERGIPE0
261998-01-01SÃO PAULO0
271998-01-01TOCANTINS0
281998-02-01ACRE0
291998-02-01ALAGOAS0
301998-02-01AMAPÁ0
311998-02-01AMAZONAS0
321998-02-01BAHIA0
331998-02-01CEARÁ0
341998-02-01DISTRITO FEDERAL0
351998-02-01ESPÍRITO SANTO0
361998-02-01GOIÁS0
371998-02-01MARANHÃO0
381998-02-01MATO GROSSO0
391998-02-01MATO GROSSO DO SUL0
401998-02-01MINAS GERAIS0
411998-02-01PARANÁ0
421998-02-01PARAÍBA0
431998-02-01PARÁ0
441998-02-01PERNAMBUCO0
451998-02-01PIAUÍ0
461998-02-01RIO DE JANEIRO0
471998-02-01RIO GRANDE DO NORTE0
481998-02-01RIO GRANDE DO SUL0
491998-02-01RONDÔNIA0
501998-02-01RORAIMA0
511998-02-01SANTA CATARINA0
521998-02-01SERGIPE0
531998-02-01SÃO PAULO0
541998-02-01TOCANTINS0
551998-03-01ACRE0
561998-03-01ALAGOAS0
571998-03-01AMAPÁ0
581998-03-01AMAZONAS0
591998-03-01BAHIA0
601998-03-01CEARÁ0
611998-03-01DISTRITO FEDERAL0
621998-03-01ESPÍRITO SANTO0
631998-03-01GOIÁS0
641998-03-01MARANHÃO0
651998-03-01MATO GROSSO0
661998-03-01MATO GROSSO DO SUL0
671998-03-01MINAS GERAIS0
681998-03-01PARANÁ0
691998-03-01PARAÍBA0
701998-03-01PARÁ0
711998-03-01PERNAMBUCO0
721998-03-01PIAUÍ0
731998-03-01RIO DE JANEIRO0
741998-03-01RIO GRANDE DO NORTE0
751998-03-01RIO GRANDE DO SUL0
761998-03-01RONDÔNIA0
771998-03-01RORAIMA0
781998-03-01SANTA CATARINA0
791998-03-01SERGIPE0
801998-03-01SÃO PAULO0
811998-03-01TOCANTINS0
821998-04-01ACRE0
831998-04-01ALAGOAS0
841998-04-01AMAPÁ0
851998-04-01AMAZONAS0
861998-04-01BAHIA0
871998-04-01CEARÁ0
881998-04-01DISTRITO FEDERAL0
891998-04-01ESPÍRITO SANTO0
901998-04-01GOIÁS0
911998-04-01MARANHÃO0
921998-04-01MATO GROSSO0
931998-04-01MATO GROSSO DO SUL0
941998-04-01MINAS GERAIS0
951998-04-01PARANÁ0
961998-04-01PARAÍBA0
971998-04-01PARÁ0
981998-04-01PERNAMBUCO0
991998-04-01PIAUÍ0
1001998-04-01RIO DE JANEIRO0
Rows: 1-100 | Columns: 3

For each state, let’s compute the previous number of forest fires.

[8]:
vdf.analytic(name = "previous_number",
             func = "lag",
             columns = "number",
             by = ["state"],
             order_by = {"date": "asc"})
[8]:
📅
date
Date
Abc
state
Varchar(32)
123
number
Integer
123
previous_number
Integer
11998-01-01PARÁ0[null]
21998-02-01PARÁ00
31998-03-01PARÁ00
41998-04-01PARÁ00
51998-05-01PARÁ00
61998-06-01PARÁ580
71998-07-01PARÁ63858
81998-08-01PARÁ6815638
91998-09-01PARÁ62376815
101998-10-01PARÁ36656237
111998-11-01PARÁ18613665
121998-12-01PARÁ10081861
131999-01-01PARÁ871008
141999-02-01PARÁ28587
151999-03-01PARÁ7285
161999-04-01PARÁ67
171999-05-01PARÁ46
181999-06-01PARÁ374
191999-07-01PARÁ98237
201999-08-01PARÁ6178982
211999-09-01PARÁ44196178
221999-10-01PARÁ50264419
231999-11-01PARÁ26825026
241999-12-01PARÁ7662682
252000-01-01PARÁ30766
262000-02-01PARÁ2930
272000-03-01PARÁ18129
282000-04-01PARÁ40181
292000-05-01PARÁ1240
302000-06-01PARÁ5912
312000-07-01PARÁ42659
322000-08-01PARÁ5484426
332000-09-01PARÁ25695484
342000-10-01PARÁ43132569
352000-11-01PARÁ33664313
362000-12-01PARÁ16983366
372001-01-01PARÁ341698
382001-02-01PARÁ1634
392001-03-01PARÁ016
402001-04-01PARÁ00
412001-05-01PARÁ70
422001-06-01PARÁ427
432001-07-01PARÁ70842
442001-08-01PARÁ9043708
452001-09-01PARÁ61399043
462001-10-01PARÁ50966139
472001-11-01PARÁ49385096
482001-12-01PARÁ25644938
492002-01-01PARÁ1442564
502002-02-01PARÁ120144
512002-03-01PARÁ14120
522002-04-01PARÁ1114
532002-05-01PARÁ1811
542002-06-01PARÁ14918
552002-07-01PARÁ772149
562002-08-01PARÁ15664772
572002-09-01PARÁ560915664
582002-10-01PARÁ51895609
592002-11-01PARÁ79965189
602002-12-01PARÁ31187996
612003-01-01PARÁ9003118
622003-02-01PARÁ34900
632003-03-01PARÁ2734
642003-04-01PARÁ1427
652003-05-01PARÁ314
662003-06-01PARÁ643
672003-07-01PARÁ157164
682003-08-01PARÁ62551571
692003-09-01PARÁ84476255
702003-10-01PARÁ38038447
712003-11-01PARÁ61953803
722003-12-01PARÁ47876195
732004-01-01PARÁ5794787
742004-02-01PARÁ20579
752004-03-01PARÁ2220
762004-04-01PARÁ2322
772004-05-01PARÁ4423
782004-06-01PARÁ34044
792004-07-01PARÁ4436340
802004-08-01PARÁ128124436
812004-09-01PARÁ947112812
822004-10-01PARÁ47269471
832004-11-01PARÁ98424726
842004-12-01PARÁ55089842
852005-01-01PARÁ17685508
862005-02-01PARÁ851768
872005-03-01PARÁ2185
882005-04-01PARÁ1421
892005-05-01PARÁ3714
902005-06-01PARÁ30437
912005-07-01PARÁ4364304
922005-08-01PARÁ130554364
932005-09-01PARÁ903213055
942005-10-01PARÁ50839032
952005-11-01PARÁ49685083
962005-12-01PARÁ14954968
972006-01-01PARÁ4031495
982006-02-01PARÁ49403
992006-03-01PARÁ1049
1002006-04-01PARÁ210
Rows: 1-100 | Columns: 4

Moving Windows

Moving windows are powerful features. Moving windows are managed by the ‘rolling’ method in VerticaPy.

[9]:
help(vp.vDataFrame.rolling)
Help on function rolling in module verticapy.core.vdataframe._rolling:

rolling(self, func: str, window: Union[list, tuple], columns: Annotated[Union[str, list[str]], 'STRING representing one column or a list of columns'], by: Optional[Annotated[Union[str, list[str]], 'STRING representing one column or a list of columns']] = None, order_by: Union[NoneType, dict, list] = None, name: Optional[str] = None) -> 'vDataFrame'
    Adds a new vDataColumn to the vDataFrame by using an
    advanced  analytical  window function on one or  two
    specific vDataColumns.

    ⚠ Warning : Some window functions can make the
                     vDataFrame structure heavier. It is
                     recommended to always check the current
                     structure with the 'current_relation'
                     method and to save it with the 'to_db'
                     method, using the parameters 'inplace
                     = True' and 'relation_type = table'.

    Parameters
    ----------
    func: str
        Function to use.
            aad         : average absolute deviation
            beta        : Beta Coefficient between 2 vDataColumns
            count       : number of non-missing elements
            corr        : Pearson correlation between 2 vDataColumns
            cov         : covariance between 2 vDataColumns
            kurtosis    : kurtosis
            jb          : Jarque-Bera index
            max         : maximum
            mean        : average
            min         : minimum
            prod        : product
            range       : difference between the max and the min
            sem         : standard error of the mean
            skewness    : skewness
            sum         : sum
            std         : standard deviation
            var         : variance
                Other window functions could work if it is part of
                the DB version you are using.
    window: list / tuple
        Window Frame Range.
        If set to two integers, computes a Row Window, otherwise
        it computes a Time  Window. For example, if set  to
        (-5, 1),  the moving  windows will take 5 rows  preceding
        and one following. If set to ('- 5 minutes', '0 minutes'),
        the  moving window  will take all elements of the last  5
        minutes.
    columns: SQLColumns
        Input vDataColumns. Must be a list of one or two elements.
    by: SQLColumns, optional
        vDataColumns used in the partition.
    order_by: dict / list, optional
        List of the vDataColumns used to sort the data using
        ascending/descending order or a dictionary of all the
        sorting methods.
        For example, to sort by "column1" ASC and "column2" DESC,
        use: {"column1": "asc", "column2": "desc"}.
    name: str, optional
        Name of the new vDataColumn.  If empty, a default name is
        generated.

    Returns
    -------
    vDataFrame
        self

Let’s look at forest fires for each state three months preceding two months following the examined period.

[10]:
vdf.rolling(name = "number_3mp_2mf",
            func = "sum",
            window = ("- 3 months", "2 months"),
            columns = "number",
            by = ["state"],
            order_by = {"date": "asc"},)
[10]:
📅
date
Date
Abc
state
Varchar(32)
123
number
Integer
123
previous_number
Integer
123
number_3mp_2mf
Integer
11998-01-01PARÁ0[null]0
21998-02-01PARÁ000
31998-03-01PARÁ000
41998-04-01PARÁ000
51998-05-01PARÁ0058
61998-06-01PARÁ580696
71998-07-01PARÁ638587511
81998-08-01PARÁ681563813748
91998-09-01PARÁ6237681517355
101998-10-01PARÁ3665623718578
111998-11-01PARÁ1861366512771
121998-12-01PARÁ100818616621
131999-01-01PARÁ8710083248
141999-02-01PARÁ285871393
151999-03-01PARÁ72851393
161999-04-01PARÁ67389
171999-05-01PARÁ46339
181999-06-01PARÁ3741029
191999-07-01PARÁ982377201
201999-08-01PARÁ617898211616
211999-09-01PARÁ4419617816605
221999-10-01PARÁ5026441918305
231999-11-01PARÁ2682502612893
241999-12-01PARÁ76626828504
252000-01-01PARÁ307663688
262000-02-01PARÁ29301046
272000-03-01PARÁ18129280
282000-04-01PARÁ40181262
292000-05-01PARÁ1240321
302000-06-01PARÁ5912537
312000-07-01PARÁ426595981
322000-08-01PARÁ54844268538
332000-09-01PARÁ2569548412792
342000-10-01PARÁ4313256915732
352000-11-01PARÁ3366431311946
362000-12-01PARÁ169833669411
372001-01-01PARÁ3416985114
382001-02-01PARÁ16341748
392001-03-01PARÁ0161748
402001-04-01PARÁ0057
412001-05-01PARÁ7065
422001-06-01PARÁ427757
432001-07-01PARÁ708429800
442001-08-01PARÁ904370815932
452001-09-01PARÁ6139904320986
462001-10-01PARÁ5096613925216
472001-11-01PARÁ4938509618737
482001-12-01PARÁ2564493812742
492002-01-01PARÁ14425647780
502002-02-01PARÁ1201442853
512002-03-01PARÁ141202853
522002-04-01PARÁ1114307
532002-05-01PARÁ1811312
542002-06-01PARÁ14918950
552002-07-01PARÁ77214916603
562002-08-01PARÁ1566477222194
572002-09-01PARÁ56091566427234
582002-10-01PARÁ5189560934458
592002-11-01PARÁ7996518921912
602002-12-01PARÁ3118799617203
612003-01-01PARÁ900311812075
622003-02-01PARÁ349004093
632003-03-01PARÁ27344093
642003-04-01PARÁ1427978
652003-05-01PARÁ314142
662003-06-01PARÁ6431652
672003-07-01PARÁ1571647893
682003-08-01PARÁ6255157116337
692003-09-01PARÁ8447625520076
702003-10-01PARÁ3803844724700
712003-11-01PARÁ6195380323232
722003-12-01PARÁ4787619515364
732004-01-01PARÁ579478711603
742004-02-01PARÁ205795431
752004-03-01PARÁ2220644
762004-04-01PARÁ2322109
772004-05-01PARÁ4423449
782004-06-01PARÁ340444843
792004-07-01PARÁ443634017632
802004-08-01PARÁ12812443627059
812004-09-01PARÁ94711281231445
822004-10-01PARÁ4726947136851
832004-11-01PARÁ9842472629547
842004-12-01PARÁ5508984221844
852005-01-01PARÁ1768550817224
862005-02-01PARÁ8517687396
872005-03-01PARÁ21857396
882005-04-01PARÁ14211925
892005-05-01PARÁ3714461
902005-06-01PARÁ304374719
912005-07-01PARÁ436430417760
922005-08-01PARÁ13055436426755
932005-09-01PARÁ90321305531534
942005-10-01PARÁ5083903232138
952005-11-01PARÁ4968508320578
962005-12-01PARÁ1495496811949
972006-01-01PARÁ40314956925
982006-02-01PARÁ494031959
992006-03-01PARÁ10491959
1002006-04-01PARÁ210464
Rows: 1-100 | Columns: 5

Moving windows give us infinite possibilities for creating new features.

After we’ve finished preparing our data, our next task is to create a machine learning model.