-
Home / Workshop / Data Prep / Features Engineering / Index

VerticaPy
Features Engineering¶
While preparing our data, we need to think constantly about the most suitable features we can use to achieve our overall goals.
Features engineering makes use of many techniques - too many to go over in this short lesson. We'll focus on the most popular ones.
Customized Features Engineering¶
To build a customized feature, you can use the 'eval' method of the vDataFrame. Let's look at an example with the well-known 'Titanic' dataset.
from verticapy.datasets import load_titanic
vdf = load_titanic()
display(vdf)
123 pclassInt | 123 survivedInt | Abc Varchar(164) | Abc sexVarchar(20) | 123 ageNumeric(6,3) | 123 sibspInt | 123 parchInt | Abc ticketVarchar(36) | 123 fareNumeric(10,5) | Abc cabinVarchar(30) | Abc embarkedVarchar(20) | Abc boatVarchar(100) | 123 bodyInt | Abc home.destVarchar(100) | |
1 | 1 | 0 | female | 2.0 | 1 | 2 | 113781 | 151.55 | C22 C26 | S | [null] | [null] | Montreal, PQ / Chesterville, ON | |
2 | 1 | 0 | male | 30.0 | 1 | 2 | 113781 | 151.55 | C22 C26 | S | [null] | 135 | Montreal, PQ / Chesterville, ON | |
3 | 1 | 0 | female | 25.0 | 1 | 2 | 113781 | 151.55 | C22 C26 | S | [null] | [null] | Montreal, PQ / Chesterville, ON | |
4 | 1 | 0 | male | 39.0 | 0 | 0 | 112050 | 0.0 | A36 | S | [null] | [null] | Belfast, NI | |
5 | 1 | 0 | male | 71.0 | 0 | 0 | PC 17609 | 49.5042 | [null] | C | [null] | 22 | Montevideo, Uruguay | |
6 | 1 | 0 | male | 47.0 | 1 | 0 | PC 17757 | 227.525 | C62 C64 | C | [null] | 124 | New York, NY | |
7 | 1 | 0 | male | [null] | 0 | 0 | PC 17318 | 25.925 | [null] | S | [null] | [null] | New York, NY | |
8 | 1 | 0 | male | 24.0 | 0 | 1 | PC 17558 | 247.5208 | B58 B60 | C | [null] | [null] | Montreal, PQ | |
9 | 1 | 0 | male | 36.0 | 0 | 0 | 13050 | 75.2417 | C6 | C | A | [null] | Winnipeg, MN | |
10 | 1 | 0 | male | 25.0 | 0 | 0 | 13905 | 26.0 | [null] | C | [null] | 148 | San Francisco, CA | |
11 | 1 | 0 | male | 45.0 | 0 | 0 | 113784 | 35.5 | T | S | [null] | [null] | Trenton, NJ | |
12 | 1 | 0 | male | 42.0 | 0 | 0 | 110489 | 26.55 | D22 | S | [null] | [null] | London / Winnipeg, MB | |
13 | 1 | 0 | male | 41.0 | 0 | 0 | 113054 | 30.5 | A21 | S | [null] | [null] | Pomeroy, WA | |
14 | 1 | 0 | male | 48.0 | 0 | 0 | PC 17591 | 50.4958 | B10 | C | [null] | 208 | Omaha, NE | |
15 | 1 | 0 | male | [null] | 0 | 0 | 112379 | 39.6 | [null] | C | [null] | [null] | Philadelphia, PA | |
16 | 1 | 0 | male | 45.0 | 0 | 0 | 113050 | 26.55 | B38 | S | [null] | [null] | Washington, DC | |
17 | 1 | 0 | male | [null] | 0 | 0 | 113798 | 31.0 | [null] | S | [null] | [null] | [null] | |
18 | 1 | 0 | male | 33.0 | 0 | 0 | 695 | 5.0 | B51 B53 B55 | S | [null] | [null] | New York, NY | |
19 | 1 | 0 | male | 28.0 | 0 | 0 | 113059 | 47.1 | [null] | S | [null] | [null] | Montevideo, Uruguay | |
20 | 1 | 0 | male | 17.0 | 0 | 0 | 113059 | 47.1 | [null] | S | [null] | [null] | Montevideo, Uruguay | |
21 | 1 | 0 | male | 49.0 | 0 | 0 | 19924 | 26.0 | [null] | S | [null] | [null] | Ascot, Berkshire / Rochester, NY | |
22 | 1 | 0 | male | 36.0 | 1 | 0 | 19877 | 78.85 | C46 | S | [null] | 172 | Little Onn Hall, Staffs | |
23 | 1 | 0 | male | 46.0 | 1 | 0 | W.E.P. 5734 | 61.175 | E31 | S | [null] | [null] | Amenia, ND | |
24 | 1 | 0 | male | [null] | 0 | 0 | 112051 | 0.0 | [null] | S | [null] | [null] | Liverpool, England / Belfast | |
25 | 1 | 0 | male | 27.0 | 1 | 0 | 13508 | 136.7792 | C89 | C | [null] | [null] | Los Angeles, CA | |
26 | 1 | 0 | male | [null] | 0 | 0 | 110465 | 52.0 | A14 | S | [null] | [null] | Stoughton, MA | |
27 | 1 | 0 | male | 47.0 | 0 | 0 | 5727 | 25.5875 | E58 | S | [null] | [null] | Victoria, BC | |
28 | 1 | 0 | male | 37.0 | 1 | 1 | PC 17756 | 83.1583 | E52 | C | [null] | [null] | Lakewood, NJ | |
29 | 1 | 0 | male | [null] | 0 | 0 | 113791 | 26.55 | [null] | S | [null] | [null] | Roachdale, IN | |
30 | 1 | 0 | male | 70.0 | 1 | 1 | WE/P 5735 | 71.0 | B22 | S | [null] | 269 | Milwaukee, WI | |
31 | 1 | 0 | male | 39.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | [null] | [null] | New York, NY | |
32 | 1 | 0 | male | 31.0 | 1 | 0 | F.C. 12750 | 52.0 | B71 | S | [null] | [null] | Montreal, PQ | |
33 | 1 | 0 | male | 50.0 | 1 | 0 | PC 17761 | 106.425 | C86 | C | [null] | 62 | Deephaven, MN / Cedar Rapids, IA | |
34 | 1 | 0 | male | 39.0 | 0 | 0 | PC 17580 | 29.7 | A18 | C | [null] | 133 | Philadelphia, PA | |
35 | 1 | 0 | female | 36.0 | 0 | 0 | PC 17531 | 31.6792 | A29 | C | [null] | [null] | New York, NY | |
36 | 1 | 0 | male | [null] | 0 | 0 | PC 17483 | 221.7792 | C95 | S | [null] | [null] | [null] | |
37 | 1 | 0 | male | 30.0 | 0 | 0 | 113051 | 27.75 | C111 | C | [null] | [null] | New York, NY | |
38 | 1 | 0 | male | 19.0 | 3 | 2 | 19950 | 263.0 | C23 C25 C27 | S | [null] | [null] | Winnipeg, MB | |
39 | 1 | 0 | male | 64.0 | 1 | 4 | 19950 | 263.0 | C23 C25 C27 | S | [null] | [null] | Winnipeg, MB | |
40 | 1 | 0 | male | [null] | 0 | 0 | 113778 | 26.55 | D34 | S | [null] | [null] | Westcliff-on-Sea, Essex | |
41 | 1 | 0 | male | [null] | 0 | 0 | 112058 | 0.0 | B102 | S | [null] | [null] | [null] | |
42 | 1 | 0 | male | 37.0 | 1 | 0 | 113803 | 53.1 | C123 | S | [null] | [null] | Scituate, MA | |
43 | 1 | 0 | male | 47.0 | 0 | 0 | 111320 | 38.5 | E63 | S | [null] | 275 | St Anne's-on-Sea, Lancashire | |
44 | 1 | 0 | male | 24.0 | 0 | 0 | PC 17593 | 79.2 | B86 | C | [null] | [null] | [null] | |
45 | 1 | 0 | male | 71.0 | 0 | 0 | PC 17754 | 34.6542 | A5 | C | [null] | [null] | New York, NY | |
46 | 1 | 0 | male | 38.0 | 0 | 1 | PC 17582 | 153.4625 | C91 | S | [null] | 147 | Winnipeg, MB | |
47 | 1 | 0 | male | 46.0 | 0 | 0 | PC 17593 | 79.2 | B82 B84 | C | [null] | [null] | New York, NY | |
48 | 1 | 0 | male | [null] | 0 | 0 | 113796 | 42.4 | [null] | S | [null] | [null] | [null] | |
49 | 1 | 0 | male | 45.0 | 1 | 0 | 36973 | 83.475 | C83 | S | [null] | [null] | New York, NY | |
50 | 1 | 0 | male | 40.0 | 0 | 0 | 112059 | 0.0 | B94 | S | [null] | 110 | [null] | |
51 | 1 | 0 | male | 55.0 | 1 | 1 | 12749 | 93.5 | B69 | S | [null] | 307 | Montreal, PQ | |
52 | 1 | 0 | male | 42.0 | 0 | 0 | 113038 | 42.5 | B11 | S | [null] | [null] | London / Middlesex | |
53 | 1 | 0 | male | [null] | 0 | 0 | 17463 | 51.8625 | E46 | S | [null] | [null] | Brighton, MA | |
54 | 1 | 0 | male | 55.0 | 0 | 0 | 680 | 50.0 | C39 | S | [null] | [null] | London / Birmingham | |
55 | 1 | 0 | male | 42.0 | 1 | 0 | 113789 | 52.0 | [null] | S | [null] | 38 | New York, NY | |
56 | 1 | 0 | male | [null] | 0 | 0 | PC 17600 | 30.6958 | [null] | C | 14 | [null] | New York, NY | |
57 | 1 | 0 | female | 50.0 | 0 | 0 | PC 17595 | 28.7125 | C49 | C | [null] | [null] | Paris, France New York, NY | |
58 | 1 | 0 | male | 46.0 | 0 | 0 | 694 | 26.0 | [null] | S | [null] | 80 | Bennington, VT | |
59 | 1 | 0 | male | 50.0 | 0 | 0 | 113044 | 26.0 | E60 | S | [null] | [null] | London | |
60 | 1 | 0 | male | 32.5 | 0 | 0 | 113503 | 211.5 | C132 | C | [null] | 45 | [null] | |
61 | 1 | 0 | male | 58.0 | 0 | 0 | 11771 | 29.7 | B37 | C | [null] | 258 | Buffalo, NY | |
62 | 1 | 0 | male | 41.0 | 1 | 0 | 17464 | 51.8625 | D21 | S | [null] | [null] | Southington / Noank, CT | |
63 | 1 | 0 | male | [null] | 0 | 0 | 113028 | 26.55 | C124 | S | [null] | [null] | Portland, OR | |
64 | 1 | 0 | male | [null] | 0 | 0 | PC 17612 | 27.7208 | [null] | C | [null] | [null] | Chicago, IL | |
65 | 1 | 0 | male | 29.0 | 0 | 0 | 113501 | 30.0 | D6 | S | [null] | 126 | Springfield, MA | |
66 | 1 | 0 | male | 30.0 | 0 | 0 | 113801 | 45.5 | [null] | S | [null] | [null] | London / New York, NY | |
67 | 1 | 0 | male | 30.0 | 0 | 0 | 110469 | 26.0 | C106 | S | [null] | [null] | Brockton, MA | |
68 | 1 | 0 | male | 19.0 | 1 | 0 | 113773 | 53.1 | D30 | S | [null] | [null] | New York, NY | |
69 | 1 | 0 | male | 46.0 | 0 | 0 | 13050 | 75.2417 | C6 | C | [null] | 292 | Vancouver, BC | |
70 | 1 | 0 | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S | [null] | 175 | Dorchester, MA | |
71 | 1 | 0 | male | 28.0 | 1 | 0 | PC 17604 | 82.1708 | [null] | C | [null] | [null] | New York, NY | |
72 | 1 | 0 | male | 65.0 | 0 | 0 | 13509 | 26.55 | E38 | S | [null] | 249 | East Bridgewater, MA | |
73 | 1 | 0 | male | 44.0 | 2 | 0 | 19928 | 90.0 | C78 | Q | [null] | 230 | Fond du Lac, WI | |
74 | 1 | 0 | male | 55.0 | 0 | 0 | 113787 | 30.5 | C30 | S | [null] | [null] | Montreal, PQ | |
75 | 1 | 0 | male | 47.0 | 0 | 0 | 113796 | 42.4 | [null] | S | [null] | [null] | Washington, DC | |
76 | 1 | 0 | male | 37.0 | 0 | 1 | PC 17596 | 29.7 | C118 | C | [null] | [null] | Brooklyn, NY | |
77 | 1 | 0 | male | 58.0 | 0 | 2 | 35273 | 113.275 | D48 | C | [null] | 122 | Lexington, MA | |
78 | 1 | 0 | male | 64.0 | 0 | 0 | 693 | 26.0 | [null] | S | [null] | 263 | Isle of Wight, England | |
79 | 1 | 0 | male | 65.0 | 0 | 1 | 113509 | 61.9792 | B30 | C | [null] | 234 | Providence, RI | |
80 | 1 | 0 | male | 28.5 | 0 | 0 | PC 17562 | 27.7208 | D43 | C | [null] | 189 | ?Havana, Cuba | |
81 | 1 | 0 | male | [null] | 0 | 0 | 112052 | 0.0 | [null] | S | [null] | [null] | Belfast | |
82 | 1 | 0 | male | 45.5 | 0 | 0 | 113043 | 28.5 | C124 | S | [null] | 166 | Surbiton Hill, Surrey | |
83 | 1 | 0 | male | 23.0 | 0 | 0 | 12749 | 93.5 | B24 | S | [null] | [null] | Montreal, PQ | |
84 | 1 | 0 | male | 29.0 | 1 | 0 | 113776 | 66.6 | C2 | S | [null] | [null] | Isleworth, England | |
85 | 1 | 0 | male | 18.0 | 1 | 0 | PC 17758 | 108.9 | C65 | C | [null] | [null] | Madrid, Spain | |
86 | 1 | 0 | male | 47.0 | 0 | 0 | 110465 | 52.0 | C110 | S | [null] | 207 | Worcester, MA | |
87 | 1 | 0 | male | 38.0 | 0 | 0 | 19972 | 0.0 | [null] | S | [null] | [null] | Rotterdam, Netherlands | |
88 | 1 | 0 | male | 22.0 | 0 | 0 | PC 17760 | 135.6333 | [null] | C | [null] | 232 | [null] | |
89 | 1 | 0 | male | [null] | 0 | 0 | PC 17757 | 227.525 | [null] | C | [null] | [null] | [null] | |
90 | 1 | 0 | male | 31.0 | 0 | 0 | PC 17590 | 50.4958 | A24 | S | [null] | [null] | Trenton, NJ | |
91 | 1 | 0 | male | [null] | 0 | 0 | 113767 | 50.0 | A32 | S | [null] | [null] | Seattle, WA | |
92 | 1 | 0 | male | 36.0 | 0 | 0 | 13049 | 40.125 | A10 | C | [null] | [null] | Winnipeg, MB | |
93 | 1 | 0 | male | 55.0 | 1 | 0 | PC 17603 | 59.4 | [null] | C | [null] | [null] | New York, NY | |
94 | 1 | 0 | male | 33.0 | 0 | 0 | 113790 | 26.55 | [null] | S | [null] | 109 | London | |
95 | 1 | 0 | male | 61.0 | 1 | 3 | PC 17608 | 262.375 | B57 B59 B63 B66 | C | [null] | [null] | Haverford, PA / Cooperstown, NY | |
96 | 1 | 0 | male | 50.0 | 1 | 0 | 13507 | 55.9 | E44 | S | [null] | [null] | Duluth, MN | |
97 | 1 | 0 | male | 56.0 | 0 | 0 | 113792 | 26.55 | [null] | S | [null] | [null] | New York, NY | |
98 | 1 | 0 | male | 56.0 | 0 | 0 | 17764 | 30.6958 | A7 | C | [null] | [null] | St James, Long Island, NY | |
99 | 1 | 0 | male | 24.0 | 1 | 0 | 13695 | 60.0 | C31 | S | [null] | [null] | Huntington, WV | |
100 | 1 | 0 | male | [null] | 0 | 0 | 113056 | 26.0 | A19 | S | [null] | [null] | Streatham, Surrey |
The feature 'parch' corresponds to the number of parents and children on-board. The feature 'sibsp' corresponds to the number of siblings and spouses on-board. We can create the feature 'family size' which is equal to parch + sibsp + 1.
vdf["family_size"] = vdf["parch"] + vdf["sibsp"] + 1
vdf.select(["parch", "sibsp", "family_size"])
123 parchInteger | 123 sibspInteger | 123 family_sizeInteger | |
1 | 2 | 1 | 4 |
2 | 2 | 1 | 4 |
3 | 2 | 1 | 4 |
4 | 0 | 0 | 1 |
5 | 0 | 0 | 1 |
6 | 0 | 1 | 2 |
7 | 0 | 0 | 1 |
8 | 1 | 0 | 2 |
9 | 0 | 0 | 1 |
10 | 0 | 0 | 1 |
11 | 0 | 0 | 1 |
12 | 0 | 0 | 1 |
13 | 0 | 0 | 1 |
14 | 0 | 0 | 1 |
15 | 0 | 0 | 1 |
16 | 0 | 0 | 1 |
17 | 0 | 0 | 1 |
18 | 0 | 0 | 1 |
19 | 0 | 0 | 1 |
20 | 0 | 0 | 1 |
21 | 0 | 0 | 1 |
22 | 0 | 1 | 2 |
23 | 0 | 1 | 2 |
24 | 0 | 0 | 1 |
25 | 0 | 1 | 2 |
26 | 0 | 0 | 1 |
27 | 0 | 0 | 1 |
28 | 1 | 1 | 3 |
29 | 0 | 0 | 1 |
30 | 1 | 1 | 3 |
31 | 0 | 1 | 2 |
32 | 0 | 1 | 2 |
33 | 0 | 1 | 2 |
34 | 0 | 0 | 1 |
35 | 0 | 0 | 1 |
36 | 0 | 0 | 1 |
37 | 0 | 0 | 1 |
38 | 2 | 3 | 6 |
39 | 4 | 1 | 6 |
40 | 0 | 0 | 1 |
41 | 0 | 0 | 1 |
42 | 0 | 1 | 2 |
43 | 0 | 0 | 1 |
44 | 0 | 0 | 1 |
45 | 0 | 0 | 1 |
46 | 1 | 0 | 2 |
47 | 0 | 0 | 1 |
48 | 0 | 0 | 1 |
49 | 0 | 1 | 2 |
50 | 0 | 0 | 1 |
51 | 1 | 1 | 3 |
52 | 0 | 0 | 1 |
53 | 0 | 0 | 1 |
54 | 0 | 0 | 1 |
55 | 0 | 1 | 2 |
56 | 0 | 0 | 1 |
57 | 0 | 0 | 1 |
58 | 0 | 0 | 1 |
59 | 0 | 0 | 1 |
60 | 0 | 0 | 1 |
61 | 0 | 0 | 1 |
62 | 0 | 1 | 2 |
63 | 0 | 0 | 1 |
64 | 0 | 0 | 1 |
65 | 0 | 0 | 1 |
66 | 0 | 0 | 1 |
67 | 0 | 0 | 1 |
68 | 0 | 1 | 2 |
69 | 0 | 0 | 1 |
70 | 0 | 0 | 1 |
71 | 0 | 1 | 2 |
72 | 0 | 0 | 1 |
73 | 0 | 2 | 3 |
74 | 0 | 0 | 1 |
75 | 0 | 0 | 1 |
76 | 1 | 0 | 2 |
77 | 2 | 0 | 3 |
78 | 0 | 0 | 1 |
79 | 1 | 0 | 2 |
80 | 0 | 0 | 1 |
81 | 0 | 0 | 1 |
82 | 0 | 0 | 1 |
83 | 0 | 0 | 1 |
84 | 0 | 1 | 2 |
85 | 0 | 1 | 2 |
86 | 0 | 0 | 1 |
87 | 0 | 0 | 1 |
88 | 0 | 0 | 1 |
89 | 0 | 0 | 1 |
90 | 0 | 0 | 1 |
91 | 0 | 0 | 1 |
92 | 0 | 0 | 1 |
93 | 0 | 1 | 2 |
94 | 0 | 0 | 1 |
95 | 3 | 1 | 5 |
96 | 0 | 1 | 2 |
97 | 0 | 0 | 1 |
98 | 0 | 0 | 1 |
99 | 0 | 1 | 2 |
100 | 0 | 0 | 1 |
When using the 'eval' method, you can enter any SQL expression and VerticaPy will evaluate it!
Regular Expressions¶
To compute features using regular expressions, we'll use the 'regexp' method.
help(vp.vDataFrame.regexp)
Help on function regexp in module verticapy.vdataframe: regexp(self, column:str, pattern:str, method:str='substr', position:int=1, occurrence:int=1, replacement:str='', return_position:int=0, name:str='') --------------------------------------------------------------------------- Computes a new vColumn based on regular expressions. Parameters ---------- column: str Input vColumn to use to compute the regular expression. pattern: str The regular expression. method: str, optional Method to use to compute the regular expressions. count : Returns the number times a regular expression matches each element of the input vColumn. ilike : Returns True if the vColumn element contains a match for the regular expression. instr : Returns the starting or ending position in a vColumn element where a regular expression matches. like : Returns True if the vColumn element matches the regular expression. not_ilike : Returns True if the vColumn element does not match the case-insensitive regular expression. not_like : Returns True if the vColumn element does not contain a match for the regular expression. replace : Replaces all occurrences of a substring that match a regular expression with another substring. substr : Returns the substring that matches a regular expression within a vColumn. position: int, optional The number of characters from the start of the string where the function should start searching for matches. occurrence: int, optional Controls which occurrence of a pattern match in the string to return. replacement: str, optional The string to replace matched substrings. return_position: int, optional Sets the position within the string to return. name: str, optional New feature name. If empty, a name will be generated. Returns ------- vDataFrame self See Also -------- vDataFrame.eval : Evaluates a customized expression.
Consider the following example: notice that passenger names include their title.
vdf["name"]
Abc Varchar(164) | |
1 | |
2 | |
3 | |
4 | |
5 | |
6 | |
7 | |
8 | |
9 | |
10 | |
11 | |
12 | |
13 | |
14 | |
15 | |
16 | |
17 | |
18 | |
19 | |
20 | |
21 | |
22 | |
23 | |
24 | |
25 | |
26 | |
27 | |
28 | |
29 | |
30 | |
31 | |
32 | |
33 | |
34 | |
35 | |
36 | |
37 | |
38 | |
39 | |
40 | |
41 | |
42 | |
43 | |
44 | |
45 | |
46 | |
47 | |
48 | |
49 | |
50 | |
51 | |
52 | |
53 | |
54 | |
55 | |
56 | |
57 | |
58 | |
59 | |
60 | |
61 | |
62 | |
63 | |
64 | |
65 | |
66 | |
67 | |
68 | |
69 | |
70 | |
71 | |
72 | |
73 | |
74 | |
75 | |
76 | |
77 | |
78 | |
79 | |
80 | |
81 | |
82 | |
83 | |
84 | |
85 | |
86 | |
87 | |
88 | |
89 | |
90 | |
91 | |
92 | |
93 | |
94 | |
95 | |
96 | |
97 | |
98 | |
99 | |
100 |
Let's extract the title using regular expressions.
vdf.regexp(column = "name",
name = "title",
pattern = " ([A-Za-z])+\.",
method = "substr")
vdf.select(["name", "title"])
Abc Varchar(164) | Abc titleVarchar(164) | |
1 | Miss. | |
2 | Mr. | |
3 | Mrs. | |
4 | Mr. | |
5 | Mr. | |
6 | Col. | |
7 | Mr. | |
8 | Mr. | |
9 | Mr. | |
10 | Mr. | |
11 | Mr. | |
12 | Mr. | |
13 | Mr. | |
14 | Mr. | |
15 | Dr. | |
16 | Major. | |
17 | Mr. | |
18 | Mr. | |
19 | Mr. | |
20 | Mr. | |
21 | Mr. | |
22 | Mr. | |
23 | Mr. | |
24 | Mr. | |
25 | Mr. | |
26 | Mr. | |
27 | Mr. | |
28 | Mr. | |
29 | Mr. | |
30 | Capt. | |
31 | Mr. | |
32 | Mr. | |
33 | Mr. | |
34 | Mr. | |
35 | Miss. | |
36 | Mr. | |
37 | Mr. | |
38 | Mr. | |
39 | Mr. | |
40 | Mr. | |
41 | Mr. | |
42 | Mr. | |
43 | Mr. | |
44 | Mr. | |
45 | Mr. | |
46 | Mr. | |
47 | Mr. | |
48 | Mr. | |
49 | Mr. | |
50 | Mr. | |
51 | Mr. | |
52 | Mr. | |
53 | Mr. | |
54 | Mr. | |
55 | Mr. | |
56 | Mr. | |
57 | Miss. | |
58 | Mr. | |
59 | Mr. | |
60 | Mr. | |
61 | Mr. | |
62 | Mr. | |
63 | Mr. | |
64 | Mr. | |
65 | Mr. | |
66 | Mr. | |
67 | Mr. | |
68 | Mr. | |
69 | Mr. | |
70 | Mr. | |
71 | Mr. | |
72 | Mr. | |
73 | Dr. | |
74 | Mr. | |
75 | Mr. | |
76 | Mr. | |
77 | Mr. | |
78 | Mr. | |
79 | Mr. | |
80 | Mr. | |
81 | Mr. | |
82 | Mr. | |
83 | Mr. | |
84 | Mr. | |
85 | Mr. | |
86 | Mr. | |
87 | Jonkheer. | |
88 | Mr. | |
89 | Mr. | |
90 | Mr. | |
91 | Mr. | |
92 | Mr. | |
93 | Mr. | |
94 | Mr. | |
95 | Mr. | |
96 | Mr. | |
97 | Mr. | |
98 | Mr. | |
99 | Mr. | |
100 | Mr. |
Advanced Analytical Functions¶
The 'analytic' method contains the many advanced analytical functions in VerticaPy.
help(vp.vDataFrame.analytic)
Help on function analytic in module verticapy.vdataframe: analytic(self, func:str, columns:Union[str, list]=[], by:list=[], order_by:Union[dict, list]=[], name:str='', offset:int=1, x_smoothing:float=0.5, add_count:bool=True) --------------------------------------------------------------------------- Adds a new vColumn to the vDataFrame by using an advanced analytical function on one or two specific vColumns. ⚠ Warning : Some analytical functions can make the vDataFrame structure more resource intensive. It is best to check the structure of the vDataFrame using the 'current_relation' method and to save it using the 'to_db' method with the parameters 'inplace = True' and 'relation_type = table' Parameters ---------- func: str Function to apply. aad : average absolute deviation beta : Beta Coefficient between 2 vColumns count : number of non-missing elements corr : Pearson's correlation between 2 vColumns cov : covariance between 2 vColumns dense_rank : dense rank ema : exponential moving average first_value : first non null lead iqr : interquartile range kurtosis : kurtosis jb : Jarque-Bera index lead : next element lag : previous element last_value : first non null lag mad : median absolute deviation max : maximum mean : average median : median min : minimum mode : most occurent element q% : q quantile (ex: 50% for the median) pct_change : ratio between the current value and the previous one percent_rank : percent rank prod : product range : difference between the max and the min rank : rank row_number : row number sem : standard error of the mean skewness : skewness sum : sum std : standard deviation unique : cardinality (count distinct) var : variance Other analytical functions could work if it is part of the DB version you are using. columns: str, optional Input vColumns. It can be a list of one or two elements. by: list, optional vColumns used in the partition. order_by: dict / list, optional List of the vColumns to use to sort the data using asc order or dictionary of all sorting methods. For example, to sort by "column1" ASC and "column2" DESC, write {"column1": "asc", "column2": "desc"} name: str, optional Name of the new vColumn. If empty a default name based on the other parameters will be generated. offset: int, optional Lead/Lag offset if parameter 'func' is the function 'lead'/'lag'. x_smoothing: float, optional The smoothing parameter of the 'ema' if the function is 'ema'. It must be in [0;1] add_count: bool, optional If the function is the 'mode' and this parameter is True then another column will be added to the vDataFrame with the mode number of occurences. Returns ------- vDataFrame self See Also -------- vDataFrame.eval : Evaluates a customized expression. vDataFrame.rolling : Computes a customized moving window.
To demonstrate some of these techniques, let's use the Amazon dataset and perform some computations.
from verticapy.datasets import load_amazon
vdf = load_amazon()
display(vdf)
📅 dateDate | Abc stateVarchar(32) | 123 numberInt | |
1 | 1998-01-01 | ACRE | 0 |
2 | 1998-01-01 | ALAGOAS | 0 |
3 | 1998-01-01 | AMAPÁ | 0 |
4 | 1998-01-01 | AMAZONAS | 0 |
5 | 1998-01-01 | BAHIA | 0 |
6 | 1998-01-01 | CEARÁ | 0 |
7 | 1998-01-01 | DISTRITO FEDERAL | 0 |
8 | 1998-01-01 | ESPÍRITO SANTO | 0 |
9 | 1998-01-01 | GOIÁS | 0 |
10 | 1998-01-01 | MARANHÃO | 0 |
11 | 1998-01-01 | MATO GROSSO | 0 |
12 | 1998-01-01 | MATO GROSSO DO SUL | 0 |
13 | 1998-01-01 | MINAS GERAIS | 0 |
14 | 1998-01-01 | PARANÁ | 0 |
15 | 1998-01-01 | PARAÍBA | 0 |
16 | 1998-01-01 | PARÁ | 0 |
17 | 1998-01-01 | PERNAMBUCO | 0 |
18 | 1998-01-01 | PIAUÍ | 0 |
19 | 1998-01-01 | RIO DE JANEIRO | 0 |
20 | 1998-01-01 | RIO GRANDE DO NORTE | 0 |
21 | 1998-01-01 | RIO GRANDE DO SUL | 0 |
22 | 1998-01-01 | RONDÔNIA | 0 |
23 | 1998-01-01 | RORAIMA | 0 |
24 | 1998-01-01 | SANTA CATARINA | 0 |
25 | 1998-01-01 | SERGIPE | 0 |
26 | 1998-01-01 | SÃO PAULO | 0 |
27 | 1998-01-01 | TOCANTINS | 0 |
28 | 1998-02-01 | ACRE | 0 |
29 | 1998-02-01 | ALAGOAS | 0 |
30 | 1998-02-01 | AMAPÁ | 0 |
31 | 1998-02-01 | AMAZONAS | 0 |
32 | 1998-02-01 | BAHIA | 0 |
33 | 1998-02-01 | CEARÁ | 0 |
34 | 1998-02-01 | DISTRITO FEDERAL | 0 |
35 | 1998-02-01 | ESPÍRITO SANTO | 0 |
36 | 1998-02-01 | GOIÁS | 0 |
37 | 1998-02-01 | MARANHÃO | 0 |
38 | 1998-02-01 | MATO GROSSO | 0 |
39 | 1998-02-01 | MATO GROSSO DO SUL | 0 |
40 | 1998-02-01 | MINAS GERAIS | 0 |
41 | 1998-02-01 | PARANÁ | 0 |
42 | 1998-02-01 | PARAÍBA | 0 |
43 | 1998-02-01 | PARÁ | 0 |
44 | 1998-02-01 | PERNAMBUCO | 0 |
45 | 1998-02-01 | PIAUÍ | 0 |
46 | 1998-02-01 | RIO DE JANEIRO | 0 |
47 | 1998-02-01 | RIO GRANDE DO NORTE | 0 |
48 | 1998-02-01 | RIO GRANDE DO SUL | 0 |
49 | 1998-02-01 | RONDÔNIA | 0 |
50 | 1998-02-01 | RORAIMA | 0 |
51 | 1998-02-01 | SANTA CATARINA | 0 |
52 | 1998-02-01 | SERGIPE | 0 |
53 | 1998-02-01 | SÃO PAULO | 0 |
54 | 1998-02-01 | TOCANTINS | 0 |
55 | 1998-03-01 | ACRE | 0 |
56 | 1998-03-01 | ALAGOAS | 0 |
57 | 1998-03-01 | AMAPÁ | 0 |
58 | 1998-03-01 | AMAZONAS | 0 |
59 | 1998-03-01 | BAHIA | 0 |
60 | 1998-03-01 | CEARÁ | 0 |
61 | 1998-03-01 | DISTRITO FEDERAL | 0 |
62 | 1998-03-01 | ESPÍRITO SANTO | 0 |
63 | 1998-03-01 | GOIÁS | 0 |
64 | 1998-03-01 | MARANHÃO | 0 |
65 | 1998-03-01 | MATO GROSSO | 0 |
66 | 1998-03-01 | MATO GROSSO DO SUL | 0 |
67 | 1998-03-01 | MINAS GERAIS | 0 |
68 | 1998-03-01 | PARANÁ | 0 |
69 | 1998-03-01 | PARAÍBA |