Features Engineering¶
While preparing our data, we need to think constantly about the most suitable features we can use to achieve our overall goals.
Features engineering makes use of many techniques - too many to go over in this short lesson. We’ll focus on the most popular ones.
Customized Features Engineering¶
To build a customized feature, you can use the ‘eval’ method of the vDataFrame. Let’s look at an example with the well-known ‘Titanic’ dataset.
[1]:
import verticapy as vp
from verticapy.datasets import load_titanic
vdf = load_titanic()
display(vdf)
123 pclassInteger | 123 survivedInteger | Abc Varchar(164) | Abc sexVarchar(20) | 123 ageNumeric(8) | 123 sibspInteger | 123 parchInteger | Abc ticketVarchar(36) | 123 fareNumeric(12) | Abc cabinVarchar(30) | Abc embarkedVarchar(20) | Abc boatVarchar(100) | 123 bodyInteger | Abc home.destVarchar(100) | |
| 1 | 1 | 0 | female | 2.0 | 1 | 2 | 113781 | 151.55 | C22 C26 | S | [null] | [null] | Montreal, PQ / Chesterville, ON | |
| 2 | 1 | 0 | male | 30.0 | 1 | 2 | 113781 | 151.55 | C22 C26 | S | [null] | 135 | Montreal, PQ / Chesterville, ON | |
| 3 | 1 | 0 | female | 25.0 | 1 | 2 | 113781 | 151.55 | C22 C26 | S | [null] | [null] | Montreal, PQ / Chesterville, ON | |
| 4 | 1 | 0 | male | 39.0 | 0 | 0 | 112050 | 0.0 | A36 | S | [null] | [null] | Belfast, NI | |
| 5 | 1 | 0 | male | 71.0 | 0 | 0 | PC 17609 | 49.5042 | [null] | C | [null] | 22 | Montevideo, Uruguay | |
| 6 | 1 | 0 | male | 47.0 | 1 | 0 | PC 17757 | 227.525 | C62 C64 | C | [null] | 124 | New York, NY | |
| 7 | 1 | 0 | male | [null] | 0 | 0 | PC 17318 | 25.925 | [null] | S | [null] | [null] | New York, NY | |
| 8 | 1 | 0 | male | 24.0 | 0 | 1 | PC 17558 | 247.5208 | B58 B60 | C | [null] | [null] | Montreal, PQ | |
| 9 | 1 | 0 | male | 36.0 | 0 | 0 | 13050 | 75.2417 | C6 | C | A | [null] | Winnipeg, MN | |
| 10 | 1 | 0 | male | 25.0 | 0 | 0 | 13905 | 26.0 | [null] | C | [null] | 148 | San Francisco, CA | |
| 11 | 1 | 0 | male | 45.0 | 0 | 0 | 113784 | 35.5 | T | S | [null] | [null] | Trenton, NJ | |
| 12 | 1 | 0 | male | 42.0 | 0 | 0 | 110489 | 26.55 | D22 | S | [null] | [null] | London / Winnipeg, MB | |
| 13 | 1 | 0 | male | 41.0 | 0 | 0 | 113054 | 30.5 | A21 | S | [null] | [null] | Pomeroy, WA | |
| 14 | 1 | 0 | male | 48.0 | 0 | 0 | PC 17591 | 50.4958 | B10 | C | [null] | 208 | Omaha, NE | |
| 15 | 1 | 0 | male | [null] | 0 | 0 | 112379 | 39.6 | [null] | C | [null] | [null] | Philadelphia, PA | |
| 16 | 1 | 0 | male | 45.0 | 0 | 0 | 113050 | 26.55 | B38 | S | [null] | [null] | Washington, DC | |
| 17 | 1 | 0 | male | [null] | 0 | 0 | 113798 | 31.0 | [null] | S | [null] | [null] | [null] | |
| 18 | 1 | 0 | male | 33.0 | 0 | 0 | 695 | 5.0 | B51 B53 B55 | S | [null] | [null] | New York, NY | |
| 19 | 1 | 0 | male | 28.0 | 0 | 0 | 113059 | 47.1 | [null] | S | [null] | [null] | Montevideo, Uruguay | |
| 20 | 1 | 0 | male | 17.0 | 0 | 0 | 113059 | 47.1 | [null] | S | [null] | [null] | Montevideo, Uruguay | |
| 21 | 1 | 0 | male | 49.0 | 0 | 0 | 19924 | 26.0 | [null] | S | [null] | [null] | Ascot, Berkshire / Rochester, NY | |
| 22 | 1 | 0 | male | 36.0 | 1 | 0 | 19877 | 78.85 | C46 | S | [null] | 172 | Little Onn Hall, Staffs | |
| 23 | 1 | 0 | male | 46.0 | 1 | 0 | W.E.P. 5734 | 61.175 | E31 | S | [null] | [null] | Amenia, ND | |
| 24 | 1 | 0 | male | [null] | 0 | 0 | 112051 | 0.0 | [null] | S | [null] | [null] | Liverpool, England / Belfast | |
| 25 | 1 | 0 | male | 27.0 | 1 | 0 | 13508 | 136.7792 | C89 | C | [null] | [null] | Los Angeles, CA | |
| 26 | 1 | 0 | male | [null] | 0 | 0 | 110465 | 52.0 | A14 | S | [null] | [null] | Stoughton, MA | |
| 27 | 1 | 0 | male | 47.0 | 0 | 0 | 5727 | 25.5875 | E58 | S | [null] | [null] | Victoria, BC | |
| 28 | 1 | 0 | male | 37.0 | 1 | 1 | PC 17756 | 83.1583 | E52 | C | [null] | [null] | Lakewood, NJ | |
| 29 | 1 | 0 | male | [null] | 0 | 0 | 113791 | 26.55 | [null] | S | [null] | [null] | Roachdale, IN | |
| 30 | 1 | 0 | male | 70.0 | 1 | 1 | WE/P 5735 | 71.0 | B22 | S | [null] | 269 | Milwaukee, WI | |
| 31 | 1 | 0 | male | 39.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | [null] | [null] | New York, NY | |
| 32 | 1 | 0 | male | 31.0 | 1 | 0 | F.C. 12750 | 52.0 | B71 | S | [null] | [null] | Montreal, PQ | |
| 33 | 1 | 0 | male | 50.0 | 1 | 0 | PC 17761 | 106.425 | C86 | C | [null] | 62 | Deephaven, MN / Cedar Rapids, IA | |
| 34 | 1 | 0 | male | 39.0 | 0 | 0 | PC 17580 | 29.7 | A18 | C | [null] | 133 | Philadelphia, PA | |
| 35 | 1 | 0 | female | 36.0 | 0 | 0 | PC 17531 | 31.6792 | A29 | C | [null] | [null] | New York, NY | |
| 36 | 1 | 0 | male | [null] | 0 | 0 | PC 17483 | 221.7792 | C95 | S | [null] | [null] | [null] | |
| 37 | 1 | 0 | male | 30.0 | 0 | 0 | 113051 | 27.75 | C111 | C | [null] | [null] | New York, NY | |
| 38 | 1 | 0 | male | 19.0 | 3 | 2 | 19950 | 263.0 | C23 C25 C27 | S | [null] | [null] | Winnipeg, MB | |
| 39 | 1 | 0 | male | 64.0 | 1 | 4 | 19950 | 263.0 | C23 C25 C27 | S | [null] | [null] | Winnipeg, MB | |
| 40 | 1 | 0 | male | [null] | 0 | 0 | 113778 | 26.55 | D34 | S | [null] | [null] | Westcliff-on-Sea, Essex | |
| 41 | 1 | 0 | male | [null] | 0 | 0 | 112058 | 0.0 | B102 | S | [null] | [null] | [null] | |
| 42 | 1 | 0 | male | 37.0 | 1 | 0 | 113803 | 53.1 | C123 | S | [null] | [null] | Scituate, MA | |
| 43 | 1 | 0 | male | 47.0 | 0 | 0 | 111320 | 38.5 | E63 | S | [null] | 275 | St Anne's-on-Sea, Lancashire | |
| 44 | 1 | 0 | male | 24.0 | 0 | 0 | PC 17593 | 79.2 | B86 | C | [null] | [null] | [null] | |
| 45 | 1 | 0 | male | 71.0 | 0 | 0 | PC 17754 | 34.6542 | A5 | C | [null] | [null] | New York, NY | |
| 46 | 1 | 0 | male | 38.0 | 0 | 1 | PC 17582 | 153.4625 | C91 | S | [null] | 147 | Winnipeg, MB | |
| 47 | 1 | 0 | male | 46.0 | 0 | 0 | PC 17593 | 79.2 | B82 B84 | C | [null] | [null] | New York, NY | |
| 48 | 1 | 0 | male | [null] | 0 | 0 | 113796 | 42.4 | [null] | S | [null] | [null] | [null] | |
| 49 | 1 | 0 | male | 45.0 | 1 | 0 | 36973 | 83.475 | C83 | S | [null] | [null] | New York, NY | |
| 50 | 1 | 0 | male | 40.0 | 0 | 0 | 112059 | 0.0 | B94 | S | [null] | 110 | [null] | |
| 51 | 1 | 0 | male | 55.0 | 1 | 1 | 12749 | 93.5 | B69 | S | [null] | 307 | Montreal, PQ | |
| 52 | 1 | 0 | male | 42.0 | 0 | 0 | 113038 | 42.5 | B11 | S | [null] | [null] | London / Middlesex | |
| 53 | 1 | 0 | male | [null] | 0 | 0 | 17463 | 51.8625 | E46 | S | [null] | [null] | Brighton, MA | |
| 54 | 1 | 0 | male | 55.0 | 0 | 0 | 680 | 50.0 | C39 | S | [null] | [null] | London / Birmingham | |
| 55 | 1 | 0 | male | 42.0 | 1 | 0 | 113789 | 52.0 | [null] | S | [null] | 38 | New York, NY | |
| 56 | 1 | 0 | male | [null] | 0 | 0 | PC 17600 | 30.6958 | [null] | C | 14 | [null] | New York, NY | |
| 57 | 1 | 0 | female | 50.0 | 0 | 0 | PC 17595 | 28.7125 | C49 | C | [null] | [null] | Paris, France New York, NY | |
| 58 | 1 | 0 | male | 46.0 | 0 | 0 | 694 | 26.0 | [null] | S | [null] | 80 | Bennington, VT | |
| 59 | 1 | 0 | male | 50.0 | 0 | 0 | 113044 | 26.0 | E60 | S | [null] | [null] | London | |
| 60 | 1 | 0 | male | 32.5 | 0 | 0 | 113503 | 211.5 | C132 | C | [null] | 45 | [null] | |
| 61 | 1 | 0 | male | 58.0 | 0 | 0 | 11771 | 29.7 | B37 | C | [null] | 258 | Buffalo, NY | |
| 62 | 1 | 0 | male | 41.0 | 1 | 0 | 17464 | 51.8625 | D21 | S | [null] | [null] | Southington / Noank, CT | |
| 63 | 1 | 0 | male | [null] | 0 | 0 | 113028 | 26.55 | C124 | S | [null] | [null] | Portland, OR | |
| 64 | 1 | 0 | male | [null] | 0 | 0 | PC 17612 | 27.7208 | [null] | C | [null] | [null] | Chicago, IL | |
| 65 | 1 | 0 | male | 29.0 | 0 | 0 | 113501 | 30.0 | D6 | S | [null] | 126 | Springfield, MA | |
| 66 | 1 | 0 | male | 30.0 | 0 | 0 | 113801 | 45.5 | [null] | S | [null] | [null] | London / New York, NY | |
| 67 | 1 | 0 | male | 30.0 | 0 | 0 | 110469 | 26.0 | C106 | S | [null] | [null] | Brockton, MA | |
| 68 | 1 | 0 | male | 19.0 | 1 | 0 | 113773 | 53.1 | D30 | S | [null] | [null] | New York, NY | |
| 69 | 1 | 0 | male | 46.0 | 0 | 0 | 13050 | 75.2417 | C6 | C | [null] | 292 | Vancouver, BC | |
| 70 | 1 | 0 | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S | [null] | 175 | Dorchester, MA | |
| 71 | 1 | 0 | male | 28.0 | 1 | 0 | PC 17604 | 82.1708 | [null] | C | [null] | [null] | New York, NY | |
| 72 | 1 | 0 | male | 65.0 | 0 | 0 | 13509 | 26.55 | E38 | S | [null] | 249 | East Bridgewater, MA | |
| 73 | 1 | 0 | male | 44.0 | 2 | 0 | 19928 | 90.0 | C78 | Q | [null] | 230 | Fond du Lac, WI | |
| 74 | 1 | 0 | male | 55.0 | 0 | 0 | 113787 | 30.5 | C30 | S | [null] | [null] | Montreal, PQ | |
| 75 | 1 | 0 | male | 47.0 | 0 | 0 | 113796 | 42.4 | [null] | S | [null] | [null] | Washington, DC | |
| 76 | 1 | 0 | male | 37.0 | 0 | 1 | PC 17596 | 29.7 | C118 | C | [null] | [null] | Brooklyn, NY | |
| 77 | 1 | 0 | male | 58.0 | 0 | 2 | 35273 | 113.275 | D48 | C | [null] | 122 | Lexington, MA | |
| 78 | 1 | 0 | male | 64.0 | 0 | 0 | 693 | 26.0 | [null] | S | [null] | 263 | Isle of Wight, England | |
| 79 | 1 | 0 | male | 65.0 | 0 | 1 | 113509 | 61.9792 | B30 | C | [null] | 234 | Providence, RI | |
| 80 | 1 | 0 | male | 28.5 | 0 | 0 | PC 17562 | 27.7208 | D43 | C | [null] | 189 | ?Havana, Cuba | |
| 81 | 1 | 0 | male | [null] | 0 | 0 | 112052 | 0.0 | [null] | S | [null] | [null] | Belfast | |
| 82 | 1 | 0 | male | 45.5 | 0 | 0 | 113043 | 28.5 | C124 | S | [null] | 166 | Surbiton Hill, Surrey | |
| 83 | 1 | 0 | male | 23.0 | 0 | 0 | 12749 | 93.5 | B24 | S | [null] | [null] | Montreal, PQ | |
| 84 | 1 | 0 | male | 29.0 | 1 | 0 | 113776 | 66.6 | C2 | S | [null] | [null] | Isleworth, England | |
| 85 | 1 | 0 | male | 18.0 | 1 | 0 | PC 17758 | 108.9 | C65 | C | [null] | [null] | Madrid, Spain | |
| 86 | 1 | 0 | male | 47.0 | 0 | 0 | 110465 | 52.0 | C110 | S | [null] | 207 | Worcester, MA | |
| 87 | 1 | 0 | male | 38.0 | 0 | 0 | 19972 | 0.0 | [null] | S | [null] | [null] | Rotterdam, Netherlands | |
| 88 | 1 | 0 | male | 22.0 | 0 | 0 | PC 17760 | 135.6333 | [null] | C | [null] | 232 | [null] | |
| 89 | 1 | 0 | male | [null] | 0 | 0 | PC 17757 | 227.525 | [null] | C | [null] | [null] | [null] | |
| 90 | 1 | 0 | male | 31.0 | 0 | 0 | PC 17590 | 50.4958 | A24 | S | [null] | [null] | Trenton, NJ | |
| 91 | 1 | 0 | male | [null] | 0 | 0 | 113767 | 50.0 | A32 | S | [null] | [null] | Seattle, WA | |
| 92 | 1 | 0 | male | 36.0 | 0 | 0 | 13049 | 40.125 | A10 | C | [null] | [null] | Winnipeg, MB | |
| 93 | 1 | 0 | male | 55.0 | 1 | 0 | PC 17603 | 59.4 | [null] | C | [null] | [null] | New York, NY | |
| 94 | 1 | 0 | male | 33.0 | 0 | 0 | 113790 | 26.55 | [null] | S | [null] | 109 | London | |
| 95 | 1 | 0 | male | 61.0 | 1 | 3 | PC 17608 | 262.375 | B57 B59 B63 B66 | C | [null] | [null] | Haverford, PA / Cooperstown, NY | |
| 96 | 1 | 0 | male | 50.0 | 1 | 0 | 13507 | 55.9 | E44 | S | [null] | [null] | Duluth, MN | |
| 97 | 1 | 0 | male | 56.0 | 0 | 0 | 113792 | 26.55 | [null] | S | [null] | [null] | New York, NY | |
| 98 | 1 | 0 | male | 56.0 | 0 | 0 | 17764 | 30.6958 | A7 | C | [null] | [null] | St James, Long Island, NY | |
| 99 | 1 | 0 | male | 24.0 | 1 | 0 | 13695 | 60.0 | C31 | S | [null] | [null] | Huntington, WV | |
| 100 | 1 | 0 | male | [null] | 0 | 0 | 113056 | 26.0 | A19 | S | [null] | [null] | Streatham, Surrey |
The feature ‘parch’ corresponds to the number of parents and children on-board. The feature ‘sibsp’ corresponds to the number of siblings and spouses on-board. We can create the feature ‘family size’ which is equal to parch + sibsp + 1.
[2]:
vdf["family_size"] = vdf["parch"] + vdf["sibsp"] + 1
vdf.select(["parch", "sibsp", "family_size"])
[2]:
123 parchInteger | 123 sibspInteger | 123 family_sizeInteger | |
| 1 | 2 | 1 | 4 |
| 2 | 2 | 1 | 4 |
| 3 | 2 | 1 | 4 |
| 4 | 0 | 0 | 1 |
| 5 | 0 | 0 | 1 |
| 6 | 0 | 1 | 2 |
| 7 | 0 | 0 | 1 |
| 8 | 1 | 0 | 2 |
| 9 | 0 | 0 | 1 |
| 10 | 0 | 0 | 1 |
| 11 | 0 | 0 | 1 |
| 12 | 0 | 0 | 1 |
| 13 | 0 | 0 | 1 |
| 14 | 0 | 0 | 1 |
| 15 | 0 | 0 | 1 |
| 16 | 0 | 0 | 1 |
| 17 | 0 | 0 | 1 |
| 18 | 0 | 0 | 1 |
| 19 | 0 | 0 | 1 |
| 20 | 0 | 0 | 1 |
| 21 | 0 | 0 | 1 |
| 22 | 0 | 1 | 2 |
| 23 | 0 | 1 | 2 |
| 24 | 0 | 0 | 1 |
| 25 | 0 | 1 | 2 |
| 26 | 0 | 0 | 1 |
| 27 | 0 | 0 | 1 |
| 28 | 1 | 1 | 3 |
| 29 | 0 | 0 | 1 |
| 30 | 1 | 1 | 3 |
| 31 | 0 | 1 | 2 |
| 32 | 0 | 1 | 2 |
| 33 | 0 | 1 | 2 |
| 34 | 0 | 0 | 1 |
| 35 | 0 | 0 | 1 |
| 36 | 0 | 0 | 1 |
| 37 | 0 | 0 | 1 |
| 38 | 2 | 3 | 6 |
| 39 | 4 | 1 | 6 |
| 40 | 0 | 0 | 1 |
| 41 | 0 | 0 | 1 |
| 42 | 0 | 1 | 2 |
| 43 | 0 | 0 | 1 |
| 44 | 0 | 0 | 1 |
| 45 | 0 | 0 | 1 |
| 46 | 1 | 0 | 2 |
| 47 | 0 | 0 | 1 |
| 48 | 0 | 0 | 1 |
| 49 | 0 | 1 | 2 |
| 50 | 0 | 0 | 1 |
| 51 | 1 | 1 | 3 |
| 52 | 0 | 0 | 1 |
| 53 | 0 | 0 | 1 |
| 54 | 0 | 0 | 1 |
| 55 | 0 | 1 | 2 |
| 56 | 0 | 0 | 1 |
| 57 | 0 | 0 | 1 |
| 58 | 0 | 0 | 1 |
| 59 | 0 | 0 | 1 |
| 60 | 0 | 0 | 1 |
| 61 | 0 | 0 | 1 |
| 62 | 0 | 1 | 2 |
| 63 | 0 | 0 | 1 |
| 64 | 0 | 0 | 1 |
| 65 | 0 | 0 | 1 |
| 66 | 0 | 0 | 1 |
| 67 | 0 | 0 | 1 |
| 68 | 0 | 1 | 2 |
| 69 | 0 | 0 | 1 |
| 70 | 0 | 0 | 1 |
| 71 | 0 | 1 | 2 |
| 72 | 0 | 0 | 1 |
| 73 | 0 | 2 | 3 |
| 74 | 0 | 0 | 1 |
| 75 | 0 | 0 | 1 |
| 76 | 1 | 0 | 2 |
| 77 | 2 | 0 | 3 |
| 78 | 0 | 0 | 1 |
| 79 | 1 | 0 | 2 |
| 80 | 0 | 0 | 1 |
| 81 | 0 | 0 | 1 |
| 82 | 0 | 0 | 1 |
| 83 | 0 | 0 | 1 |
| 84 | 0 | 1 | 2 |
| 85 | 0 | 1 | 2 |
| 86 | 0 | 0 | 1 |
| 87 | 0 | 0 | 1 |
| 88 | 0 | 0 | 1 |
| 89 | 0 | 0 | 1 |
| 90 | 0 | 0 | 1 |
| 91 | 0 | 0 | 1 |
| 92 | 0 | 0 | 1 |
| 93 | 0 | 1 | 2 |
| 94 | 0 | 0 | 1 |
| 95 | 3 | 1 | 5 |
| 96 | 0 | 1 | 2 |
| 97 | 0 | 0 | 1 |
| 98 | 0 | 0 | 1 |
| 99 | 0 | 1 | 2 |
| 100 | 0 | 0 | 1 |
When using the ‘eval’ method, you can enter any SQL expression and VerticaPy will evaluate it!
Regular Expressions¶
To compute features using regular expressions, we’ll use the ‘regexp’ method.
[3]:
help(vp.vDataFrame.regexp)
Help on function regexp in module verticapy.core.vdataframe._text:
regexp(self, column: str, pattern: str, method: Literal['count', 'ilike', 'instr', 'like', 'not_ilike', 'not_like', 'replace', 'substr'] = 'substr', position: int = 1, occurrence: int = 1, replacement: Optional[str] = None, return_position: int = 0, name: Optional[str] = None) -> 'vDataFrame'
Computes a new vDataColumn based on regular expressions.
Parameters
----------
column: str
Input vDataColumn used to compute the regular
expression.
pattern: str
The regular expression.
method: str, optional
Method used to compute the regular expressions.
count : Returns the number of times a
regular expression matches each
element of the input vDataColumn.
ilike : Returns True if the vDataColumn
element contains a match for the
regular expression.
instr : Returns the starting or ending
position in a vDataColumn element
where a regular expression matches.
like : Returns True if the vDataColumn
element matches the regular
expression.
not_ilike : Returns True if the vDataColumn
element does not match the case
-insensitive regular expression.
not_like : Returns True if the vDataColumn
element does not contain a match
for the regular expression.
replace : Replaces all occurrences of a
substring that match a regular
expression with another substring.
substr : Returns the substring that matches
a regular expression within a
vDataColumn.
position: int, optional
The number of characters from the start of the string
where the function should start searching for matches.
occurrence: int, optional
Controls which occurrence of a pattern match in the
string to return.
replacement: str, optional
The string to replace matched substrings.
return_position: int, optional
Sets the position within the string to return.
name: str, optional
New feature name. If empty, a name is generated.
Returns
-------
vDataFrame
self
Consider the following example: notice that passenger names include their title.
[4]:
vdf["name"]
[4]:
Abc Varchar(164) | |
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 | |
| 6 | |
| 7 | |
| 8 | |
| 9 | |
| 10 | |
| 11 | |
| 12 | |
| 13 | |
| 14 | |
| 15 | |
| 16 | |
| 17 | |
| 18 | |
| 19 | |
| 20 | |
| 21 | |
| 22 | |
| 23 | |
| 24 | |
| 25 | |
| 26 | |
| 27 | |
| 28 | |
| 29 | |
| 30 | |
| 31 | |
| 32 | |
| 33 | |
| 34 | |
| 35 | |
| 36 | |
| 37 | |
| 38 | |
| 39 | |
| 40 | |
| 41 | |
| 42 | |
| 43 | |
| 44 | |
| 45 | |
| 46 | |
| 47 | |
| 48 | |
| 49 | |
| 50 | |
| 51 | |
| 52 | |
| 53 | |
| 54 | |
| 55 | |
| 56 | |
| 57 | |
| 58 | |
| 59 | |
| 60 | |
| 61 | |
| 62 | |
| 63 | |
| 64 | |
| 65 | |
| 66 | |
| 67 | |
| 68 | |
| 69 | |
| 70 | |
| 71 | |
| 72 | |
| 73 | |
| 74 | |
| 75 | |
| 76 | |
| 77 | |
| 78 | |
| 79 | |
| 80 | |
| 81 | |
| 82 | |
| 83 | |
| 84 | |
| 85 | |
| 86 | |
| 87 | |
| 88 | |
| 89 | |
| 90 | |
| 91 | |
| 92 | |
| 93 | |
| 94 | |
| 95 | |
| 96 | |
| 97 | |
| 98 | |
| 99 | |
| 100 |
Let’s extract the title using regular expressions.
[5]:
vdf.regexp(column = "name",
name = "title",
pattern = " ([A-Za-z])+\.",
method = "substr")
vdf.select(["name", "title"])
[5]:
Abc Varchar(164) | Abc titleVarchar(164) | |
| 1 | Miss. | |
| 2 | Mr. | |
| 3 | Mrs. | |
| 4 | Mr. | |
| 5 | Mr. | |
| 6 | Col. | |
| 7 | Mr. | |
| 8 | Mr. | |
| 9 | Mr. | |
| 10 | Mr. | |
| 11 | Mr. | |
| 12 | Mr. | |
| 13 | Mr. | |
| 14 | Mr. | |
| 15 | Dr. | |
| 16 | Major. | |
| 17 | Mr. | |
| 18 | Mr. | |
| 19 | Mr. | |
| 20 | Mr. | |
| 21 | Mr. | |
| 22 | Mr. | |
| 23 | Mr. | |
| 24 | Mr. | |
| 25 | Mr. | |
| 26 | Mr. | |
| 27 | Mr. | |
| 28 | Mr. | |
| 29 | Mr. | |
| 30 | Capt. | |
| 31 | Mr. | |
| 32 | Mr. | |
| 33 | Mr. | |
| 34 | Mr. | |
| 35 | Miss. | |
| 36 | Mr. | |
| 37 | Mr. | |
| 38 | Mr. | |
| 39 | Mr. | |
| 40 | Mr. | |
| 41 | Mr. | |
| 42 | Mr. | |
| 43 | Mr. | |
| 44 | Mr. | |
| 45 | Mr. | |
| 46 | Mr. | |
| 47 | Mr. | |
| 48 | Mr. | |
| 49 | Mr. | |
| 50 | Mr. | |
| 51 | Mr. | |
| 52 | Mr. | |
| 53 | Mr. | |
| 54 | Mr. | |
| 55 | Mr. | |
| 56 | Mr. | |
| 57 | Miss. | |
| 58 | Mr. | |
| 59 | Mr. | |
| 60 | Mr. | |
| 61 | Mr. | |
| 62 | Mr. | |
| 63 | Mr. | |
| 64 | Mr. | |
| 65 | Mr. | |
| 66 | Mr. | |
| 67 | Mr. | |
| 68 | Mr. | |
| 69 | Mr. | |
| 70 | Mr. | |
| 71 | Mr. | |
| 72 | Mr. | |
| 73 | Dr. | |
| 74 | Mr. | |
| 75 | Mr. | |
| 76 | Mr. | |
| 77 | Mr. | |
| 78 | Mr. | |
| 79 | Mr. | |
| 80 | Mr. | |
| 81 | Mr. | |
| 82 | Mr. | |
| 83 | Mr. | |
| 84 | Mr. | |
| 85 | Mr. | |
| 86 | Mr. | |
| 87 | Jonkheer. | |
| 88 | Mr. | |
| 89 | Mr. | |
| 90 | Mr. | |
| 91 | Mr. | |
| 92 | Mr. | |
| 93 | Mr. | |
| 94 | Mr. | |
| 95 | Mr. | |
| 96 | Mr. | |
| 97 | Mr. | |
| 98 | Mr. | |
| 99 | Mr. | |
| 100 | Mr. |
Advanced Analytical Functions¶
The ‘analytic’ method contains the many advanced analytical functions in VerticaPy.
[6]:
help(vp.vDataFrame.analytic)
Help on function analytic in module verticapy.core.vdataframe._math:
analytic(self, func: str, columns: Optional[Annotated[Union[str, list[str]], 'STRING representing one column or a list of columns']] = None, by: Optional[Annotated[Union[str, list[str]], 'STRING representing one column or a list of columns']] = None, order_by: Union[NoneType, Annotated[Union[str, list[str]], 'STRING representing one column or a list of columns'], dict] = None, name: Optional[str] = None, offset: int = 1, x_smoothing: float = 0.5, add_count: bool = True) -> 'vDataFrame'
Adds a new vDataColumn to the vDataFrame by using an advanced
analytical function on one or two specific vDataColumns.
⚠ Warning : Some analytical functions can make the vDataFrame
structure more resource intensive. It is best to
check the structure of the vDataFrame with the
'current_relation' method and save it with the
'to_db' method, uisng the parameters
'inplace = True' and 'relation_type = table'.
Parameters
----------
func: str
Function to apply.
aad : average absolute deviation
beta : Beta Coefficient between 2 vDataColumns
count : number of non-missing elements
corr : Pearson's correlation between 2 vDataColumns
cov : covariance between 2 vDataColumns
dense_rank : dense rank
ema : exponential moving average
first_value : first non null lead
iqr : interquartile range
kurtosis : kurtosis
jb : Jarque-Bera index
lead : next element
lag : previous element
last_value : first non null lag
mad : median absolute deviation
max : maximum
mean : average
median : median
min : minimum
mode : most occurent element
q% : q quantile (ex: 50% for the median)
pct_change : ratio between the current value and the previous one
percent_rank : percent rank
prod : product
range : difference between the max and the min
rank : rank
row_number : row number
sem : standard error of the mean
skewness : skewness
sum : sum
std : standard deviation
unique : cardinality (count distinct)
var : variance
Other analytical functions could work if they are part of your DB
version.
columns: SQLColumns, optional
Input vDataColumns. Must be a list of one or two elements.
by: SQLColumns, optional
vDataColumns used in the partition.
order_by: dict / list, optional
Either a list of the vDataColumns used to sort (in ascending order)
the data, or a dictionary of vDataColumns and their sorting
methods. For example, to sort by "column1" ASC and "column2" DESC,
write: {"column1": "asc", "column2": "desc"}
name: str, optional
Name of the new vDataColumn. If empty, a default name based on the
other parameters is generated.
offset: int, optional
Lead/Lag offset if parameter 'func' is the function 'lead'/'lag'.
x_smoothing: float, optional
The smoothing parameter of the 'ema' if the function is 'ema'. It
must be a float in the range [0;1].
add_count: bool, optional
If the 'func' is set to 'mode' and this parameter is True, a column
with the mode number of occurences is added to the vDataFrame.
Returns
-------
vDataFrame
self
To demonstrate some of these techniques, let’s use the Amazon dataset and perform some computations.
[7]:
from verticapy.datasets import load_amazon
vdf = load_amazon()
display(vdf)
📅 dateDate | Abc stateVarchar(32) | 123 numberInteger | |
| 1 | 1998-01-01 | ACRE | 0 |
| 2 | 1998-01-01 | ALAGOAS | 0 |
| 3 | 1998-01-01 | AMAPÁ | 0 |
| 4 | 1998-01-01 | AMAZONAS | 0 |
| 5 | 1998-01-01 | BAHIA | 0 |
| 6 | 1998-01-01 | CEARÁ | 0 |
| 7 | 1998-01-01 | DISTRITO FEDERAL | 0 |
| 8 | 1998-01-01 | ESPÍRITO SANTO | 0 |
| 9 | 1998-01-01 | GOIÁS | 0 |
| 10 | 1998-01-01 | MARANHÃO | 0 |
| 11 | 1998-01-01 | MATO GROSSO | 0 |
| 12 | 1998-01-01 | MATO GROSSO DO SUL | 0 |
| 13 | 1998-01-01 | MINAS GERAIS | 0 |
| 14 | 1998-01-01 | PARANÁ | 0 |
| 15 | 1998-01-01 | PARAÍBA | 0 |
| 16 | 1998-01-01 | PARÁ | 0 |
| 17 | 1998-01-01 | PERNAMBUCO | 0 |
| 18 | 1998-01-01 | PIAUÍ | 0 |
| 19 | 1998-01-01 | RIO DE JANEIRO | 0 |
| 20 | 1998-01-01 | RIO GRANDE DO NORTE | 0 |
| 21 | 1998-01-01 | RIO GRANDE DO SUL | 0 |
| 22 | 1998-01-01 | RONDÔNIA | 0 |
| 23 | 1998-01-01 | RORAIMA | 0 |
| 24 | 1998-01-01 | SANTA CATARINA | 0 |
| 25 | 1998-01-01 | SERGIPE | 0 |
| 26 | 1998-01-01 | SÃO PAULO | 0 |
| 27 | 1998-01-01 | TOCANTINS | 0 |
| 28 | 1998-02-01 | ACRE | 0 |
| 29 | 1998-02-01 | ALAGOAS | 0 |
| 30 | 1998-02-01 | AMAPÁ | 0 |
| 31 | 1998-02-01 | AMAZONAS | 0 |
| 32 | 1998-02-01 | BAHIA | 0 |
| 33 | 1998-02-01 | CEARÁ | 0 |
| 34 | 1998-02-01 | DISTRITO FEDERAL | 0 |
| 35 | 1998-02-01 | ESPÍRITO SANTO | 0 |
| 36 | 1998-02-01 | GOIÁS | 0 |
| 37 | 1998-02-01 | MARANHÃO | 0 |
| 38 | 1998-02-01 | MATO GROSSO | 0 |
| 39 | 1998-02-01 | MATO GROSSO DO SUL | 0 |
| 40 | 1998-02-01 | MINAS GERAIS | 0 |
| 41 | 1998-02-01 | PARANÁ | 0 |
| 42 | 1998-02-01 | PARAÍBA | 0 |
| 43 | 1998-02-01 | PARÁ | 0 |
| 44 | 1998-02-01 | PERNAMBUCO | 0 |
| 45 | 1998-02-01 | PIAUÍ | 0 |
| 46 | 1998-02-01 | RIO DE JANEIRO | 0 |
| 47 | 1998-02-01 | RIO GRANDE DO NORTE | 0 |
| 48 | 1998-02-01 | RIO GRANDE DO SUL | 0 |
| 49 | 1998-02-01 | RONDÔNIA | 0 |
| 50 | 1998-02-01 | RORAIMA | 0 |
| 51 | 1998-02-01 | SANTA CATARINA | 0 |
| 52 | 1998-02-01 | SERGIPE | 0 |
| 53 | 1998-02-01 | SÃO PAULO | 0 |
| 54 | 1998-02-01 | TOCANTINS | 0 |
| 55 | 1998-03-01 | ACRE | 0 |
| 56 | 1998-03-01 | ALAGOAS | 0 |
| 57 | 1998-03-01 | AMAPÁ | 0 |
| 58 | 1998-03-01 | AMAZONAS | 0 |
| 59 | 1998-03-01 | BAHIA | 0 |
| 60 | 1998-03-01 | CEARÁ | 0 |
| 61 | 1998-03-01 | DISTRITO FEDERAL | 0 |
| 62 | 1998-03-01 | ESPÍRITO SANTO | 0 |
| 63 | 1998-03-01 | GOIÁS | 0 |
| 64 | 1998-03-01 | MARANHÃO | 0 |
| 65 | 1998-03-01 | MATO GROSSO | 0 |
| 66 | 1998-03-01 | MATO GROSSO DO SUL | 0 |
| 67 | 1998-03-01 | MINAS GERAIS | 0 |
| 68 | 1998-03-01 | PARANÁ | 0 |
| 69 | 1998-03-01 | PARAÍBA | 0 |
| 70 | 1998-03-01 | PARÁ | 0 |
| 71 | 1998-03-01 | PERNAMBUCO | 0 |
| 72 | 1998-03-01 | PIAUÍ | 0 |
| 73 | 1998-03-01 | RIO DE JANEIRO | 0 |
| 74 | 1998-03-01 | RIO GRANDE DO NORTE | 0 |
| 75 | 1998-03-01 | RIO GRANDE DO SUL | 0 |
| 76 | 1998-03-01 | RONDÔNIA | 0 |
| 77 | 1998-03-01 | RORAIMA | 0 |
| 78 | 1998-03-01 | SANTA CATARINA | 0 |
| 79 | 1998-03-01 | SERGIPE | 0 |
| 80 | 1998-03-01 | SÃO PAULO | 0 |
| 81 | 1998-03-01 | TOCANTINS | 0 |
| 82 | 1998-04-01 | ACRE | 0 |
| 83 | 1998-04-01 | ALAGOAS | 0 |
| 84 | 1998-04-01 | AMAPÁ | 0 |
| 85 | 1998-04-01 | AMAZONAS | 0 |
| 86 | 1998-04-01 | BAHIA | 0 |
| 87 | 1998-04-01 | CEARÁ | 0 |
| 88 | 1998-04-01 | DISTRITO FEDERAL | 0 |
| 89 | 1998-04-01 | ESPÍRITO SANTO | 0 |
| 90 | 1998-04-01 | GOIÁS | 0 |
| 91 | 1998-04-01 | MARANHÃO | 0 |
| 92 | 1998-04-01 | MATO GROSSO | 0 |
| 93 | 1998-04-01 | MATO GROSSO DO SUL | 0 |
| 94 | 1998-04-01 | MINAS GERAIS | 0 |
| 95 | 1998-04-01 | PARANÁ | 0 |
| 96 | 1998-04-01 | PARAÍBA | 0 |
| 97 | 1998-04-01 | PARÁ | 0 |
| 98 | 1998-04-01 | PERNAMBUCO | 0 |
| 99 | 1998-04-01 | PIAUÍ | 0 |
| 100 | 1998-04-01 | RIO DE JANEIRO | 0 |
For each state, let’s compute the previous number of forest fires.
[8]:
vdf.analytic(name = "previous_number",
func = "lag",
columns = "number",
by = ["state"],
order_by = {"date": "asc"})
[8]:
📅 dateDate | Abc stateVarchar(32) | 123 numberInteger | 123 previous_numberInteger | |
| 1 | 1998-01-01 | PARÁ | 0 | [null] |
| 2 | 1998-02-01 | PARÁ | 0 | 0 |
| 3 | 1998-03-01 | PARÁ | 0 | 0 |
| 4 | 1998-04-01 | PARÁ | 0 | 0 |
| 5 | 1998-05-01 | PARÁ | 0 | 0 |
| 6 | 1998-06-01 | PARÁ | 58 | 0 |
| 7 | 1998-07-01 | PARÁ | 638 | 58 |
| 8 | 1998-08-01 | PARÁ | 6815 | 638 |
| 9 | 1998-09-01 | PARÁ | 6237 | 6815 |
| 10 | 1998-10-01 | PARÁ | 3665 | 6237 |
| 11 | 1998-11-01 | PARÁ | 1861 | 3665 |
| 12 | 1998-12-01 | PARÁ | 1008 | 1861 |
| 13 | 1999-01-01 | PARÁ | 87 | 1008 |
| 14 | 1999-02-01 | PARÁ | 285 | 87 |
| 15 | 1999-03-01 | PARÁ | 7 | 285 |
| 16 | 1999-04-01 | PARÁ | 6 | 7 |
| 17 | 1999-05-01 | PARÁ | 4 | 6 |
| 18 | 1999-06-01 | PARÁ | 37 | 4 |
| 19 | 1999-07-01 | PARÁ | 982 | 37 |
| 20 | 1999-08-01 | PARÁ | 6178 | 982 |
| 21 | 1999-09-01 | PARÁ | 4419 | 6178 |
| 22 | 1999-10-01 | PARÁ | 5026 | 4419 |
| 23 | 1999-11-01 | PARÁ | 2682 | 5026 |
| 24 | 1999-12-01 | PARÁ | 766 | 2682 |
| 25 | 2000-01-01 | PARÁ | 30 | 766 |
| 26 | 2000-02-01 | PARÁ | 29 | 30 |
| 27 | 2000-03-01 | PARÁ | 181 | 29 |
| 28 | 2000-04-01 | PARÁ | 40 | 181 |
| 29 | 2000-05-01 | PARÁ | 12 | 40 |
| 30 | 2000-06-01 | PARÁ | 59 | 12 |
| 31 | 2000-07-01 | PARÁ | 426 | 59 |
| 32 | 2000-08-01 | PARÁ | 5484 | 426 |
| 33 | 2000-09-01 | PARÁ | 2569 | 5484 |
| 34 | 2000-10-01 | PARÁ | 4313 | 2569 |
| 35 | 2000-11-01 | PARÁ | 3366 | 4313 |
| 36 | 2000-12-01 | PARÁ | 1698 | 3366 |
| 37 | 2001-01-01 | PARÁ | 34 | 1698 |
| 38 | 2001-02-01 | PARÁ | 16 | 34 |
| 39 | 2001-03-01 | PARÁ | 0 | 16 |
| 40 | 2001-04-01 | PARÁ | 0 | 0 |
| 41 | 2001-05-01 | PARÁ | 7 | 0 |
| 42 | 2001-06-01 | PARÁ | 42 | 7 |
| 43 | 2001-07-01 | PARÁ | 708 | 42 |
| 44 | 2001-08-01 | PARÁ | 9043 | 708 |
| 45 | 2001-09-01 | PARÁ | 6139 | 9043 |
| 46 | 2001-10-01 | PARÁ | 5096 | 6139 |
| 47 | 2001-11-01 | PARÁ | 4938 | 5096 |
| 48 | 2001-12-01 | PARÁ | 2564 | 4938 |
| 49 | 2002-01-01 | PARÁ | 144 | 2564 |
| 50 | 2002-02-01 | PARÁ | 120 | 144 |
| 51 | 2002-03-01 | PARÁ | 14 | 120 |
| 52 | 2002-04-01 | PARÁ | 11 | 14 |
| 53 | 2002-05-01 | PARÁ | 18 | 11 |
| 54 | 2002-06-01 | PARÁ | 149 | 18 |
| 55 | 2002-07-01 | PARÁ | 772 | 149 |
| 56 | 2002-08-01 | PARÁ | 15664 | 772 |
| 57 | 2002-09-01 | PARÁ | 5609 | 15664 |
| 58 | 2002-10-01 | PARÁ | 5189 | 5609 |
| 59 | 2002-11-01 | PARÁ | 7996 | 5189 |
| 60 | 2002-12-01 | PARÁ | 3118 | 7996 |
| 61 | 2003-01-01 | PARÁ | 900 | 3118 |
| 62 | 2003-02-01 | PARÁ | 34 | 900 |
| 63 | 2003-03-01 | PARÁ | 27 | 34 |
| 64 | 2003-04-01 | PARÁ | 14 | 27 |
| 65 | 2003-05-01 | PARÁ | 3 | 14 |
| 66 | 2003-06-01 | PARÁ | 64 | 3 |
| 67 | 2003-07-01 | PARÁ | 1571 | 64 |
| 68 | 2003-08-01 | PARÁ | 6255 | 1571 |
| 69 | 2003-09-01 | PARÁ | 8447 | 6255 |
| 70 | 2003-10-01 | PARÁ | 3803 | 8447 |
| 71 | 2003-11-01 | PARÁ | 6195 | 3803 |
| 72 | 2003-12-01 | PARÁ | 4787 | 6195 |
| 73 | 2004-01-01 | PARÁ | 579 | 4787 |
| 74 | 2004-02-01 | PARÁ | 20 | 579 |
| 75 | 2004-03-01 | PARÁ | 22 | 20 |
| 76 | 2004-04-01 | PARÁ | 23 | 22 |
| 77 | 2004-05-01 | PARÁ | 44 | 23 |
| 78 | 2004-06-01 | PARÁ | 340 | 44 |
| 79 | 2004-07-01 | PARÁ | 4436 | 340 |
| 80 | 2004-08-01 | PARÁ | 12812 | 4436 |
| 81 | 2004-09-01 | PARÁ | 9471 | 12812 |
| 82 | 2004-10-01 | PARÁ | 4726 | 9471 |
| 83 | 2004-11-01 | PARÁ | 9842 | 4726 |
| 84 | 2004-12-01 | PARÁ | 5508 | 9842 |
| 85 | 2005-01-01 | PARÁ | 1768 | 5508 |
| 86 | 2005-02-01 | PARÁ | 85 | 1768 |
| 87 | 2005-03-01 | PARÁ | 21 | 85 |
| 88 | 2005-04-01 | PARÁ | 14 | 21 |
| 89 | 2005-05-01 | PARÁ | 37 | 14 |
| 90 | 2005-06-01 | PARÁ | 304 | 37 |
| 91 | 2005-07-01 | PARÁ | 4364 | 304 |
| 92 | 2005-08-01 | PARÁ | 13055 | 4364 |
| 93 | 2005-09-01 | PARÁ | 9032 | 13055 |
| 94 | 2005-10-01 | PARÁ | 5083 | 9032 |
| 95 | 2005-11-01 | PARÁ | 4968 | 5083 |
| 96 | 2005-12-01 | PARÁ | 1495 | 4968 |
| 97 | 2006-01-01 | PARÁ | 403 | 1495 |
| 98 | 2006-02-01 | PARÁ | 49 | 403 |
| 99 | 2006-03-01 | PARÁ | 10 | 49 |
| 100 | 2006-04-01 | PARÁ | 2 | 10 |
Moving Windows¶
Moving windows are powerful features. Moving windows are managed by the ‘rolling’ method in VerticaPy.
[9]:
help(vp.vDataFrame.rolling)
Help on function rolling in module verticapy.core.vdataframe._rolling:
rolling(self, func: str, window: Union[list, tuple], columns: Annotated[Union[str, list[str]], 'STRING representing one column or a list of columns'], by: Optional[Annotated[Union[str, list[str]], 'STRING representing one column or a list of columns']] = None, order_by: Union[NoneType, dict, list] = None, name: Optional[str] = None) -> 'vDataFrame'
Adds a new vDataColumn to the vDataFrame by using an
advanced analytical window function on one or two
specific vDataColumns.
⚠ Warning : Some window functions can make the
vDataFrame structure heavier. It is
recommended to always check the current
structure with the 'current_relation'
method and to save it with the 'to_db'
method, using the parameters 'inplace
= True' and 'relation_type = table'.
Parameters
----------
func: str
Function to use.
aad : average absolute deviation
beta : Beta Coefficient between 2 vDataColumns
count : number of non-missing elements
corr : Pearson correlation between 2 vDataColumns
cov : covariance between 2 vDataColumns
kurtosis : kurtosis
jb : Jarque-Bera index
max : maximum
mean : average
min : minimum
prod : product
range : difference between the max and the min
sem : standard error of the mean
skewness : skewness
sum : sum
std : standard deviation
var : variance
Other window functions could work if it is part of
the DB version you are using.
window: list / tuple
Window Frame Range.
If set to two integers, computes a Row Window, otherwise
it computes a Time Window. For example, if set to
(-5, 1), the moving windows will take 5 rows preceding
and one following. If set to ('- 5 minutes', '0 minutes'),
the moving window will take all elements of the last 5
minutes.
columns: SQLColumns
Input vDataColumns. Must be a list of one or two elements.
by: SQLColumns, optional
vDataColumns used in the partition.
order_by: dict / list, optional
List of the vDataColumns used to sort the data using
ascending/descending order or a dictionary of all the
sorting methods.
For example, to sort by "column1" ASC and "column2" DESC,
use: {"column1": "asc", "column2": "desc"}.
name: str, optional
Name of the new vDataColumn. If empty, a default name is
generated.
Returns
-------
vDataFrame
self
Let’s look at forest fires for each state three months preceding two months following the examined period.
[10]:
vdf.rolling(name = "number_3mp_2mf",
func = "sum",
window = ("- 3 months", "2 months"),
columns = "number",
by = ["state"],
order_by = {"date": "asc"},)
[10]:
📅 dateDate | Abc stateVarchar(32) | 123 numberInteger | 123 previous_numberInteger | 123 number_3mp_2mfInteger | |
| 1 | 1998-01-01 | PARÁ | 0 | [null] | 0 |
| 2 | 1998-02-01 | PARÁ | 0 | 0 | 0 |
| 3 | 1998-03-01 | PARÁ | 0 | 0 | 0 |
| 4 | 1998-04-01 | PARÁ | 0 | 0 | 0 |
| 5 | 1998-05-01 | PARÁ | 0 | 0 | 58 |
| 6 | 1998-06-01 | PARÁ | 58 | 0 | 696 |
| 7 | 1998-07-01 | PARÁ | 638 | 58 | 7511 |
| 8 | 1998-08-01 | PARÁ | 6815 | 638 | 13748 |
| 9 | 1998-09-01 | PARÁ | 6237 | 6815 | 17355 |
| 10 | 1998-10-01 | PARÁ | 3665 | 6237 | 18578 |
| 11 | 1998-11-01 | PARÁ | 1861 | 3665 | 12771 |
| 12 | 1998-12-01 | PARÁ | 1008 | 1861 | 6621 |
| 13 | 1999-01-01 | PARÁ | 87 | 1008 | 3248 |
| 14 | 1999-02-01 | PARÁ | 285 | 87 | 1393 |
| 15 | 1999-03-01 | PARÁ | 7 | 285 | 1393 |
| 16 | 1999-04-01 | PARÁ | 6 | 7 | 389 |
| 17 | 1999-05-01 | PARÁ | 4 | 6 | 339 |
| 18 | 1999-06-01 | PARÁ | 37 | 4 | 1029 |
| 19 | 1999-07-01 | PARÁ | 982 | 37 | 7201 |
| 20 | 1999-08-01 | PARÁ | 6178 | 982 | 11616 |
| 21 | 1999-09-01 | PARÁ | 4419 | 6178 | 16605 |
| 22 | 1999-10-01 | PARÁ | 5026 | 4419 | 18305 |
| 23 | 1999-11-01 | PARÁ | 2682 | 5026 | 12893 |
| 24 | 1999-12-01 | PARÁ | 766 | 2682 | 8504 |
| 25 | 2000-01-01 | PARÁ | 30 | 766 | 3688 |
| 26 | 2000-02-01 | PARÁ | 29 | 30 | 1046 |
| 27 | 2000-03-01 | PARÁ | 181 | 29 | 280 |
| 28 | 2000-04-01 | PARÁ | 40 | 181 | 262 |
| 29 | 2000-05-01 | PARÁ | 12 | 40 | 321 |
| 30 | 2000-06-01 | PARÁ | 59 | 12 | 537 |
| 31 | 2000-07-01 | PARÁ | 426 | 59 | 5981 |
| 32 | 2000-08-01 | PARÁ | 5484 | 426 | 8538 |
| 33 | 2000-09-01 | PARÁ | 2569 | 5484 | 12792 |
| 34 | 2000-10-01 | PARÁ | 4313 | 2569 | 15732 |
| 35 | 2000-11-01 | PARÁ | 3366 | 4313 | 11946 |
| 36 | 2000-12-01 | PARÁ | 1698 | 3366 | 9411 |
| 37 | 2001-01-01 | PARÁ | 34 | 1698 | 5114 |
| 38 | 2001-02-01 | PARÁ | 16 | 34 | 1748 |
| 39 | 2001-03-01 | PARÁ | 0 | 16 | 1748 |
| 40 | 2001-04-01 | PARÁ | 0 | 0 | 57 |
| 41 | 2001-05-01 | PARÁ | 7 | 0 | 65 |
| 42 | 2001-06-01 | PARÁ | 42 | 7 | 757 |
| 43 | 2001-07-01 | PARÁ | 708 | 42 | 9800 |
| 44 | 2001-08-01 | PARÁ | 9043 | 708 | 15932 |
| 45 | 2001-09-01 | PARÁ | 6139 | 9043 | 20986 |
| 46 | 2001-10-01 | PARÁ | 5096 | 6139 | 25216 |
| 47 | 2001-11-01 | PARÁ | 4938 | 5096 | 18737 |
| 48 | 2001-12-01 | PARÁ | 2564 | 4938 | 12742 |
| 49 | 2002-01-01 | PARÁ | 144 | 2564 | 7780 |
| 50 | 2002-02-01 | PARÁ | 120 | 144 | 2853 |
| 51 | 2002-03-01 | PARÁ | 14 | 120 | 2853 |
| 52 | 2002-04-01 | PARÁ | 11 | 14 | 307 |
| 53 | 2002-05-01 | PARÁ | 18 | 11 | 312 |
| 54 | 2002-06-01 | PARÁ | 149 | 18 | 950 |
| 55 | 2002-07-01 | PARÁ | 772 | 149 | 16603 |
| 56 | 2002-08-01 | PARÁ | 15664 | 772 | 22194 |
| 57 | 2002-09-01 | PARÁ | 5609 | 15664 | 27234 |
| 58 | 2002-10-01 | PARÁ | 5189 | 5609 | 34458 |
| 59 | 2002-11-01 | PARÁ | 7996 | 5189 | 21912 |
| 60 | 2002-12-01 | PARÁ | 3118 | 7996 | 17203 |
| 61 | 2003-01-01 | PARÁ | 900 | 3118 | 12075 |
| 62 | 2003-02-01 | PARÁ | 34 | 900 | 4093 |
| 63 | 2003-03-01 | PARÁ | 27 | 34 | 4093 |
| 64 | 2003-04-01 | PARÁ | 14 | 27 | 978 |
| 65 | 2003-05-01 | PARÁ | 3 | 14 | 142 |
| 66 | 2003-06-01 | PARÁ | 64 | 3 | 1652 |
| 67 | 2003-07-01 | PARÁ | 1571 | 64 | 7893 |
| 68 | 2003-08-01 | PARÁ | 6255 | 1571 | 16337 |
| 69 | 2003-09-01 | PARÁ | 8447 | 6255 | 20076 |
| 70 | 2003-10-01 | PARÁ | 3803 | 8447 | 24700 |
| 71 | 2003-11-01 | PARÁ | 6195 | 3803 | 23232 |
| 72 | 2003-12-01 | PARÁ | 4787 | 6195 | 15364 |
| 73 | 2004-01-01 | PARÁ | 579 | 4787 | 11603 |
| 74 | 2004-02-01 | PARÁ | 20 | 579 | 5431 |
| 75 | 2004-03-01 | PARÁ | 22 | 20 | 644 |
| 76 | 2004-04-01 | PARÁ | 23 | 22 | 109 |
| 77 | 2004-05-01 | PARÁ | 44 | 23 | 449 |
| 78 | 2004-06-01 | PARÁ | 340 | 44 | 4843 |
| 79 | 2004-07-01 | PARÁ | 4436 | 340 | 17632 |
| 80 | 2004-08-01 | PARÁ | 12812 | 4436 | 27059 |
| 81 | 2004-09-01 | PARÁ | 9471 | 12812 | 31445 |
| 82 | 2004-10-01 | PARÁ | 4726 | 9471 | 36851 |
| 83 | 2004-11-01 | PARÁ | 9842 | 4726 | 29547 |
| 84 | 2004-12-01 | PARÁ | 5508 | 9842 | 21844 |
| 85 | 2005-01-01 | PARÁ | 1768 | 5508 | 17224 |
| 86 | 2005-02-01 | PARÁ | 85 | 1768 | 7396 |
| 87 | 2005-03-01 | PARÁ | 21 | 85 | 7396 |
| 88 | 2005-04-01 | PARÁ | 14 | 21 | 1925 |
| 89 | 2005-05-01 | PARÁ | 37 | 14 | 461 |
| 90 | 2005-06-01 | PARÁ | 304 | 37 | 4719 |
| 91 | 2005-07-01 | PARÁ | 4364 | 304 | 17760 |
| 92 | 2005-08-01 | PARÁ | 13055 | 4364 | 26755 |
| 93 | 2005-09-01 | PARÁ | 9032 | 13055 | 31534 |
| 94 | 2005-10-01 | PARÁ | 5083 | 9032 | 32138 |
| 95 | 2005-11-01 | PARÁ | 4968 | 5083 | 20578 |
| 96 | 2005-12-01 | PARÁ | 1495 | 4968 | 11949 |
| 97 | 2006-01-01 | PARÁ | 403 | 1495 | 6925 |
| 98 | 2006-02-01 | PARÁ | 49 | 403 | 1959 |
| 99 | 2006-03-01 | PARÁ | 10 | 49 | 1959 |
| 100 | 2006-04-01 | PARÁ | 2 | 10 | 464 |
Moving windows give us infinite possibilities for creating new features.
After we’ve finished preparing our data, our next task is to create a machine learning model.