VerticaPy

Python API for Vertica Data Science at Scale

Quick Start

First, install the API using the pip command.

root@ubuntu:~$ pip3 install verticapy

You can then choose which type of connection you want to use.

  • vertica_python (Native Python Client)
  • pyodbc (ODBC)
  • jaydebeapi (JDBC)

These modules create a Database Cursor, which you can use to communicate with your Vertica database.

For example, use the following command to install the vertica_python module.

root@ubuntu:~$ pip3 install vertica_python

If you already have a database DSN, you can use it to create database cursor.

In [3]:
from verticapy import *
cur = vertica_conn("VerticaDSN").cursor()

The DSN (Data Source Name) will be used to create a Native connection.

Otherwise, you can set up a permanent auto-connection.

In [4]:
from verticapy.connections.connect import *
new_auto_connection({"host": "10.211.55.14", 
                     "port": "5433", 
                     "database": "testdb", 
                     "password": "XxX", 
                     "user": "dbadmin"},
                    name = "VerticaDSN")
change_auto_connection("VerticaDSN")

You can start experimenting with the data in your database.

In [5]:
vDataFrame("public.iris")
123
SepalLengthCm
Numeric(5,2)
Abc
Species
Varchar(30)
123
PetalLengthCm
Numeric(5,2)
123
SepalWidthCm
Numeric(5,2)
123
PetalWidthCm
Numeric(5,2)
14.30Iris-setosa1.103.000.10
24.40Iris-setosa1.402.900.20
34.40Iris-setosa1.303.000.20
44.40Iris-setosa1.303.200.20
54.50Iris-setosa1.302.300.30
64.60Iris-setosa1.503.100.20
74.60Iris-setosa1.403.200.20
84.60Iris-setosa1.403.400.30
94.60Iris-setosa1.003.600.20
104.70Iris-setosa1.303.200.20
114.70Iris-setosa1.603.200.20
124.80Iris-setosa1.403.000.10
134.80Iris-setosa1.403.000.30
144.80Iris-setosa1.603.100.20
154.80Iris-setosa1.603.400.20
164.80Iris-setosa1.903.400.20
174.90Iris-versicolor3.302.401.00
184.90Iris-virginica4.502.501.70
194.90Iris-setosa1.403.000.20
204.90Iris-setosa1.503.100.10
214.90Iris-setosa1.503.100.10
224.90Iris-setosa1.503.100.10
235.00Iris-versicolor3.502.001.00
245.00Iris-versicolor3.302.301.00
255.00Iris-setosa1.603.000.20
265.00Iris-setosa1.203.200.20
275.00Iris-setosa1.403.300.20
285.00Iris-setosa1.503.400.20
295.00Iris-setosa1.603.400.40
305.00Iris-setosa1.303.500.30
315.00Iris-setosa1.603.500.60
325.00Iris-setosa1.403.600.20
335.10Iris-versicolor3.002.501.10
345.10Iris-setosa1.703.300.50
355.10Iris-setosa1.503.400.20
365.10Iris-setosa1.403.500.20
375.10Iris-setosa1.403.500.30
385.10Iris-setosa1.503.700.40
395.10Iris-setosa1.503.800.30
405.10Iris-setosa1.603.800.20
415.10Iris-setosa1.903.800.40
425.20Iris-versicolor3.902.701.40
435.20Iris-setosa1.403.400.20
445.20Iris-setosa1.503.500.20
455.20Iris-setosa1.504.100.10
465.30Iris-setosa1.503.700.20
475.40Iris-versicolor4.503.001.50
485.40Iris-setosa1.503.400.40
495.40Iris-setosa1.703.400.20
505.40Iris-setosa1.503.700.20
515.40Iris-setosa1.303.900.40
525.40Iris-setosa1.703.900.40
535.50Iris-versicolor4.002.301.30
545.50Iris-versicolor3.702.401.00
555.50Iris-versicolor3.802.401.10
565.50Iris-versicolor4.002.501.30
575.50Iris-versicolor4.402.601.20
585.50Iris-setosa1.303.500.20
595.50Iris-setosa1.404.200.20
605.60Iris-versicolor3.902.501.10
615.60Iris-versicolor4.202.701.30
625.60Iris-virginica4.902.802.00
635.60Iris-versicolor3.602.901.30
645.60Iris-versicolor4.103.001.30
655.60Iris-versicolor4.503.001.50
665.70Iris-virginica5.002.502.00
675.70Iris-versicolor3.502.601.00
685.70Iris-versicolor4.102.801.30
695.70Iris-versicolor4.502.801.30
705.70Iris-versicolor4.202.901.30
715.70Iris-versicolor4.203.001.20
725.70Iris-setosa1.703.800.30
735.70Iris-setosa1.504.400.40
745.80Iris-versicolor4.002.601.20
755.80Iris-versicolor3.902.701.20
765.80Iris-versicolor4.102.701.00
775.80Iris-virginica5.102.701.90
785.80Iris-virginica5.102.701.90
795.80Iris-virginica5.102.802.40
805.80Iris-setosa1.204.000.20
815.90Iris-versicolor4.203.001.50
825.90Iris-virginica5.103.001.80
835.90Iris-versicolor4.803.201.80
846.00Iris-versicolor4.002.201.00
856.00Iris-virginica5.002.201.50
866.00Iris-versicolor5.102.701.60
876.00Iris-versicolor4.502.901.50
886.00Iris-virginica4.803.001.80
896.00Iris-versicolor4.503.401.60
906.10Iris-virginica5.602.601.40
916.10Iris-versicolor4.002.801.30
926.10Iris-versicolor4.702.801.20
936.10Iris-versicolor4.702.901.40
946.10Iris-versicolor4.603.001.40
956.10Iris-virginica4.903.001.80
966.20Iris-versicolor4.502.201.50
976.20Iris-virginica4.802.801.80
986.20Iris-versicolor4.302.901.30
996.20Iris-virginica5.403.402.30
1006.30Iris-versicolor4.402.301.30
Out[5]:
Rows: 1-100 of 150 | Columns: 5

If you don't have data on hand, you can load some well-known datasets.

In [19]:
from verticapy.learn.datasets import load_titanic
vdf = load_titanic()

We can start exploring our data.

In [7]:
vdf
123
fare
Numeric(10,5)
123
survived
Int
Abc
sex
Varchar(20)
Abc
boat
Varchar(100)
123
pclass
Int
123
age
Numeric(6,3)
Abc
ticket
Varchar(36)
Abc
Varchar(164)
Abc
embarked
Varchar(20)
Abc
cabin
Varchar(30)
123
body
Int
123
parch
Int
Abc
home.dest
Varchar(100)
123
sibsp
Int
1151.550000female[null]12.000113781SC22 C26[null]2Montreal, PQ / Chesterville, ON1
2151.550000male[null]130.000113781SC22 C261352Montreal, PQ / Chesterville, ON1
3151.550000female[null]125.000113781SC22 C26[null]2Montreal, PQ / Chesterville, ON1
40.000000male[null]139.000112050SA36[null]0Belfast, NI0
549.504200male[null]171.000PC 17609C[null]220Montevideo, Uruguay0
6227.525000male[null]147.000PC 17757CC62 C641240New York, NY1
725.925000male[null]1[null]PC 17318S[null][null]0New York, NY0
8247.520800male[null]124.000PC 17558CB58 B60[null]1Montreal, PQ0
975.241700maleA136.00013050CC6[null]0Winnipeg, MN0
1026.000000male[null]125.00013905C[null]1480San Francisco, CA0
1135.500000male[null]145.000113784ST[null]0Trenton, NJ0
1226.550000male[null]142.000110489SD22[null]0London / Winnipeg, MB0
1330.500000male[null]141.000113054SA21[null]0Pomeroy, WA0
1450.495800male[null]148.000PC 17591CB102080Omaha, NE0
1539.600000male[null]1[null]112379C[null][null]0Philadelphia, PA0
1626.550000male[null]145.000113050SB38[null]0Washington, DC0
1731.000000male[null]1[null]113798S[null][null]0[null]0
185.000000male[null]133.000695SB51 B53 B55[null]0New York, NY0
1947.100000male[null]128.000113059S[null][null]0Montevideo, Uruguay0
2047.100000male[null]117.000113059S[null][null]0Montevideo, Uruguay0
2126.000000male[null]149.00019924S[null][null]0Ascot, Berkshire / Rochester, NY0
2278.850000male[null]136.00019877SC461720Little Onn Hall, Staffs1
2361.175000male[null]146.000W.E.P. 5734SE31[null]0Amenia, ND1
240.000000male[null]1[null]112051S[null][null]0Liverpool, England / Belfast0
25136.779200male[null]127.00013508CC89[null]0Los Angeles, CA1
2652.000000male[null]1[null]110465SA14[null]0Stoughton, MA0
2725.587500male[null]147.0005727SE58[null]0Victoria, BC0
2883.158300male[null]137.000PC 17756CE52[null]1Lakewood, NJ1
2926.550000male[null]1[null]113791S[null][null]0Roachdale, IN0
3071.000000male[null]170.000WE/P 5735SB222691Milwaukee, WI1
3171.283300male[null]139.000PC 17599CC85[null]0New York, NY1
3252.000000male[null]131.000F.C. 12750SB71[null]0Montreal, PQ1
33106.425000male[null]150.000PC 17761CC86620Deephaven, MN / Cedar Rapids, IA1
3429.700000male[null]139.000PC 17580CA181330Philadelphia, PA0
3531.679200female[null]136.000PC 17531CA29[null]0New York, NY0
36221.779200male[null]1[null]PC 17483SC95[null]0[null]0
3727.750000male[null]130.000113051CC111[null]0New York, NY0
38263.000000male[null]119.00019950SC23 C25 C27[null]2Winnipeg, MB3
39263.000000male[null]164.00019950SC23 C25 C27[null]4Winnipeg, MB1
4026.550000male[null]1[null]113778SD34[null]0Westcliff-on-Sea, Essex0
410.000000male[null]1[null]112058SB102[null]0[null]0
4253.100000male[null]137.000113803SC123[null]0Scituate, MA1
4338.500000male[null]147.000111320SE632750St Anne's-on-Sea, Lancashire0
4479.200000male[null]124.000PC 17593CB86[null]0[null]0
4534.654200male[null]171.000PC 17754CA5[null]0New York, NY0
46153.462500male[null]138.000PC 17582SC911471Winnipeg, MB0
4779.200000male[null]146.000PC 17593CB82 B84[null]0New York, NY0
4842.400000male[null]1[null]113796S[null][null]0[null]0
4983.475000male[null]145.00036973SC83[null]0New York, NY1
500.000000male[null]140.000112059SB941100[null]0
5193.500000male[null]155.00012749SB693071Montreal, PQ1
5242.500000male[null]142.000113038SB11[null]0London / Middlesex0
5351.862500male[null]1[null]17463SE46[null]0Brighton, MA0
5450.000000male[null]155.000680SC39[null]0London / Birmingham0
5552.000000male[null]142.000113789S[null]380New York, NY1
5630.695800male141[null]PC 17600C[null][null]0New York, NY0
5728.712500female[null]150.000PC 17595CC49[null]0Paris, France New York, NY0
5826.000000male[null]146.000694S[null]800Bennington, VT0
5926.000000male[null]150.000113044SE60[null]0London0
60211.500000male[null]132.500113503CC132450[null]0
6129.700000male[null]158.00011771CB372580Buffalo, NY0
6251.862500male[null]141.00017464SD21[null]0Southington / Noank, CT1
6326.550000male[null]1[null]113028SC124[null]0Portland, OR0
6427.720800male[null]1[null]PC 17612C[null][null]0Chicago, IL0
6530.000000male[null]129.000113501SD61260Springfield, MA0
6645.500000male[null]130.000113801S[null][null]0London / New York, NY0
6726.000000male[null]130.000110469SC106[null]0Brockton, MA0
6853.100000male[null]119.000113773SD30[null]0New York, NY1
6975.241700male[null]146.00013050CC62920Vancouver, BC0
7051.862500male[null]154.00017463SE461750Dorchester, MA0
7182.170800male[null]128.000PC 17604C[null][null]0New York, NY1
7226.550000male[null]165.00013509SE382490East Bridgewater, MA0
7390.000000male[null]144.00019928QC782300Fond du Lac, WI2
7430.500000male[null]155.000113787SC30[null]0Montreal, PQ0
7542.400000male[null]147.000113796S[null][null]0Washington, DC0
7629.700000male[null]137.000PC 17596CC118[null]1Brooklyn, NY0
77113.275000male[null]158.00035273CD481222Lexington, MA0
7826.000000male[null]164.000693S[null]2630Isle of Wight, England0
7961.979200male[null]165.000113509CB302341Providence, RI0
8027.720800male[null]128.500PC 17562CD431890?Havana, Cuba0
810.000000male[null]1[null]112052S[null][null]0Belfast0
8228.500000male[null]145.500113043SC1241660Surbiton Hill, Surrey0
8393.500000male[null]123.00012749SB24[null]0Montreal, PQ0
8466.600000male[null]129.000113776SC2[null]0Isleworth, England1
85108.900000male[null]118.000PC 17758CC65[null]0Madrid, Spain1
8652.000000male[null]147.000110465SC1102070Worcester, MA0
870.000000male[null]138.00019972S[null][null]0Rotterdam, Netherlands0
88135.633300male[null]122.000PC 17760C[null]2320[null]0
89227.525000male[null]1[null]PC 17757C[null][null]0[null]0
9050.495800male[null]131.000PC 17590SA24[null]0Trenton, NJ0
9150.000000male[null]1[null]113767SA32[null]0Seattle, WA0
9240.125000male[null]136.00013049CA10[null]0Winnipeg, MB0
9359.400000male[null]155.000PC 17603C[null][null]0New York, NY1
9426.550000male[null]133.000113790S[null]1090London0
95262.375000male[null]161.000PC 17608CB57 B59 B63 B66[null]3Haverford, PA / Cooperstown, NY1
9655.900000male[null]150.00013507SE44[null]0Duluth, MN1
9726.550000male[null]156.000113792S[null][null]0New York, NY0
9830.695800male[null]156.00017764CA7[null]0St James, Long Island, NY0
9960.000000male[null]124.00013695SC31[null]0Huntington, WV1
10026.000000male[null]1[null]113056SA19[null]0Streatham, Surrey0
Out[7]:
Rows: 1-100 of 1234 | Columns: 14

With the exception of aggregations, everything is happening in the database and nothing is loaded in memory. You can view the generated SQL code using the sql_on_off method.

In [8]:
vdf.sql_on_off()
vdf.describe()

Computes the descriptive statistics of all the numerical columns using SUMMARIZE_NUMCOL.

  SELECT
    SUMMARIZE_NUMCOL("fare", "survived", "pclass", "age", "body", "parch", "sibsp") OVER ()  
  FROM
"public"."titanic"

Computes the different aggregations.

  SELECT
    COUNT(DISTINCT "age"),
    COUNT(DISTINCT "body"),
    COUNT(DISTINCT "fare"),
    COUNT(DISTINCT "parch"),
    COUNT(DISTINCT "pclass"),
    COUNT(DISTINCT "sibsp"),
    COUNT(DISTINCT "survived")  
  FROM
"public"."titanic" LIMIT 1
countmeanstdmin25%50%75%maxunique
age99730.152457372116314.43530462991590.3321.028.039.080.096.0
body118164.1440677966196.57602075578081.079.25160.5257.5328.0118.0
fare123333.96379367396652.64607298312930.07.895814.454231.3875512.3292277.0
parch12340.3784440842787680.8686047077903930.00.00.00.09.08.0
pclass12342.284440842787680.8424856361902921.01.03.03.03.03.0
sibsp12340.5040518638573741.041117272416290.00.00.01.08.07.0
survived12340.3646677471636960.4815320186412880.00.00.01.01.02.0
Out[8]:

VerticaPy is smart enough to not recompute the same aggregation twice. Each virtual column has its own catalog which will be updated as the user modifies it.

In [9]:
vdf["age"].catalog
Out[9]:
{'25%': 21,
 '50%': 28,
 '75%': 39,
 'avg': 30.1524573721163,
 'biserial': {},
 'count': 997,
 'cov': {},
 'cramer': {},
 'kendall': {},
 'max': 80,
 'min': 0.33,
 'pearson': {},
 'regr_avgx': {},
 'regr_avgy': {},
 'regr_count': {},
 'regr_intercept': {},
 'regr_r2': {},
 'regr_slope': {},
 'regr_sxx': {},
 'regr_sxy': {},
 'regr_syy': {},
 'spearman': {},
 'std': 14.4353046299159,
 'unique': 96}

Draw the dataset Correlation Matrix.

In [5]:
vdf.corr(method = "spearman")
"pclass"
"survived"
"age"
"sibsp"
"parch"
"fare"
"body"
"split"
"pclass"1.0-0.335634606444129-0.167006853613121-0.0691851178284916-0.0429882472107741-0.7225221647915550.03060035832153550.00608935050829333
"survived"-0.3356346064441291.0-0.09097639691725840.09343035183924840.1716914555791240.3224096131114810.2110607214419230.0594649418525532
"age"-0.167006853613121-0.09097639691725841.0-0.158699476099267-0.2353446209476840.00451935857538285-0.02066172112957970.0103956591655203
"sibsp"-0.06918511782849160.0934303518392484-0.1586994760992671.00.4365249859077550.4439429239672830.0332791733089436-0.0220509572388223
"parch"-0.04298824721077410.171691455579124-0.2353446209476840.4365249859077551.00.4023733695150660.0606076641278389-0.0414420260192756
"fare"-0.7225221647915550.3224096131114810.004519358575382850.4439429239672830.4023733695150661.0-0.000856180961294435-0.0174756880618063
"body"0.03060035832153550.211060721441923-0.02066172112957970.03327917330894360.0606076641278389-0.0008561809612944351.00.0123031213548403
"split"0.006089350508293330.05946494185255320.0103956591655203-0.0220509572388223-0.0414420260192756-0.01747568806180630.01230312135484031.0
Out[5]:

Use statistical tests.

In [7]:
vdf.testjb("age")
We test the following hypothesis:
(H0) The distribution of "age" is not normal
(H1) The distribution of "age" is normal
👍 - The distribution of "age" might be normal
jb = 28.5338631758186
p_value = 0.0
Out[7]:
(True, 28.5338631758186, 0.0)

Two lines of codes are enough to prepare your data and to train your ML models.

In [3]:
# Modules
from verticapy.learn.model_selection import cross_validate
from verticapy.learn.ensemble import RandomForestClassifier

# Droping the existing relation.
drop_view("titanic_clean")
The view titanic_clean was successfully dropped.
In [20]:
# Data Preparation
vdf["sex"].label_encode()["boat"].fillna(method = "0ifnull")["name"].str_extract(
    ' ([A-Za-z]+)\.').eval("family_size", expr = "parch + sibsp + 1").drop(
    columns = ["cabin", "body", "ticket", "home.dest"])["fare"].fill_outliers().fillna(
    ).to_db("titanic_clean")

# Model Evaluation
cross_validate(RandomForestClassifier("rf_titanic", max_leaf_nodes = 100, n_estimators = 30), 
               "titanic_clean", 
               ["age", "family_size", "sex", "pclass", "fare", "boat"], 
               "survived", 
               cutoff = 0.35)
795 element(s) was/were filled
aucprc_aucaccuracylog_lossprecisionrecallf1_scoremccinformednessmarkednesscsi
1-fold0.99415799432355760.98862802520383630.9628770301624130.03909312650874710.93548387096774190.96026490066225170.96227110728769820.9191317978207570.92455061494796590.91374474053295930.9006211180124224
2-fold0.99249664429530180.98881399082494490.96992481203007520.04229941211317390.97241379310344820.94630872483221470.96478638184245660.93559819028891160.93030872483221480.94091773011132230.9215686274509803
3-fold0.99618110236220460.99320768209056250.98267326732673270.03161893119621040.97986577181208060.97333333333333340.98070490004549480.96284998687352730.96152230971128620.96417949730227660.954248366013072
avg0.99427858032702130.99021656603978130.9718250365064070.037670489939377130.96258781196109020.95996898627593330.96925412972521650.93919332499439870.93879388316382230.93961398931551940.9254793704921582
std0.00184518658944420370.0025920507759281350.010033986807936090.0054805190685542690.023766588095405260.0135147341872752460.0099960870953227820.022079713736287170.019892837589252240.025242642049556340.027026669520704615
Out[20]:

VerticaPy lets you to bring the logic to the data with just a few lines of codes.

Enjoy!