VerticaPy

Python API for Vertica Data Science at Scale

Best Practices

As with all software, VerticaPy has a performance cost. To get the best performance, we need to understand the architectures of Vertica and VerticaPy. In this lesson, we'll go through some optimization steps.

1. Optimize your architecture at the database-level

At the end of the day, VerticaPy is an abstraction of SQL, so any database-level optimizations you make carry over to VerticaPy.

Optimizing Vertica mostly comes down to optimizing projections. Think in advance about your data architecture before creating a vDataFrame so we only select the essential columns.

In the following example, we use the 'usecols' parameter of the vDataFrame to select the most essential columns in our dataset.

In [61]:
from verticapy import *
vdf = vDataFrame("public.titanic",
                 usecols = ["survived", "pclass", "age"])
display(vdf)
123
survived
Int
123
pclass
Int
123
age
Numeric(6,3)
1012.000
20130.000
30125.000
40139.000
50171.000
60147.000
701[null]
80124.000
90136.000
100125.000
110145.000
120142.000
130141.000
140148.000
1501[null]
160145.000
1701[null]
180133.000
190128.000
200117.000
210149.000
220136.000
230146.000
2401[null]
250127.000
2601[null]
270147.000
280137.000
2901[null]
300170.000
310139.000
320131.000
330150.000
340139.000
350136.000
3601[null]
370130.000
380119.000
390164.000
4001[null]
4101[null]
420137.000
430147.000
440124.000
450171.000
460138.000
470146.000
4801[null]
490145.000
500140.000
510155.000
520142.000
5301[null]
540155.000
550142.000
5601[null]
570150.000
580146.000
590150.000
600132.500
610158.000
620141.000
6301[null]
6401[null]
650129.000
660130.000
670130.000
680119.000
690146.000
700154.000
710128.000
720165.000
730144.000
740155.000
750147.000
760137.000
770158.000
780164.000
790165.000
800128.500
8101[null]
820145.500
830123.000
840129.000
850118.000
860147.000
870138.000
880122.000
8901[null]
900131.000
9101[null]
920136.000
930155.000
940133.000
950161.000
960150.000
970156.000
980156.000
990124.000
10001[null]
Rows: 1-100 of 1234 | Columns: 3

2. Save the current relation when you can

The vDataFrame works just like a view. If the final generated relation uses a lot of different functions, it will drastically increase the computation time for each method call.

Smaller transformations won't slow down the process much, but heavier transformations (multiple joins, heavy use of advanced analytical funcions, moving windows, etc.) can cause noticeable slowdown. If you make these kinds of changes, you should save the vDataFrame structure. Let's look at an example.

In [62]:
# Doing multiple operation
vdf = vDataFrame("public.titanic")
vdf["sex"].label_encode()["boat"].fillna(method = "0ifnull")["name"].str_extract(
    ' ([A-Za-z]+)\.').eval("family_size", expr = "parch + sibsp + 1").drop(
    columns = ["cabin", "body", "ticket", "home.dest"])["fare"].fill_outliers().fillna()
print(vdf.current_relation())
795 element(s) was/were filled
(
   SELECT
     COALESCE("fare", 32.9113074018842) AS "fare",
     "survived",
     "sex",
     "boat",
     "pclass",
     "age",
     "name",
     "embarked",
     "parch",
     "sibsp",
     "family_size" 
   FROM
 (
   SELECT
     (CASE WHEN "fare" < -176.6204982585513 THEN -176.6204982585513 WHEN "fare" > 244.5480856064831 THEN 244.5480856064831 ELSE "fare" END) AS "fare",
     "survived",
     DECODE("sex", 'female', 0, 'male', 1, 2) AS "sex",
     DECODE("boat", NULL, 0, 1) AS "boat",
     "pclass",
     COALESCE("age", 30.1524573721163) AS "age",
     REGEXP_SUBSTR("name", ' ([A-Za-z]+)\.') AS "name",
     COALESCE("embarked", 'S') AS "embarked",
     "parch",
     "sibsp",
     parch + sibsp + 1 AS "family_size" 
   FROM
 (
   SELECT
     "fare",
     "survived",
     "sex",
     "boat",
     "pclass",
     "age",
     "name",
     "embarked",
     "parch",
     "sibsp" 
   FROM
 "public"."titanic") 
VERTICAPY_SUBTABLE) 
VERTICAPY_SUBTABLE) 
VERTICAPY_SUBTABLE

We can look at the query plan of the new relation. This will help us understand how Vertica will execute the different aggregations.

In [3]:
print(vdf.explain())
------------------------------ 
QUERY PLAN DESCRIPTION: 

EXPLAIN SELECT * FROM (SELECT "survived", "boat", "embarked", "sibsp", COALESCE("fare", 32.9113074018842) AS "fare", "sex", "pclass", "age", "name", "parch", "family_size" FROM (SELECT "survived", DECODE("boat", NULL, 0, 1) AS "boat", COALESCE("embarked", 'S') AS "embarked", "sibsp", (CASE WHEN "fare" < -176.6204982585513 THEN -176.6204982585513 WHEN "fare" > 244.5480856064831 THEN 244.5480856064831 ELSE "fare" END) AS "fare", DECODE("sex", 'female', 0, 'male', 1, 2) AS "sex", "pclass", COALESCE("age", 30.1524573721163) AS "age", REGEXP_SUBSTR("name", ' ([A-Za-z]+)\.') AS "name", "parch", parch + sibsp + 1 AS "family_size" FROM (SELECT "survived", "boat", "embarked", "sibsp", "fare", "sex", "pclass", "age", "name", "parch" FROM "public"."titanic") VERTICAPY_SUBTABLE) VERTICAPY_SUBTABLE) VERTICAPY_SUBTABLE

Access Path:
+-STORAGE ACCESS for titanic [Cost: 6K, Rows: 10K (NO STATISTICS)] (PATH ID: 1)
|  Projection: public.titanic_super
|  Materialize: titanic.pclass, titanic.survived, titanic.name, titanic.sex, titanic.age, titanic.sibsp, titanic.parch, titanic.fare, titanic.embarked, titanic.boat


----------------------------------------------- 
PLAN: BASE QUERY PLAN (GraphViz Format)
----------------------------------------------- 
digraph G {
graph [rankdir=BT, label = "BASE QUERY PLAN
	Query: EXPLAIN SELECT * FROM (SELECT \"survived\", \"boat\", \"embarked\", \"sibsp\", COALESCE(\"fare\", 32.9113074018842) AS \"fare\", \"sex\", \"pclass\", \"age\", \"name\", \"parch\", \"family_size\" FROM (SELECT \"survived\", DECODE(\"boat\", NULL, 0, 1) AS \"boat\", COALESCE(\"embarked\", \'S\') AS \"embarked\", \"sibsp\", (CASE WHEN \"fare\" \< -176.6204982585513 THEN -176.6204982585513 WHEN \"fare\" \> 244.5480856064831 THEN 244.5480856064831 ELSE \"fare\" END) AS \"fare\", DECODE(\"sex\", \'female\', 0, \'male\', 1, 2) AS \"sex\", \"pclass\", COALESCE(\"age\", 30.1524573721163) AS \"age\", REGEXP_SUBSTR(\"name\", \' ([A-Za-z]+)\.\') AS \"name\", \"parch\", parch + sibsp + 1 AS \"family_size\" FROM (SELECT \"survived\", \"boat\", \"embarked\", \"sibsp\", \"fare\", \"sex\", \"pclass\", \"age\", \"name\", \"parch\" FROM \"public\".\"titanic\") VERTICAPY_SUBTABLE) VERTICAPY_SUBTABLE) VERTICAPY_SUBTABLE
	
	All Nodes Vector: 
	
	  node[0]=v_testdb_node0001 (initiator) Up
	", labelloc=t, labeljust=l ordering=out]
0[label = "Root 
	OutBlk=[UncTuple(11)]", color = "green", shape = "house"];
1[label = "NewEENode 
	OutBlk=[UncTuple(11)]", color = "green", shape = "box"];
2[label = "StorageUnionStep: titanic_super
	Unc: Integer(8)
	Unc: Integer(8)
	Unc: Varchar(20)
	Unc: Integer(8)
	Unc: Numeric(18, 13)
	Unc: Integer(8)
	Unc: Integer(8)
	Unc: Numeric(16, 13)
	Unc: Varchar(164)
	Unc: Integer(8)
	Unc: Integer(8)", color = "purple", shape = "box"];
3[label = "ExprEval: 
	  titanic.survived
	  CASE titanic.boat WHEN NULLSEQUAL NULL THEN 0 ELSE 1 END
	  coalesce(titanic.embarked, \'S\')
	  titanic.sibsp
	  coalesce(CASE WHEN (titanic.fare \< (-176.6204982585513)) THEN (-176.6204982585513) WHEN (titanic.fare \> 244.5480856064831) THEN 244.5480856064831 ELSE titanic.fare END, 32.9113074018842)
	  CASE titanic.sex WHEN NULLSEQUAL \'female\' THEN 0 WHEN NULLSEQUAL \'male\' THEN 1 ELSE 2 END
	  titanic.pclass
	  coalesce(titanic.age, 30.1524573721163)
	  regexp_substr(titanic.name, E\' ([A-Za-z]+)\\.\', 1, 1, \'\', 0)
	  titanic.parch
	  ((titanic.parch + titanic.sibsp) + 1)
	Unc: Integer(8)
	Unc: Integer(8)
	Unc: Varchar(20)
	Unc: Integer(8)
	Unc: Numeric(18, 13)
	Unc: Integer(8)
	Unc: Integer(8)
	Unc: Numeric(16, 13)
	Unc: Varchar(164)
	Unc: Integer(8)
	Unc: Integer(8)", color = "brown", shape = "box"];
4[label = "ScanStep: titanic_super
	pclass
	survived
	name
	sex
	age
	sibsp
	parch
	fare
	embarked
	boat
	Unc: Integer(8)
	Unc: Integer(8)
	Unc: Varchar(164)
	Unc: Varchar(20)
	Unc: Numeric(6, 3)
	Unc: Integer(8)
	Unc: Integer(8)
	Unc: Numeric(10, 5)
	Unc: Varchar(20)
	Unc: Varchar(100)", color = "brown", shape = "box"];
1->0 [label = "V[0] C=11", color = "black", style="bold", arrowtail="inv"];
2->1 [label = "0", color = "blue"];
3->2 [label = "0", color = "blue"];
4->3 [label = "0", color = "blue"];}

We did plenty of operations and we must keep in mind that each method call will use this relation for its computations. We can save the result as a table in the Vertica database and use the parameter 'inplace' to change the current relation of the vDataFrame by the new one.

In [6]:
drop("public.titanic_clean")
vdf.to_db("public.titanic_clean",
              relation_type = "table",
              inplace = True)
print(vdf.current_relation())
"public"."titanic_clean"

When we're dealing with very large datasets, we have to think a bit before saving some transformations. Ideally, you'll want to do a proper data exploration first and then perform the heavier transformations only when they're really neeed.

3. Stick to essential columns

Columnar databases perform faster queries when there are fewer columns. Since Vertica is a columnar MPP database, so it's important to understand that most of the optimizations are made in projections. VerticaPy doesn't manage this part, so it's important that the data you're working with is well-organized, particularly for larger volumes of data.

Most vDataFrame methods will automatically pick up all numerical columns - even if doing so has a significant performance impact - so it's important to be a little picky and only select the essential columns for any given use case. Let's look at an example.

In [65]:
set_option("sql_on", True)
vdf.avg()

Computes the different aggregations.

  SELECT
    AVG("fare"),
    AVG("survived"),
    AVG("pclass"),
    AVG("age"),
    AVG("body"),
    AVG("parch"),
    AVG("sibsp")  
  FROM
"public"."titanic" LIMIT 1
avg
"fare"33.9637936739659
"survived"0.364667747163695
"pclass"2.28444084278768
"age"30.1524573721163
"body"164.14406779661
"parch"0.378444084278768
"sibsp"0.504051863857374
Out[65]:

Here, we didn't use the 'columns' parameter to pick on any specific columns, so we ended up computing the average of all the numerical columns of the vDataFrame. This isn't a big deal when we're dealing with smaller volumes of data (less than a TB), but with larger volumes of data, we have to be more careful with which columns we use.

In [66]:
vdf.avg(columns = ["age", "survived"])
avg
"age"30.1524573721163
"survived"0.364667747163695
Out[66]:

If you just want to exclude a few columns, you can simply get a list of all the columns and specify the unwanted columns with the 'get_columns' method.

In [9]:
vdf.get_columns()
Out[9]:
['"survived"',
 '"embarked"',
 '"fare"',
 '"sibsp"',
 '"age"',
 '"pclass"',
 '"sex"',
 '"name"',
 '"family_size"',
 '"boat"',
 '"parch"']
In [10]:
vdf.get_columns(exclude_columns = ["boat", "embarked"])
Out[10]:
['"survived"',
 '"fare"',
 '"sibsp"',
 '"age"',
 '"pclass"',
 '"sex"',
 '"name"',
 '"family_size"',
 '"parch"']

If you only want numerical columns, you can use the 'numcol'. This works the same way as 'get_columns', so you can also exclude columns in the same way.

In [11]:
vdf.numcol()
Out[11]:
['"survived"',
 '"fare"',
 '"sibsp"',
 '"age"',
 '"pclass"',
 '"sex"',
 '"family_size"',
 '"boat"',
 '"parch"']
In [12]:
vdf.numcol(exclude_columns = ["body", "sibsp"])
Out[12]:
['"survived"',
 '"fare"',
 '"age"',
 '"pclass"',
 '"sex"',
 '"family_size"',
 '"boat"',
 '"parch"']

Let's compute a correlation matrix of our numerical columns excluding 'body' and 'sibsp'.

In [14]:
vdf.corr(columns = vdf.numcol(exclude_columns = ["parch", "sibsp"]))

4. Use the help function

The 'help' function is very useful for quickly viewing parameters.

In [15]:
help(vdf.agg)
Help on method agg in module verticapy.vdataframe:

agg(func:list, columns:list=[]) method of verticapy.vdataframe.vDataFrame instance
    ---------------------------------------------------------------------------
    Aggregates the vDataFrame using the input functions.
    
    Parameters
    ----------
    func: list
            List of the different aggregation.
                    approx_unique  : approximative cardinality
                    count          : number of non-missing elements
                    cvar           : conditional value at risk
                    dtype          : virtual column type
                    iqr            : interquartile range
                    kurtosis       : kurtosis
                    jb             : Jarque Bera index 
                    mad            : median absolute deviation
                    mae            : mean absolute error (deviation)
                    max            : maximum
                    mean           : average
                    median         : median
                    min            : minimum
                    mode           : most occurent element
                    percent        : percent of non-missing elements
                    q%             : q quantile (ex: 50% for the median)
                    prod           : product
                    range          : difference between the max and the min
                    sem            : standard error of the mean
                    skewness       : skewness
                    sum            : sum
                    std            : standard deviation
                    topk           : kth most occurent element (ex: top1 for the mode)
                    topk_percent   : kth most occurent element density
                    unique         : cardinality (count distinct)
                    var            : variance
                            Other aggregations could work if it is part of 
                            the DB version you are using.
    columns: list, optional
            List of the vcolumns names. If empty, all the vcolumns 
            or only numerical vcolumns will be used depending on the
            aggregations.
    
    Returns
    -------
    tablesample
            An object containing the result. For more information, see
            utilities.tablesample.
    
    See Also
    --------
    vDataFrame.analytic : Adds a new vcolumn to the vDataFrame by using an advanced 
            analytical function on a specific vcolumn.

5. Managing database cursors

More connections to the database will increase the concurrency on the system, so try to close all of your connections after using them. VerticaPy simplifies the connection process by allowing the user to create an auto-connection, but it's generally best to make your own cursor when you can.

To demonstrate, let's create a database connection. Once we're done, we'll call 'close()' on our connection and cursor.

In [17]:
import vertica_python

conn_info = {'host': "10.211.55.14", 
             'port': 5433, 
             'user': "dbadmin", 
             'password': "XxX", 
             'database': "testdb"}
conn = vertica_python.connect(** conn_info)
cur = conn.cursor()

We can use it to create a vDataFrame perform some operations on the data.

In [67]:
vdf = vDataFrame("public.titanic", cur)
vdf["sex"].label_encode()["boat"].fillna(method = "0ifnull")["name"].str_extract(
    ' ([A-Za-z]+)\.').eval("family_size", expr = "parch + sibsp + 1").drop(
    columns = ["cabin", "body", "ticket", "home.dest"])["fare"].fill_outliers().fillna()
795 element(s) was/were filled
123
fare
Float
123
survived
Int
123
sex
Int
123
boat
Bool
123
pclass
Int
123
age
Float
Abc
name
Varchar(164)
Abc
embarked
Varchar(20)
123
parch
Int
123
sibsp
Int
123
family_size
Integer
1151.550000000000000012.0000000000000 Miss.S214
2151.5500000000000010130.0000000000000 Mr.S214
3151.5500000000000000125.0000000000000 Mrs.S214
40E-13010139.0000000000000 Mr.S001
549.5042000000000010171.0000000000000 Mr.C001
6227.5250000000000010147.0000000000000 Col.C012
725.9250000000000010130.1524573721163 Mr.S001
8244.5480856064831010124.0000000000000 Mr.C102
975.2417000000000011136.0000000000000 Mr.C001
1026.0000000000000010125.0000000000000 Mr.C001
1135.5000000000000010145.0000000000000 Mr.S001
1226.5500000000000010142.0000000000000 Mr.S001
1330.5000000000000010141.0000000000000 Mr.S001
1450.4958000000000010148.0000000000000 Mr.C001
1539.6000000000000010130.1524573721163 Dr.C001
1626.5500000000000010145.0000000000000 Major.S001
1731.0000000000000010130.1524573721163 Mr.S001
185.0000000000000010133.0000000000000 Mr.S001
1947.1000000000000010128.0000000000000 Mr.S001
2047.1000000000000010117.0000000000000 Mr.S001
2126.0000000000000010149.0000000000000 Mr.S001
2278.8500000000000010136.0000000000000 Mr.S012
2361.1750000000000010146.0000000000000 Mr.S012
240E-13010130.1524573721163 Mr.S001
25136.7792000000000010127.0000000000000 Mr.C012
2652.0000000000000010130.1524573721163 Mr.S001
2725.5875000000000010147.0000000000000 Mr.S001
2883.1583000000000010137.0000000000000 Mr.C113
2926.5500000000000010130.1524573721163 Mr.S001
3071.0000000000000010170.0000000000000 Capt.S113
3171.2833000000000010139.0000000000000 Mr.C012
3252.0000000000000010131.0000000000000 Mr.S012
33106.4250000000000010150.0000000000000 Mr.C012
3429.7000000000000010139.0000000000000 Mr.C001
3531.6792000000000000136.0000000000000 Miss.C001
36221.7792000000000010130.1524573721163 Mr.S001
3727.7500000000000010130.0000000000000 Mr.C001
38244.5480856064831010119.0000000000000 Mr.S236
39244.5480856064831010164.0000000000000 Mr.S416
4026.5500000000000010130.1524573721163 Mr.S001
410E-13010130.1524573721163 Mr.S001
4253.1000000000000010137.0000000000000 Mr.S012
4338.5000000000000010147.0000000000000 Mr.S001
4479.2000000000000010124.0000000000000 Mr.C001
4534.6542000000000010171.0000000000000 Mr.C001
46153.4625000000000010138.0000000000000 Mr.S102
4779.2000000000000010146.0000000000000 Mr.C001
4842.4000000000000010130.1524573721163 Mr.S001
4983.4750000000000010145.0000000000000 Mr.S012
500E-13010140.0000000000000 Mr.S001
5193.5000000000000010155.0000000000000 Mr.S113
5242.5000000000000010142.0000000000000 Mr.S001
5351.8625000000000010130.1524573721163 Mr.S001
5450.0000000000000010155.0000000000000 Mr.S001
5552.0000000000000010142.0000000000000 Mr.S012
5630.6958000000000011130.1524573721163 Mr.C001
5728.7125000000000000150.0000000000000 Miss.C001
5826.0000000000000010146.0000000000000 Mr.S001
5926.0000000000000010150.0000000000000 Mr.S001
60211.5000000000000010132.5000000000000 Mr.C001
6129.7000000000000010158.0000000000000 Mr.C001
6251.8625000000000010141.0000000000000 Mr.S012
6326.5500000000000010130.1524573721163 Mr.S001
6427.7208000000000010130.1524573721163 Mr.C001
6530.0000000000000010129.0000000000000 Mr.S001
6645.5000000000000010130.0000000000000 Mr.S001
6726.0000000000000010130.0000000000000 Mr.S001
6853.1000000000000010119.0000000000000 Mr.S012
6975.2417000000000010146.0000000000000 Mr.C001
7051.8625000000000010154.0000000000000 Mr.S001
7182.1708000000000010128.0000000000000 Mr.C012
7226.5500000000000010165.0000000000000 Mr.S001
7390.0000000000000010144.0000000000000 Dr.Q023
7430.5000000000000010155.0000000000000 Mr.S001
7542.4000000000000010147.0000000000000 Mr.S001
7629.7000000000000010137.0000000000000 Mr.C102
77113.2750000000000010158.0000000000000 Mr.C203
7826.0000000000000010164.0000000000000 Mr.S001
7961.9792000000000010165.0000000000000 Mr.C102
8027.7208000000000010128.5000000000000 Mr.C001
810E-13010130.1524573721163 Mr.S001
8228.5000000000000010145.5000000000000 Mr.S001
8393.5000000000000010123.0000000000000 Mr.S001
8466.6000000000000010129.0000000000000 Mr.S012
85108.9000000000000010118.0000000000000 Mr.C012
8652.0000000000000010147.0000000000000 Mr.S001
870E-13010138.0000000000000 Jonkheer.S001
88135.6333000000000010122.0000000000000 Mr.C001
89227.5250000000000010130.1524573721163 Mr.C001
9050.4958000000000010131.0000000000000 Mr.S001
9150.0000000000000010130.1524573721163 Mr.S001
9240.1250000000000010136.0000000000000 Mr.C001
9359.4000000000000010155.0000000000000 Mr.C012
9426.5500000000000010133.0000000000000 Mr.S001
95244.5480856064831010161.0000000000000 Mr.C315
9655.9000000000000010150.0000000000000 Mr.S012
9726.5500000000000010156.0000000000000 Mr.S001
9830.6958000000000010156.0000000000000 Mr.C001
9960.0000000000000010124.0000000000000 Mr.S012
10026.0000000000000010130.1524573721163 Mr.S001
Out[67]:
Rows: 1-100 of 1234 | Columns: 11

We can then close the connection when we are done.

In [19]:
cur.close()
conn.close()

It is very important to follow the previous process when you are working in an environment with multiple users.

6. Understand the time complexity of methods

Some techniques are more computationally expensive than others. For example, a 'kendall' correlation is very expensive compared to a 'pearson' correlation. This is because a 'kendall' correlation uses a cross join, giving it a time complexity of O(n*n) (where 'n' is the number of rows). We'll demonstrate this with the 'titanic' dataset.

In [68]:
vdf = vDataFrame("public.titanic")
set_option("time_on", True)
print("Pearson")
x = vdf.corr(method = "pearson", show = False)
print("Kendall")
x = vdf.corr(method = "kendall", show = False)
Pearson
Execution: 0.012s
Kendall
Execution: 1.997s

As you can see, a Kendall Correlation Matrix is noticeably slower (around 100 times more than Pearson) because of its time complexity. Keep this in mind when using methods on larger datasets.

7. Limit the number of elements in a plot

Graphics are a powerful way to understand data, but graphs can be difficult to parse if it has too many elements. Let's draw a multi-histogram where one column is categorical with thousands of categories.

In [28]:
vdf.hist(["name", "survived"])