VerticaPy

Python API for Vertica Data Science at Scale

Descriptive Statistics

The easiest way to understand data is to aggregate it. An aggregation is a number or a category which summarizes the data. VerticaPy lets you compute all well-known aggregation in a single line.

The 'agg' method is the best way to compute multiple aggregations on multiple columns at the same time.

In [17]:
import verticapy as vp
help(vp.vDataFrame.agg)
Help on function aggregate in module verticapy.vdataframe:

aggregate(self, func:list, columns:list=[], ncols_block:int=20, processes:int=1)
    ---------------------------------------------------------------------------
    Aggregates the vDataFrame using the input functions.
    
    Parameters
    ----------
    func: list
        List of the different aggregations.
            aad            : average absolute deviation
            approx_median  : approximate median
            approx_q%      : approximate q quantile 
                             (ex: approx_50% for the approximate median)
            approx_unique  : approximative cardinality
            count          : number of non-missing elements
            cvar           : conditional value at risk
            dtype          : virtual column type
            iqr            : interquartile range
            kurtosis       : kurtosis
            jb             : Jarque-Bera index 
            mad            : median absolute deviation
            max            : maximum
            mean           : average
            median         : median
            min            : minimum
            mode           : most occurent element
            percent        : percent of non-missing elements 
            q%             : q quantile (ex: 50% for the median)
                             Use the 'approx_q%' (approximate quantile) 
                             aggregation to get better performances.
            prod           : product
            range          : difference between the max and the min
            sem            : standard error of the mean
            skewness       : skewness
            sum            : sum
            std            : standard deviation
            topk           : kth most occurent element (ex: top1 for the mode)
            topk_percent   : kth most occurent element density
            unique         : cardinality (count distinct)
            var            : variance
                Other aggregations will work if supported by your version of 
                the database.
    columns: list, optional
        List of the vColumn's names. If empty, depending on the aggregations,
        all or only numerical vColumns will be used.
    ncols_block: int, optional
        Number of columns used per query. Setting this parameter divides
        what would otherwise be one large query into many smaller queries called
        "blocks." The size of each block is determined by the ncols_block parameter.
    processes: int, optional
        Number of child processes to create. Setting this with the ncols_block parameter
        lets you parallelize a single query into many smaller queries, where each child 
        process creates its own connection to the database and sends one query. This can 
        improve query performance, but consumes more resources. If processes is set to 1, 
        the queries are sent iteratively from a single process.
    
    Returns
    -------
    tablesample
        An object containing the result. For more information, see
        utilities.tablesample.
    
    See Also
    --------
    vDataFrame.analytic : Adds a new vColumn to the vDataFrame by using an advanced 
        analytical function on a specific vColumn.

This is a tremendously useful function for understanding your data. Let's use the churn dataset.

In [18]:
vdf = vp.read_csv("data/churn.csv")
vdf.agg(func = ["min", "10%", "median", "90%", "max", "kurtosis", "unique"])
Out[18]:
min
10%
median
90%
max
kurtosis
unique
"SeniorCitizen"000.0111.362595895793912
"Partner"000.011-1.99595342119472
"Dependents"000.011-1.23437805716952
"tenure"0229.06972-1.3873716359716973
"PhoneService"011.0115.438907555087062
"PaperlessBilling"001.011-1.859606185608842
"MonthlyCharges"18.2520.0570.35102.6118.75-1.257259694549511585
"TotalCharges"18.884.61397.4755976.648684.8-0.2317987608693626530
"Churn"000.011-0.8702113423319812
Rows: 1-9 | Columns: 8

Some methods, like 'describe', are abstractions of the 'agg' method; they simplify the call to computing specific aggregations.

In [19]:
vdf.describe()
Out[19]:
count
mean
std
min
approx_25%
approx_50%
approx_75%
max
"SeniorCitizen"70430.1621468124378820.3686116056100130.00.00.00.01.0
"Partner"70430.4830327985233560.4997475107199860.00.00.01.01.0
"Dependents"70430.2995882436461730.4581101675100150.00.00.01.01.0
"tenure"704332.371148658242324.55948102309440.09.029.055.072.0
"PhoneService"70430.9031662643759750.2957522317836350.01.01.01.01.0
"PaperlessBilling"70430.5922192247621750.4914569240494070.00.01.01.01.0
"MonthlyCharges"704364.76169246059930.090047097678518.2535.570.321428571428689.85118.75
"TotalCharges"70322283.300440841882266.7713618831518.8402.6833333333331397.4753798.23758684.8
"Churn"70430.2653698707936940.4415613051219470.00.00.01.01.0
Rows: 1-9 | Columns: 9
In [20]:
vdf.describe(method = "all")
Out[20]:
123
"SeniorCitizen"
Int
100%
010
"Partner"
Boolean
100%
010
"Dependents"
Boolean
100%
123
"tenure"
Int
100%
010
"PhoneService"
Boolean
100%
010
"PaperlessBilling"
Boolean
100%
123
"MonthlyCharges"
Numeric(8,3)
100%
123
"TotalCharges"
Numeric(9,3)
99%
010
"Churn"
Boolean
100%
Abc
"customerID"
Varchar(20)
100%
Abc
"gender"
Varchar(20)
100%
Abc
"MultipleLines"
Varchar(100)
100%
Abc
"InternetService"
Varchar(22)
100%
Abc
"OnlineSecurity"
Varchar(38)
100%
Abc
"OnlineBackup"
Varchar(38)
100%
Abc
"DeviceProtection"
Varchar(38)
100%
Abc
"TechSupport"
Varchar(38)
100%
Abc
"StreamingTV"
Varchar(38)
100%
Abc
"StreamingMovies"
Varchar(38)
100%
Abc
"Contract"
Varchar(28)
100%
Abc
"PaymentMethod"
Varchar(50)
100%
dtypeintbooleanbooleanintbooleanbooleannumeric(8,3)numeric(9,3)booleanvarchar(20)varchar(20)varchar(100)varchar(22)varchar(38)varchar(38)varchar(38)varchar(38)varchar(38)varchar(38)varchar(28)varchar(50)
percent100.0100.0100.0100.0100.0100.0100.099.844100.0100.0100.0100.0100.0100.0100.0100.0100.0100.0100.0100.0100.0
count704370437043704370437043704370327043704370437043704370437043704370437043704370437043
top0
1
20.05[null]
0002-ORFBOMaleNoFiber opticNoNoNoNoNoNoMonth-to-monthElectronic check
top_percent83.78551.69770.0418.70490.31759.2220.8660.15673.4630.01450.47648.13343.95949.66643.84543.94449.31139.89839.54355.01933.579
avg0.1621468124378820.4830327985233560.29958824364617332.37114865824230.9031662643759750.59221922476217564.7616924605992283.300440841880.26536987079369410.04.990487008377113.77750958398416.300014198494965.970041175635386.028255004969476.027261110322315.973590799375276.067726820956986.0712764446968611.301150078091718.5702115575749
stddev0.3686116056100130.4997475107199860.45811016751001524.55948102309440.2957522317836350.49145692404940730.09004709767852266.771361883150.4415613051219470.01.00002574729434.030386526393034.178818085360946.866548314487536.836814337332836.837327235013846.864753089095936.816296591249626.814437236183972.985058424526545.04042209794139
min00000018.2518.8010422222222812
approx_25%00091035.5402.683333333333010423222222816
approx_50%000291170.32142857142861397.4750104333333331416
approx_75%011551189.853798.237511063113333331423
max1117211118.758684.8110616111919191919191425
range1117211100.58666.0102149171717171717613
empty[null][null][null][null][null][null][null][null][null]000000000000
Rows: 1-14 | Columns: 22
In [21]:
vdf.describe(method = "categorical")

Out[21]:
dtype
count
top
top_percent
"customerID"varchar(20)70430002-ORFBO0.014
"gender"varchar(20)7043Male50.476
"SeniorCitizen"int7043083.785
"Partner"boolean7043051.697
"Dependents"boolean7043070.041
"tenure"int704318.704
"PhoneService"boolean7043190.317
"MultipleLines"varchar(100)7043No48.133
"InternetService"varchar(22)7043Fiber optic43.959
"OnlineSecurity"varchar(38)7043No49.666
"OnlineBackup"varchar(38)7043No43.845
"DeviceProtection"varchar(38)7043No43.944
"TechSupport"varchar(38)7043No49.311
"StreamingTV"varchar(38)7043No39.898
"StreamingMovies"varchar(38)7043No39.543
"Contract"varchar(28)7043Month-to-month55.019
"PaperlessBilling"boolean7043159.222
"PaymentMethod"varchar(50)7043Electronic check33.579
"MonthlyCharges"numeric(8,3)704320.050.866
"TotalCharges"numeric(9,3)7032[null]0.156
"Churn"boolean7043073.463
Rows: 1-21 | Columns: 5

Multi-column aggregations can also be called with many built-in methods. For example, you can compute the 'avg' of all the numerical columns in just one line.

In [22]:
vdf.avg()
Out[22]:
avg
"SeniorCitizen"0.162146812437882
"Partner"0.483032798523356
"Dependents"0.299588243646173
"tenure"32.3711486582423
"PhoneService"0.903166264375975
"PaperlessBilling"0.592219224762175
"MonthlyCharges"64.761692460599
"TotalCharges"2283.30044084188
"Churn"0.265369870793694
Rows: 1-9 | Columns: 2

Or just the 'median' of a specific column.

In [23]:
vdf["tenure"].median()
Out[23]:
29.0

The approximate median is automatically computed. Set the parameter 'approx' to False to get the exact median.

In [25]:
vdf["tenure"].median(approx=False)
Out[25]:
29.0

You can also use the 'groupby' method to compute customized aggregations.

In [35]:
# SQL way
vdf.groupby(["gender",
             "Contract"],
            ["AVG(DECODE(Churn, 'Yes', 1, 0)) AS Churn"])
Abc
gender
Varchar(6)
Abc
Contract
Varchar(20)
123
Churn
Float
1MaleOne year0.120689655172414
2FemaleTwo year0.0261904761904762
3FemaleMonth-to-month0.437402597402597
4FemaleOne year0.104456824512535
5MaleTwo year0.0307692307692308
6MaleMonth-to-month0.416923076923077
Out[35]:
Rows: 6 | Columns: 3
In [27]:
# Pythonic way
import verticapy.stats as st
vdf.groupby([vdf["gender"],
             vdf["Contract"]],
            [st.min(vdf["tenure"])._as("min_tenure"),
             st.max(vdf["tenure"])._as("max_tenure")])
Out[27]:
Abc
gender
Varchar(20)
Abc
Contract
Varchar(28)
123
min_tenure
Integer
123
max_tenure
Integer
1FemaleOne year172
2MaleOne year072
3FemaleTwo year072
4FemaleMonth-to-month171
5MaleTwo year072
6MaleMonth-to-month172
Rows: 1-6 | Columns: 4

Computing many aggregations at the same time can be resource intensive. You can use the parameters 'ncols_block' and 'processes' to manage the ressources.

For example, the parameter 'ncols_block' will divide the main query into smaller using a specific number of columns. The parameter 'processes' allows you to manage the number of queries you want to send at the same time. An entire example is available in the vDataFrame.agg documentation.