Loading...

vDataFrame#

class verticapy.vDataFrame(input_relation: str | list | dict | DataFrame | ndarray | TableSample, usecols: str | list[str] | None = None, schema: str | None = None, external: bool = False, symbol: str = '$', sql_push_ext: bool = True, _empty: bool = False, _is_sql_magic: int = 0, _clean_query: bool = True)#

An object that records all user modifications, allowing users to manipulate the relation without mutating the underlying data in Vertica. When changes are made, the vDataFrame queries the Vertica database, which aggregates and returns the final result. The vDataFrame creates, for each column of the relation, a Virtual Column (vDataColumn) that stores the column alias an all user transformations.

Parameters#

input_relation: str | TableSample | pandas.DataFrame | list | numpy.ndarray | dict, optional

If the input_relation is of type str, it must represent the relation (view, table, or temporary table) used to create the object. To get a specific schema relation, your string must include both the relation and schema: 'schema.relation' or '"schema"."relation"'. Alternatively, you can use the ‘schema’ parameter, in which case the input_relation must exclude the schema name. It can also be the SQL query used to create the vDataFrame. If it is a pandas.DataFrame, a temporary local table is created. Otherwise, the vDataFrame is created using the generated SQL code of multiple UNIONs.

usecols: SQLColumns, optional

When input_relation is not an array-like type: List of columns used to create the object. As Vertica is a columnar DB, including less columns makes the process faster. Do not hesitate to exclude useless columns. Otherwise: List of column names.

schema: str, optional

The schema of the relation. Specifying a schema allows you to specify a table within a particular schema, or to specify a schema and relation name that contain period ‘.’ characters. If specified, the input_relation cannot include a schema.

external: bool, optional

A boolean to indicate whether it is an external table. If set to True, a Connection Identifier Database must be defined.

symbol: str, optional

Symbol used to identify the external connection. One of the following: "$", "€", "£", "%", "@", "&", "§", "?", "!"

sql_push_ext: bool, optional

If set to True, the external vDataFrame attempts to push the entire query to the external table (only DQL statements - SELECT; for other statements, use SQL Magic directly). This can increase performance but might increase the error rate. For instance, some DBs might not support the same SQL as Vertica.

Attributes#

vDataColumnsvDataColumn

Each vDataColumn of the vDataFrame is accessible by specifying its name between brackets. For example, to access the vDataColumn “myVC”: vDataFrame["myVC"].

Examples#

In this example, we will look at some of the ways how we can create a vDataFrame.

  • From dictionary

  • From numpy.array

  • From pandas.DataFrame

  • From SQL Query

  • From a table

After that we will also look at the mathematical operators that are available:

  • Pandas-Like

  • SQL-Like

Lastly, we will look at some examples of applications of functions that be applied directly on the vDataFrame.


Let’s begin by importing VerticaPy.

import verticapy as vp

Hint

By assigning an alias to verticapy, we mitigate the risk of code collisions with other libraries. This precaution is necessary because verticapy uses commonly known function names like “average” and “median”, which can potentially lead to naming conflicts. The use of an alias ensures that the functions from verticapy are used as intended without interfering with functions from other libraries.

Dictionary#

This is the most direct way to create a vDataFrame:

vdf = vp.vDataFrame(
    {
        "cats": ["A", "B", "C"],
        "reps": [2, 4, 8],
    },
)

Abc
cats
Varchar(1)
100%
123
reps
Integer
100%
1A2
2B4
3C8

Numpy Array#

We can also use a numpy.array:

import numpy as np

vdf = vp.vDataFrame(
    np.array(
        [
            [1, 2, 3],
            [4, 5, 6],
            [7, 8, 9],
        ],
    ),
    usecols = [
        "col_A",
        "col_B",
        "col_C",
    ],
)

123
col_A
Integer
100%
...
123
col_B
Integer
100%
123
col_C
Integer
100%
11...23
24...56
37...89

Pandas DataFrame#

We can also use a pandas.DataFrame object:

# Import Pandas library
import pandas as pd

# Create the data dictionary
data = {
    'Name': ['John', 'Ali', 'Pheona'],
    'Age': [25, 30, 22],
    'City': ['New York', 'Gaza', 'Los Angeles'],
}


# Create the Pandas DataFrame object
df = pd.DataFrame(data)

# Create a vDataFrame
vdf = vp.vDataFrame(df)
Abc
Name
Varchar(20)
100%
...
123
Age
Int
100%
Abc
City
Varchar(22)
100%
1Ali...30Gaza
2John...25New York
3Pheona...22Los Angeles

SQL Query#

We can also use a SQL Query:

# Write a SQL Query to fetch three rows from the Titanic table
sql_query = "SELECT age, sex FROM public.titanic LIMIT 3;"

# Create a vDataFrame
vdf = vp.vDataFrame(sql_query)
123
age
Numeric(8)
100%
Abc
sex
Varchar(20)
100%
12.0female
230.0male
325.0female

Table#

A table can also be directly ingested:

# Create a vDataFrame from the titanic table in public schema
vdf = vp.vDataFrame("public.titanic")
123
pclass
Int
100%
...
123
survived
Int
100%
Abc
home.dest
Varchar(100)
57%
11...0Montreal, PQ / Chesterville, ON
21...0Montreal, PQ / Chesterville, ON
31...0Montreal, PQ / Chesterville, ON
41...0Belfast, NI
51...0Montevideo, Uruguay
61...0New York, NY
71...0New York, NY
81...0Montreal, PQ
91...0Winnipeg, MN
101...0San Francisco, CA
111...0Trenton, NJ
121...0London / Winnipeg, MB
131...0Pomeroy, WA
141...0Omaha, NE
151...0Philadelphia, PA
161...0Washington, DC
171...0[null]
181...0New York, NY
191...0Montevideo, Uruguay
201...0Montevideo, Uruguay

Mathematical Operators#

We can use all the common mathematical operators on the vDataFrame.

Pandas-Like#

First let us re-create a simple vDataFrame:

vdf = vp.vDataFrame(
    {
        "cats": ["A", "B", "C"],
        "reps": [2, 4, 8],
    },
)

In order to search for a specific string value of a specific column:

result = vdf[vdf["cats"] == "A"]
Abc
cats
Varchar(1)
100%
123
reps
Integer
100%
1A2

Similarly we can perform a mathematical operations as well for numerical columns:

result = vdf[vdf["reps"] > 2]
Abc
cats
Varchar(1)
100%
123
reps
Integer
100%
1B4
2C8

Both operators could also be combined:

result = vdf[vdf["reps"] > 2][vdf["cats"] == "C"]
Abc
cats
Varchar(1)
100%
123
reps
Integer
100%
1C8

We can also perform mathematical calculations on the elements inside the vDataFrame quite conveniently:

vdf["new"] = abs(vdf["reps"] * 4 - 100)
Abc
cats
Varchar(1)
100%
...
123
reps
Integer
100%
123
new
Integer
100%
1A...292
2B...484
3C...868

SQL-Like#

SQL queries can be directly applied on the vDataFrame using StringSQL. This adds a new level of flexibility to the vDataFrame. StringSQL allows the user to generate formatted SQL queries in a string form. Since any SQL query in string format can be passed to the vDataFrame, you can seamlessly pass the output of StringSQL directly to the vDataFrame.

# Create the SQL Query using StringSQL
sql_query = vp.StringSQL("reps > 2")

# Get the output as a vDataFrame
result = vdf[sql_query]
Abc
cats
Varchar(1)
100%
...
123
reps
Integer
100%
123
new
Integer
100%
1B...484
2C...868

Note

Have a look at StringSQL for more details.

Another example of a slightly advanced SQL Query could be:

# Create the SQL Query using StringSQL
sql_query = vp.StringSQL("reps BETWEEN 3 AND 8 AND cats = 'B'")

# Get the output as a vDataFrame
result = vdf[sql_query]
Abc
cats
Varchar(1)
100%
...
123
reps
Integer
100%
123
new
Integer
100%
1B...484

Direct Functions#

There are many methods that can be directly used by vDataFrame. Let us look at how conveiently we can call them. Here is an example of the vDataFrame.describe() method:

# Import the dataset
from verticapy.datasets import load_titanic

# Create vDataFrame
vdf = load_titanic()

# Summarize the vDataFrame
vdf.describe()
Out[23]: 
None            ...    approx_75%         max  
"pclass"        ...           3.0         3.0  
"survived"      ...           1.0         1.0  
"age"           ...          39.0        80.0  
"sibsp"         ...           1.0         8.0  
"parch"         ...           0.0         9.0  
"fare"          ...       31.3875    512.3292  
"body"          ...         257.5       328.0  
Rows: 1-7 | Columns: 9
...
approx_75%
max
"pclass"...33
"survived"...11
"age"...3980
"sibsp"...18
"parch"...09
"fare"...31.3875512.3292
"body"...257.5328

Note

Explore vDataFrame and vDataColumn different methods to see more examples.

See also

vDataColumn : Columns of vDataFrame object.
class verticapy.vDataColumn(alias: str, transformations: list | None = None, parent: vDataFrame | None = None, catalog: dict | None = None)#

Python object that stores all user transformations. If the vDataFrame represents the entire relation, a vDataColumn can be seen as one column of that relation. Through its abstractions, vDataColumn simplify several processes.

Parameters#

alias: str

vDataColumn alias.

transformations: list, optional

List of the different transformations. Each transformation must be similar to the following: (function, type, category)

parent: vDataFrame, optional

Parent of the vDataColumn. One vDataFrame can have multiple children vDataColumn, whereas one vDataColumn can only have one parent.

catalog: dict, optional

Catalog where each key corresponds to an aggregation. vDataColumn will memorize the already computed aggregations to increase performance. The catalog is updated when the parent vDataFrame is modified.

Attributes#

alias, str:

vDataColumn alias.

catalog, dict:

Catalog of pre-computed aggregations.

parent, vDataFrame:

Parent of the vDataColumn.

transformations, str:

List of the different transformations.

Examples#

Let’s begin by importing VerticaPy.

import verticapy as vp

Hint

By assigning an alias to verticapy, we mitigate the risk of code collisions with other libraries. This precaution is necessary because verticapy uses commonly known function names like “average” and “median”, which can potentially lead to naming conflicts. The use of an alias ensures that the functions from verticapy are used as intended without interfering with functions from other libraries.

Let’s create a vDataFrame with two vDataColumn:

vdf = vp.vDataFrame(
    {
        "cats": ["A", "B", "C"],
        "reps": [2, 4, 8],
    },
)

Abc
cats
Varchar(1)
100%
123
reps
Integer
100%
1A2
2B4
3C8

“cats” and “reps” are vDataColumn objects. They can be accessed the same way as a dictionary or a pandas.DataFrame. They represent the columns of the entire relation.

For example, the following code will access the vDataColumn “cats”:

vdf["cats"]

Note

vDataColumn are columns inside a vDataFrame; they have their own methods but cannot exist without a parent vDataFrame. Please refer to vDataFrame to see an entire example.

See also

vDataFrame : Main VerticaPy dataset object.

Plotting#

There are three main plotting libraries available in VerticaPy:

  • Plotly

  • Highcharts

  • Matplotlib

To access the bases classes of all the plotting libraries click the dropdown button below.

Note

The documentation for these classes is provided solely to enhance the user’s understanding of the implementations. Users are not required to interact directly with these classes, and we do not recommend doing so.

Plotting Base Classes

ACFPlot(*args, **kwargs)

BarChart(*args, **kwargs)

BarChart2D(*args, **kwargs)

BoxPlot(*args, **kwargs)

CandleStick(*args, **kwargs)

ChampionChallengerPlot(*args, **kwargs)

ContourPlot(*args, **kwargs)

CutoffCurve(*args, **kwargs)

DensityPlot(*args, **kwargs)

ElbowCurve(*args, **kwargs)

HeatMap(*args, **kwargs)

Histogram(*args, **kwargs)

HorizontalBarChart(*args, **kwargs)

HorizontalBarChart2D(*args, **kwargs)

ImportanceBarChart(*args, **kwargs)

LiftChart(*args, **kwargs)

LinePlot(*args, **kwargs)

LogisticRegressionPlot(*args, **kwargs)

LOFPlot(*args, **kwargs)

MultiDensityPlot(*args, **kwargs)

MultiLinePlot(*args, **kwargs)

NestedPieChart(*args, **kwargs)

OutliersPlot(*args, **kwargs)

PCACirclePlot(*args, **kwargs)

PieChart(*args, **kwargs)

PlotlyBase(*args, **kwargs)

Plotly Base Class.

PRCCurve(*args, **kwargs)

RangeCurve(*args, **kwargs)

RegressionPlot(*args, **kwargs)

RegressionTreePlot(*args, **kwargs)

ROCCurve(*args, **kwargs)

ScatterPlot(*args, **kwargs)

SpiderChart(*args, **kwargs)

StepwisePlot(*args, **kwargs)

SVMClassifierPlot(*args, **kwargs)

TSPlot(*args, **kwargs)

VoronoiPlot(*args, **kwargs)

ACFPlot(*args, **kwargs)

ACFPACFPlot(*args, **kwargs)

BarChart(*args, **kwargs)

BarChart2D(*args, **kwargs)

BoxPlot(*args, **kwargs)

CandleStick(*args, **kwargs)

ChampionChallengerPlot(*args, **kwargs)

ContourPlot(*args, **kwargs)

CutoffCurve(*args, **kwargs)

DensityPlot(*args, **kwargs)

ElbowCurve(*args, **kwargs)

HeatMap(*args, **kwargs)

Histogram(*args, **kwargs)

HorizontalBarChart(*args, **kwargs)

HorizontalBarChart2D(*args, **kwargs)

ImportanceBarChart(*args, **kwargs)

LiftChart(*args, **kwargs)

LinePlot(*args, **kwargs)

LogisticRegressionPlot(*args, **kwargs)

LOFPlot(*args, **kwargs)

MultiDensityPlot(*args, **kwargs)

MultiLinePlot(*args, **kwargs)

NestedPieChart(*args, **kwargs)

OutliersPlot(*args, **kwargs)

PCACirclePlot(*args, **kwargs)

PieChart(*args, **kwargs)

PRCCurve(*args, **kwargs)

RangeCurve(*args, **kwargs)

RegressionPlot(*args, **kwargs)

RegressionTreePlot(*args, **kwargs)

ROCCurve(*args, **kwargs)

ScatterPlot(*args, **kwargs)

SpiderChart(*args, **kwargs)

StepwisePlot(*args, **kwargs)

SVMClassifierPlot(*args, **kwargs)

TSPlot(*args, **kwargs)

ACFPlot(*args, **kwargs)

ACFPACFPlot(*args, **kwargs)

AnimatedBarChart(*args, **kwargs)

AnimatedBubblePlot(*args, **kwargs)

AnimatedLinePlot(*args, **kwargs)

AnimatedPieChart(*args, **kwargs)

BarChart(*args, **kwargs)

BarChart2D(*args, **kwargs)

BoxPlot(*args, **kwargs)

CandleStick(*args, **kwargs)

ChampionChallengerPlot(*args, **kwargs)

ContourPlot(*args, **kwargs)

CutoffCurve(*args, **kwargs)

DensityPlot(*args, **kwargs)

DensityPlot2D(*args, **kwargs)

ElbowCurve(*args, **kwargs)

HeatMap(*args, **kwargs)

Histogram(*args, **kwargs)

HorizontalBarChart(*args, **kwargs)

HorizontalBarChart2D(*args, **kwargs)

ImportanceBarChart(*args, **kwargs)

LiftChart(*args, **kwargs)

LinePlot(*args, **kwargs)

LogisticRegressionPlot(*args, **kwargs)

LOFPlot(*args, **kwargs)

MultiDensityPlot(*args, **kwargs)

MultiLinePlot(*args, **kwargs)

NestedPieChart(*args, **kwargs)

OutliersPlot(*args, **kwargs)

PCACirclePlot(*args, **kwargs)

PieChart(*args, **kwargs)

PRCCurve(*args, **kwargs)

RangeCurve(*args, **kwargs)

RegressionPlot(*args, **kwargs)

RegressionTreePlot(*args, **kwargs)

ROCCurve(*args, **kwargs)

ScatterMatrix(*args, **kwargs)

ScatterPlot(*args, **kwargs)

SpiderChart(*args, **kwargs)

StepwisePlot(*args, **kwargs)

SVMClassifierPlot(*args, **kwargs)

TSPlot(*args, **kwargs)

VoronoiPlot(*args, **kwargs)

General#

vDataFrame.func(...)

bar(columns[, method, of, max_cardinality, ...])

Draws the bar chart of the input vDataColumns based on an aggregation.

barh(columns[, method, of, max_cardinality, ...])

Draws the horizontal bar chart of the input vDataColumns based on an aggregation.

boxplot([columns, q, max_nb_fliers, whis, chart])

Draws the Box Plot of the input vDataColumns.

contour(columns, func[, nbins, chart])

Draws the contour plot of the input function using two input vDataColumns.

density([columns, bandwidth, kernel, nbins, ...])

Draws the vDataColumns Density Plot.

heatmap(columns[, method, of, h, chart])

Draws the Heatmap of the two input vDataColumns.

hexbin(columns[, method, of, bbox, img, chart])

Draws the Hexbin of the input vDataColumns based on an aggregation.

hist(columns[, method, of, h, chart])

Draws the histograms of the input vDataColumns based on an aggregation.

outliers_plot(columns[, threshold, ...])

Draws the global outliers plot of one or two columns based on their ZSCORE.

pie(columns[, method, of, max_cardinality, ...])

Draws the nested pie chart of the input vDataColumns.

pivot_table(columns[, method, of, ...])

Draws the pivot table of one or two columns based on an aggregation.

plot(ts[, columns, start_date, end_date, ...])

Draws the time series.

scatter(columns[, by, size, cmap_col, ...])

Draws the scatter plot of the input vDataColumns.

scatter_matrix([columns, max_nb_points])

Draws the scatter matrix of the vDataFrame.

pivot_table_chi2(response[, columns, nbins, ...])

Returns the chi-square term using the pivot table of the response vDataColumn against the input vDataColumn.

range_plot(columns, ts[, q, start_date, ...])

Draws the range plot of the input vDataColumns.

vDataFrame[].func(...)

bar([method, of, max_cardinality, nbins, h, ...])

Draws the bar chart of the vDataColumn based on an aggregation.

barh([method, of, max_cardinality, nbins, ...])

Draws the horizontal bar chart of the vDataColumn based on an aggregation.

candlestick(ts[, method, q, start_date, ...])

Draws the Time Series of the vDataColumn.

boxplot([by, q, h, max_cardinality, ...])

Draws the box plot of the vDataColumn.

density([by, bandwidth, kernel, nbins, ...])

Draws the vDataColumn Density Plot.

hist([by, method, of, h, h_by, ...])

Draws the histogram of the input vDataColumn based on an aggregation.

pie([method, of, max_cardinality, h, kind, ...])

Draws the pie chart of the vDataColumn based on an aggregation.

plot(ts[, by, start_date, end_date, kind, chart])

Draws the Time Series of the vDataColumn.

range_plot(ts[, q, start_date, end_date, ...])

Draws the range plot of the vDataColumn.

spider([by, method, of, max_cardinality, h, ...])

Draws the spider plot of the input vDataColumn based on an aggregation.

Animated#

vDataFrame.func(...)

animated_bar(ts, columns[, by, start_date, ...])

Draws the animated bar chart (bar race).

animated_pie(ts, columns[, by, start_date, ...])

Draws the animated pie chart.

animated_plot(ts[, columns, by, start_date, ...])

Draws the animated line plot.

animated_scatter(ts, columns[, by, ...])

Draws the animated scatter plot.


Descriptive Statistics#

vDataFrame.func(...)

aad([columns])

Utilizes the aad (Average Absolute Deviation) aggregation method to analyze the vDataColumn.

aggregate(func[, columns, ncols_block, ...])

Aggregates the vDataFrame using the input functions.

all(columns, **agg_kwargs)

Applies the BOOL_AND aggregation method to the vDataFrame.

any(columns, **agg_kwargs)

Uses the BOOL_OR aggregation method in the vDataFrame.

avg([columns])

This operation aggregates the vDataFrame using the AVG aggregation, which calculates the average value for the selected column or columns.

count([columns])

This operation aggregates the vDataFrame using the COUNT aggregation, providing the count of non-missing values for specified columns.

count_percent([columns, sort_result, desc])

Performs aggregation on the vDataFrame using a list of aggregate functions, including count and percent.

describe([method, columns, unique, ...])

This function aggregates the vDataFrame using multiple statistical aggregations such as minimum (min), maximum (max), median, cardinality (unique), and other relevant statistics.

duplicated([columns, count, limit])

This function returns a list or set of values that occur more than once within the dataset.

kurtosis([columns])

Calculates the kurtosis of the vDataFrame to obtain a measure of the data's peakedness or tailness.

mad([columns])

Utilizes the mad (Median Absolute Deviation) aggregation method with the vDataFrame.

max([columns])

Aggregates the vDataFrame by applying the MAX aggregation, which calculates the maximum value, for the specified columns.

median([columns, approx])

Aggregates the vDataFrame using the MEDIAN or APPROX_MEDIAN aggregation, which calculates the median value for the specified columns.

min([columns])

Aggregates the vDataFrame by applying the MIN aggregation, which calculates the minimum value, for the specified columns.

nunique([columns, approx])

When aggregating the vDataFrame using nunique (cardinality), VerticaPy employs the COUNT DISTINCT function to determine the number of unique values in a particular column.

product([columns])

Aggregates the vDataFrame by applying the product aggregation function.

quantile(q[, columns, approx])

Aggregates the vDataFrame using specified quantile.

score(y_true, y_score, metric)

Computes the score using the input columns and the input metric.

sem([columns])

Leverages the sem (Standard Error of the Mean) aggregation technique to perform analysis and aggregation on the vDataFrame.

skewness([columns])

Utilizes the skewness aggregation method to analyze and aggregate the vDataFrame.

std([columns])

Aggregates the vDataFrame using STDDEV aggregation (Standard Deviation), providing insights into the spread or variability of data for the selected columns.

sum([columns])

Aggregates the vDataFrame using SUM aggregation, which computes the total sum of values for the specified columns, providing a cumulative view of numerical data.

var([columns])

Aggregates the vDataFrame using VAR aggregation (Variance), providing insights into the spread or variability of data for the selected columns.

vDataFrame[].func(...)

aad()

Utilizes the aad (Average Absolute Deviation) aggregation method to analyze the vDataColumn.

aggregate(func)

Aggregates the vDataFrame using the input functions.

avg()

This operation aggregates the vDataFrame using the AVG aggregation, which calculates the average value for the input column.

count()

This operation aggregates the vDataFrame using the COUNT aggregation, providing the count of non-missing values for the input column.

describe([method, max_cardinality, numcol])

This function aggregates the vDataColumn using multiple statistical aggregations such as minimum (min), maximum (max), median, cardinality (unique), and other relevant statistics.

distinct(**kwargs)

This function returns the distinct categories or unique values within a vDataColumn.

kurtosis()

Calculates the kurtosis of the vDataColumn to obtain a measure of the data's peakedness or tailness.

mad()

Utilizes the mad (Median Absolute Deviation) aggregation method with the vDataFrame.

max()

Aggregates the vDataFrame by applying the 'MAX' aggregation, which calculates the maximum value, for the input column.

median([approx])

Aggregates the vDataFrame using the MEDIAN or APPROX_MEDIAN aggregation, which calculates the median value for the specified columns.

min()

Aggregates the vDataFrame by applying the MIN aggregation, which calculates the minimum value, for the input column.

mode([dropna, n])

This function returns the nth most frequently occurring element in the vDataColumn.

nlargest([n])

Returns the n largest vDataColumn elements.

nsmallest([n])

Returns the n smallest elements in the vDataColumn.

nunique([approx])

When aggregating the vDataFrame using nunique (cardinality), VerticaPy employs the COUNT DISTINCT function to determine the number of unique values in particular columns.

product()

Aggregates the vDataColumn by applying the product aggregation function.

quantile(q[, approx])

Aggregates the vDataColumn using a specified quantile.

sem()

Leverages the sem (Standard Error of the Mean) aggregation technique to perform analysis and aggregation on the vDataColumn.

skewness()

Utilizes the skewness aggregation method to analyze and aggregate the vDataColumn.

std()

Aggregates the vDataFrame using STDDEV aggregation (Standard Deviation), providing insights into the spread or variability of data for the input column.

sum()

Aggregates the vDataFrame using SUM aggregation, which computes the total sum of values for the specified columns, providing a cumulative view of numerical data.

topk([k, dropna])

This function returns the k most frequently occurring elements in a column, along with their distribution expressed as percentages.

value_counts([k])

This function returns the k most frequently occurring elements in a column, along with information about how often they occur.

var()

Aggregates the vDataFrame using VAR aggregation (Variance), providing insights into the spread or variability of data for the input column.


Correlation & Dependency#

General#

vDataFrame.func(...)

acf(column, ts[, by, p, unit, method, ...])

Calculates the correlations between the specified vDataColumn and its various time lags.

corr([columns, method, mround, focus, show, ...])

Calculates the Correlation Matrix for the vDataFrame.

corr_pvalue(column1, column2[, method])

Computes the Correlation Coefficient between two input vDataColumns, along with its associated p-value.

cov([columns, focus, show, chart])

Computes the covariance matrix of the vDataFrame.

iv_woe(y[, columns, nbins, show, chart])

Calculates the Information Value (IV) Table, a powerful tool for assessing the predictive capability of an independent variable concerning a dependent variable.

pacf(column, ts[, by, p, unit, method, ...])

Computes the partial autocorrelations of the specified vDataColumn.

regr([columns, method, show, chart])

Calculates the regression matrix for the given vDataFrame.

vDataFrame[].func(...)

iv_woe(y[, nbins])

Calculates the Information Value (IV) / Weight Of Evidence (WOE) Table.

Time-series#

vDataFrame.func(...)

acf(column, ts[, by, p, unit, method, ...])

Calculates the correlations between the specified vDataColumn and its various time lags.

pacf(column, ts[, by, p, unit, method, ...])

Computes the partial autocorrelations of the specified vDataColumn.


Preprocessing#

Encoding#

vDataFrame.func(...)

case_when(name, *args)

Creates a new feature by evaluating on provided conditions.

one_hot_encode([columns, max_cardinality, ...])

Encodes the vDataColumns using the One Hot Encoding algorithm.

vDataFrame[].func(...)

cut(breaks[, labels, include_lowest, right])

Discretizes the vDataColumn using the input list.

decode(*args)

Encodes the vDataColumn using a user-defined encoding.

discretize([method, h, nbins, k, ...])

Discretizes the vDataColumn using the input method.

label_encode()

Encodes the vDataColumn using a bijection from the different categories to [0, n - 1] (n being the vDataColumn cardinality).

mean_encode(response)

Encodes the vDataColumn using the average of the response partitioned by the different vDataColumn categories.

one_hot_encode([prefix, prefix_sep, ...])

Encodes the vDataColumn with the One-Hot Encoding algorithm.

Dealing With Missing Values#

vDataFrame.func(...)

dropna([columns])

Filters the specified vDataColumns in a vDataFrame for missing values.

fillna([val, method, numeric_only])

Fills missing elements in vDataColumn using specific rules.

interpolate(ts, rule[, method, by])

Computes a regular time interval vDataFrame by interpolating the missing values using different techniques.

vDataFrame[].func(...)

dropna()

Filters the vDataFrame where the vDataColumn is missing.

fillna([val, method, expr, by, order_by])

Fills missing elements in the vDataColumn with a user-specified rule.

Duplicate Values#

vDataFrame.func(...)

drop_duplicates([columns])

Filters the duplicates using a partition by the input vDataColumns.

Normalization and Global Outliers#

vDataFrame.func(...)

outliers([columns, name, threshold, robust])

Adds a new vDataColumn labeled with 0 or 1, where 1 indicates that the record is a global outlier.

scale([columns, method])

Scales the input vDataColumns using the input method.

vDataFrame[].func(...)

clip([lower, upper])

Clips the vDataColumn by transforming the values less than the lower bound to the lower bound value and the values higher than the upper bound to the upper bound value.

fill_outliers([method, threshold, ...])

Fills the vDataColumns outliers using the input method.

normalize([method, by, return_trans])

Scales the input vDataColumns using the input method.

Data Types Conversion#

vDataFrame.func(...)

astype(dtype)

Converts the vDataColumns to the input types.

bool_to_int()

Converts all booleans vDataColumns to integers.

vDataFrame[].func(...)

astype(dtype)

Converts the vDataColumn to the input type.

Formatting#

vDataFrame.func(...)

format_colnames(*args[, columns, ...])

Method used to format the input columns by using the vDataFrame columns' names.

get_match_index(x, col_list[, str_check])

Returns the matching index.

is_colname_in(column)

Method used to check if the input column name is used by the vDataFrame.

merge_similar_names(skip_word)

Merges columns with similar names.

explode_array(index, column[, prefix, delimiter])

Returns exploded vDataFrame of array-like columns in a vDataFrame.

vDataFrame[].func(...)

astype(dtype)

Converts the vDataColumn to the input type.

rename(new_name)

Renames the vDataColumn by dropping the current vDataColumn and creating a copy with the specified name.

Splitting into Train/Test#

vDataFrame.func(...)

train_test_split([test_size, order_by, ...])

Creates two vDataFrames (train/test), which can be used to evaluate a model.

Working with Weights#

vDataFrame.func(...)

add_duplicates(weight[, use_gcd])

Duplicates the vDataFrame using the input weight.

Complete Disjunctive Table#

vDataFrame.func(...)

cdt([columns, max_cardinality, nbins, tcdt, ...])

Returns the complete disjunctive table of the vDataFrame.


Features Engineering#

Analytic Functions#

vDataFrame.func(...)

analytic(func[, columns, by, order_by, ...])

Adds a new vDataColumn to the vDataFrame by using an advanced analytical function on one or two specific vDataColumns.

interpolate(ts, rule[, method, by])

Computes a regular time interval vDataFrame by interpolating the missing values using different techniques.

sessionize(ts[, by, session_threshold, name])

Adds a new vDataColumn to the vDataFrame that corresponds to sessions (user activity during a specific time).

Custom Features Creation#

vDataFrame.func(...)

case_when(name, *args)

Creates a new feature by evaluating on provided conditions.

eval(name, expr)

Evaluates a customized expression.

Features Transformations#

vDataFrame.func(...)

abs([columns])

Applies the absolute value function to all input vDataColumns.

apply(func)

Applies each function of the dictionary to the input vDataColumns.

applymap(func[, numeric_only])

Applies a function to all vDataColumns.

polynomial_comb([columns, r])

Returns a vDataFrame containing the different product combinations of the input vDataColumn.

swap(column1, column2)

Swap the two input vDataColumns.

vDataFrame[].func(...)

abs()

Applies the absolute value function to the input vDataColumn.

add(x)

Adds the input element to the vDataColumn.

apply(func[, copy_name])

Applies a function to the vDataColumn.

apply_fun(func[, x])

Applies a default function to the vDataColumn.

date_part(field)

Extracts a specific TS field from the vDataColumn (only if the vDataColumn type is date like).

div(x)

Divides the vDataColumn by the input element.

mul(x)

Multiplies the vDataColumn by the input element.

round(n)

Rounds the vDataColumn by keeping only the input number of digits after the decimal point.

slice(length[, unit, start])

Slices and transforms the vDataColumn using a time series rule.

sub(x)

Subtracts the input element from the vDataColumn.

Moving Windows#

vDataFrame.func(...)

cummax(column[, by, order_by, name])

Adds a new vDataColumn to the vDataFrame by computing the cumulative maximum of the input vDataColumn.

cummin(column[, by, order_by, name])

Adds a new vDataColumn to the vDataFrame by computing the cumulative minimum of the input vDataColumn.

cumprod(column[, by, order_by, name])

Adds a new vDataColumn to the vDataFrame by computing the cumulative product of the input vDataColumn.

cumsum(column[, by, order_by, name])

Adds a new vDataColumn to the vDataFrame by computing the cumulative sum of the input vDataColumn.

rolling(func, window, columns[, by, ...])

Adds a new vDataColumn to the vDataFrame by using an advanced analytical window function on one or two specific vDataColumn.

Working with Text#

vDataFrame.func(...)

regexp(column, pattern[, method, position, ...])

Computes a new vDataColumn based on regular expressions.

vDataFrame[].func(...)

str_contains(pat)

Verifies if the regular expression is in each of the vDataColumn records.

str_count(pat)

Computes the number of matches for the regular expression in each record of the vDataColumn.

str_extract(pat)

Extracts the regular expression in each record of the vDataColumn.

str_replace(to_replace[, value])

Replaces the regular expression matches in each of the vDataColumn record by an input value.

str_slice(start, step)

Slices the vDataColumn.

Binary Operator Functions#

vDataFrame[].func(...)

add(x)

Adds the input element to the vDataColumn.

div(x)

Divides the vDataColumn by the input element.

mul(x)

Multiplies the vDataColumn by the input element.

sub(x)

Subtracts the input element from the vDataColumn.

Basic Feature Selection#

vDataFrame.func(...)

chaid(response, columns[, nbins, method, ...])

Returns a CHAID (Chi-square Automatic Interaction Detector) tree.

chaid_columns([columns, max_cardinality])

Function used to simplify the code.


Join, sort and transform#

vDataFrame.func(...)

append(input_relation[, expr1, expr2, union_all])

Merges the vDataFrame with another vDataFrame or an input relation, and returns a new vDataFrame.

copy()

Returns a deep copy of the vDataFrame.

flat_vmap([vmap_col, limit, exclude_columns])

Flatten the selected VMap.

groupby(columns[, expr, rollup, having])

This method facilitates the aggregation of the vDataFrame by grouping its elements based on one or more specified criteria.

join(input_relation[, on, on_interpolate, ...])

Joins the vDataFrame with another one or an input_relation.

narrow(index[, columns, col_name, val_name])

Returns the Narrow Table of the vDataFrame using the input vDataColumns.

pivot(index, columns, values[, aggr, prefix])

Returns the Pivot of the vDataFrame using the input aggregation.

recommend(unique_id, item_id[, method, ...])

Recommend items based on the Collaborative Filtering (CF) technique.

sort(columns)

Sorts the vDataFrame using the input vDataColumn.

vDataFrame[].func(...)

add_copy(name)

Adds a copy vDataColumn to the parent vDataFrame.


Filter and Sample#

Sample#

vDataFrame.func(...)

sample([n, x, method, by])

Downsamples the input vDataFrame.

Balance#

vDataFrame.func(...)

balance(column[, method, x, order_by])

Balances the dataset using the input method.

Filter Columns#

vDataFrame.func(...)

drop([columns])

Drops the input vDataColumns from the vDataFrame.

select(columns)

Returns a copy of the vDataFrame with only the selected vDataColumn.

vDataFrame[].func(...)

drop([add_history])

Drops the vDataColumn from the vDataFrame.

drop_outliers([threshold, use_threshold, alpha])

Drops outliers in the vDataColumn.

Filter Records#

vDataFrame.func(...)

at_time(ts, time)

Filters the vDataFrame by only keeping the records at the input time.

between(column[, start, end, inplace])

Filters the vDataFrame by only keeping the records between two input elements.

between_time(ts[, start_time, end_time, inplace])

Filters the vDataFrame by only keeping the records between two input times.

filter([conditions])

Filters the vDataFrame using the input expressions.

first(ts, offset)

Filters the vDataFrame by only keeping the first records.

isin(val)

Checks whether specific records are in the vDataFrame and returns the new vDataFrame of the search.

last(ts, offset)

Filters the vDataFrame by only keeping the last records.

vDataFrame[].func(...)

isin(val, *args)

Checks whether specific records are in the vDataColumn and returns the new vDataFrame of the search.


Serialization#

General Format#

vDataFrame.func(...)

to_csv([path, sep, na_rep, quotechar, ...])

Creates a CSV file or folder of CSV files of the current vDataFrame relation.

to_json([path, usecols, order_by, n_files])

Creates a JSON file or folder of JSON files of the current vDataFrame relation.

to_shp(name, path[, usecols, overwrite, shape])

Creates a SHP file of the current vDataFrame relation.

In-memory Object#

vDataFrame.func(...)

to_numpy()

Converts the vDataFrame to a numpy.array.

to_pandas()

Converts the vDataFrame to a pandas.DataFrame.

to_list()

Converts the vDataFrame to a Python list.

to_geopandas(geometry)

Converts the vDataFrame to a Geopandas DataFrame.

Databases#

vDataFrame.func(...)

to_db(name[, usecols, relation_type, ...])

Saves the vDataFrame current relation to the Vertica database.

Binary Format#

vDataFrame.func(...)

to_pickle(name)

Saves the vDataFrame to a Python pickle file.

Utilities#

Information#

vDataFrame.func(...)

catcol([max_cardinality])

Returns the vDataFrame categorical vDataColumns.

current_relation([reindent, split])

Returns the current vDataFrame relation.

datecol()

Returns a list of the vDataColumns of type date in the vDataFrame.

dtypes()

Returns the different vDataColumns types.

empty()

Returns True if the vDataFrame is empty.

explain([digraph])

Provides information on how Vertica is computing the current vDataFrame relation.

get_columns([exclude_columns])

Returns the vDataFrame vDataColumns.

head([limit])

Returns the vDataFrame head.

idisplay()

This method displays the interactive table.

iloc([limit, offset, columns])

Returns a part of the vDataFrame (delimited by an offset and a limit).

info()

Displays information about the different vDataFrame transformations.

memory_usage()

Returns the vDataFrame memory usage.

expected_store_usage([unit])

Returns the vDataFrame expected store usage.

numcol([exclude_columns])

Returns a list of names of the numerical vDataColumns in the vDataFrame.

shape()

Returns the number of rows and columns of the vDataFrame.

tail([limit])

Returns the tail of the vDataFrame.

vDataFrame[].func(...)

category()

Returns the category of the vDataColumn.

ctype()

Returns the vDataColumn DB type.

dtype()

Returns the vDataColumn DB type.

get_len()

Returns a new vDataColumn that represents the length of each element.

head([limit])

Returns the head of the vDataColumn.

iloc([limit, offset])

Returns a part of the vDataColumn (delimited by an offset and a limit).

isarray()

Returns True if the vDataColumn is an array, False otherwise.

isbool()

Returns True if the vDataColumn is boolean, False otherwise.

isdate()

Returns True if the vDataColumn category is date, False otherwise.

isnum()

Returns True if the vDataColumn is numerical, False otherwise.

isvmap()

Returns True if the vDataColumn category is VMap, False otherwise.

memory_usage()

Returns the vDataColumn memory usage.

store_usage()

Returns the vDataColumn expected store usage (unit: b).

tail([limit])

Returns the tail of the vDataColumn.

Management#

vDataFrame.func(...)

del_catalog()

Deletes the current vDataFrame catalog.

load([offset])

Loads a previous structure of the vDataFrame.

save()

Saves the current structure of the vDataFrame.