vDataFrame#

vDataFrame

An object that records all user modifications, allowing users to manipulate the relation without mutating the underlying data in Vertica. When changes are made, the vDataFrame queries the Vertica database, which aggregates and returns the final result. The vDataFrame creates, for each column of the relation, a Virtual Column (vDataColumn) that stores the column alias an all user transformations.

Parameters#

input_relation: str | TableSample | pandas.DataFrame | list | numpy.ndarray | dict, optional: If the input_relation is of type str, it must represent the relation (view, table, or temporary table) used to create the object. To get a specific schema relation, your string must include both the relation and schema: 'schema.relation' or '"schema"."relation"'. Alternatively, you can use the ‘schema’ parameter, in which case the input_relation must exclude the schema name. It can also be the SQL query used to create the vDataFrame. If it is a pandas.DataFrame, a temporary local table is created. Otherwise, the vDataFrame is created using the generated SQL code of multiple UNIONs.
usecols: SQLColumns, optional: When input_relation is not an array-like type: List of columns used to create the object. As Vertica is a columnar DB, including less columns makes the process faster. Do not hesitate to exclude useless columns. Otherwise: List of column names.
schema: str, optional: The schema of the relation. Specifying a schema allows you to specify a table within a particular schema, or to specify a schema and relation name that contain period ‘.’ characters. If specified, the input_relation cannot include a schema.
external: bool, optional: A boolean to indicate whether it is an external table. If set to True, a Connection Identifier Database must be defined.
symbol: str, optional: Symbol used to identify the external connection. One of the following: "$", "€", "£", "%", "@", "&", "§", "?", "!"
sql_push_ext: bool, optional: If set to True, the external vDataFrame attempts to push the entire query to the external table (only DQL statements - SELECT; for other statements, use SQL Magic directly). This can increase performance but might increase the error rate. For instance, some DBs might not support the same SQL as Vertica.

Attributes#

vDataColumnsvDataColumn: Each vDataColumn of the vDataFrame is accessible by specifying its name between brackets. For example, to access the vDataColumn “myVC”: vDataFrame["myVC"].

Examples#

In this example, we will look at some of the ways how we can create a vDataFrame.

From dictionary
From numpy.array
From pandas.DataFrame
From SQL Query
From a table

After that we will also look at the mathematical operators that are available:

Pandas-Like
SQL-Like

Lastly, we will look at some examples of applications of functions that be applied directly on the vDataFrame.

Let’s begin by importing VerticaPy.

import verticapy as vp

Hint

By assigning an alias to verticapy, we mitigate the risk of code collisions with other libraries. This precaution is necessary because verticapy uses commonly known function names like “average” and “median”, which can potentially lead to naming conflicts. The use of an alias ensures that the functions from verticapy are used as intended without interfering with functions from other libraries.

Dictionary#

This is the most direct way to create a vDataFrame:

vdf = vp.vDataFrame(
    {
        "cats": ["A", "B", "C"],
        "reps": [2, 4, 8],
    },
)

	Abc cats Varchar(1) 100%	123 reps Integer 100%
1	A	2
2	B	4
3	C	8

Numpy Array#

We can also use a numpy.array:

import numpy as np

vdf = vp.vDataFrame(
    np.array(
        [
            [1, 2, 3],
            [4, 5, 6],
            [7, 8, 9],
        ],
    ),
    usecols = [
        "col_A",
        "col_B",
        "col_C",
    ],
)

	123 col_A Integer 100%	...	123 col_B Integer 100%	123 col_C Integer 100%
1	1	...	2	3
2	4	...	5	6
3	7	...	8	9

Pandas DataFrame#

We can also use a pandas.DataFrame object:

# Import Pandas library
import pandas as pd

# Create the data dictionary
data = {
    'Name': ['John', 'Ali', 'Pheona'],
    'Age': [25, 30, 22],
    'City': ['New York', 'Gaza', 'Los Angeles'],
}


# Create the Pandas DataFrame object
df = pd.DataFrame(data)

# Create a vDataFrame
vdf = vp.vDataFrame(df)

	Abc Name Varchar(20) 100%	...	123 Age Int 100%	Abc City Varchar(22) 100%
1	Ali	...	30	Gaza
2	John	...	25	New York
3	Pheona	...	22	Los Angeles

SQL Query#

We can also use a SQL Query:

# Write a SQL Query to fetch three rows from the Titanic table
sql_query = "SELECT age, sex FROM public.titanic LIMIT 3;"

# Create a vDataFrame
vdf = vp.vDataFrame(sql_query)

	123 age Numeric(8) 100%	Abc sex Varchar(20) 100%
1	2.0	female
2	30.0	male
3	25.0	female

Table#

A table can also be directly ingested:

# Create a vDataFrame from the titanic table in public schema
vdf = vp.vDataFrame("public.titanic")

	123 pclass Int 100%	...	123 survived Int 100%	Abc home.dest Varchar(100) 57%
1	1	...	0	Montreal, PQ / Chesterville, ON
2	1	...	0	Montreal, PQ / Chesterville, ON
3	1	...	0	Montreal, PQ / Chesterville, ON
4	1	...	0	Belfast, NI
5	1	...	0	Montevideo, Uruguay
6	1	...	0	New York, NY
7	1	...	0	New York, NY
8	1	...	0	Montreal, PQ
9	1	...	0	Winnipeg, MN
10	1	...	0	San Francisco, CA
11	1	...	0	Trenton, NJ
12	1	...	0	London / Winnipeg, MB
13	1	...	0	Pomeroy, WA
14	1	...	0	Omaha, NE
15	1	...	0	Philadelphia, PA
16	1	...	0	Washington, DC
17	1	...	0	[null]
18	1	...	0	New York, NY
19	1	...	0	Montevideo, Uruguay
20	1	...	0	Montevideo, Uruguay

Mathematical Operators#

We can use all the common mathematical operators on the vDataFrame.

Pandas-Like#

First let us re-create a simple vDataFrame:

vdf = vp.vDataFrame(
    {
        "cats": ["A", "B", "C"],
        "reps": [2, 4, 8],
    },
)

In order to search for a specific string value of a specific column:

result = vdf[vdf["cats"] == "A"]

	Abc cats Varchar(1) 100%	123 reps Integer 100%
1	A	2

Similarly we can perform a mathematical operations as well for numerical columns:

result = vdf[vdf["reps"] > 2]

	Abc cats Varchar(1) 100%	123 reps Integer 100%
1	B	4
2	C	8

Both operators could also be combined:

result = vdf[vdf["reps"] > 2][vdf["cats"] == "C"]

	Abc cats Varchar(1) 100%	123 reps Integer 100%
1	C	8

We can also perform mathematical calculations on the elements inside the vDataFrame quite conveniently:

vdf["new"] = abs(vdf["reps"] * 4 - 100)

	Abc cats Varchar(1) 100%	...	123 reps Integer 100%	123 new Integer 100%
1	A	...	2	92
2	B	...	4	84
3	C	...	8	68

SQL-Like#

SQL queries can be directly applied on the vDataFrame using StringSQL. This adds a new level of flexibility to the vDataFrame. StringSQL allows the user to generate formatted SQL queries in a string form. Since any SQL query in string format can be passed to the vDataFrame, you can seamlessly pass the output of StringSQL directly to the vDataFrame.

# Create the SQL Query using StringSQL
sql_query = vp.StringSQL("reps > 2")

# Get the output as a vDataFrame
result = vdf[sql_query]

	Abc cats Varchar(1) 100%	...	123 reps Integer 100%	123 new Integer 100%
1	B	...	4	84
2	C	...	8	68

Note

Have a look at StringSQL for more details.

Another example of a slightly advanced SQL Query could be:

# Create the SQL Query using StringSQL
sql_query = vp.StringSQL("reps BETWEEN 3 AND 8 AND cats = 'B'")

# Get the output as a vDataFrame
result = vdf[sql_query]

	Abc cats Varchar(1) 100%	...	123 reps Integer 100%	123 new Integer 100%
1	B	...	4	84

Direct Functions#

There are many methods that can be directly used by vDataFrame. Let us look at how conveiently we can call them. Here is an example of the vDataFrame.describe() method:

# Import the dataset
from verticapy.datasets import load_titanic

# Create vDataFrame
vdf = load_titanic()

# Summarize the vDataFrame
vdf.describe()
Out[23]: 
None            ...    approx_75%         max  
"pclass"        ...           3.0         3.0  
"survived"      ...           1.0         1.0  
"age"           ...          39.0        80.0  
"sibsp"         ...           1.0         8.0  
"parch"         ...           0.0         9.0  
"fare"          ...       31.3875    512.3292  
"body"          ...         257.5       328.0  
Rows: 1-7 | Columns: 9

	...	approx_75%	max
"pclass"	...	3	3
"survived"	...	1	1
"age"	...	39	80
"sibsp"	...	1	8
"parch"	...	0	9
"fare"	...	31.3875	512.3292
"body"	...	257.5	328

Note

Explore vDataFrame and vDataColumn different methods to see more examples.

Parameters#

alias: str: vDataColumn alias.
transformations: list, optional: List of the different transformations. Each transformation must be similar to the following: (function, type, category)
parent: vDataFrame, optional: Parent of the vDataColumn. One vDataFrame can have multiple children vDataColumn, whereas one vDataColumn can only have one parent.
catalog: dict, optional: Catalog where each key corresponds to an aggregation. vDataColumn will memorize the already computed aggregations to increase performance. The catalog is updated when the parent vDataFrame is modified.

Attributes#

alias, str:: vDataColumn alias.
catalog, dict:: Catalog of pre-computed aggregations.
parent, vDataFrame:: Parent of the vDataColumn.
transformations, str:: List of the different transformations.

Examples#

Let’s begin by importing VerticaPy.

import verticapy as vp

Hint

Let’s create a vDataFrame with two vDataColumn:

vdf = vp.vDataFrame(
    {
        "cats": ["A", "B", "C"],
        "reps": [2, 4, 8],
    },
)

	Abc cats Varchar(1) 100%	123 reps Integer 100%
1	A	2
2	B	4
3	C	8

“cats” and “reps” are vDataColumn objects. They can be accessed the same way as a dictionary or a pandas.DataFrame. They represent the columns of the entire relation.

For example, the following code will access the vDataColumn “cats”:

vdf["cats"]

Note

vDataColumn are columns inside a vDataFrame; they have their own methods but cannot exist without a parent vDataFrame. Please refer to vDataFrame to see an entire example.

Plotting#

There are three main plotting libraries available in VerticaPy:

Plotly
Highcharts
Matplotlib

To access the bases classes of all the plotting libraries click the dropdown button below.

Note

The documentation for these classes is provided solely to enhance the user’s understanding of the implementations. Users are not required to interact directly with these classes, and we do not recommend doing so.

General#

vDataFrame

vDataFrame.func(...)

`bar`(columns[, method, of, max_cardinality, ...])	Draws the bar chart of the input vDataColumns based on an aggregation.
`barh`(columns[, method, of, max_cardinality, ...])	Draws the horizontal bar chart of the input `vDataColumns` based on an aggregation.
`boxplot`([columns, q, max_nb_fliers, whis, chart])	Draws the Box Plot of the input vDataColumns.
`contour`(columns, func[, nbins, chart])	Draws the contour plot of the input function using two input vDataColumns.
`density`([columns, bandwidth, kernel, nbins, ...])	Draws the vDataColumns Density Plot.
`heatmap`(columns[, method, of, h, chart])	Draws the Heatmap of the two input vDataColumns.
`hexbin`(columns[, method, of, bbox, img, chart])	Draws the Hexbin of the input vDataColumns based on an aggregation.
`hist`(columns[, method, of, h, chart])	Draws the histograms of the input vDataColumns based on an aggregation.
`outliers_plot`(columns[, threshold, ...])	Draws the global outliers plot of one or two columns based on their ZSCORE.
`pie`(columns[, method, of, max_cardinality, ...])	Draws the nested pie chart of the input `vDataColumns`.
`pivot_table`(columns[, method, of, ...])	Draws the pivot table of one or two columns based on an aggregation.
`plot`(ts[, columns, start_date, end_date, ...])	Draws the time series.
`scatter`(columns[, by, size, cmap_col, ...])	Draws the scatter plot of the input vDataColumns.
`scatter_matrix`([columns, max_nb_points])	Draws the scatter matrix of the vDataFrame.
`pivot_table_chi2`(response[, columns, nbins, ...])	Returns the chi-square term using the pivot table of the response `vDataColumn` against the input `vDataColumn`.
`range_plot`(columns, ts[, q, start_date, ...])	Draws the range plot of the input vDataColumns.

vDataColumn

vDataFrame[].func(...)

`bar`([method, of, max_cardinality, nbins, h, ...])	Draws the bar chart of the vDataColumn based on an aggregation.
`barh`([method, of, max_cardinality, nbins, ...])	Draws the horizontal bar chart of the vDataColumn based on an aggregation.
`candlestick`(ts[, method, q, start_date, ...])	Draws the Time Series of the vDataColumn.
`boxplot`([by, q, h, max_cardinality, ...])	Draws the box plot of the vDataColumn.
`density`([by, bandwidth, kernel, nbins, ...])	Draws the vDataColumn Density Plot.
`hist`([by, method, of, h, h_by, ...])	Draws the histogram of the input vDataColumn based on an aggregation.
`pie`([method, of, max_cardinality, h, kind, ...])	Draws the pie chart of the vDataColumn based on an aggregation.
`plot`(ts[, by, start_date, end_date, kind, chart])	Draws the Time Series of the vDataColumn.
`range_plot`(ts[, q, start_date, end_date, ...])	Draws the range plot of the vDataColumn.
`spider`([by, method, of, max_cardinality, h, ...])	Draws the spider plot of the input vDataColumn based on an aggregation.

Animated#

vDataFrame

vDataFrame.func(...)

`animated_bar`(ts, columns[, by, start_date, ...])	Draws the animated bar chart (bar race).
`animated_pie`(ts, columns[, by, start_date, ...])	Draws the animated pie chart.
`animated_plot`(ts[, columns, by, start_date, ...])	Draws the animated line plot.
`animated_scatter`(ts, columns[, by, ...])	Draws the animated scatter plot.

Descriptive Statistics#

vDataFrame

vDataFrame.func(...)

`aad`([columns])	Utilizes the `aad` (Average Absolute Deviation) aggregation method to analyze the vDataColumn.
`aggregate`(func[, columns, ncols_block, ...])	Aggregates the vDataFrame using the input functions.
`all`(columns, **agg_kwargs)	Applies the `BOOL_AND` aggregation method to the vDataFrame.
`any`(columns, **agg_kwargs)	Uses the `BOOL_OR` aggregation method in the vDataFrame.
`avg`([columns])	This operation aggregates the vDataFrame using the `AVG` aggregation, which calculates the average value for the selected column or columns.
`count`([columns])	This operation aggregates the vDataFrame using the `COUNT` aggregation, providing the count of non-missing values for specified columns.
`count_percent`([columns, sort_result, desc])	Performs aggregation on the vDataFrame using a list of aggregate functions, including `count` and `percent`.
`describe`([method, columns, unique, ...])	This function aggregates the vDataFrame using multiple statistical aggregations such as minimum (min), maximum (max), median, cardinality (unique), and other relevant statistics.
`duplicated`([columns, count, limit])	This function returns a list or set of values that occur more than once within the dataset.
`kurtosis`([columns])	Calculates the kurtosis of the vDataFrame to obtain a measure of the data's peakedness or tailness.
`mad`([columns])	Utilizes the `mad` (Median Absolute Deviation) aggregation method with the vDataFrame.
`max`([columns])	Aggregates the vDataFrame by applying the `MAX` aggregation, which calculates the maximum value, for the specified columns.
`median`([columns, approx])	Aggregates the vDataFrame using the `MEDIAN` or `APPROX_MEDIAN` aggregation, which calculates the median value for the specified columns.
`min`([columns])	Aggregates the vDataFrame by applying the `MIN` aggregation, which calculates the minimum value, for the specified columns.
`nunique`([columns, approx])	When aggregating the vDataFrame using nunique (cardinality), VerticaPy employs the COUNT DISTINCT function to determine the number of unique values in a particular column.
`product`([columns])	Aggregates the vDataFrame by applying the `product` aggregation function.
`quantile`(q[, columns, approx])	Aggregates the vDataFrame using specified `quantile`.
`score`(y_true, y_score, metric)	Computes the score using the input columns and the input metric.
`sem`([columns])	Leverages the `sem` (Standard Error of the Mean) aggregation technique to perform analysis and aggregation on the vDataFrame.
`skewness`([columns])	Utilizes the `skewness` aggregation method to analyze and aggregate the vDataFrame.
`std`([columns])	Aggregates the vDataFrame using `STDDEV` aggregation (Standard Deviation), providing insights into the spread or variability of data for the selected columns.
`sum`([columns])	Aggregates the vDataFrame using `SUM` aggregation, which computes the total sum of values for the specified columns, providing a cumulative view of numerical data.
`var`([columns])	Aggregates the vDataFrame using `VAR` aggregation (Variance), providing insights into the spread or variability of data for the selected columns.

vDataColumn

vDataFrame[].func(...)

`aad`()	Utilizes the `aad` (Average Absolute Deviation) aggregation method to analyze the vDataColumn.
`aggregate`(func)	Aggregates the vDataFrame using the input functions.
`avg`()	This operation aggregates the vDataFrame using the `AVG` aggregation, which calculates the average value for the input column.
`count`()	This operation aggregates the vDataFrame using the `COUNT` aggregation, providing the count of non-missing values for the input column.
`describe`([method, max_cardinality, numcol])	This function aggregates the vDataColumn using multiple statistical aggregations such as minimum (min), maximum (max), median, cardinality (unique), and other relevant statistics.
`distinct`(**kwargs)	This function returns the distinct categories or unique values within a vDataColumn.
`kurtosis`()	Calculates the kurtosis of the vDataColumn to obtain a measure of the data's peakedness or tailness.
`mad`()	Utilizes the `mad` (Median Absolute Deviation) aggregation method with the vDataFrame.
`max`()	Aggregates the vDataFrame by applying the 'MAX' aggregation, which calculates the maximum value, for the input column.
`median`([approx])	Aggregates the vDataFrame using the `MEDIAN` or `APPROX_MEDIAN` aggregation, which calculates the median value for the specified columns.
`min`()	Aggregates the vDataFrame by applying the `MIN` aggregation, which calculates the minimum value, for the input column.
`mode`([dropna, n])	This function returns the nth most frequently occurring element in the vDataColumn.
`nlargest`([n])	Returns the `n` largest `vDataColumn` elements.
`nsmallest`([n])	Returns the `n` smallest elements in the `vDataColumn`.
`nunique`([approx])	When aggregating the vDataFrame using nunique (cardinality), VerticaPy employs the COUNT DISTINCT function to determine the number of unique values in particular columns.
`product`()	Aggregates the vDataColumn by applying the `product` aggregation function.
`quantile`(q[, approx])	Aggregates the vDataColumn using a specified `quantile`.
`sem`()	Leverages the `sem` (Standard Error of the Mean) aggregation technique to perform analysis and aggregation on the vDataColumn.
`skewness`()	Utilizes the `skewness` aggregation method to analyze and aggregate the vDataColumn.
`std`()	Aggregates the vDataFrame using `STDDEV` aggregation (Standard Deviation), providing insights into the spread or variability of data for the input column.
`sum`()	Aggregates the vDataFrame using `SUM` aggregation, which computes the total sum of values for the specified columns, providing a cumulative view of numerical data.
`topk`([k, dropna])	This function returns the k most frequently occurring elements in a column, along with their distribution expressed as percentages.
`value_counts`([k])	This function returns the k most frequently occurring elements in a column, along with information about how often they occur.
`var`()	Aggregates the vDataFrame using `VAR` aggregation (Variance), providing insights into the spread or variability of data for the input column.

Correlation & Dependency#

General#

vDataFrame

vDataFrame.func(...)

`acf`(column, ts[, by, p, unit, method, ...])	Calculates the correlations between the specified vDataColumn and its various time lags.
`corr`([columns, method, mround, focus, show, ...])	Calculates the Correlation Matrix for the vDataFrame.
`corr_pvalue`(column1, column2[, method])	Computes the Correlation Coefficient between two input vDataColumns, along with its associated p-value.
`cov`([columns, focus, show, chart])	Computes the covariance matrix of the vDataFrame.
`iv_woe`(y[, columns, nbins, show, chart])	Calculates the Information Value (IV) Table, a powerful tool for assessing the predictive capability of an independent variable concerning a dependent variable.
`pacf`(column, ts[, by, p, unit, method, ...])	Computes the partial autocorrelations of the specified vDataColumn.
`regr`([columns, method, show, chart])	Calculates the regression matrix for the given vDataFrame.

vDataColumn

vDataFrame[].func(...)

iv_woe(y[, nbins])

Calculates the Information Value (IV) / Weight Of Evidence (WOE) Table.

Time-series#

vDataFrame

vDataFrame.func(...)

`acf`(column, ts[, by, p, unit, method, ...])	Calculates the correlations between the specified vDataColumn and its various time lags.
`pacf`(column, ts[, by, p, unit, method, ...])	Computes the partial autocorrelations of the specified vDataColumn.

Features Engineering#

Analytic Functions#

vDataFrame

vDataFrame.func(...)

`analytic`(func[, columns, by, order_by, ...])	Adds a new vDataColumn to the vDataFrame by using an advanced analytical function on one or two specific vDataColumns.
`interpolate`(ts, rule[, method, by])	Computes a regular time interval vDataFrame by interpolating the missing values using different techniques.
`sessionize`(ts[, by, session_threshold, name])	Adds a new `vDataColumn` to the `vDataFrame` that corresponds to sessions (user activity during a specific time).

Custom Features Creation#

vDataFrame

vDataFrame.func(...)

case_when(name, *args)

Creates a new feature by evaluating on provided conditions.

eval(name, expr)

Evaluates a customized expression.

Features Transformations#

vDataFrame

vDataFrame.func(...)

`abs`([columns])	Applies the absolute value function to all input vDataColumns.
`apply`(func)	Applies each function of the dictionary to the input vDataColumns.
`applymap`(func[, numeric_only])	Applies a function to all vDataColumns.
`polynomial_comb`([columns, r])	Returns a vDataFrame containing the different product combinations of the input `vDataColumn`.
`swap`(column1, column2)	Swap the two input vDataColumns.

vDataColumn

vDataFrame[].func(...)

`abs`()	Applies the absolute value function to the input vDataColumn.
`add`(x)	Adds the input element to the vDataColumn.
`apply`(func[, copy_name])	Applies a function to the vDataColumn.
`apply_fun`(func[, x])	Applies a default function to the vDataColumn.
`date_part`(field)	Extracts a specific TS field from the vDataColumn (only if the vDataColumn type is date like).
`div`(x)	Divides the vDataColumn by the input element.
`mul`(x)	Multiplies the vDataColumn by the input element.
`round`(n)	Rounds the vDataColumn by keeping only the input number of digits after the decimal point.
`slice`(length[, unit, start])	Slices and transforms the vDataColumn using a time series rule.
`sub`(x)	Subtracts the input element from the vDataColumn.

Moving Windows#

vDataFrame

vDataFrame.func(...)

`cummax`(column[, by, order_by, name])	Adds a new `vDataColumn` to the `vDataFrame` by computing the cumulative maximum of the input `vDataColumn`.
`cummin`(column[, by, order_by, name])	Adds a new `vDataColumn` to the `vDataFrame` by computing the cumulative minimum of the input `vDataColumn`.
`cumprod`(column[, by, order_by, name])	Adds a new `vDataColumn` to the `vDataFrame` by computing the cumulative product of the input `vDataColumn`.
`cumsum`(column[, by, order_by, name])	Adds a new `vDataColumn` to the `vDataFrame` by computing the cumulative sum of the input `vDataColumn`.
`rolling`(func, window, columns[, by, ...])	Adds a new `vDataColumn` to the `vDataFrame` by using an advanced analytical window function on one or two specific `vDataColumn`.

Working with Text#

vDataFrame

vDataFrame.func(...)

regexp(column, pattern[, method, position, ...])

Computes a new vDataColumn based on regular expressions.

vDataColumn

vDataFrame[].func(...)

`str_contains`(pat)	Verifies if the regular expression is in each of the vDataColumn records.
`str_count`(pat)	Computes the number of matches for the regular expression in each record of the vDataColumn.
`str_extract`(pat)	Extracts the regular expression in each record of the vDataColumn.
`str_replace`(to_replace[, value])	Replaces the regular expression matches in each of the vDataColumn record by an input value.
`str_slice`(start, step)	Slices the vDataColumn.

Binary Operator Functions#

vDataColumn

vDataFrame[].func(...)

`add`(x)	Adds the input element to the vDataColumn.
`div`(x)	Divides the vDataColumn by the input element.
`mul`(x)	Multiplies the vDataColumn by the input element.
`sub`(x)	Subtracts the input element from the vDataColumn.

Basic Feature Selection#

vDataFrame

vDataFrame.func(...)

`chaid`(response, columns[, nbins, method, ...])	Returns a CHAID (Chi-square Automatic Interaction Detector) tree.
`chaid_columns`([columns, max_cardinality])	Function used to simplify the code.

Join, sort and transform#

vDataFrame

vDataFrame.func(...)

`append`(input_relation[, expr1, expr2, union_all])	Merges the vDataFrame with another vDataFrame or an input relation, and returns a new vDataFrame.
`copy`()	Returns a deep copy of the `vDataFrame`.
`flat_vmap`([vmap_col, limit, exclude_columns])	Flatten the selected VMap.
`groupby`(columns[, expr, rollup, having])	This method facilitates the aggregation of the vDataFrame by grouping its elements based on one or more specified criteria.
`join`(input_relation[, on, on_interpolate, ...])	Joins the `vDataFrame` with another one or an `input_relation`.
`narrow`(index[, columns, col_name, val_name])	Returns the Narrow Table of the vDataFrame using the input vDataColumns.
`pivot`(index, columns, values[, aggr, prefix])	Returns the Pivot of the vDataFrame using the input aggregation.
`recommend`(unique_id, item_id[, method, ...])	Recommend items based on the Collaborative Filtering (CF) technique.
`sort`(columns)	Sorts the `vDataFrame` using the input `vDataColumn`.

vDataColumn

vDataFrame[].func(...)

add_copy(name)

Adds a copy vDataColumn to the parent vDataFrame.

Filter and Sample#

Search#

vDataFrame

vDataFrame.func(...)

search([conditions, usecols, expr, order_by])

Searches for elements that match the input conditions.

Sample#

vDataFrame

vDataFrame.func(...)

sample([n, x, method, by])

Downsamples the input vDataFrame.

Balance#

vDataFrame

vDataFrame.func(...)

balance(column[, method, x, order_by])

Balances the dataset using the input method.

Filter Columns#

vDataFrame

vDataFrame.func(...)

`drop`([columns])	Drops the input vDataColumns from the vDataFrame.
`select`(columns)	Returns a copy of the `vDataFrame` with only the selected `vDataColumn`.

vDataColumn

vDataFrame[].func(...)

`drop`([add_history])	Drops the vDataColumn from the vDataFrame.
`drop_outliers`([threshold, use_threshold, alpha])	Drops outliers in the vDataColumn.

Filter Records#

vDataFrame

vDataFrame.func(...)

`at_time`(ts, time)	Filters the vDataFrame by only keeping the records at the input time.
`between`(column[, start, end, inplace])	Filters the vDataFrame by only keeping the records between two input elements.
`between_time`(ts[, start_time, end_time, inplace])	Filters the vDataFrame by only keeping the records between two input times.
`filter`([conditions])	Filters the vDataFrame using the input expressions.
`first`(ts, offset)	Filters the vDataFrame by only keeping the first records.
`isin`(val)	Checks whether specific records are in the vDataFrame and returns the new vDataFrame of the search.
`last`(ts, offset)	Filters the vDataFrame by only keeping the last records.

vDataColumn

vDataFrame[].func(...)

isin(val, *args)

Checks whether specific records are in the vDataColumn and returns the new vDataFrame of the search.

Serialization#

General Format#

vDataFrame

vDataFrame.func(...)

`to_csv`([path, sep, na_rep, quotechar, ...])	Creates a CSV file or folder of CSV files of the current `vDataFrame` relation.
`to_json`([path, usecols, order_by, n_files])	Creates a JSON file or folder of JSON files of the current `vDataFrame` relation.
`to_shp`(name, path[, usecols, overwrite, shape])	Creates a SHP file of the current `vDataFrame` relation.

In-memory Object#

vDataFrame

vDataFrame.func(...)

`to_numpy`()	Converts the `vDataFrame` to a `numpy.array`.
`to_pandas`()	Converts the vDataFrame to a `pandas.DataFrame`.
`to_list`()	Converts the `vDataFrame` to a Python `list`.
`to_geopandas`(geometry)	Converts the `vDataFrame` to a Geopandas `DataFrame`.

Databases#

vDataFrame

vDataFrame.func(...)

to_db(name[, usecols, relation_type, ...])

Saves the vDataFrame current relation to the Vertica database.

Binary Format#

vDataFrame

vDataFrame.func(...)

to_pickle(name)

Saves the vDataFrame to a Python pickle file.

Utilities#

Information#

vDataFrame

vDataFrame.func(...)

`catcol`([max_cardinality])	Returns the vDataFrame categorical vDataColumns.
`current_relation`([reindent, split])	Returns the current vDataFrame relation.
`datecol`()	Returns a list of the vDataColumns of type date in the vDataFrame.
`dtypes`()	Returns the different vDataColumns types.
`empty`()	Returns True if the vDataFrame is empty.
`explain`([digraph])	Provides information on how Vertica is computing the current `vDataFrame` relation.
`get_columns`([exclude_columns])	Returns the vDataFrame vDataColumns.
`head`([limit])	Returns the vDataFrame head.
`idisplay`()	This method displays the interactive table.
`iloc`([limit, offset, columns])	Returns a part of the `vDataFrame` (delimited by an `offset` and a `limit`).
`info`()	Displays information about the different vDataFrame transformations.
`memory_usage`()	Returns the vDataFrame memory usage.
`expected_store_usage`([unit])	Returns the vDataFrame expected store usage.
`numcol`([exclude_columns])	Returns a list of names of the numerical vDataColumns in the vDataFrame.
`shape`()	Returns the number of rows and columns of the `vDataFrame`.
`tail`([limit])	Returns the tail of the `vDataFrame`.

vDataColumn

vDataFrame[].func(...)

`category`()	Returns the category of the vDataColumn.
`ctype`()	Returns the vDataColumn DB type.
`dtype`()	Returns the vDataColumn DB type.
`get_len`()	Returns a new `vDataColumn` that represents the length of each element.
`head`([limit])	Returns the head of the `vDataColumn`.
`iloc`([limit, offset])	Returns a part of the `vDataColumn` (delimited by an `offset` and a `limit`).
`isarray`()	Returns True if the vDataColumn is an array, False otherwise.
`isbool`()	Returns True if the vDataColumn is boolean, False otherwise.
`isdate`()	Returns True if the vDataColumn category is date, False otherwise.
`isnum`()	Returns True if the vDataColumn is numerical, False otherwise.
`isvmap`()	Returns True if the vDataColumn category is VMap, False otherwise.
`memory_usage`()	Returns the vDataColumn memory usage.
`store_usage`()	Returns the vDataColumn expected store usage (unit: b).
`tail`([limit])	Returns the tail of the `vDataColumn`.

Management#

vDataFrame

vDataFrame.func(...)

`del_catalog`()	Deletes the current vDataFrame catalog.
`load`([offset])	Loads a previous structure of the `vDataFrame`.
`save`()	Saves the current structure of the `vDataFrame`.

`ACFPlot`(args, *kwargs)
`BarChart`(args, *kwargs)
`BarChart2D`(args, *kwargs)
`BoxPlot`(args, *kwargs)
`CandleStick`(args, *kwargs)
`ChampionChallengerPlot`(args, *kwargs)
`ContourPlot`(args, *kwargs)
`CutoffCurve`(args, *kwargs)
`DensityPlot`(args, *kwargs)
`ElbowCurve`(args, *kwargs)
`HeatMap`(args, *kwargs)
`Histogram`(args, *kwargs)
`HorizontalBarChart`(args, *kwargs)
`HorizontalBarChart2D`(args, *kwargs)
`ImportanceBarChart`(args, *kwargs)
`LiftChart`(args, *kwargs)
`LinePlot`(args, *kwargs)
`LogisticRegressionPlot`(args, *kwargs)
`LOFPlot`(args, *kwargs)
`MultiDensityPlot`(args, *kwargs)
`MultiLinePlot`(args, *kwargs)
`NestedPieChart`(args, *kwargs)
`OutliersPlot`(args, *kwargs)
`PCACirclePlot`(args, *kwargs)
`PieChart`(args, *kwargs)
`PlotlyBase`(args, *kwargs)	Plotly Base Class.
`PRCCurve`(args, *kwargs)
`RangeCurve`(args, *kwargs)
`RegressionPlot`(args, *kwargs)
`RegressionTreePlot`(args, *kwargs)
`ROCCurve`(args, *kwargs)
`ScatterPlot`(args, *kwargs)
`SpiderChart`(args, *kwargs)
`StepwisePlot`(args, *kwargs)
`SVMClassifierPlot`(args, *kwargs)
`TSPlot`(args, *kwargs)
`VoronoiPlot`(args, *kwargs)

`ACFPlot`(args, *kwargs)
`ACFPACFPlot`(args, *kwargs)
`BarChart`(args, *kwargs)
`BarChart2D`(args, *kwargs)
`BoxPlot`(args, *kwargs)
`CandleStick`(args, *kwargs)
`ChampionChallengerPlot`(args, *kwargs)
`ContourPlot`(args, *kwargs)
`CutoffCurve`(args, *kwargs)
`DensityPlot`(args, *kwargs)
`ElbowCurve`(args, *kwargs)
`HeatMap`(args, *kwargs)
`Histogram`(args, *kwargs)
`HorizontalBarChart`(args, *kwargs)
`HorizontalBarChart2D`(args, *kwargs)
`ImportanceBarChart`(args, *kwargs)
`LiftChart`(args, *kwargs)
`LinePlot`(args, *kwargs)
`LogisticRegressionPlot`(args, *kwargs)
`LOFPlot`(args, *kwargs)
`MultiDensityPlot`(args, *kwargs)
`MultiLinePlot`(args, *kwargs)
`NestedPieChart`(args, *kwargs)
`OutliersPlot`(args, *kwargs)
`PCACirclePlot`(args, *kwargs)
`PieChart`(args, *kwargs)
`PRCCurve`(args, *kwargs)
`RangeCurve`(args, *kwargs)
`RegressionPlot`(args, *kwargs)
`RegressionTreePlot`(args, *kwargs)
`ROCCurve`(args, *kwargs)
`ScatterPlot`(args, *kwargs)
`SpiderChart`(args, *kwargs)
`StepwisePlot`(args, *kwargs)
`SVMClassifierPlot`(args, *kwargs)
`TSPlot`(args, *kwargs)

`ACFPlot`(args, *kwargs)
`ACFPACFPlot`(args, *kwargs)
`AnimatedBarChart`(args, *kwargs)
`AnimatedBubblePlot`(args, *kwargs)
`AnimatedLinePlot`(args, *kwargs)
`AnimatedPieChart`(args, *kwargs)
`BarChart`(args, *kwargs)
`BarChart2D`(args, *kwargs)
`BoxPlot`(args, *kwargs)
`CandleStick`(args, *kwargs)
`ChampionChallengerPlot`(args, *kwargs)
`ContourPlot`(args, *kwargs)
`CutoffCurve`(args, *kwargs)
`DensityPlot`(args, *kwargs)
`DensityPlot2D`(args, *kwargs)
`ElbowCurve`(args, *kwargs)
`HeatMap`(args, *kwargs)
`Histogram`(args, *kwargs)
`HorizontalBarChart`(args, *kwargs)
`HorizontalBarChart2D`(args, *kwargs)
`ImportanceBarChart`(args, *kwargs)
`LiftChart`(args, *kwargs)
`LinePlot`(args, *kwargs)
`LogisticRegressionPlot`(args, *kwargs)
`LOFPlot`(args, *kwargs)
`MultiDensityPlot`(args, *kwargs)
`MultiLinePlot`(args, *kwargs)
`NestedPieChart`(args, *kwargs)
`OutliersPlot`(args, *kwargs)
`PCACirclePlot`(args, *kwargs)
`PieChart`(args, *kwargs)
`PRCCurve`(args, *kwargs)
`RangeCurve`(args, *kwargs)
`RegressionPlot`(args, *kwargs)
`RegressionTreePlot`(args, *kwargs)
`ROCCurve`(args, *kwargs)
`ScatterMatrix`(args, *kwargs)
`ScatterPlot`(args, *kwargs)
`SpiderChart`(args, *kwargs)
`StepwisePlot`(args, *kwargs)
`SVMClassifierPlot`(args, *kwargs)
`TSPlot`(args, *kwargs)
`VoronoiPlot`(args, *kwargs)

`cut`(breaks[, labels, include_lowest, right])	Discretizes the vDataColumn using the input list.
`decode`(*args)	Encodes the vDataColumn using a user-defined encoding.
`discretize`([method, h, nbins, k, ...])	Discretizes the vDataColumn using the input method.
`label_encode`()	Encodes the vDataColumn using a bijection from the different categories to [0, n - 1] (n being the vDataColumn cardinality).
`mean_encode`(response)	Encodes the vDataColumn using the average of the response partitioned by the different vDataColumn categories.
`one_hot_encode`([prefix, prefix_sep, ...])	Encodes the vDataColumn with the One-Hot Encoding algorithm.

`dropna`([columns])	Filters the specified vDataColumns in a vDataFrame for missing values.
`fillna`([val, method, numeric_only])	Fills missing elements in `vDataColumn` using specific rules.
`interpolate`(ts, rule[, method, by])	Computes a regular time interval vDataFrame by interpolating the missing values using different techniques.

`case_when`(name, *args)	Creates a new feature by evaluating on provided conditions.
`one_hot_encode`([columns, max_cardinality, ...])	Encodes the vDataColumns using the One Hot Encoding algorithm.

`dropna`()	Filters the vDataFrame where the vDataColumn is missing.
`fillna`([val, method, expr, by, order_by])	Fills missing elements in the vDataColumn with a user-specified rule.

`outliers`([columns, name, threshold, robust])	Adds a new `vDataColumn` labeled with 0 or 1, where 1 indicates that the record is a global outlier.
`scale`([columns, method])	Scales the input vDataColumns using the input method.

`clip`([lower, upper])	Clips the vDataColumn by transforming the values less than the lower bound to the lower bound value and the values higher than the upper bound to the upper bound value.
`fill_outliers`([method, threshold, ...])	Fills the vDataColumns outliers using the input method.
`normalize`([method, by, return_trans])	Scales the input vDataColumns using the input method.

`astype`(dtype)	Converts the vDataColumns to the input types.
`bool_to_int`()	Converts all booleans vDataColumns to integers.

`format_colnames`(*args[, columns, ...])	Method used to format the input columns by using the vDataFrame columns' names.
`get_match_index`(x, col_list[, str_check])	Returns the matching index.
`is_colname_in`(column)	Method used to check if the input column name is used by the vDataFrame.
`merge_similar_names`(skip_word)	Merges columns with similar names.
`explode_array`(index, column[, prefix, delimiter])	Returns exploded vDataFrame of array-like columns in a vDataFrame.

`astype`(dtype)	Converts the vDataColumn to the input type.
`rename`(new_name)	Renames the vDataColumn by dropping the current vDataColumn and creating a copy with the specified name.

vDataFrame#

Parameters#

Attributes#

Examples#

Dictionary#

Numpy Array#

Pandas DataFrame#

SQL Query#

Table#

Mathematical Operators#

Pandas-Like#

SQL-Like#

Direct Functions#

Parameters#

Attributes#

Examples#

Plotting#

General#

Animated#

Descriptive Statistics#

Correlation & Dependency#

General#

Time-series#

Preprocessing#

Encoding#

Dealing With Missing Values#

Duplicate Values#

Normalization and Global Outliers#

Data Types Conversion#

Formatting#

Splitting into Train/Test#

Working with Weights#

Complete Disjunctive Table#

Features Engineering#

Analytic Functions#

Custom Features Creation#

Features Transformations#

Moving Windows#

Working with Text#

Binary Operator Functions#

Basic Feature Selection#

Join, sort and transform#

Filter and Sample#

Search#

Sample#

Balance#

Filter Columns#

Filter Records#

Serialization#

General Format#

In-memory Object#

Databases#

Binary Format#

Utilities#

Information#

Management#