Example#

The vDataFrame a powerful Python object that lies at the heart of VerticaPy. vDataFrames consist of vColumn objects that represent columns in the dataset.

You can find all vDataFrame’s methods inside the folder verticapy/core/vdataframe. Note that similar methods have been clubbed together inside one module/file. For examples, all methods pertaining to aggregates are in the ‘_aggregate.py’ file.

You can define any new vDataFrame method inside these modules depending on the nature of the method. The same applies to vColumns. You can use any of the developed classes to inherit properties.

When defining a function, you should specify the ‘type’ hints for every variable:

For variables of multiple types, use the Union operator.
For variables that are optional, use the Optional operator.
For variables that require literal input, use the Literal operator.

There are examples of such hints throughout the code.

@save_verticapy_logs
def pie(
    self,
    columns: SQLColumns,
    max_cardinality: Union[None, int, tuple] = None,
    h: Union[None, int, tuple] = None,
    chart: Optional[PlottingObject] = None,
    **style_kwargs,
) -> PlottingObject:

Be sure to write a detailed description for each function that explains how it works.

"""
Draws the nested density pie chart of the input
vDataColumns.

Parameters
----------
columns: SQLColumns
    List of the vDataColumns names.
max_cardinality: int / tuple, optional
    Maximum number of distinct elements for
    vDataColumns 1  and  2  to be used as
    categorical. For these elements, no  h
    is picked or computed.
    If  of type tuple, represents the
    'max_cardinality' of each column.
h: int / tuple, optional
    Interval  width  of the bar. If empty,  an
    optimized h will be computed.
    If  of type tuple, it must represent  each
    column's 'h'.
chart: PlottingObject, optional
    The chart object to plot on.
**style_kwargs
    Any  optional  parameter  to  pass to  the
    plotting functions.
"""

Important

For a detailed explaination of how to write doc-strings, please refer to Automatic Documentation

Important: the vDataFrame.get_columns() and vDataFrame.format_colnames() functions are essential for correctly formatting input column names.

from verticapy.datasets import load_titanic

titanic = load_titanic()

titanic.get_columns()
Out[3]: 
['"pclass"',
 '"survived"',
 '"name"',
 '"sex"',
 '"age"',
 '"sibsp"',
 '"parch"',
 '"ticket"',
 '"fare"',
 '"cabin"',
 '"embarked"',
 '"boat"',
 '"body"',
 '"home.dest"']

Use the _genSQL method to get the current vDataFrame relation.

titanic._genSQL()
Out[4]: '"public"."titanic"'

And the _executeSQL_ function to execute a SQL query.

from verticapy._utils._sql._sys import _executeSQL

_executeSQL(f"SELECT * FROM {titanic._genSQL()} LIMIT 2")
Out[6]: <vertica_python.vertica.cursor.Cursor at 0x7f54a2b5a500>

The result of the query is accessible using one of the methods of the ‘executeSQL’ parameter.

_executeSQL(f"SELECT * FROM {titanic._genSQL()} LIMIT 2",method="fetchall")
Out[7]: 
[[1,
  0,
  'Allison, Miss. Helen Loraine',
  'female',
  Decimal('2.000'),
  1,
  2,
  '113781',
  Decimal('151.55000'),
  'C22 C26',
  'S',
  None,
  None,
  'Montreal, PQ / Chesterville, ON'],
 [1,
  0,
  'Allison, Mr. Hudson Joshua Creighton',
  'male',
  Decimal('30.000'),
  1,
  2,
  '113781',
  Decimal('151.55000'),
  'C22 C26',
  'S',
  None,
  135,
  'Montreal, PQ / Chesterville, ON']]

The @save_verticapy_logs decorator saves information about a specified VerticaPy method to the QUERY_PROFILES table in the Vertica database. You can use this to collect usage statistics on methods and their parameters.

For example, to create a method to compute the correlations between two vDataFrame columns:

# Example correlation method for a vDataFrame

# Add type hints + @save_verticapy_logs decorator
@save_verticapy_logs
def pearson(self, column1: str, column2: str):
    # Describe the function
    """
    ---------------------------------------------------------------------------
    Computes the Pearson Correlation Coefficient of the two input vColumns.

    Parameters
    ----------
    column1: str
        Input vColumn.
    column2: str
        Input vColumn.

    Returns
    -------
    Float
        Pearson Correlation Coefficient

    See Also
    --------
    vDataFrame.corr : Computes the Correlation Matrix of the vDataFrame.
        """
    # Check data types
    # Format the columns
    column1, column2 = self.format_colnames([column1, column2])
    # Get the current vDataFrame relation
    table = self._genSQL()
    # Create the SQL statement - Label the query when possible
    query = f"SELECT /*+LABEL(vDataFrame.pearson)*/ CORR({column1}, {column2}) FROM {table};"
    # Execute the SQL query and get the result
    result = _executeSQL(query,
                        title = "Computing Pearson coefficient",
                        method="fetchfirstelem")
    # Return the result
    return result

Same can be done with vColumn methods.

# Example Method for a vColumn

# Add types hints + @save_verticapy_logs decorator
@save_verticapy_logs
def pearson(self, column: str,):
    # Describe the function
    """
    ---------------------------------------------------------------------------
    Computes the Pearson Correlation Coefficient of the vColumn and the input
    vColumn.

    Parameters
    ----------
    column: str
        Input vColumn.

    Returns
    -------
    Float
        Pearson Correlation Coefficient

    See Also
    --------
    vDataFrame.corr : Computes the Correlation Matrix of the vDataFrame.
        """
    # Format the column
    column1 = self.parent.format_colnames([column])[0]
    # Get the current vColumn name
    column2 = self.alias
    # Get the current vDataFrame relation
    table = self.parent._genSQL()
    # Create the SQL statement - Label the query when possible
    query = f"SELECT /*+LABEL(vColumn.pearson)*/ CORR({column1}, {column2}) FROM {table};"
    # Execute the SQL query and get the result
    result = executeSQL(query,
                        title = "Computing Pearson coefficient",
                        method="fetchfirstelem")
    # Return the result
    return result

Functions will work exactly the same.