VerticaPy

Python API for Vertica Data Science at Scale

Data Ingestion

The first step of the data science process (excluding business understanding) is data ingestion. Vertica allows the ingestion of data in many formats and types thanks to several built-in parsers.

Most of the time, we know in-advance the file types and we need to write the entire SQL query to ingest it, but sometimes we don't have this luxury. To solve this problem, Vertica allows users to create flex tables, which act as efficient ways to ingest any data file without knowing its column types or even structure!

VerticaPy uses flex tables for auto-ingestion of JSON and CSV files. To injest other formats, you'll still want to use direct SQL queries. Be careful when using the following functions as the detected data types may not be optimal, and it's always best to write SQL queries using optimized types and segmentations.

Remember: VerticaPy uses Vertica SQL as its backend so optimizing the structure of your projections in your database will pass any performance gains to VerticaPy.

Ingesting CSV

CSV is a common format and has an internal structure that makes it easy to ingest. To ingest a CSV file, use the 'read_csv' function.

In [1]:
import verticapy as vp
help(vp.read_csv)
Help on function read_csv in module verticapy.utilities:

read_csv(path:str, schema:str='', table_name:str='', sep:str=',', header:bool=True, header_names:list=[], dtype:dict={}, na_rep:str='', quotechar:str='"', escape:str='\x17', genSQL:bool=False, parse_nrows:int=-1, insert:bool=False, temporary_table:bool=False, temporary_local_table:bool=True, gen_tmp_table_name:bool=True, ingest_local:bool=True)
    ---------------------------------------------------------------------------
    Ingests a CSV file using flex tables.
    
    Parameters
    ----------
    path: str
            Absolute path where the CSV file is located.
    schema: str, optional
            Schema where the CSV file will be ingested.
    table_name: str, optional
            The final relation/table name. If unspecified, the the name is set to the 
        name of the file or parent directory.
    sep: str, optional
            Column separator.
    header: bool, optional
            If set to False, the parameter 'header_names' will be to use to name the 
            different columns.
    header_names: list, optional
            List of the columns names.
    dtype: dict, optional
        Dictionary of the user types. Providing a dictionary can increase 
        ingestion speed and precision; instead of parsing the file to guess 
        the different types, VerticaPy will use the input types.
    na_rep: str, optional
            Missing values representation.
    quotechar: str, optional
            Char which is enclosing the str values.
    escape: str, optional
            Separator between each record.
    genSQL: bool, optional
            If set to True, the SQL code for creating the final table will be 
            generated but not executed. It is a good way to change the final
            relation types or to customize the data ingestion.
    parse_nrows: int, optional
            If this parameter is greater than 0. A new file of 'parse_nrows' lines
            will be created and ingested first to identify the data types. It will be
            then dropped and the entire file will be ingested. The data types identification
            will be less precise but this parameter can make the process faster if the
            file is heavy.
    insert: bool, optional
            If set to True, the data will be ingested to the input relation. Be sure
            that your file has a header corresponding to the name of the relation
            columns, otherwise ingestion will fail.
    temporary_table: bool, optional
        If set to True, a temporary table will be created.
    temporary_local_table: bool, optional
        If set to True, a temporary local table will be created. The parameter 'schema'
        must be empty, otherwise this parameter is ignored.
    gen_tmp_table_name: bool, optional
        Sets the name of the temporary table. This parameter is only used when the 
        parameter 'temporary_local_table' is set to True and if the parameters 
        "table_name" and "schema" are unspecified.
    ingest_local: bool, optional
        If set to True, the file will be ingested from the local machine.
    
    Returns
    -------
    vDataFrame
            The vDataFrame of the relation.
    
    See Also
    --------
    read_json : Ingests a JSON file into the Vertica database.

In this example, we use 'read_csv' and specify the CSV file (we will use the iris dataset), the table's schema and name, and the delimiter character.

In [7]:
vp.drop("public.iris", relation_type = "table")
vp.read_csv("data/iris.csv",
            schema = "public",
            table_name = "iris",
            sep = ",")
The table "public"."iris" has been successfully created.
Out[7]:
123
Id
Int
123
SepalLengthCm
Numeric(5,2)
123
SepalWidthCm
Numeric(5,2)
123
PetalLengthCm
Numeric(5,2)
123
PetalWidthCm
Numeric(5,2)
Abc
Species
Varchar(30)
115.13.51.40.2Iris-setosa
224.93.01.40.2Iris-setosa
334.73.21.30.2Iris-setosa
444.63.11.50.2Iris-setosa
555.03.61.40.2Iris-setosa
665.43.91.70.4Iris-setosa
774.63.41.40.3Iris-setosa
885.03.41.50.2Iris-setosa
994.42.91.40.2Iris-setosa
10104.93.11.50.1Iris-setosa
11115.43.71.50.2Iris-setosa
12124.83.41.60.2Iris-setosa
13134.83.01.40.1Iris-setosa
14144.33.01.10.1Iris-setosa
15155.84.01.20.2Iris-setosa
16165.74.41.50.4Iris-setosa
17175.43.91.30.4Iris-setosa
18185.13.51.40.3Iris-setosa
19195.73.81.70.3Iris-setosa
20205.13.81.50.3Iris-setosa
21215.43.41.70.2Iris-setosa
22225.13.71.50.4Iris-setosa
23234.63.61.00.2Iris-setosa
24245.13.31.70.5Iris-setosa
25254.83.41.90.2Iris-setosa
26265.03.01.60.2Iris-setosa
27275.03.41.60.4Iris-setosa
28285.23.51.50.2Iris-setosa
29295.23.41.40.2Iris-setosa
30304.73.21.60.2Iris-setosa
31314.83.11.60.2Iris-setosa
32325.43.41.50.4Iris-setosa
33335.24.11.50.1Iris-setosa
34345.54.21.40.2Iris-setosa
35354.93.11.50.1Iris-setosa
36365.03.21.20.2Iris-setosa
37375.53.51.30.2Iris-setosa
38384.93.11.50.1Iris-setosa
39394.43.01.30.2Iris-setosa
40405.13.41.50.2Iris-setosa
41415.03.51.30.3Iris-setosa
42424.52.31.30.3Iris-setosa
43434.43.21.30.2Iris-setosa
44445.03.51.60.6Iris-setosa
45455.13.81.90.4Iris-setosa
46464.83.01.40.3Iris-setosa
47475.13.81.60.2Iris-setosa
48484.63.21.40.2Iris-setosa
49495.33.71.50.2Iris-setosa
50505.03.31.40.2Iris-setosa
51517.03.24.71.4Iris-versicolor
52526.43.24.51.5Iris-versicolor
53536.93.14.91.5Iris-versicolor
54545.52.34.01.3Iris-versicolor
55556.52.84.61.5Iris-versicolor
56565.72.84.51.3Iris-versicolor
57576.33.34.71.6Iris-versicolor
58584.92.43.31.0Iris-versicolor
59596.62.94.61.3Iris-versicolor
60605.22.73.91.4Iris-versicolor
61615.02.03.51.0Iris-versicolor
62625.93.04.21.5Iris-versicolor
63636.02.24.01.0Iris-versicolor
64646.12.94.71.4Iris-versicolor
65655.62.93.61.3Iris-versicolor
66666.73.14.41.4Iris-versicolor
67675.63.04.51.5Iris-versicolor
68685.82.74.11.0Iris-versicolor
69696.22.24.51.5Iris-versicolor
70705.62.53.91.1Iris-versicolor
71715.93.24.81.8Iris-versicolor
72726.12.84.01.3Iris-versicolor
73736.32.54.91.5Iris-versicolor
74746.12.84.71.2Iris-versicolor
75756.42.94.31.3Iris-versicolor
76766.63.04.41.4Iris-versicolor
77776.82.84.81.4Iris-versicolor
78786.73.05.01.7Iris-versicolor
79796.02.94.51.5Iris-versicolor
80805.72.63.51.0Iris-versicolor
81815.52.43.81.1Iris-versicolor
82825.52.43.71.0Iris-versicolor
83835.82.73.91.2Iris-versicolor
84846.02.75.11.6Iris-versicolor
85855.43.04.51.5Iris-versicolor
86866.03.44.51.6Iris-versicolor
87876.73.14.71.5Iris-versicolor
88886.32.34.41.3Iris-versicolor
89895.63.04.11.3Iris-versicolor
90905.52.54.01.3Iris-versicolor
91915.52.64.41.2Iris-versicolor
92926.13.04.61.4Iris-versicolor
93935.82.64.01.2Iris-versicolor
94945.02.33.31.0Iris-versicolor
95955.62.74.21.3Iris-versicolor
96965.73.04.21.2Iris-versicolor
97975.72.94.21.3Iris-versicolor
98986.22.94.31.3Iris-versicolor
99995.12.53.01.1Iris-versicolor
1001005.72.84.11.3Iris-versicolor
Rows: 1-100 | Columns: 6

If you don't specify a schema, it will use the 'public' schema. If you don't specify a 'table_name' it will use the name of the CSV file.

You can also forgo ingesting the file and instead simply generate the SQL query used to create the table.

In [8]:
result = vp.read_csv("data/iris.csv",
                     schema = "public",
                     table_name = "iris",
                     sep = ",",
                     genSQL = True)
print(result[0])
print(result[1])
CREATE TABLE "public"."iris"("Id" Integer, "SepalLengthCm" Numeric(5,2), "SepalWidthCm" Numeric(5,2), "PetalLengthCm" Numeric(5,2), "PetalWidthCm" Numeric(5,2), "Species" Varchar(30))
COPY "public"."iris"("Id", "SepalLengthCm", "SepalWidthCm", "PetalLengthCm", "PetalWidthCm", "Species") FROM {} DELIMITER ',' NULL '' ENCLOSED BY '"' ESCAPE AS '' SKIP 1

You can also use the 'insert' parameter to insert new data into an existing relation.

In [9]:
vp.read_csv("data/iris.csv",
            schema = "public",
            table_name = "iris",
            sep = ",",
            insert = True).shape()
Out[9]:
(300, 6)

Ingesting JSON

JSON is also another popular format. You can ingest JSON files using the 'read_json' function.

In [10]:
help(vp.read_json)
Help on function read_json in module verticapy.utilities:

read_json(path:str, schema:str='', table_name:str='', usecols:list=[], new_name:dict={}, insert:bool=False, temporary_table:bool=False, temporary_local_table:bool=True, gen_tmp_table_name:bool=True, ingest_local:bool=True)
    ---------------------------------------------------------------------------
    Ingests a JSON file using flex tables.
    
    Parameters
    ----------
    path: str
            Absolute path where the JSON file is located.
    schema: str, optional
            Schema where the JSON file will be ingested.
    table_name: str, optional
            Final relation name.
    usecols: list, optional
            List of the JSON parameters to ingest. The other ones will be ignored. If
            empty all the JSON parameters will be ingested.
    new_name: dict, optional
            Dictionary of the new columns name. If the JSON file is nested, it is advised
            to change the final names as special characters will be included.
            For example, {"param": {"age": 3, "name": Badr}, "date": 1993-03-11} will 
            create 3 columns: "param.age", "param.name" and "date". You can rename these 
            columns using the 'new_name' parameter with the following dictionary:
            {"param.age": "age", "param.name": "name"}
    insert: bool, optional
            If set to True, the data will be ingested to the input relation. The JSON
            parameters must be the same than the input relation otherwise they will
            not be ingested.
    temporary_table: bool, optional
        If set to True, a temporary table will be created.
    temporary_local_table: bool, optional
        If set to True, a temporary local table will be created. The parameter 'schema'
        must be empty, otherwise this parameter is ignored.
    gen_tmp_table_name: bool, optional
        Sets the name of the temporary table. This parameter is only used when the 
        parameter 'temporary_local_table' is set to True and if the parameters 
        "table_name" and "schema" are unspecified.
    ingest_local: bool, optional
        If set to True, the file will be ingested from the local machine.
    
    Returns
    -------
    vDataFrame
            The vDataFrame of the relation.
    
    See Also
    --------
    read_csv : Ingests a CSV file into the Vertica database.

This function works the same way as 'read_csv', but doesn't include the delimiter parameter since it's a standardized format. We will use the JSON file of the iris dataset.

In [16]:
vp.drop("public.iris", relation_type = "table")
vp.read_json("data/iris.json",
             schema = "public",
             table_name = "iris")
The table "public"."iris" has been successfully created.
Out[16]:
123
Id
Int
123
PetalLengthCm
Numeric(6,3)
123
PetalWidthCm
Numeric(6,3)
123
SepalLengthCm
Numeric(6,3)
123
SepalWidthCm
Numeric(6,3)
Abc
Species
Varchar(30)
111.40.25.13.5Iris-setosa
221.40.24.93.0Iris-setosa
331.30.24.73.2Iris-setosa
441.50.24.63.1Iris-setosa
551.40.25.03.6Iris-setosa
661.70.45.43.9Iris-setosa
771.40.34.63.4Iris-setosa
881.50.25.03.4Iris-setosa
991.40.24.42.9Iris-setosa
10101.50.14.93.1Iris-setosa
11111.50.25.43.7Iris-setosa
12121.60.24.83.4Iris-setosa
13131.40.14.83.0Iris-setosa
14141.10.14.33.0Iris-setosa
15151.20.25.84.0Iris-setosa
16161.50.45.74.4Iris-setosa
17171.30.45.43.9Iris-setosa
18181.40.35.13.5Iris-setosa
19191.70.35.73.8Iris-setosa
20201.50.35.13.8Iris-setosa
21211.70.25.43.4Iris-setosa
22221.50.45.13.7Iris-setosa
23231.00.24.63.6Iris-setosa
24241.70.55.13.3Iris-setosa
25251.90.24.83.4Iris-setosa
26261.60.25.03.0Iris-setosa
27271.60.45.03.4Iris-setosa
28281.50.25.23.5Iris-setosa
29291.40.25.23.4Iris-setosa
30301.60.24.73.2Iris-setosa
31311.60.24.83.1Iris-setosa
32321.50.45.43.4Iris-setosa
33331.50.15.24.1Iris-setosa
34341.40.25.54.2Iris-setosa
35351.50.14.93.1Iris-setosa
36361.20.25.03.2Iris-setosa
37371.30.25.53.5Iris-setosa
38381.50.14.93.1Iris-setosa
39391.30.24.43.0Iris-setosa
40401.50.25.13.4Iris-setosa
41411.30.35.03.5Iris-setosa
42421.30.34.52.3Iris-setosa
43431.30.24.43.2Iris-setosa
44441.60.65.03.5Iris-setosa
45451.90.45.13.8Iris-setosa
46461.40.34.83.0Iris-setosa
47471.60.25.13.8Iris-setosa
48481.40.24.63.2Iris-setosa
49491.50.25.33.7Iris-setosa
50501.40.25.03.3Iris-setosa
51514.71.47.03.2Iris-versicolor
52524.51.56.43.2Iris-versicolor
53534.91.56.93.1Iris-versicolor
54544.01.35.52.3Iris-versicolor
55554.61.56.52.8Iris-versicolor
56564.51.35.72.8Iris-versicolor
57574.71.66.33.3Iris-versicolor
58583.31.04.92.4Iris-versicolor
59594.61.36.62.9Iris-versicolor
60603.91.45.22.7Iris-versicolor
61613.51.05.02.0Iris-versicolor
62624.21.55.93.0Iris-versicolor
63634.01.06.02.2Iris-versicolor
64644.71.46.12.9Iris-versicolor
65653.61.35.62.9Iris-versicolor
66664.41.46.73.1Iris-versicolor
67674.51.55.63.0Iris-versicolor
68684.11.05.82.7Iris-versicolor
69694.51.56.22.2Iris-versicolor
70703.91.15.62.5Iris-versicolor
71714.81.85.93.2Iris-versicolor
72724.01.36.12.8Iris-versicolor
73734.91.56.32.5Iris-versicolor
74744.71.26.12.8Iris-versicolor
75754.31.36.42.9Iris-versicolor
76764.41.46.63.0Iris-versicolor
77774.81.46.82.8Iris-versicolor
78785.01.76.73.0Iris-versicolor
79794.51.56.02.9Iris-versicolor
80803.51.05.72.6Iris-versicolor
81813.81.15.52.4Iris-versicolor
82823.71.05.52.4Iris-versicolor
83833.91.25.82.7Iris-versicolor
84845.11.66.02.7Iris-versicolor
85854.51.55.43.0Iris-versicolor
86864.51.66.03.4Iris-versicolor
87874.71.56.73.1Iris-versicolor
88884.41.36.32.3Iris-versicolor
89894.11.35.63.0Iris-versicolor
90904.01.35.52.5Iris-versicolor
91914.41.25.52.6Iris-versicolor
92924.61.46.13.0Iris-versicolor
93934.01.25.82.6Iris-versicolor
94943.31.05.02.3Iris-versicolor
95954.21.35.62.7Iris-versicolor
96964.21.25.73.0Iris-versicolor
97974.21.35.72.9Iris-versicolor
98984.31.36.22.9Iris-versicolor
99993.01.15.12.5Iris-versicolor
1001004.11.35.72.8Iris-versicolor
Rows: 1-100 | Columns: 6

More examples of the JSON function are available in the read_json documentation.