
VerticaPy
Data Ingestion¶
The first step of the data science process (excluding business understanding) is data ingestion. Vertica allows the ingestion of data in many formats and types thanks to several built-in parsers.
Most of the time, we know in-advance the file types and we need to write the entire SQL query to ingest it, but sometimes we don't have this luxury. To solve this problem, Vertica allows users to create flex tables, which act as efficient ways to ingest any data file without knowing its column types or even structure!
VerticaPy uses flex tables for auto-ingestion of JSON and CSV files. To injest other formats, you'll still want to use direct SQL queries. Be careful when using the following functions as the detected data types may not be optimal, and it's always best to write SQL queries using optimized types and segmentations.
Remember: VerticaPy uses Vertica SQL as its backend so optimizing the structure of your projections in your database will pass any performance gains to VerticaPy.
Ingesting CSV¶
CSV is a common format and has an internal structure that makes it easy to ingest. To ingest a CSV file, use the 'read_csv' function.
import verticapy as vp
help(vp.read_csv)
In this example, we use 'read_csv' and specify the CSV file (we will use the iris dataset), the table's schema and name, and the delimiter character.
vp.drop("public.iris", relation_type = "table")
vp.read_csv("data/iris.csv",
schema = "public",
table_name = "iris",
sep = ",")
If you don't specify a schema, it will use the 'public' schema. If you don't specify a 'table_name' it will use the name of the CSV file.
You can also forgo ingesting the file and instead simply generate the SQL query used to create the table.
result = vp.read_csv("data/iris.csv",
schema = "public",
table_name = "iris",
sep = ",",
genSQL = True)
print(result[0])
print(result[1])
You can also use the 'insert' parameter to insert new data into an existing relation.
vp.read_csv("data/iris.csv",
schema = "public",
table_name = "iris",
sep = ",",
insert = True).shape()
Ingesting JSON¶
JSON is also another popular format. You can ingest JSON files using the 'read_json' function.
help(vp.read_json)
This function works the same way as 'read_csv', but doesn't include the delimiter parameter since it's a standardized format. We will use the JSON file of the iris dataset.
vp.drop("public.iris", relation_type = "table")
vp.read_json("data/iris.json",
schema = "public",
table_name = "iris")
More examples of the JSON function are available in the read_json documentation.