
VerticaPy
In order to work with complex data types in VerticaPy, you'll need to complete the following three setup tasks:
- Import relevant libraries:
import verticapy as vp
- Connect to Vertica:
vp.new_connection({"host": "10.211.55.14",
"port": "5433",
"database": "testdb",
"password": "XxX",
"user": "dbadmin"},
name = "Vertica_New_Connection")
- Check your VerticaPy version to make sure you have access to the right functions:
vp.__version__
You can make it easier to keep track of your work by creating a custom schema:
Note: Because some tables are repeated in this demonstration, tables with the same names are dropped.
vp.drop("complex_vmap_test", method = "schema")
vp.create_schema("complex_vmap_test")
We also set the path to our data:
path= "/home/dbadmin/"
You can download the demo datasets by clicking here.
Loading Complex Data¶
There are two ways to load a nested data file:
- Load directly using read_json. In this case, you will need to use an additional parameter to identify all the data types. The function loads the data using flex tables and VMaps (Native Vertica MAPS, which are flexible but not optimally performant).
- Load using read_file. The function preidcts the complex data structure.
Let's try both:
import verticapy as vp
data = vp.read_json(path + "laliga/2008.json",
schema = "public",
use_complex_dt = True,)
data = vp.read_file(path = path + "laliga/2005.json",
schema = "complex_vmap_test",)
data
We can also use the handy 'genSQL' parameter to generate (but not execute) the SQL needed to create the final relation:
Note: This is a great way to customize the data ingestion or alter the final relation types.
sql = vp.read_file(path = path + "laliga/2005.json",
schema = "complex_vmap_test",
table_name = "laliga_2005",
genSQL = True,
varchar_varbinary_length = 60,
dtype = {"away_score": "float"})
for query in sql:
print(query)
Feature Exploration¶
data["competition_stage"].astype("json")
As with a normal vDataFrame, we can easily extract the values from the sub-columns:
data["away_team"]["away_team_gender"]
We can view any nested data structure by index:
data["competition"]["competition_id"]
These nested structures can be used to create features:
data["name_home"] = data["home_team"]["home_team_name"]
We can even flatten the nested structure inside a json file, either flattening the entire file or just particular columns:
data = vp.read_json(path = path + "laliga/2008.json",
table_name = "laliga_flat",
schema = "complex_vmap_test",
flatten_maps = True,)
data