VerticaPy

Python API for Vertica Data Science at Scale

Working with Complex Data Types & VMaps

Setup

In order to work with complex data types in VerticaPy, you'll need to complete the following three setup tasks:

  • Import relevant libraries:
In [1]:
import verticapy as vp
  • Connect to Vertica:
In [ ]:
vp.new_connection({"host": "10.211.55.14", 
                   "port": "5433", 
                   "database": "testdb", 
                   "password": "XxX", 
                   "user": "dbadmin"},
                   name = "Vertica_New_Connection")
  • Check your VerticaPy version to make sure you have access to the right functions:
In [2]:
vp.__version__
Out[2]:
'0.11.0'

You can make it easier to keep track of your work by creating a custom schema:
Note: Because some tables are repeated in this demonstration, tables with the same names are dropped.

In [3]:
vp.drop("complex_vmap_test", method = "schema")
vp.create_schema("complex_vmap_test")
Out[3]:
True

We also set the path to our data:

In [ ]:
path= "/home/dbadmin/"

You can download the demo datasets by clicking here.

Loading Complex Data

There are two ways to load a nested data file:

  • Load directly using read_json. In this case, you will need to use an additional parameter to identify all the data types. The function loads the data using flex tables and VMaps (Native Vertica MAPS, which are flexible but not optimally performant).
  • Load using read_file. The function preidcts the complex data structure.

Let's try both:

In [4]:
import verticapy as vp
data = vp.read_json(path + "laliga/2008.json",
                    schema = "public",
                    use_complex_dt = True,)
Out[4]:
123
away_score
Int
🛠
Row(away_team_gender varchar(80),away_team_group varchar(80),away_team_id int,away_team_name varchar(80),country row(id int,name
🛠
Row(competition_id int,competition_name varchar(80),country_name varchar(80))
🛠
Row(id int,name varchar(80))
123
home_score
Int
🛠
Row(country row(id int,name varchar(80)),home_team_gender varchar(80),home_team_group varchar(80),home_team_id int,home_team_nam
📅
kick_off
Time
📅
last_updated
Date
📅
match_date
Date
123
match_id
Int
Abc
match_status
Varchar(80)
123
match_week
Int
🛠
Row(data_version date,shot_fidelity_version int,xy_fidelity_version int)
🛠
Row(season_id int,season_name varchar(80))
10119:00:002019-12-162008-08-3169147available1
20220:00:002019-12-162009-03-0769216available26
30220:00:002019-12-162009-04-1169222available30
40520:00:002019-12-162009-01-1769183available19
50522:00:002019-12-162008-10-2569232available8
60619:00:002019-12-162009-03-2269142available28
71020:00:002019-12-162009-04-0469273available29
82021:00:002019-12-162008-11-1669285available11
92117:00:002019-12-162009-02-0169209available21
104120:00:002019-12-162008-11-0169235available9
116121:00:002019-12-162008-09-2169138available3
121422:00:002019-12-162009-01-2469189available20
131622:00:002019-12-162008-10-0469212available6
142021:03:002019-12-162009-03-1569177available27
152320:00:002019-12-162008-09-2469207available4
163221:00:002019-12-162009-01-1169230available18
170222:00:002019-12-162008-12-1369225available15
180622:00:002019-12-162008-11-0869279available10
191120:00:002019-12-162008-09-1369215available2
201319:00:002019-12-162009-02-0869231available22
212120:00:002019-12-162009-02-2169139available24
222220:00:002019-12-162009-02-1469195available23
232222:00:002019-12-162009-04-2569251available33
243022:00:002019-12-162008-11-2969171available13
253319:03:002019-12-162009-05-1069217available35
263419:00:002019-12-162009-03-0169185available25
270422:00:002019-12-162008-12-0669149available14
281020:00:002019-12-162009-04-1869223available31
292119:00:002019-12-162008-12-2169228available16
302122:00:002019-12-162008-09-2769211available5
316220:00:002019-12-162009-05-0269249available34
Rows: 1-31 | Columns: 14

Similar to the use of read_json above, we can use read_file to ingest the complex data directly:

In [6]:
data = vp.read_file(path = path + "laliga/2005.json",
                    schema = "complex_vmap_test",)
data
Out[6]:
123
away_score
Int
🛠
Row(away_team_gender varchar(80),away_team_group varchar(80),away_team_id int,away_team_name varchar(80),country row(id int,name
🛠
Row(competition_id int,competition_name varchar(80),country_name varchar(80))
🛠
Row(id int,name varchar(80))
123
home_score
Int
🛠
Row(country row(id int,name varchar(80)),home_team_gender varchar(80),home_team_group varchar(80),home_team_id int,home_team_nam
📅
kick_off
Time
📅
last_updated
Date
📅
match_date
Date
123
match_id
Int
Abc
match_status
Varchar(80)
123
match_week
Int
🛠
Row(data_version date,shot_fidelity_version int,xy_fidelity_version int)
🛠
Row(season_id int,season_name varchar(80))
10322:00:002019-12-162005-10-2268318available8
21219:00:002019-12-162006-01-1568342available19
31522:00:002019-12-162006-02-1869172available24
43121:00:002019-12-162005-11-0668339available10
50219:00:002019-12-162006-01-2268324available20
60220:30:002019-12-162005-10-2669164available11
71221:00:002020-02-272005-12-1168321available15
82122:00:002019-12-162006-01-0768348available18
92220:00:002019-12-162005-10-0168317available6
103122:00:002019-12-162005-12-1768322available16
110221:00:002019-12-162005-12-2068323available17
122022:00:002019-12-162006-02-2568351available25
133020:00:002019-12-162005-11-1968319available12
143119:00:002019-12-162006-02-0568350available22
151421:00:002019-12-162005-11-2768354available13
162019:00:002019-12-162005-12-0468320available14
173019:00:002019-12-162006-01-2968325available21
Rows: 1-17 | Columns: 14

We can also use the handy 'genSQL' parameter to generate (but not execute) the SQL needed to create the final relation:
Note: This is a great way to customize the data ingestion or alter the final relation types.

In [9]:
sql = vp.read_file(path = path + "laliga/2005.json",
                   schema = "complex_vmap_test",
                   table_name = "laliga_2005",
                   genSQL = True,
                   varchar_varbinary_length = 60,
                   dtype = {"away_score": "float"})
for query in sql:
    print(query)
CREATE TABLE "complex_vmap_test"."laliga_2005"("away_score" float, "away_team" Row("away_team_gender" varchar(60), "away_team_group" varchar(60), "away_team_id" int, "away_team_name" varchar(60), "country" Row("id" int, "name" varchar(60))), "competition" Row("competition_id" int, "competition_name" varchar(60), "country_name" varchar(60)), "competition_stage" Row("id" int, "name" varchar(60)), "home_score" int, "home_team" Row("country" Row("id" int, "name" varchar(60)), "home_team_gender" varchar(60), "home_team_group" varchar(60), "home_team_id" int, "home_team_name" varchar(60)), "kick_off" time, "last_updated" date, "match_date" date, "match_id" int, "match_status" varchar(60), "match_week" int, "metadata" Row("data_version" date, "shot_fidelity_version" int, "xy_fidelity_version" int), "season" Row("season_id" int, "season_name" varchar(60)))
copy "complex_vmap_test"."laliga_2005" from '/scratch_b/qa/ericsson/laliga/2005.json' PARSER FJsonParser()

Feature Exploration

In the generated SQL from the above example, we can see that the away_team column is a ROW type with a complex structure consisting of many sub-columns. We can convert this column into a JSON and view its contents:

In [7]:
data["competition_stage"].astype("json")
Out[7]:
123
away_score
Int
🛠
Row(away_team_gender varchar(80),away_team_group varchar(80),away_team_id int,away_team_name varchar(80),country row(id int,name
🛠
Row(competition_id int,competition_name varchar(80),country_name varchar(80))
Abc
competition_stage
Varchar
123
home_score
Int
🛠
Row(country row(id int,name varchar(80)),home_team_gender varchar(80),home_team_group varchar(80),home_team_id int,home_team_nam
📅
kick_off
Time
📅
last_updated
Date
📅
match_date
Date
123
match_id
Int
Abc
match_status
Varchar(80)
123
match_week
Int
🛠
Row(data_version date,shot_fidelity_version int,xy_fidelity_version int)
🛠
Row(season_id int,season_name varchar(80))
10{"id":1,"name":"Regular Season"}219:00:002019-12-162006-01-2268324available20
20{"id":1,"name":"Regular Season"}220:30:002019-12-162005-10-2669164available11
30{"id":1,"name":"Regular Season"}221:00:002019-12-162005-12-2068323available17
40{"id":1,"name":"Regular Season"}322:00:002019-12-162005-10-2268318available8
51{"id":1,"name":"Regular Season"}219:00:002019-12-162006-01-1568342available19
61{"id":1,"name":"Regular Season"}221:00:002020-02-272005-12-1168321available15
71{"id":1,"name":"Regular Season"}421:00:002019-12-162005-11-2768354available13
81{"id":1,"name":"Regular Season"}522:00:002019-12-162006-02-1869172available24
92{"id":1,"name":"Regular Season"}019:00:002019-12-162005-12-0468320available14
102{"id":1,"name":"Regular Season"}022:00:002019-12-162006-02-2568351available25
112{"id":1,"name":"Regular Season"}122:00:002019-12-162006-01-0768348available18
122{"id":1,"name":"Regular Season"}220:00:002019-12-162005-10-0168317available6
133{"id":1,"name":"Regular Season"}019:00:002019-12-162006-01-2968325available21
143{"id":1,"name":"Regular Season"}020:00:002019-12-162005-11-1968319available12
153{"id":1,"name":"Regular Season"}119:00:002019-12-162006-02-0568350available22
163{"id":1,"name":"Regular Season"}121:00:002019-12-162005-11-0668339available10
173{"id":1,"name":"Regular Season"}122:00:002019-12-162005-12-1768322available16
Rows: 1-17 | Columns: 14

As with a normal vDataFrame, we can easily extract the values from the sub-columns:

In [8]:
data["away_team"]["away_team_gender"]
Out[8]:
Abc
away_team_gender
Varchar(80)
1male
2male
3male
4male
5male
6male
7male
8male
9male
10male
11male
12male
13male
14male
15male
16male
17male
Rows: 17 | Column: away_team_gender | Type: Varchar(80)

We can view any nested data structure by index:

In [9]:
data["competition"]["competition_id"]
Out[9]:
123
competition_id
Integer
111
211
311
411
511
611
711
811
911
1011
1111
1211
1311
1411
1511
1611
1711
Rows: 17 | Column: competition_id | Type: Integer

These nested structures can be used to create features:

In [10]:
data["name_home"] = data["home_team"]["home_team_name"]

We can even flatten the nested structure inside a json file, either flattening the entire file or just particular columns:

In [17]:
data = vp.read_json(path = path + "laliga/2008.json",
                    table_name = "laliga_flat",
                    schema = "complex_vmap_test",
                    flatten_maps = True,)
data
The table "complex_vmap_test"."laliga_flat" has been successfully created.
Out[17]:
Abc
Vmap(368)
Abc
Vmap(378)
Abc
season.season_name
Varchar(20)
123
season.season_id
Int
123
metadata.xy_fidelity_version
Int
123
metadata.shot_fidelity_version
Int
📅
metadata.data_version
Date
123
match_week
Int
Abc
match_status
Varchar(20)
123
match_id
Int
📅
match_date
Date
📅
last_updated
Timestamp
📅
kick_off
Time
Abc
home_team.home_team_name
Varchar(34)
123
home_team.home_team_id
Int
Abc
home_team.home_team_group
Varchar(20)
Abc
home_team.home_team_gender
Varchar(20)
Abc
home_team.country.name
Varchar(20)
123
home_team.country.id
Int
123
home_score
Int
Abc
competition_stage.name
Varchar(28)
123
competition_stage.id
Int
Abc
competition.country_name
Varchar(20)
Abc
competition.competition_name
Varchar(20)
123
competition.competition_id
Int
Abc
away_team.country.name
Varchar(20)
123
away_team.country.id
Int
Abc
away_team.away_team_name
Varchar(38)
123
away_team.away_team_id
Int
Abc
away_team.away_team_group
Varchar(20)
Abc
away_team.away_team_gender
Varchar(20)
123
away_score
Int
12008/200941222000-01-011available691472008-08-312019-12-16 23:09:16.16875619:00:00Numancia444[null]maleSpain2141Regular Season1SpainLa Liga11Spain214Barcelona217[null]male0
22008/200941222000-01-012available692152008-09-132019-12-16 23:09:16.16875620:00:00Barcelona217[null]maleSpain2141Regular Season1SpainLa Liga11Spain214Racing Santander1217[null]male1
32008/200941222000-01-013available691382008-09-212019-12-16 23:09:16.16875621:00:00Sporting Gijón1041[null]maleSpain2141Regular Season1SpainLa Liga11Spain214Barcelona217[null]male6
42008/200941222000-01-014available692072008-09-242019-12-16 23:09:16.16875620:00:00Barcelona217[null]maleSpain2143Regular Season1SpainLa Liga11Spain214Real Betis218[null]male2
52008/200941222000-01-015available692112008-09-272019-12-16 23:09:16.16875622:00:00Espanyol214[null]maleSpain2141Regular Season1SpainLa Liga11Spain214Barcelona217[null]male2
62008/200941222000-01-016available692122008-10-042019-12-16 23:09:16.16875622:00:00Barcelona217[null]maleSpain2146Regular Season1SpainLa Liga11Spain214Atlético Madrid212[null]male1
72008/200941222000-01-018available692322008-10-252019-12-16 23:09:16.16875622:00:00Barcelona217[null]maleSpain2145Regular Season1SpainLa Liga11Spain214Almería403[null]male0
82008/200941222000-01-019available692352008-11-012019-12-16 23:09:16.16875620:00:00Málaga223[null]maleSpain2141Regular Season1SpainLa Liga11Spain214Barcelona217[null]male4
92008/200941222000-01-0110available692792008-11-082019-12-16 23:09:16.16875622:00:00Barcelona217[null]maleSpain2146Regular Season1SpainLa Liga11Spain214Real Valladolid901[null]male0
102008/200941222000-01-0111available692852008-11-162019-12-16 23:09:16.16875621:00:00Recreativo Huelva1220[null]maleSpain2140Regular Season1SpainLa Liga11Spain214Barcelona217[null]male2
112008/200941222000-01-0113available691712008-11-292019-12-16 23:09:16.16875622:00:00Sevilla213[null]maleSpain2140Regular Season1SpainLa Liga11Spain214Barcelona217[null]male3
122008/200941222000-01-0114available691492008-12-062019-12-16 23:09:16.16875622:00:00Barcelona217[null]maleSpain2144Regular Season1SpainLa Liga11Spain214Valencia207[null]male0
132008/200941222000-01-0115available692252008-12-132019-12-16 23:09:16.16875622:00:00Barcelona217[null]maleSpain2142Regular Season1SpainLa Liga11Spain214Real Madrid220[null]male0
142008/200941222000-01-0116available692282008-12-212019-12-16 23:09:16.16875619:00:00Villarreal222[null]maleSpain2141Regular Season1SpainLa Liga11Spain214Barcelona217[null]male2
152008/200941222000-01-0118available692302009-01-112019-12-16 23:09:16.16875621:00:00Osasuna422[null]maleSpain2142Regular Season1SpainLa Liga11Spain214Barcelona217[null]male3
162008/200941222000-01-0119available691832009-01-172019-12-16 23:09:16.16875620:00:00Barcelona217[null]maleSpain2145Regular Season1SpainLa Liga11Spain214Deportivo La Coruna219[null]male0
172008/200941222000-01-0120available691892009-01-242019-12-16 23:09:16.16875622:00:00Barcelona217[null]maleSpain2144Regular Season1SpainLa Liga11Spain214Numancia444[null]male1
182008/200941222000-01-0121available692092009-02-012019-12-16 23:09:16.16875617:00:00Racing Santander1217[null]maleSpain2141Regular Season1SpainLa Liga11Spain214Barcelona217[null]male2
192008/200941222000-01-0122available692312009-02-082019-12-16 23:09:16.16875619:00:00Barcelona217[null]maleSpain2143Regular Season1SpainLa Liga11Spain214Sporting Gijón1041[null]male1
202008/200941222000-01-0123available691952009-02-142019-12-16 23:09:16.16875620:00:00Real Betis218[null]maleSpain2142Regular Season1SpainLa Liga11Spain214Barcelona217[null]male2
212008/200941222000-01-0124available691392009-02-212019-12-16 23:09:16.16875620:00:00Barcelona217[null]maleSpain2141Regular Season1SpainLa Liga11Spain214Espanyol214[null]male2
222008/200941222000-01-0125available691852009-03-012019-12-16 23:09:16.16875619:00:00Atlético Madrid212[null]maleSpain2144Regular Season1SpainLa Liga11Spain214Barcelona217[null]male3
232008/200941222000-01-0126available692162009-03-072019-12-16 23:09:16.16875620:00:00Barcelona217[null]maleSpain2142Regular Season1SpainLa Liga11Spain214Athletic Bilbao215[null]male0
242008/200941222000-01-0127available691772009-03-152019-12-16 23:09:16.16875621:03:00Almería403[null]maleSpain2140Regular Season1SpainLa Liga11Spain214Barcelona217[null]male2
252008/200941222000-01-0128available691422009-03-222019-12-16 23:09:16.16875619:00:00Barcelona217[null]maleSpain2146Regular Season1SpainLa Liga11Spain214Málaga223[null]male0
262008/200941222000-01-0129available692732009-04-042019-12-16 23:09:16.16875620:00:00Real Valladolid901[null]maleSpain2140Regular Season1SpainLa Liga11Spain214Barcelona217[null]male1
272008/200941222000-01-0130available692222009-04-112019-12-16 23:09:16.16875620:00:00Barcelona217[null]maleSpain2142Regular Season1SpainLa Liga11Spain214Recreativo Huelva1220[null]male0
282008/2009412