VerticaPy

Python API for Vertica Data Science at Scale

DBLINK in VerticaPy

Introduction

Starting with VerticaPy 0.12.0, you can work with other databases, such as PostgresQL and mySQL, using DBLINK functionality. DBLINK is a Vertica User Defined Transform Function coded in C++ that runs SQL against other databases. To setup and learn more about DBLINK in Vertica, please view the github repo.

In order to use this new functionality, we first need to install the ODBC driver and manager, as well as configure DBLINK on all nodes of the cluster. Configuration entails three files:

  • dblink.cids
  • odbc.ini
  • odbcinst.ini

For more information about requirements, see the github repo.

These files provide the host server address, username, and password, as well as the database name that we want to access. In future versions, we are planning to simplify this process and automate the creation of these files.

In the next section, let's work through an example of a database in PostgreSQL.

Connecting to an External Database

In [1]:
# Importing VerticaPy
import verticapy as vp

We first need to provide the connection information that we have set up in the Connection Identifier Database file (dblink.cids). We can select a special character symbol to identify this connection.

Let's try to set up a connection with an external PostgreSQL database, which we name "pgdb". The connection details for "pgdb", including server name, user name etc., are in the configuration files mentioned in the introduction section.

In [2]:
# Setting up a connection with a database with the alias "pgdb"
vp.set_external_connection(cid = "pgdb",
                           rowset = 500,
                           symbol = "&")

Creating a vDataFrame

We can create a vDataFrame from a table stored in an external database by setting the 'external' parameter to True. SQL can be used to fetch required data, and we can provide an identifying symbol that can be used for fetching perform queries with SQL.

In [3]:
# Creating a vDataFrame using an SQL query by setting external = True.
tab_data = vp.vDataFrame(input_relation = "airports",
                         external = True,
                         symbol = "&")
tab_data
Out[3]:
Abc
IATA_CODE
Varchar(3)
Abc
Varchar(100)
Abc
CITY
Varchar(60)
Abc
STATE
Varchar(2)
Abc
COUNTRY
Varchar(100)
🌎
LATITUDE
Numeric(30)
🌎
LONGITUDE
Numeric(30)
1ABEAllentownPAUSA40.65236-75.4404
2ABIAbileneTXUSA32.41132-99.6819
3ABQAlbuquerqueNMUSA35.04022-106.60919
4ABRAberdeenSDUSA45.44906-98.42183
5ABYAlbanyGAUSA31.53552-84.19447
6ACKNantucketMAUSA41.25305-70.06018
7ACTWacoTXUSA31.61129-97.23052
8ACVArcata/EurekaCAUSA40.97812-124.10862
9ACYAtlantic CityNJUSA39.45758-74.57717
10ADKAdakAKUSA51.87796-176.64603
11ADQKodiakAKUSA57.74997-152.49386
12AEXAlexandriaLAUSA31.32737-92.54856
13AGSAugustaGAUSA33.36996-81.9645
14AKNKing SalmonAKUSA58.6768-156.64922
15ALBAlbanyNYUSA42.74812-73.80298
16ALOWaterlooIAUSA42.55708-92.40034
17AMAAmarilloTXUSA35.21937-101.70593
18ANCAnchorageAKUSA61.17432-149.99619
19APNAlpenaMIUSA45.07807-83.56029
20ASEAspenCOUSA39.22316-106.86885
21ATLAtlantaGAUSA33.64044-84.42694
22ATWAppletonWIUSA44.25741-88.51948
23AUSAustinTXUSA30.19453-97.66987
24AVLAshevilleNCUSA35.43619-82.54181
25AVPWilkes-Barre/ScrantonPAUSA41.33815-75.72427
26AZOKalamazooMIUSA42.23488-85.55206
27BDLWindsor LocksCTUSA41.93887-72.68323
28BETBethelAKUSA60.77978-161.838
29BFLBakersfieldCAUSA35.4336-119.05677
30BGMBinghamtonNYUSA42.20848-75.97961
31BGRBangorMEUSA44.80744-68.82814
32BHMBirminghamALUSA33.56294-86.75355
33BILBillingsMTUSA45.80766-108.54286
34BISBismarckNDUSA46.77411-100.74672
35BJIBemidjiMNUSA47.50942-94.93372
36BLIBellinghamWAUSA48.79275-122.53753
37BMIBloomingtonILUSA40.47799-88.91595
38BNANashvilleTNUSA36.12448-86.67818
39BOIBoiseIDUSA43.56444-116.22278
40BOSBostonMAUSA42.36435-71.00518
41BPTBeaumont/Port ArthurTXUSA29.95083-94.02069
42BQKBrunswickGAUSA31.25903-81.46631
43BQNAguadillaPRUSA18.49486-67.12944
44BRDBrainerdMNUSA46.39786-94.13723
45BROBrownsvilleTXUSA25.90683-97.42586
46BRWBarrowAKUSA71.28545-156.766
47BTMButteMTUSA45.9548-112.49746
48BTRBaton RougeLAUSA30.53316-91.14963
49BTVBurlingtonVTUSA44.473-73.15031
50BUFBuffaloNYUSA42.94052-78.73217
51BURBurbankCAUSA34.20062-118.3585
52BWIBaltimoreMDUSA39.1754-76.6682
53BZNBozemanMTUSA45.7769-111.15301
54CAEColumbiaSCUSA33.93884-81.11954
55CAKAkronOHUSA40.91631-81.44247
56CDCCedar CityUTUSA37.70097-113.09858
57CDVCordovaAKUSA60.49183-145.47765
58CECCrescent CityCAUSA41.78016-124.23653
59CHAChattanoogaTNUSA35.03527-85.20379
60CHOCharlottesvilleVAUSA38.13864-78.45286
61CHSCharlestonSCUSA32.89865-80.04051
62CIDCedar RapidsIAUSA41.88459-91.71087
63CIUSault Ste. MarieMIUSA46.25075-84.47239
64CLDSan DiegoCAUSA33.12723-117.27873
65CLEClevelandOHUSA41.41089-81.8494
66CLLCollege StationTXUSA30.58859-96.36382
67CLTCharlotteNCUSA35.21401-80.94313
68CMHColumbusOHUSA39.99799-82.89188
69CMIChampaign/UrbanaILUSA40.03925-88.27806
70CMXHancockMIUSA47.16842-88.48906
71CNYMoabUTUSA38.75496-109.75484
72CODCodyWYUSA44.52019-109.0238
73COSColorado SpringsCOUSA38.80581-104.70025
74COUColumbiaMOUSA38.81809-92.21963
75CPRCasperWYUSA42.90836-106.46447
76CRPCorpus ChristiTXUSA27.77036-97.50122
77CRWCharlestonWVUSA38.37315-81.59319
78CSGColumbusGAUSA32.51633-84.93886
79CVGCovingtonKYUSA39.04614-84.66217
80CWAMosineeWIUSA44.77762-89.66678
81DABDaytona BeachFLUSA29.17992-81.05806
82DALDallasTXUSA32.84711-96.85177
83DAYDaytonOHUSA39.90238-84.21938
84DBQDubuqueIAUSA42.40296-90.70917
85DCAArlingtonVAUSA38.85208-77.03772
86DENDenverCOUSA39.85841-104.667
87DFWDallas-Fort WorthTXUSA32.89595-97.0372
88DHNDothanALUSA31.32134-85.44963
89DIKDickinsonNDUSA46.79739-102.80195
90DLGDillinghamAKUSA59.04541-158.50334
91DLHDuluthMNUSA46.84209-92.19365
92DRODurangoCOUSA37.15152-107.75377
93DSMDes MoinesIAUSA41.53493-93.66068
94DTWDetroitMIUSA42.21206-83.34884
95DVLDevils LakeNDUSA48.11425-98.90878
96EAUEau ClaireWIUSA44.86526-91.48507
97ECPPanama CityFLUSA[null][null]
98EGEEagleCOUSA39.64257-106.9177
99EKOElkoNVUSA40.82493-115.7917
100ELMElmiraNYUSA42.15991-76.89144
Rows: 1-100 | Columns: 7

All vDataFrame functions are available for this imported table. For example, we can get all the column names:

In [ ]:
# Get all columns of the dataset
tab_data.get_columns()

Or the column data types:

In [ ]:
# Get data types of all columns inside the dataset
tab_data.dtypes()

Or the count of the datapoints:

In [6]:
# Counting all elements inside each column
tab_data.count()
Out[6]:
count
"IATA_CODE"322.0
"AIRPORT"322.0
"CITY"322.0
"STATE"322.0
"COUNTRY"322.0
"LATITUDE"319.0
"LONGITUDE"319.0
Rows: 1-7 | Columns: 2

Note: Every time we perform these calculations or call the vDataFrame, it runs the SQL query to fetch all the data from the external database. After retrieving the entire table, the operations are computed by Vertica. In order to push the queries to a remote database, we can use the option "sql_push_ext". When we create a vDataFrame with this option activated, all the aggregations are done on the external database using SQL.

In [7]:
# Creating a vDataFrame and setting sql_push_ext to True, which tries 
# to push SQL queries to external database (where possible).
Ext_Table=vp.vDataFrame(input_relation = "airports",
                        external = True,
                        symbol = "&",
                        sql_push_ext = True)
Ext_Table
Out[7]:
Abc
IATA_CODE
Varchar(3)
Abc
Varchar(100)
Abc
CITY
Varchar(60)
Abc
STATE
Varchar(2)
Abc
COUNTRY
Varchar(100)
🌎
LATITUDE
Numeric(30)
🌎
LONGITUDE
Numeric(30)
1ABEAllentownPAUSA40.65236-75.4404
2ABIAbileneTXUSA32.41132-99.6819
3ABQAlbuquerqueNMUSA35.04022-106.60919
4ABRAberdeenSDUSA45.44906-98.42183
5ABYAlbanyGAUSA31.53552-84.19447
6ACKNantucketMAUSA41.25305-70.06018
7ACTWacoTXUSA31.61129-97.23052
8ACVArcata/EurekaCAUSA40.97812-124.10862
9ACYAtlantic CityNJUSA39.45758-74.57717
10ADKAdakAKUSA51.87796-176.64603
11ADQKodiakAKUSA57.74997-152.49386
12AEXAlexandriaLAUSA31.32737-92.54856
13AGSAugustaGAUSA33.36996-81.9645
14AKNKing SalmonAKUSA58.6768-156.64922
15ALBAlbanyNYUSA42.74812-73.80298
16ALOWaterlooIAUSA42.55708-92.40034
17AMAAmarilloTXUSA35.21937-101.70593
18ANCAnchorageAKUSA61.17432-149.99619
19APNAlpenaMIUSA45.07807-83.56029
20ASEAspenCOUSA39.22316-106.86885
21ATLAtlantaGAUSA33.64044-84.42694
22ATWAppletonWIUSA44.25741-88.51948
23AUSAustinTXUSA30.19453-97.66987
24AVLAshevilleNCUSA35.43619-82.54181
25AVPWilkes-Barre/ScrantonPAUSA41.33815-75.72427
26AZOKalamazooMIUSA42.23488-85.55206
27BDLWindsor LocksCTUSA41.93887-72.68323
28BETBethelAKUSA60.77978-161.838
29BFLBakersfieldCAUSA35.4336-119.05677
30BGMBinghamtonNYUSA42.20848-75.97961
31BGRBangorMEUSA44.80744-68.82814
32BHMBirminghamALUSA33.56294-86.75355
33BILBillingsMTUSA45.80766-108.54286
34BISBismarckNDUSA46.77411-100.74672
35BJIBemidjiMNUSA47.50942-94.93372
36BLIBellinghamWAUSA48.79275-122.53753
37BMIBloomingtonILUSA40.47799-88.91595
38BNANashvilleTNUSA36.12448-86.67818
39BOIBoiseIDUSA43.56444-116.22278
40BOSBostonMAUSA42.36435-71.00518
41BPTBeaumont/Port ArthurTXUSA29.95083-94.02069
42BQKBrunswickGAUSA31.25903-81.46631
43BQNAguadillaPRUSA18.49486-67.12944
44BRDBrainerdMNUSA46.39786-94.13723
45BROBrownsvilleTXUSA25.90683-97.42586
46BRWBarrowAKUSA71.28545-156.766
47BTMButteMTUSA45.9548-112.49746
48BTRBaton RougeLAUSA30.53316-91.14963
49BTVBurlingtonVTUSA44.473-73.15031
50BUFBuffaloNYUSA42.94052-78.73217
51BURBurbankCAUSA34.20062-118.3585
52BWIBaltimoreMDUSA39.1754-76.6682
53BZNBozemanMTUSA45.7769-111.15301
54CAEColumbiaSCUSA33.93884-81.11954
55CAKAkronOHUSA40.91631-81.44247
56CDCCedar CityUTUSA37.70097-113.09858
57CDVCordovaAKUSA60.49183-145.47765
58CECCrescent CityCAUSA41.78016-124.23653
59CHAChattanoogaTNUSA35.03527-85.20379
60CHOCharlottesvilleVAUSA38.13864-78.45286
61CHSCharlestonSCUSA32.89865-80.04051
62CIDCedar RapidsIAUSA41.88459-91.71087
63CIUSault Ste. MarieMIUSA46.25075-84.47239
64CLDSan DiegoCAUSA33.12723-117.27873
65CLEClevelandOHUSA41.41089-81.8494
66CLLCollege StationTXUSA30.58859-96.36382
67CLTCharlotteNCUSA35.21401-80.94313
68CMHColumbusOHUSA39.99799-82.89188
69CMIChampaign/UrbanaILUSA40.03925-88.27806
70CMXHancockMIUSA47.16842-88.48906
71CNYMoabUTUSA38.75496-109.75484
72CODCodyWYUSA44.52019-109.0238
73COSColorado SpringsCOUSA38.80581-104.70025
74COUColumbiaMOUSA38.81809-92.21963
75CPRCasperWYUSA42.90836-106.46447
76CRPCorpus ChristiTXUSA27.77036-97.50122
77CRWCharlestonWVUSA38.37315-81.59319
78CSGColumbusGAUSA32.51633-84.93886
79CVGCovingtonKYUSA39.04614-84.66217
80CWAMosineeWIUSA44.77762-89.66678
81DABDaytona BeachFLUSA29.17992-81.05806
82DALDallasTXUSA32.84711-96.85177
83DAYDaytonOHUSA39.90238-84.21938
84DBQDubuqueIAUSA42.40296-90.70917
85DCAArlingtonVAUSA38.85208-77.03772
86DENDenverCOUSA39.85841-104.667
87DFWDallas-Fort WorthTXUSA32.89595-97.0372
88DHNDothanALUSA31.32134-85.44963
89DIKDickinsonNDUSA46.79739-102.80195
90DLGDillinghamAKUSA59.04541-158.50334
91DLHDuluthMNUSA46.84209-92.19365
92DRODurangoCOUSA37.15152-107.75377
93DSMDes MoinesIAUSA41.53493-93.66068
94DTWDetroitMIUSA42.21206-83.34884
95DVLDevils LakeNDUSA48.11425-98.90878
96EAUEau ClaireWIUSA44.86526-91.48507
97ECPPanama CityFLUSA[null][null]
98EGEEagleCOUSA39.64257-106.9177
99EKOElkoNVUSA40.82493-115.7917
100ELMElmiraNYUSA42.15991-76.89144
Rows: 1-100 | Columns: 7

If we look at the SQL generated in background, we can see that it pushes the aggregation query to the database.

In [8]:
# Turning on SQL output to view the queries
vp.set_option("sql_on",True)

Let's look at the count query again, and see how VerticaPy is pushing it to the external database.

In [9]:
# Counting elements in each column
Ext_Table.count()

Computing the different aggregations.

  SELECT
    DBLINK(USING PARAMETERS cid='pgdb', query=' SELECT COUNT("IATA_CODE"), COUNT("AIRPORT"), COUNT("CITY"), COUNT("STATE"), COUNT("COUNTRY"), COUNT("LATITUDE"), COUNT("LONGITUDE")   FROM ( SELECT "IATA_CODE", "AIRPORT", "CITY", "STATE", "COUNTRY", "LATITUDE", "LONGITUDE"   FROM ( SELECT *   FROM airports)   VERTICAPY_SUBTABLE)   VERTICAPY_SUBTABLE LIMIT 1', rowset=500) OVER ()
Out[9]:
count
"IATA_CODE"322.0
"AIRPORT"322.0
"CITY"322.0
"STATE"322.0
"COUNTRY"322.0
"LATITUDE"319.0
"LONGITUDE"319.0
Rows: 1-7 | Columns: 2

Let's also look at the "min" method:

In [10]:
# Finding minimum in the ID column of Ext_Table
Ext_Table["LATITUDE"].min()

Computing the different aggregations.

  SELECT
    DBLINK(USING PARAMETERS cid='pgdb', query=' SELECT MIN("LATITUDE")   FROM ( SELECT "IATA_CODE", "AIRPORT", "CITY", "STATE", "COUNTRY", "LATITUDE", "LONGITUDE"   FROM ( SELECT *   FROM airports)   VERTICAPY_SUBTABLE)   VERTICAPY_SUBTABLE LIMIT 1', rowset=500) OVER ()
Out[10]:
13.48345

For the above examples, the queries were pushed to the external database.

If the function is unique to Vertica, it automatically fetches the data from the external database to compute on the Vertica server. Let's try an example with the describe function, which is a unique Vertica function.

In [11]:
# Describe the main attributes of numerical columns in the Ext_table
Ext_Table.describe()

Getting the version.

  SELECT
    /*+LABEL('utilities.version')*/ version()

Computing the descriptive statistics of all numerical columns using SUMMARIZE_NUMCOL.

  SELECT
    /*+LABEL('vDataframe.describe')*/ SUMMARIZE_NUMCOL("LATITUDE", "LONGITUDE") OVER ()  
  FROM
(
  SELECT
    "IATA_CODE",
    "AIRPORT",
    "CITY",
    "STATE",
    "COUNTRY",
    "LATITUDE",
    "LONGITUDE"  
  FROM
(
  SELECT
    DBLINK(USING PARAMETERS cid='pgdb', query=' SELECT *   FROM airports', rowset=500) OVER ())  
VERTICAPY_SUBTABLE)  
VERTICAPY_SUBTABLE
Out[11]:
count
mean
std
min
approx_25%
approx_50%
approx_75%
max
"LATITUDE"31938.98124391849538.6167355810180413.4834533.6520439.2976143.15467571.28545
"LONGITUDE"319-98.378964451410721.5234920464981-176.64603-110.839385-93.40307-82.722995-64.79856
Rows: 1-2 | Columns: 9

We can see that the data was fetched from the external database to be computed in Vertica.

(Now we can turn off SQL display).

In [12]:
# Turning off SQL display
vp.set_option("sql_on",
              False)

Using SQL Magic Cells

In [2]:
# Load extension for running SQL magic cells
%load_ext verticapy.sql

We can use magic cells to call external tables using special characters like "$$$" and "%%%". If we have multiple external databases, we can specify special characters for each.

This makes writing queries a lot more convenient and visually appealing!

Now we will try to get fetch data from our external database "pgdb, whose special character is "&".

In [14]:
%%sql
/* Getting all data from airports table which is placed in the PostgreSQL database represented by "&". */
SELECT * FROM &&& airports &&&;
Execution: 0.106s
Out[14]:
Abc
IATA_CODE
Varchar(3)
Abc
Varchar(100)
Abc
CITY
Varchar(60)
Abc
STATE
Varchar(2)
Abc
COUNTRY
Varchar(100)
🌎
LATITUDE
Numeric(30)
🌎
LONGITUDE
Numeric(30)
1ABEAllentownPAUSA40.65236-75.4404
2ABIAbileneTXUSA32.41132-99.6819
3ABQAlbuquerqueNMUSA35.04022-106.60919
4ABRAberdeenSDUSA45.44906-98.42183
5ABYAlbanyGAUSA31.53552-84.19447
6ACKNantucketMAUSA41.25305-70.06018
7ACTWacoTXUSA31.61129-97.23052
8ACVArcata/EurekaCAUSA40.97812-124.10862
9ACYAtlantic CityNJUSA39.45758-74.57717
10ADKAdakAKUSA51.87796-176.64603
11ADQKodiakAKUSA57.74997-152.49386
12AEXAlexandriaLAUSA31.32737-92.54856
13AGSAugustaGAUSA33.36996-81.9645
14AKNKing SalmonAKUSA58.6768-156.64922
15ALBAlbanyNYUSA42.74812-73.80298
16ALOWaterlooIAUSA42.55708-92.40034
17AMAAmarilloTXUSA35.21937-101.70593
18ANCAnchorageAKUSA61.17432-149.99619
19APNAlpenaMIUSA45.07807-83.56029
20ASEAspenCOUSA39.22316-106.86885
21ATLAtlantaGAUSA33.64044-84.42694
22ATWAppletonWIUSA44.25741-88.51948
23AUSAustinTXUSA30.19453-97.66987
24AVLAshevilleNCUSA35.43619-82.54181
25AVPWilkes-Barre/ScrantonPAUSA41.33815-75.72427
26AZOKalamazooMIUSA42.23488-85.55206
27BDLWindsor LocksCTUSA41.93887-72.68323
28BETBethelAKUSA60.77978-161.838
29BFLBakersfieldCAUSA35.4336-119.05677
30BGMBinghamtonNYUSA42.20848-75.97961
31BGRBangorMEUSA44.80744-68.82814
32BHMBirminghamALUSA33.56294-86.75355
33BILBillingsMTUSA45.80766-108.54286
34BISBismarckNDUSA46.77411-100.74672
35BJIBemidjiMNUSA47.50942-94.93372
36BLIBellinghamWAUSA48.79275-122.53753
37BMIBloomingtonILUSA40.47799-88.91595
38BNANashvilleTNUSA36.12448-86.67818
39BOIBoiseIDUSA43.56444-116.22278
40BOSBostonMAUSA42.36435-71.00518
41BPTBeaumont/Port ArthurTXUSA29.95083-94.02069
42BQKBrunswickGAUSA31.25903-81.46631
43BQNAguadillaPRUSA18.49486-67.12944
44BRDBrainerdMNUSA46.39786-94.13723
45BROBrownsvilleTXUSA25.90683-97.42586
46BRWBarrowAKUSA71.28545-156.766
47BTMButteMTUSA45.9548-112.49746
48BTRBaton RougeLAUSA30.53316-91.14963
49BTVBurlingtonVTUSA44.473-73.15031
50BUFBuffaloNYUSA42.94052-78.73217
51BURBurbankCAUSA34.20062-118.3585
52BWIBaltimoreMDUSA39.1754-76.6682
53BZNBozemanMTUSA45.7769-111.15301
54CAEColumbiaSCUSA33.93884-81.11954
55CAKAkronOHUSA40.91631-81.44247
56CDCCedar CityUTUSA37.70097-113.09858
57CDVCordovaAKUSA60.49183-145.47765
58CECCrescent CityCAUSA41.78016-124.23653
59CHAChattanoogaTNUSA35.03527-85.20379
60CHOCharlottesvilleVAUSA38.13864-78.45286
61CHSCharlestonSCUSA32.89865-80.04051
62CIDCedar RapidsIAUSA41.88459-91.71087
63CIUSault Ste. MarieMIUSA46.25075-84.47239
64CLDSan DiegoCAUSA33.12723-117.27873
65CLEClevelandOHUSA41.41089-81.8494
66CLLCollege StationTXUSA30.58859-96.36382
67CLTCharlotteNCUSA35.21401-80.94313
68CMHColumbusOHUSA39.99799-82.89188
69CMIChampaign/UrbanaILUSA40.03925-88.27806
70CMXHancockMIUSA47.16842-88.48906
71CNYMoabUTUSA38.75496-109.75484
72CODCodyWYUSA44.52019-109.0238
73COSColorado SpringsCOUSA38.80581-104.70025
74COUColumbiaMOUSA38.81809-92.21963
75CPRCasperWYUSA42.90836-106.46447
76CRPCorpus ChristiTXUSA27.77036-97.50122
77CRWCharlestonWVUSA38.37315-81.59319
78CSGColumbusGAUSA32.51633-84.93886
79CVGCovingtonKYUSA39.04614-84.66217
80CWAMosineeWIUSA44.77762-89.66678
81DAB