
VerticaPy
DBLINK in VerticaPy¶
Introduction¶
Starting with VerticaPy 0.12.0, you can work with other databases, such as PostgresQL and mySQL, using DBLINK functionality. DBLINK is a Vertica User Defined Transform Function coded in C++ that runs SQL against other databases. To setup and learn more about DBLINK in Vertica, please view the github repo.
In order to use this new functionality, we first need to install the ODBC driver and manager, as well as configure DBLINK on all nodes of the cluster. Configuration entails three files:
- dblink.cids
- odbc.ini
- odbcinst.ini
For more information about requirements, see the github repo.
These files provide the host server address, username, and password, as well as the database name that we want to access. In future versions, we are planning to simplify this process and automate the creation of these files.
In the next section, let's work through an example of a database in PostgreSQL.
Connecting to an External Database¶
# Importing VerticaPy
import verticapy as vp
We first need to provide the connection information that we have set up in the Connection Identifier Database file (dblink.cids). We can select a special character symbol to identify this connection.
Let's try to set up a connection with an external PostgreSQL database, which we name "pgdb". The connection details for "pgdb", including server name, user name etc., are in the configuration files mentioned in the introduction section.
# Setting up a connection with a database with the alias "pgdb"
vp.set_external_connection(cid = "pgdb",
rowset = 500,
symbol = "&")
Creating a vDataFrame¶
We can create a vDataFrame from a table stored in an external database by setting the 'external' parameter to True. SQL can be used to fetch required data, and we can provide an identifying symbol that can be used for fetching perform queries with SQL.
# Creating a vDataFrame using an SQL query by setting external = True.
tab_data = vp.vDataFrame(input_relation = "airports",
external = True,
symbol = "&")
tab_data
All vDataFrame functions are available for this imported table. For example, we can get all the column names:
# Get all columns of the dataset
tab_data.get_columns()
Or the column data types:
# Get data types of all columns inside the dataset
tab_data.dtypes()
Or the count of the datapoints:
# Counting all elements inside each column
tab_data.count()
Note: Every time we perform these calculations or call the vDataFrame, it runs the SQL query to fetch all the data from the external database. After retrieving the entire table, the operations are computed by Vertica. In order to push the queries to a remote database, we can use the option "sql_push_ext". When we create a vDataFrame with this option activated, all the aggregations are done on the external database using SQL.
# Creating a vDataFrame and setting sql_push_ext to True, which tries
# to push SQL queries to external database (where possible).
Ext_Table=vp.vDataFrame(input_relation = "airports",
external = True,
symbol = "&",
sql_push_ext = True)
Ext_Table
If we look at the SQL generated in background, we can see that it pushes the aggregation query to the database.
# Turning on SQL output to view the queries
vp.set_option("sql_on",True)
Let's look at the count query again, and see how VerticaPy is pushing it to the external database.
# Counting elements in each column
Ext_Table.count()
Let's also look at the "min" method:
# Finding minimum in the ID column of Ext_Table
Ext_Table["LATITUDE"].min()
For the above examples, the queries were pushed to the external database.
If the function is unique to Vertica, it automatically fetches the data from the external database to compute on the Vertica server. Let's try an example with the describe function, which is a unique Vertica function.
# Describe the main attributes of numerical columns in the Ext_table
Ext_Table.describe()
We can see that the data was fetched from the external database to be computed in Vertica.
(Now we can turn off SQL display).
# Turning off SQL display
vp.set_option("sql_on",
False)
Using SQL Magic Cells¶
# Load extension for running SQL magic cells
%load_ext verticapy.sql
We can use magic cells to call external tables using special characters like "$$$" and "%%%". If we have multiple external databases, we can specify special characters for each.
This makes writing queries a lot more convenient and visually appealing!
Now we will try to get fetch data from our external database "pgdb, whose special character is "&".
%%sql
/* Getting all data from airports table which is placed in the PostgreSQL database represented by "&". */
SELECT * FROM &&& airports &&&;