VerticaPy
DBLINK in VerticaPy¶
Introduction¶
Starting with VerticaPy 0.12.0, you can work with other databases, such as PostgresQL and mySQL, using DBLINK functionality. DBLINK is a Vertica User Defined Transform Function coded in C++ that runs SQL against other databases. To setup and learn more about DBLINK in Vertica, please view the github repo.
In order to use this new functionality, we first need to install the ODBC driver and manager, as well as configure DBLINK on all nodes of the cluster. Configuration entails three files:
- dblink.cids
- odbc.ini
- odbcinst.ini
For more information about requirements, see the github repo.
These files provide the host server address, username, and password, as well as the database name that we want to access. In future versions, we are planning to simplify this process and automate the creation of these files.
In the next section, let's work through an example of a database in PostgreSQL.
Connecting to an External Database¶
# Importing VerticaPy
import verticapy as vp
We first need to provide the connection information that we have set up in the Connection Identifier Database file (dblink.cids). We can select a special character symbol to identify this connection.
Let's try to set up a connection with an external PostgreSQL database, which we name "pgdb". The connection details for "pgdb", including server name, user name etc., are in the configuration files mentioned in the introduction section.
# Setting up a connection with a database with the alias "pgdb"
vp.set_external_connection(cid = "pgdb",
rowset = 500,
symbol = "&")
Creating a vDataFrame¶
We can create a vDataFrame from a table stored in an external database by setting the 'external' parameter to True. SQL can be used to fetch required data, and we can provide an identifying symbol that can be used for fetching perform queries with SQL.
# Creating a vDataFrame using an SQL query by setting external = True.
tab_data = vp.vDataFrame(input_relation = "airports",
external = True,
symbol = "&")
tab_data
All vDataFrame functions are available for this imported table. For example, we can get all the column names:
# Get all columns of the dataset
tab_data.get_columns()
Or the column data types:
# Get data types of all columns inside the dataset
tab_data.dtypes()
Or the count of the datapoints:
# Counting all elements inside each column
tab_data.count()
Note: Every time we perform these calculations or call the vDataFrame, it runs the SQL query to fetch all the data from the external database. After retrieving the entire table, the operations are computed by Vertica. In order to push the queries to a remote database, we can use the option "sql_push_ext". When we create a vDataFrame with this option activated, all the aggregations are done on the external database using SQL.
# Creating a vDataFrame and setting sql_push_ext to True, which tries
# to push SQL queries to external database (where possible).
Ext_Table=vp.vDataFrame(input_relation = "airports",
external = True,
symbol = "&",
sql_push_ext = True)
Ext_Table
If we look at the SQL generated in background, we can see that it pushes the aggregation query to the database.
# Turning on SQL output to view the queries
vp.set_option("sql_on",True)
Let's look at the count query again, and see how VerticaPy is pushing it to the external database.
# Counting elements in each column
Ext_Table.count()
Let's also look at the "min" method:
# Finding minimum in the ID column of Ext_Table
Ext_Table["LATITUDE"].min()
For the above examples, the queries were pushed to the external database.
If the function is unique to Vertica, it automatically fetches the data from the external database to compute on the Vertica server. Let's try an example with the describe function, which is a unique Vertica function.
# Describe the main attributes of numerical columns in the Ext_table
Ext_Table.describe()
We can see that the data was fetched from the external database to be computed in Vertica.
(Now we can turn off SQL display).
# Turning off SQL display
vp.set_option("sql_on",
False)
Using SQL Magic Cells¶
# Load extension for running SQL magic cells
%load_ext verticapy.sql
We can use magic cells to call external tables using special characters like "$$$" and "%%%". If we have multiple external databases, we can specify special characters for each.
This makes writing queries a lot more convenient and visually appealing!
Now we will try to get fetch data from our external database "pgdb, whose special character is "&".
%%sql
/* Getting all data from airports table which is placed in the PostgreSQL database represented by "&". */
SELECT * FROM &&& airports &&&;
To perform all regular queries, all we need to do is call the table with its name inside three special characters.
We'll now try out some queries:
Count the elements inside the table:
%%sql
/* Counting all elements inside the airports table in PostgreSQL. */
SELECT COUNT(*) FROM &&& airports &&&;
Find the IATA_CODE where CITY is "Allentown":
%%sql
/* Finding IATA_CODE where the CITY is "Allentown" in the airports table. */
SELECT IATA_CODE
FROM &&& airports &&&
WHERE CITY='Allentown';
Note: Any query that we write inside the "&&&" signs is also sent to the external database to be run.
So, instead of just calling the whole table, we can query it using the same special character padding.
For example, let's select all elements inside the airports table:
%%sql
/* Getting all data from airports table which is placed in the PostgreSQL database represented by "$". */
&&& SELECT * FROM airports &&&;
Now we'll run a search query to find a particular id:
%%sql
/* Finding IATA_CODE where the CITY is "Allentown" in the airports table. */
&&& SELECT "IATA_CODE" FROM airports WHERE "CITY"='Allentown' &&&;
We can also insert a new entry into the airports table, which is placed in the postgreSQL database represented by "&":
%%sql
/* Inserting an entry into the airportss table which is placed in the postgreSQL database represented by "&". */
&&&
INSERT INTO airports
("IATA_CODE", "AIRPORT", "CITY", "STATE", "COUNTRY", "LATITUDE", "LONGITUDE")
VALUES ('MXX' , 'Midway Airport', 'Chicago', 'IL', 'USA', 66.60, 35.00);
&&&
Connecte Multiple Databases¶
You can connect and use multiple datasets from different databases.
In this example we will get:
- Airline data from PostgreSQL
- Airport data from MySQL
- Flights data from Vertica
The datasets can be found here.
Airline Data in PostgreSQL¶
We can set up a new connection in just one line by referencing the alias inside the connection files. As before, we will provide the special character symbol that is used to invoke the connection.
# Setting up a connection with a database given an alias "pgdb"
vp.set_external_connection(cid="pgdb",
rowset=500,
symbol="$")
Let's look at the airline table that we have in our postgreSQL database.
%%sql
/* Fetch all the data from the table airports in "pgdb" database. */
SELECT * FROM $$$ airline $$$;
Airports Data in MySQL¶
We can create another new connection by providing the cid reference for our MySQL database. We'll also provide a unique special character, which is not used for any other connection.
# Setting up a connection with a database given an alias "mysql"
vp.set_external_connection(cid="mysql",
rowset=500,
symbol="&")
Let's take a look at the airports table that we have in our MySQL database.
%%sql
/* Fetch all the data from the table airports in "mysql" database */
SELECT * FROM &&& airports &&&;
Flights Data Vertica¶
We'll now read a locally stored CSV file with the flights data and materialize it in Vertica.
# Reading a csv file and naming the table flights_vertica
flight_vertica = vp.read_csv('flights.csv',
table_name = "flight_vertica")
flight_vertica
%%sql
/* Fetch all the data from the table flight_vertica. */
SELECT * FROM flight_vertica;
Joins and Queries Across Multiple Databases¶
Now we can run queries that execute through multiple sources.
Let's try to find the TAIL_NUMBER and Departing City for all the flights by joining the two tables:
- flight_vertica (stored in Vertica)
- airports (stored in MySQL)
%%sql
/* Fetch TAIL_NUMBER and CITY after Joining the flight_vertica table with airports table in MySQL database. */
SELECT flight_vertica.TAIL_NUMBER, airports.CITY AS Departing_City
FROM flight_vertica
INNER JOIN &&& airports &&&
ON flight_vertica.ORIGIN_AIRPORT = airports.IATA_CODE;
Let's try another query to find the TAIL_NUMBER and AIRLINE of all the flights by joining the two tables:
- flight_vertica (stored in Vertica)
- airline (stored in PostgreSQL)
%%sql
/* Fetch TAIL_NUMBER and AIRLINE after Joining the flight_vertica table with airline table in PostgreSQL database. */
SELECT flight_vertica.TAIL_NUMBER, airline.AIRLINE
FROM flight_vertica
INNER JOIN $$$ airline $$$
ON flight_vertica.AIRLINE = airline.IATA_CODE;
We can even try queries that require multiple joins.
In the following example, we try to get the TAIL_NUMBER, AIRLINE, and CITY details for all the flights by joining:
- flight_local table (stored in Vertica)
- airline table (stored in PostgreSQL)
- airports table (stored in MySQL)
%%sql
/* Fetch FLIGHT_NUMBER, AIRLINE and STATE after Joining the flight_vertica table with two other tables from different databases. */
SELECT flight_vertica.FLIGHT_NUMBER, airline.AIRLINE, airports.STATE
FROM flight_vertica
INNER JOIN $$$ airline $$$
ON flight_vertica.AIRLINE = airline.IATA_CODE
INNER JOIN &&& airports &&&
ON flight_vertica.ORIGIN_AIRPORT = airports.IATA_CODE;
Pandas.DataFrame¶
The joins also work with pandas.Dataframe. We can perform the same query that required multiple joins, but now with a local Pandas dataframe.
We will first read the local CSV file.
# Create a Pandas Data Frame after importing the csv file "passengers.csv"
import pandas as pd
passengers_pandas = pd.read_csv('passengers.csv')
passengers_pandas
We can now perform the same query involving the three tables:
- flight_vertica table (stored in Vertica)
- passengers_pandas table (pandas.DataFrame stored in-memory)
- airline table (stored in PostgreSQL)
- airports table (stored in MySQL)
%%sql
SELECT
flight_vertica.TAIL_NUMBER,
airline.AIRLINE,
airports.CITY,
:passengers_pandas.PASSENGER_COUNT
FROM flight_vertica
INNER JOIN $$$ airline $$$
ON flight_vertica.AIRLINE = airline.IATA_CODE
INNER JOIN &&& airports &&&
ON flight_vertica.ORIGIN_AIRPORT = airports.IATA_CODE
INNER JOIN :passengers_pandas
ON flight_vertica.FLIGHT_NUMBER = :passengers_pandas.FLIGHT_NUMBER;
Conclusion¶
With the combination of Verticapy and DBLINK, we can now work with multiple datasets stored in different databases. We can work simultaneously with external tables, Vertica tables, and Pandas DataFrame in a single query! There is no need to materialize the table before use because it's all taken care of in the background.
The cherry on the cake is the ease-of-use that is enabled by VerticaPy and its Python-like syntax.
Queries that required paragraph upon paragraph to execute can now be done efficiently with only a few intuitive lines of code.
This new functionality opens up many possibilities for data querying and manipulation in Vertica.