Co-authored by James Clampffer, Deepak Majeti. – First in a series
A lot of projects require querying ORC or Parquet files, or other data that may have internal types that are a bit more complex than usual. It’s a little hard to explain what we mean when we say, “complex.”
So, this is the first in a series of blog posts on complex types. Over the course of this series, we’ll talk about everything from the high level, “What are complex data types?” and “What are the benefits of querying complex data types?” down to the technical nitty gritty of, “How do I query complex data types in a SQL analytics warehouse like Vertica?” We will also keep you updated as to which types Vertica supports, and what we’re adding next.
What are Complex Data Types?
Let’s start with a definition. Complex types are nested data structures composed of primitive data types. These data structures can also be composed of other complex types. Some examples of complex types include struct(row), array/list, map and union.
Complex types are supported by most programming languages including Python, C++ and Java. They are also supported by databases like PostgreSQL which introduced composite (struct) type in 8.0.
And, they are supported in Vertica. What we are adding is a richer, more complete support for these types, not just querying complex types in ORC and Parquet, but doing so rapidly, without converting the file to a far larger row and column format first. We are even adding the capability to store data in efficient complex types in Vertica’s own ROS storage for even faster processing.
Examples of Complex Types
Array – Array is a homogenous collection of elements of the same data type. These can be int, float, or any kind of string (string, char, varchar, etc.) or another complex type. The Array data type is quite similar to Java’s arraylist which is a strongly typed list but also lets you nest the lists to create arrays.
Example of an array could be PhoneNumbers where each number is defined as a string data type. All elements in this array would have the same data type, ie. string.
Map – Map data types are key-value pairs. This type associates a key with a value. For example, a city could be associated with a zipcode, or an address with a phone number.
In Vertica, the data types for the key and the value in a Map do not have to be the same. The keys can only be made of the base types (not a complex type). Map values can be any type that Array supports. For example, the key could be a string and the value could be an int.
Array and Map complex types are also called “collections.”
Struct – Struct is a composite data type that can contain other complex and simple data types. The data types in a struct do not need to be the same (as they do in arrays.) This flexibility allows users to combine multiple data types together under a single name. While many other member variables of similar or different data types can be part of a struct, all these variables need to be defined at the time the struct is created.
An example of struct below shows a SimpleCustomer struct which contains the name (varchar) and id (int) of the customer. Additional fields can similarly be added at the time of struct definition.
Nested Complex Types
Just when you thought things were complex enough, you can also create complex types by nesting other complex types.
For example, we might want to create a struct with all the webevents tracked by customers. For storing the phone numbers of the customers, we can create an ARRAY PhoneNumbers. Now, we can use this array of phone numbers that we have when creating another complex data type to store customer records. Instead of recreating the array, we can instead define the type of “contact” in Customer struct as type PhoneNumbers which is based on the Array data type.
Subsequently, when creating a webEvents struct, we can use basic data types like int and timestamp and define customer record based on the struct Customer we created in the previous step. We also defined another variable http_requests based on a complex data type (MAP) that we already created.
Now, you see why they’re called complex types. Yet, you can also see how using nested complex types like this actually simplifies code and saves time.
Be sure to check out the next post in this series:
2 – What Are the Benefits of Using Complex Data Types in SQL?
And don’t miss the on-demand webinar: Data Preparation for Machine Learning at Scale
Quick Tip: Changing the Data Type of a Column in an External Table
1. Benefits of Machine Learning in a Database
2: Calculate a Correlation Matrix – An In-Database Machine Learning Data Exploration Post
Quick Tip: Exporting to Parquet
Announcing Vertica Version 9.3 – Ride the Winds of Change
Vertica Version 10 Launches Today