Vertica and Apache DataSketches: Technical Exploration

About this Document

The goal of this exploration is to connect Vertica and Apache DataSketches. DataSketches provides machine learning functions called theta functions that are imported in Vertica to enhance the ML functionality of Vertica.

This exploration provides instructions for tightly integrating theta analytical functions into Vertica. To begin, you need to build a DataSketches library in Vertica and then create a table to use these theta functions for your data.

DataSketches Overview

DataSketches is an open source software library of algorithms called sketches that can process massive datasets as a stream and provide fast approximations, real-time analysis, with proven error bounds.

Test Environment

  • Vertica DataSketches (GitHub link)

  • CMake 3.14+

  • GCC+ (C++ compiler)

  • Linux RHEL 8.3

  • Vertica Analytical Database 12.0.1

Building DataSketches Library for Vertica

DataSketches repository contains UDxs, the source code of which is developed in C++. It was created by the Analytics Infrastructure teams at Criteo. You need to compile the source code and build the so file.

Following are steps to build the DataSketches source code:

  1. Open RHEL 8.3 machine and setup Vertica 12.0.1 on it. Follow the Setup instructions in the Vertica documentation.

  2. Install CMake 3.14+ and C++ compiler on Vertica 12.0.1 machine.

  3. Download Vertica DataSketches source code from GitHub.

  4. Go to the directory that contains the source code of Vertica-DataSketches.

  5. Execute the following commands to build Vertica DataSketches library for Vertica.

    <Base_Loc>/vertica-datasketch-master>mkdir build
    <Base_Loc>//vertica-datasketch-master> cd build
    <Base_Loc>//vertica-datasketch-master/build> cmake ../SOURCE
    <Base_Loc>//vertica-datasketch-master/build>make
  6. The vertica library is now built in the build folder.

Sample Query for Using Theta Functions

Create a table that will be used by theta functions. In the following example, we are creating a table named theta on which we will use theta functions:

Create table test (id varchar, name varchar);
Insert into test values ('a', '1st value');
Insert into test values ('b', '2nd value');
Insert into test values ('c', '3rd value');
Insert into test values ('d', '4th value');
commit;
select theta_sketch_create(id) id, name into theta from test group by name;

Following are sample queries demonstrating how to use each function of the theta library:

  1. theta_sketch_create > select theta_sketch_create(id) id, name into theta from test group by name;

  2. theta_sketch_a_not_b > select theta_sketch_a_not_b(id, id), name from theta;

  3. theta_sketch_get_estimate > select theta_sketch_get_estimate(id), name from theta;

  4. theta_sketch_intersection > select theta_sketch_intersection(id, id), name from theta;

  5. theta_sketch_intersection_agg > select theta_sketch_intersection_agg(id), name from theta group by name;

  6. theta_sketch_union > select theta_sketch_union(id, id), name from theta;

  7. theta_sketch_union_agg > select theta_sketch_union_agg(id), name from theta group by name;

For More Information