Doing In-Database Machine Learning 1 – Why Would You Do That?

Posted April 22, 2019 by Waqas Dhillon, Product Manager - Machine Learning, Vertica

Machine learning robot studying equations and graphs
Co-authored with Paige Roberts.

A lot of modern scalable analytical databases like Vertica allow you to do machine learning from end to end, right in the database, rather than moving and transforming the data first into something like a Spark dataframe or a Python data structure. Whenever people hear about this capability, they have two questions. The first question is, “Why would I want to do machine learning inside a database?” The second question, once they’ve seen the advantages, is “How do I do machine learning inside a database?” In this blog series, we’re going to focus on the “how to do it” aspect, but for this introductory post, let’s take a quick look at the “why.”



Some of the advantages of in-database machine learning are going to benefit you, no matter what analytical database you choose. Nearly every modern, large scale database has seen this customer need and addressed it in some way. Vertica has focused hard in the last few years on improving your in-database machine learning experience, so it has some unique advantages even beyond other databases. I’ll separate out some of the advantages that you will gain by doing ML inside any good MPP OLAP database, and then highlight some of the advantages Vertica specifically brings to the table.

How Does In-Database Machine Learning Benefit You?

Generally, the advantages come in four categories:
  • Infrastructure
  • Speed and Scalability
  • Concurrency
  • Ease of Use

For line of business people, these advantages mean you get your measurable results from machine learning projects faster. Easy prototyping of predictive models means the gap between ideation and practical application is greatly reduced. New ideas can be implemented and put into action, and your business gets incremental ROI quickly and continuously. No waiting for months and months to see any return.

For data scientists and data analysts, you save overhead and time, and gain ease of use. Most of this work is done through a basic SQL interface that you are already likely to be familiar with. Managing models is easier. Calling, training, testing and other aspects of your workflow are easier to do. In addition, you save the CPU/IO hit and time overhead of moving data from one format to another before you can use it. The biggest advantage is at the end of the process. Moving your model from development to production is vastly simplified by already having it in the same database in development that will use the model in production.

For database administrators and architects, you will see a wide variety of advantages. Security, and simplicity of architecture are possibly the biggest benefits. Databases, especially Vertica, have intense levels of built-in security: compaction, encryption, role-based access, etc. In addition, because so much work can be done inside a single location, data management architecture can be simplified. Fewer tools need to interface to make things work smoothly.

And one more thing, because databases have built-in resource isolation, concurrency levels can be far higher than with other machine learning platforms. This benefits everyone in every role by allowing far more people to put the database to work. Teams can work simultaneously on different problems without bumping into each other, and bogging down performance.

Let’s drill down just a little into each of the general advantage categories I mentioned above.

Infrastructure

Advantages you’ll get from doing ML inside any scalable analytical database:
  • Reduction in hardware and software management requirements
  • Built-in security
  • Production deployment ready – faster time to value

Advantages Vertica brings to the table:
  • Compression, encryption, late materialization – 1/3 the hardware, faster data prep, even more secure
  • Flexibility
    • Enterprise mode – on premises or any Cloud
    • Eon Mode with elastic compute capability on Cloud
  • Open Integration with other software needed to complete solutions – ETL, data transportation like Kafka, data visualization – both open source and proprietary
  • Data Type agnosticism

Speed and Scalability

Advantages you’ll get from doing ML inside any scalable analytical database:
  • MPP scale-out architecture
  • No data movement across systems
  • No down-sampling – increased accuracy
  • Fast data preparation
  • Fast model iteration
Advantages Vertica brings to the table:
  • In-memory processing and auto disk spill – even faster data processing and response scaling beyond the amount of memory available

Concurrency

Advantages you’ll get from doing ML inside any scalable analytical database:
  • Natural segmentation of user environments and privileges
  • Resource isolation for multiple user sessions
Advantages Vertica brings to the table:
  • Configurable memory usage for machine learning functions

Ease of Use

Advantages you’ll get from doing ML inside any scalable analytical database:
  • Manage and deploy machine learning models using simple SQL calls
  • Integrate ML functions with other tools via the same SQL interface
Advantages Vertica brings to the table:
  • Ability to extend the functionality with custom Python, R, C++ or Java functions and algorithms

How Do You Get Those Advantages?

As you can see, once the numerous advantages are laid out on the table, it’s pretty clear that this is something you would want. The rest of this blog series will focus on how to make that happen.

We’ll use a data science workflow as a framework so each post should give you an idea of how that step in the workflow is done inside a database. Most of it will be applicable to any database, but some of it will be Vertica specific because that’s where our expertise lies. Here’s the outline:

  1. Why would you do that?
  2. Data exploration and data preparation for machine learning
  3. Data preparation and time-series analytics
  4. Model training and prediction
  5. Model evaluation
  6. Machine learning model management in Vertica
  7. Deploying models into production

Now that you have an introduction to the benefits of in-database machine learning, we look forward to diving in to showing you how you can get those benefits for yourself.



In the meantime, if machine learning is your area of interest, here’s some other things you might be interested in:

Machine Learning page on Vertica.com

Vertica’s In-Database Random Forest

Open Source Vertica-Python Client

IOT Smart Metering

Vertica ML-Python Library

How to Code Vertica UDx

Make Data Analysis Easier with Dimensionality Reduction