Do you need a database or a query engine?

This blog post was authored by Steve Sarsfield.

As we travel through life, we are constantly assessing our choices. Should you eat that salad, or opt for the burger? Should you marry your partner or seek greener pastures elsewhere? All of us do these assessments in both our personal and business lives. However, it may have never occurred to you in your moments of private introspection to consider: Do I need a database or a query engine?

Of course, query engine decisions may not be as impactful as marriage, but like marriage, you can live happily ever after or become committed to a relationship that won’t work in the long run. It’s a confusing landscape because often query engines are sold as databases. Databases appear to be significantly more costly than query engines while being more capable. And vendors tend to blur the line between query engines and databases. Without careful thought, you may find yourself married to the wrong solution with promises to deliver analytics that you cannot meet.

Exploring the Differences

What makes a database? A database is a place where you can load and store data in the most optimal way for my queries. You move data into a database because of the optimizations and because someone is holding your team to a service level agreement on the amount of data stored, how fast questions are answered and how many concurrent users are asking their questions about the data. In a database, you might need ACID compliance; if you ask for a change, you want to be reasonable sure the change happens even if a hardware error or other random error occurs. Vertica is a database, as are other solutions like MySQL, Oracle, Microsoft SQL and many others.

On the other hand, a query engine is a piece of software that you can bring to the data to query it. In this case, you don’t really want to move the data. Maybe the data is too big and will take too long to move. Instead, you’d like to bring a query engine directly to the nodes or cloud that’s holding it and use the resources in place. In querying the data with a query engine, you are less concerned about optimizations and more interested seeing what’s possible. You may want to explore the data that is outside the constraints of service level agreements. Since some other process is probably writing the data, things like ACID compliance are less important. Vertica for SQL on Hadoop is a query engine as are solutions like Apache Drill, Cloudera Impala and even Apache Spark.

Use Case Differences

You need a query engine when you have a lot of data stored and need to bring analytics to it. Companies will frequently store data in Amazon S3 or Hadoop without knowing the value of much of it. They may peel off portions of the data to their database, usually a data warehouse, to perform analytics. You can also point multiple query engines at the data, but the engines won’t necessarily run in isolation, nor will you automatically have workload management that can handle fast running queries and long running queries without bumping into each other.

You need a database when your database needs a new home that can deliver compliance with standards for SQL, ACID compliance and where backup and restore are part of the system. A database provides advanced methods for optimization, so if you need faster analytics. Most importantly, you store data in a database when you’re expecting it to meet service level agreements on analytics. In other words, if you have to run x number of reports in x number of minutes, it’s a database that will get you there.

Technical Differences

Let’s look at some technical inner workings of a database and a query engine.

Technical Capability Database Query Engine
Transactional Isolation A database manages the transactions that it is ask to do and has built-in features that ensure no transactions is forgotten. A database is aware that other transactions may be coming and has a strategy for queuing and managing actions A query engine works in isolation. You may point several query engines at the same data. If you point too many queries at the same data, some of them may fail.
Data Optimization A database prefers to store structured data to know exactly where the data is and how to access it. By storing structured data, it can better optimize data access to boost performance. It can also compress data more efficiently to lower the storage footprint. Query engines usually use something like HIVE to define structure around the data. Because data of dissimilar types are often mixed together, chance of optimization is less.
Unstructured Data Query Most modern databases have ways to handle unstructured data, but it’s mostly about adding structure to the data blobs. Schema on read is a popular feature in modern databases, for example, allowing it to read in poly/semi-structured data like JSON. Most query engines were design to sit on top of Hadoop and perform analytics on unstructured data.
Query Volume and workload management A database is designed to handle many users concurrently asking many questions of the data. Databases have workload management built in. Resource pools can be set up so that the queries that need to be fast run in one pool, while less critical queries can be allocated to pools with fewer resources. For query engines, workload management is often done at the cluster level. Although technologies like Ambari, YARN and Mesos exist, they are still immature and may be difficult to implement to manage workloads.
Data Assurance DB Responsibility User Responsibility
Concurrency Intrinsic Limited
ACID Yes No
Performance Optimized User Dependent

Clarifying Your Use Case

Database technologies have been around for decades, while the use of query engines is relatively new. Before you embark on your next analytics project, make a careful assessment about which one you need. In the case of Vertica, we recognize that there are times for a query engine and times for database. That’s why we have Vertica Enterprise Edition, a product that acts like a database because it is one. It can natively stores data with all the assurances of database. On the other hand, with support for querying data like ORC, Parquet and S3 data in-place, Vertica Enterprise has the capability to act as either a query engine or a database. For a pure query engine, we offer Vertica for SQL on Apache Hadoop. Our query engine uses some of the same code base of Vertica to answer SQL queries, but runs completely installed in the Hadoop cluster as a query engine.