Welcome to the Flex Zone – Dynamic Schemas

Posted January 15, 2014 by Ben Vandiver

Enter the Flex Zone

Much of the data analytics we perform occurs with data whose schema changes over time. Many organizations have developers who define the record types that are collected and analysts who comb through these records looking for important patterns. In today’s fast-changing business landscape, agility is the key to success: developers need to continually define new metrics for their analyst’s consumption. Typically, relational databases have been a significant impediment: schema changes are costly and the database data ingest logic is brittle. Instead of record changes being at the need of the developer, they involve careful coordination with database administrators against a backdrop of tweaking production systems. The Flexible table feature in Vertica 7.0 “Crane” seeks to address this challenge by allowing the table schema to vary from record to record. No database administrator involvement is necessary with day-to-day schema changes. The DBA need not get involved, unless performance tuning is necessary.

A Web-driven database

For our example, we will consider a simple web service that serves a form and stores the results. While simplistic, this isn’t far off from serious web services, mobile games, smart meters, etc. There are two source files for this example, a web server flex-server.py and an HTML file form.html (store them in the same directory).

Getting Started

First, we need to create a table to store the form data. With Vertica Flex tables, we don’t even need to know what the structure of the data is upfront:
CREATE FLEX TABLE webreqs();The web server, implemented in Python, is quite simple: it serves a form and stores the responses as JSON records.
class FlexServer(BaseHTTPServer.BaseHTTPRequestHandler):
def do_GET(self):
url = urlparse.urlparse(self.path)
# form submission?
if url.path == '/data':
v.send(json.dumps({"time":str(datetime.datetime.now()),
;"src":self.client_address[0],
#"header":dict(self.headers.items()), # uncomment to get header info
"data":urlparse.parse_qs(url.query)})+'n')
self.send_response(302)
self.send_header('Location',urlparse.urljoin(self.path,'/'))
else:
self.send_response(200)
self.send_header('Content-type','text/html')
self.end_headers()
with open("form.html","r") as src:
self.wfile.write(src.read())
returnv = SendToVertica()
from BaseHTTPServer import HTTPServer
server = HTTPServer(('localhost',8080),FlexServer)
server.serve_forever()
For brevity, we’ve left off the SendToVertica class, which streams the data into Vertica – it’s also quite simple and the full text is in flex-server.py.

The HTML file contains the form – it’s broken out into a separate file to make editing it easier. The initial form merely asks for the submitter’s name.

Enter Info

Name:

Start the server (python flex-server.py) and point your browser at localhost:8080 – you should see something like:

flex-dyn-schema-blog-scrnshot-300x129

Submit the form a couple of times. Then look at the server output – it might look something like this:
{"src": "127.0.0.1", "data": {"name": ["Ben"]}, "time": "2013-12-24 01:40:35.475723"}
{"src": "127.0.0.1", "data": {"name": ["Amy"]}, "time": "2013-12-24 01:40:39.293608"}
Vertica makes it easy to load this data into Flex tables – no table schema definition necessary and there’s already a parser for JSON data. The flex-server is running the following statement in order to load the data:
copy webreqs from stdin parser fjsonparser(flatten_arrays=true);The flatten_arrays parameter may be new to you – we need it here because HTML form attributes can be many valued, so there’s an extra level of array we need to flatten out of the input.

Now we can write a query against the data as if the JSON features were columns in the table:
select "data.name.0" from webreqs;
data.name.0
-------------
Amy
Ben
(2 rows)select compute_flextable_keys_and_build_view('webreqs');
compute_flextable_keys_and_build_view
----------------------------------------------------------------------------------------------------
Please see public.webreqs_keys for updated keys
The view public.webreqs_view is ready for querying
(1 row)select * from webreqs_view;
data.name.0 |     src   | time
-------------------+-------------------------------+----------------------------------------------------------------------------------------------
Ben | 127.0.0.1 | 2013-12-24 01:40:35.475723
Amy | 127.0.0.1 | 2013-12-24 01:40:39.293608
(2 rows)

It’s Flex Time

But now for the main event: so far, loading the data was easy – but what if the developer decides the form will also have an additional vote field. What needs to change? Well, let’s change the HTML file to include the new form field:

Enter Info

Name:

Vote:

Submit the form a couple more times and look at the output – now there’s an additional key in the JSON!
{"src": "127.0.0.1", "data": {"vote": ["yes"], "name": ["Jen"]}, "time": "2013-12-24 01:40:57.785534"}
{"src": "127.0.0.1", "data": {"vote": ["no"], "name": ["Ian"]}, "time": "2013-12-24 01:41:01.867649"}
What database changes are necessary? None! The new fields are immediately available to query:
select sum(case when "data.vote.0" = 'yes' then 1 else 0 end) as yeses, count("data.vote.0") as total from webreqs;
yeses | total
-------+-------
1 | 2
(1 row)
But what about the entries submitted before the “vote” field existed?
select "data.name.0","data.vote.0" from webreqs;
data.name.0 | data.vote.0
-------------+-------------
Jen | yes
Ben |
Amy |
Ian | no
(4 rows)
For fields that don’t exist in a particular record, you get SQL null. SQL operators do the “right thing” when they encounter nulls, such as count and sum above.

If you believe the schema has changed, you can update your keys and views to reflect the new reality by re-running the schematization tools:
select compute_flextable_keys_and_build_view('webreqs');
compute_flextable_keys_and_build_view
----------------------------------------------------------------------------------------------------
Please see public.webreqs_keys for updated keys
The view public.webreqs_view is ready for querying
(1 row)select * from webreqs_view;
data.name.0 |    src    | time                       | data.vote.0
--------------------+---------------------------------+-----------------------------------------------------------------------------------+-------------------------------------------
Jen | 127.0.0.1 | 2013-12-24 01:40:57.785534 | yes
Ben | 127.0.0.1 | 2013-12-24 01:40:35.475723 |
Amy | 127.0.0.1 | 2013-12-24 01:40:39.293608 |
Ian | 127.0.0.1 | 2013-12-24 01:41:01.867649 | no
(4 rows)

Note that you are not required to recompute the view to access the new fields, only to access the new fields through the view.

Try it yourself – edit the HTML form to have new fields and see how easy it is to query the resulting data in Vertica! You can also uncomment headers line in the server to get access to the header information (think cookies, sessionization, etc).

Tuning for Performance

While flexible tables are great for exploration, we recommend that you do some tuning before moving into production. Luckily, tuning is easy and still preserves the flexible nature of the tables. The central mechanism is “promotion” of one or more columns to real table columns. To promote, select commonly used columns, particularly those used in joins or predicates and add them to your flex table. You can do this by using ALTER TABLE … ADD COLUMN … DEFAULT. For example, the time and vote columns might be of particular interest:
ALTER TABLE webreqs ADD COLUMN "time" TIMESTAMP NOT NULL DEFAULT time::timestamp;
ALTER TABLE webreqs ADD COLUMN "data.vote.0" VARCHAR(10) DEFAULT "data.vote.0"::varchar;
As part of adding the column, Vertica populates the column with any same-name data. Future loads will also load data into the newly added columns. Note that we have selected specific data types for the new columns. New data that doesn’t match the declared data type will result in nulls getting stored (or errors if the column is declared NOT NULL).

Don’t want to run ADD COLUMN yourself? No problem, because there’s a helper function to do this step for you:
select materialize_flextable_columns('webreqs');The following columns were added to the table public.webreqs:
data.name.0
src
time
data.vote.0
For more details, run the following query:
SELECT * FROM v_catalog.materialize_flextable_columns_results WHERE table_schema = 'public' and table_name = 'webreqs';
There’s nothing special about these newly added columns – you could have done the same thing upfront:
CREATE FLEX TABLE webreqs (
"time" TIMESTAMP NOT NULL,
"data.vote.0" VARCHAR(10)
);
So if you know a little about your data and how it will be used, you can jump directly to the optimized approach.

Vertica’s Database Designer exists to optimize the physical design (storage layout) and it works great on promoted columns in flexible tables. A typical usage model is explore, promote, and optimize. A key thing to remember is that changes to the schema do not require promotion and optimization. If you add fields that dramatically change your query workload, then some additional performance optimization may be necessary.

Flexible tables from Vertica should dramatically simplify the process of adapting your schema to the needs of your changing business.