Archive for January, 2010

Five Tools for High Velocity Analytics

In this post we take a look at the tools required to achieve “high velocity” analytics.  What are the technologies that are important for high velocity analytics and the defining characteristics of these technologies?

  • Low Latency Data Source – Starting at the front end, the gating requirement for embedding analytics in your business is a real time data source.
  • Pipelined [ET]L – Whether data is extracted or transformed, it has to be loaded in a pipeline as data arrives, not in batches at the end of the day.
  • Complex Event Processing – The only way to respond to real time events is on the wire.  CEP systems look at every record before it is stored and can respond to events as they happen.
  • Real Time  Analytic Database – Once data makes it to a storage system you can analyze it in context with your historical data. Concurrent load and query is the critical backbone to high velocity analytics.
  • Flexible Business Intelligence – Front end tools are designed for business users and they have to be as flexible as an analyst is creative.

We assume you’re reading this post because you have a lot of incoming data or you are expecting it.  You will need to have tools in place to get this data into your system as quickly as possible. If your inbound data is log files, you can use tools like scribe to capture these logs in real time from your web servers. This tool was developed by the team at Facebook to centralize all of their log data and is now free and open source.  There’s a great introductory article on the High Scalability blog.

If your key business data comes from an online transaction processing (OLTP) system you need to first make sure you have a fast OLTP system handling inbound transactions.  This can be anything from a general purpose database, perhaps shardedfor scalability or an optimized OLTP specific DBMS.  As with log files, the key attribute you need to identify is transaction latency with large volumes of data; just having a lot of data is not sufficient.  If you can handle a million users but it takes an hour to process their orders then that hour is going to be your bottleneck. The same is true when you need to modify your pricing or ad campaigns or to identify up-sell opportunities. That hour of lag is going to be the bottleneck to building analytics into your business and increasing your profitability. The time between a user executing a transaction, such as purchasing an item or changing their subscription, and when you can act on that transaction needs to be minutes if not seconds.

To achieve this low latency interface you can use a change data capture tool, an OLTP solution with a direct extract/load (EL) to an analytic database or structure your application to log the result of the transaction directly to your analytic database. You can also use tools such as scribe to log this transaction just as you would any other application log. You may also have a custom data feed such as a financial tick feed fromThomson ReutersBloomberg B-PIPE or QuantHouse QuantFEED or a feed from your operational network provider.

In between the transaction data source and your analytic database you may need to respond to events as they occur “on the wire.”  This is where you employ a complex event processing engine to handle on the wire detection, automate common responses and flag important events.  CEP systems typically operate by running data through a pre-defined query, that accumulates and modifies state, triggering behaviors when a certain threshold is met. For example, a CEP system can be used to keep a count of errors for various data sources and raise an alert if any of them exceeds your maximum SLA threshold.

The hub for your business processes is the analytic database. This database is different from the general purpose database you use for accounting and it may even be different than the enterprise data warehouse where you log and report across your business activities. Your analytic database must be able to accept incoming data 24×7, allow you to access data quickly and with low latency – within minutes if not seconds and scale out infinitely as your data volumes grow.

The analytic database collects real time data as it is streamed in and stored for some defined period of time. Since the historical data storage is defined by business requirements, the analytic database must scale out to handle as much historical data as necessary. Similarly as requirements for faster analysis on more data grows the database must scale to handle more users and faster queries.

Since data is flowing in non-stop, the analytic database must have robust features to support trickle load, concurrent load and query and non-stop high availability.  If you have to pause queries in order to load or if any parts of the system need to be restarted in order to load data after a failure, you risk having downtime in a critical component during the highest peak of loads. These scenarios are easy to test by simulating high rates of loads and queries while pulling the plug (figuratively or literally) on random components.

Finally, as business users get access to more data in real time, the type of analysis changes. With the flexibility to iteratively explore real time data, user demand for additional information and different views grows. The front end tools must handle dynamic visualizations to accommodate these requirements. Both classic BI tools such as MicroStrategyCognos and Business Objects as well as new cutting edge tools fromTableauJasperSoft and Pentaho are modernizing the front end for real time BI. The key features to look for are flexibility of schema, and simplicity of abstraction.  When adding new data to the system it should be easy to incorporate it into the tool (if not automatic) and the complexity for mapping from what the business user clicks on and what the database schema defines should be minimal.  Keep in mind the development time between adding a data source to the analytic DBMS and giving your business users access to that data.

The key takeaway of this tutorial is that the latency between your transaction processing systems, data capture tools, complex event processing, real time analytic DBMS and your business analytics tool will define the speed at which your business can react to changes and ultimately your flexibility to adapt.  The lower the latency in your toolset, the higher the velocity which with you can operate and the more effectively you will be able to compete.  There is no doubt today that the winning players in every market are the most adaptable and flexible companies.

Get Started With Vertica Today

Subscribe to Vertica