Vertica

Archive for the ‘in-database analytics’ Category

Can Vertica Climb a Tree?

big_basin_0939_mg_1143

The answer is YES if it is the right kind of tree. Here “tree” refers to a common data structure that consists of parent-child hierarchical relationship such as an org chart. Traditionally this kind of hierarchical data structure can be modeled and stored in tables but is usually not simple to navigate and use in a relational database (RDBMS). Some other RDBMS (e.g. Oracle) has a built-in CONNECT_BY function that can be used to find the level of a given node and navigate the tree. However if you take a close look at its syntax, you will realize that it is quite complicated and not at all easy to understand or use.

For a complex hierarchical tree with 10+ levels and large number of nodes, any meaningful business questions that require joins to the fact tables, aggregate and filter on multiple levels will result in SQL statements that look extremely unwieldy and can perform poorly. The reason is that such kind of procedural logic may internally scan the same tree multiple times, wasting precious machine resources. Also this kind of approach flies in the face of some basic SQL principles, simple, intuitive and declarative. Another major issue is the integration with third-party BI reporting tools which may often not recognize vendor-specific variants such as CONNECT_BY.

Other implementations include ANSI SQL’s recursive SQL syntax using WITH and UNION ALL, special graph based algorithms and enumerated path technique. These solutions tend to follow an algorithmic approach and as such, they can be long on theory but short on practical applications.
Since SQL derives its tremendous power and popularity from its declarative nature, specifying clearly WHAT you want to get out of a RDBMS but not HOW you can get it, a fair question to ask is: Is there a simple and intuitive approach to the modeling and navigating of such kind of hierarchical (recursive) data structures in a RDBMS? Thankfully the answer is yes.

In the following example, I will discuss a design that focuses on “flattening” out such kind of hierarchical parent-child relationship in a special way. The output is a wide sparsely populated table that has extra columns that will hold the node-ids at various levels on a tree and the number of these extra columns is dependent upon the depth of a tree. For simplicity, I will use one table with one hierarchy as an example. The same design principles can be applied to tables with multiple hierarchies embedded in them. The following is a detailed outline of how this can be done in a program/script:

  1. Capture the (parent, child) pairs in a table (table_source).
  2. Identify the root node by following specific business rules and store this info in a new temp_table_1.
    Example: parent_id=id.
  3. Next find the 1st level of nodes and store them in a temp_table_2. Join condition:
    temp_table_1.id=table_source.parent_id.
  4. Continue to go down the tree and at the end of each step (N), store data in temp_table_N.
    Join condition: temp_table_M.parent_id=temp_table_N.id, where M=N+1.
  5. Stop at a MAX level (Mevel) when there is no child for any node at this level (leaf nodes).
  6. Create a flattened table: table_flat by adding in total (Mlevel+1) columns named as LEVEL,
    LEVEL_1_ID,….LEVEL_Mlevel_ID.
  7. A SQL insert statement can be generated to join all these temp tables together to load
    into the final flat table: table_flat.

  8. When there are multiple hierarchies in one table, the above procedures can be repeated for each
    hierarchy to arrive at a flattened table in the end.

 

This design is general and is not specific to any particular RDBMS architecture, row or column or hybrid. However the physical implementation of this design naturally favors columnar databases such as Vertica. Why? The flattened table is usually wide with many extra columns and these extra columns tend to be sparsely populated and they can be very efficiently stored in compressed format in Vertica. Another advantage is that when a small set of these columns are included in the select clause of an SQL, because of Vertica’s columnar nature, the other columns (no matter how many there are) will not introduce any performance overhead. This is as close to “free lunch” as you can get in a RDBMS.

Let’s consider the following simple hierarchical tree structure:

Vertica Tree diagram

There are four levels and the root node has an ID of 1. Each node is assumed to have one and only one parent (except for the root node) and each parent node may have zero to many child nodes. The above structure can be loaded into a table (hier_tab) having two columns: Parent_ID and Node_ID, which represent all the (parent, child) pairs in the above hierarchical tree:

CHart 1

It is possible to develop a script to “flatten” out this table by starting from the root node, going down the tree recursively one level at a time and stopping when there is no data left (i.e. reaching the max level or depth of the tree). The final output is a new table (hier_tab_flat):

Chart 2

What’s so special above this “flattened” table? First, this table has the same key (Node_ID) as the original table; Second, this table has several extra columns named as LEVEL_N_ID and the number of these columns is equal to the max number of levels (4 in this case) plus one extra LEVEL column; Third, for each node in this table, there is a row that includes the ID’s of all of its parents up to the root (LEVEL=1) and itself. This represents a path starting from a node and going all the way up to the root level.The power of this new “flattened” table is that it has encoded all the hierarchical tree info in the original table. Questions such as finding a level of a node and all the nodes that are below a give node, etc. can be translated into relatively simple SQL statements by applying predicates to the proper columns.

Example 1: Find all the nodes that are at LEVEL=3.Select Node_ID From hier_tab_flat Where LEVEL=3;Example 2: Find all the nodes that are below node= 88063633.

This requires two logical steps (which can be handled in a front-end application to generate the proper SQL).

Step 2.1. Find the LEVEL of node= 88063633 (which is 3).

Select LEVEL From hier_tab_flat Where Node_ID=88063633;

Step 2.2. Apply predicates to the column LEVE_3_ID:

Select Node_ID From hier_tab_flat Where LEVE_3_ID =88063633;

Complex business conditions such as finding all the nodes belonging to node=214231509 but excluding the nodes that are headed by node=88063633 can now be translated into the following SQL:

Select Node_ID
From hier_tab_flat
Where LEVE_2_ID=214231509
And LEVE_3_ID <> 88063633 ;

By invoking the script that flattens one hierarchy repeatedly, you can also flatten a table with multiple hierarchies using the same design. With this flattened table in your Vertica tool box, you can climb up and down any hierarchical tree using nothing but SQL.

Po Hong is a senior pre-sales engineer in HP Vertica’s Corporate Systems Engineering (CSE) group with a broad range of experience in various relational databases such as Vertica, Neoview, Teradata and Oracle

Comparing Pattern Mining on a Billion Records with HP Vertica and Hadoop

Pattern mining can help analysts discover hidden structures in data. Pattern mining has many applications—from retail and marketing to security management. For example, from a supermarket data set, you may be able to predict whether customers who buy Lay’s potato chips are likely to buy a certain brand of beer. Similarly, from network log data, you may determine groups of Web sites that are visited together or perform event analysis for security enforcement. In this blog post, we will show you how the HP Vertica Analytics Platform can efficiently find frequent patterns in very large data sets.

A pattern mining algorithm

Frequent patterns are items that occur often in a data set. After finding frequent patterns, analysts can use methods such as association rule mining to discover rules in the data. A classic example of an association (from Wikipedia) is that customers who buy diapers also tend to buy beer from a supermarket. While there are many frequent pattern mining algorithms in literature, we will use the FP-growth algorithm. FP-growth is considered efficient as it performs fewer database scans and does not require candidate set generation [1].

Instead of describing FP-growth in detail, we list the main steps from a practitioner’s perspective. We need to perform the following steps to obtain frequent patterns using FP-growth:

  1. Create transactions of items
  2. Count occurrence of item sets
  3. Sort item sets according to their occurrence
  4. Remove infrequent items
  5. Scan DB and build FP-tree
  6. Recursively grow frequent item sets

Let’s use an example to illustrate these steps. We will assume that our data set is a Web proxy log from a corporate network that, among other things, has IP address and Web sites visited as fields. Our goal is to find patterns such as Web sites that are visited together. After step 1, we obtain a set of transaction items shown in Table 1. Each transaction lists the Web sites visited from each IP address. After steps 2 and 3, we get Table 2 that has items sorted by their frequencies. Assuming that an item is considered frequent only if it occurs more than three times, then in step 4 we will discard cnn and yahoo from the table. In step 5 we use the pruned table to create an FP-tree (Figure 1). Finally, in step 6 we grow frequent patterns. The final output is shown in Table 3. The output, for example, shows that many users tend to visit both the Web sites of HP and Amazon.

Table 1: Sites Visited

Table 2: Sorted Items

Figure 1: FP-tree

Table 3: Final output of frequent patterns

Parallel pattern mining on the HP Vertica Analytics Platform

Despite the efficiency of the FP-Growth algorithm, single-threaded sequential version of FP-Growth can take very long on large data sets. Fortunately, we can rewrite the algorithm using SQL and HP Vertica user-defined functions (UDFs), and let the HP Vertica Analytics Platform parallelize the implementation. The main issue to resolve is how to map the algorithm to SQL statements and then remove dependencies between UDFs so that they can run independently and in parallel. Below are the statements that we used in the HP Vertica Analytics Platform. Let’s assume that we are still working with the Web proxy log example introduced earlier.

  1. Create transaction of items
    • SELECT DISTINCT srcIP, hostname INTO uniqueSipHn FROM networkLog;
  2. Count frequency of occurrence of each host name
    • SELECT count(hostname) INTO hnCnt FROM uniqueSipHn;
  3. List host names visited by each IP and also the frequency of each host name.
    • SELECT a.srcIP, b.hostName, b.frequency into sipHnCnt FROM uniqueSipHn a INNER JOIN hnCnt b ON a.hostName=b.hostName;
  4. Build conditional transactions. Assume an item is frequent if it occurs more than 20,000 times.
    • SELECT t1.hostName, t1.srcIP, t2.hostName AS condItem INTO condTr FROM sipHnCnt t1 JOIN sipHnCnt t2 ON (t1.srcIP=t2.srcIP) and (t1.count>20000 and t2.count>20000) and ((t2.count>t1.count) or (t2.count=t1.count and t2.hostName>t1.hostName))
  5. Generate patterns in parallel using UDF.
    • SELECT FPGrowth(srcIP, condItem, 20000) OVER(PARTITION BY hostName ORDER BY srcIP) INTO frequentItems FROM condTr;

The real test: a billion records, and, of course, Hadoop

Now that we know how to implement parallel frequent pattern mining in the HP Vertica Analytics Platform, let’s see how the implementation performs a large data set. Our input data is a few days’ worth of Web proxy logs. The log file is 330 GB in size, and has a billion records each with 22 fields. For comparison, we use Mahout’s implementation of parallel frequent pattern mining (Cloudera Hadoop 2.0 and mahout-0.7). We wrote a MapReduce program to create transactions from the log (step 1 of the algorithm). Our test bed consists of 12 HP ProLiant servers, each with 12 cores, 96GB RAM, and 128GB SSD.

Figure 2 depicts our results. On 4 servers, the HP Vertica Analytics Platform can complete the end-to-end pattern mining in fewer than 140 seconds. Hadoop takes 1,250 seconds (20 minutes)—approximately 9x more time than the HP Vertica Analytics Platform. As we increase the number of servers to 12, both the HP Vertica Analytics Platform and Hadoop take less time to complete. However, unlike Hadoop, the HP Vertica Analytics Platform has close to linear scaling for this setup.

Are you searching for patterns in your data set? Want a fast and easy-to-use data analytics platform? Evaluate the HP Vertica Community Edition today.


[1] Mining frequent patterns without candidate generation. Jiawei Han, Jian Pei, Yiwen Yin. SIGMOD 2000.

Top 4 Considerations When Evaluating a Data Analytics Platform

From fraud detection to clickstream analytics to simply building better products or delivering a more optimal customer experience, Big Data use cases are abounding with analytics at the core.

With a solid business or use case in place, the next step that organizations typically take is to investigate and evaluate the appropriate set of analytics technology from which to accomplish their analysis, often starting with a data analytics platform. But what are the requirements from which to base your evaluation?

The Winter Corporation, the large-scale data experts, just finalized an in-depth white paper (The HP Vertica Analytics Platform: Large Scale Use and Advanced Analytics) that reflects the results and findings through evaluation, independent research, customer and employee interviews, and documentation review.

Intended for a more technical audience, this white paper focuses on key evaluation criteria that your organization can use as a guide as you conduct your own evaluation.

 

 

Winter Corporation identified these key feature areas as critical for any data analytics platform:

1. Architecture
• Column store architecture
• Shared nothing parallelism
• Cluster size and elasticity
• Smart K-Safety based availability
• Hybrid storage model
• Multiple database isolation modes
• Both bulk load and trickle feed

2. Performance
• Extensive data compression and data encoding
• Read-optimized storage
• Highly parallel operation
• Storage of multiple projections
• Automatic physical database design

3. General Useful and Noteworthy Features for Large-Scale Use
• Export-import
• Backup/restore
• Workload analyzer
• Workload management
• Role-based security

4. Extensions for Advanced Analytics
• SQL extensions
• Built-in functions
• User-defined extensions
• Flexibility in accessing and analyzing all data (structured, semistructured, or unstructured)

Finally, once you have evaluated and confirmed that the data analytics platform meets your feature and technology requirements, you want to hear from other organizations that have deployed large-scale analytics’ initiatives in real-world environments.

The white paper concludes with a write-up on how Zynga, a social game services company with more than 240 million users of its online games, stores the actions of every player in every game — about 6 TB per day of data — in near-real time in the HP Vertica Analytics Platform. No matter where in the world a game event occurs, the data can be retrieved via a report or query from the central HP Vertica database no more than five minutes later.

Vertica Moneyball and ‘R’. The perfect team!

Back in April, Colin’s blog on, “Moneyball – not just for baseball anymore” was a good example of describing how statistics can be used to make better decisions on and off the baseball field.  New measures can be created to better understand a player’s real contribution to a team.  For instance, most baseball players are familiar with the popular earned run average (ERA) measure for pitchers, but a new one that is becoming more popular is called WHIP (Walks plus Hits per Innings Pitched).

Here is how Wikipedia describes WHIP: While earned run average (ERA) measures the runs a pitcher gives up, WHIP more directly measures a pitcher’s effectiveness against the batters faced. It is calculated by adding the number of walks and hits allowed and dividing this sum by the number of innings pitched; therefore, the lower a pitcher’s WHIP, the better his performance.   Listed below is the calculation for WHIP.

 WHIP = (Walks + Hits)/ Innings Pitched.

Dashboards such as the following can be built demonstrating these new kinds of measures or key performance indicators (KPI) and how they can be used across a wider audience and provide more insight on teams and players.

Some of the other measures needed to accurately determine a person’s contribution to the team can only be implemented using a statistical package such as ‘R’.  Typically implementing a statistical package in an organization is not a trivial task for the following reasons:

1.)    Specialized domain expertise required – Statistics requires a new skill set to understand and use properly.

2.)    Data Access – Import and Export must be done into the statistical package.

3.)    Performance – Many of the statistical algorithms are compute intensive.

This article will demonstrate how Vertica 6 handles the first two items above and another article to soon be posted will show how Vertica 6 “Presto” has some distinct ‘R’ integration related “Performance” capabilities.

While it is true that understanding statistics can be challenging without proper training, having a group who fully understands the algorithms collaborate with the business domain experts ensures that proper implementation can be done.  Implementing these new algorithms in the database allows your organization to leverage the powerful statistics in their daily business analysis and reduce the time to market because they can now be treated as any other “standard” database function. The possibility for error is also reduced because no longer are complex “Extraction, Transformation and Load (ETL)” products required to import and export the data into the statistical package.  The entire process is now streamlined so that any BI tool or ETL tool in the organization can also leverage the new capability as well because they are now in the database.

So let’s put on our favorite baseball cap, in my case a Tiger cap, and take a closer look at how using ‘R’ can enhance our understanding of our favorite baseball teams and players.

As indicated before, “Moneyball” enlightened the baseball world with many new “measures” that are now almost common speak amongst baseball fans.  The scenario for this example could be a team might want to ensure they are paying their pitchers appropriately based on performance, or they might be interested in finding some more talented pitchers for their team.  Once these pitchers are determined, I want to group them together in “liked clusters” based on our key performance indicators (KPI). The two KPI’s I have decided to use are the WHIP calculation that we described above and another one called IPouts, which is simply the “number of outs pitched”.

Listed below is a simple query showing results for last year’s top pitchers sorted on the new measure called WHIP.

You can see very clearly why Justin Verlander was the MVP and Cy Young award winner last year.  His WHIP and IPouts where the best and he was third in ERA.   All of the measures provided so far can be implemented with standard SQL.  The next thing I want to do is group these pitchers into clusters based on my two measures of WHIP and IPouts.  To do this I used the new Vertica integration with a statistical package called ‘R’ to implement a clustering algorithm called KMeans.  In my case I want 3 clusters of the 91 pitchers from 2011 that qualified.  The column below called Geo.cluster was provided by the integration of ‘R’ in Vertica.

You can see that even in the top 10 we have pitchers in all of our 3 clusters. Keep in mind that lower numbers for WHIP and ERA are better and higher values for IPouts are better. Looking at the list above I now have some better insight on the players and I can focus on cluster 3 players and possibly some players from cluster 2. Listed below is an example of a histogram showing the WHIP on the X axis for all our 91 pitchers of 2011.  You can include these charts and graphs in your dashboards as well.

Other database competitors can also claim ‘R’ integration, but Vertica’s implementation provides better value to your organization because of its simplicity and performance.  Other vendors take an ‘R’ centric approach, which means your users have to know ‘R’ and use the ‘R’ suite of programming tools.  Vertica’s implementation is a more ‘data’ centric approach that shields the users from having to know and use the ‘R’ language.  Users can continue to use their favorite BI or query tool and now have access to ‘R’ capability.

This article demonstrated how statistics can be used to build new measures to provide more insight on a particular situation.  This kind of analysis can also be applied in your organization to help with detecting fraud etc.

Stay tuned on future posts that will give you more detail on how the kmeans and other statistical functions like page rank were implemented in Vertica using ‘R’.  Go Tigers!

For more details on how to implement R in Vertica please to the following blog http://www.vertica.com/2012/10/02/how-to-implement-r-in-vertica/

The Right Tool for the Job: Using Hadoop with Vertica for Big Data Analytics

by Mingsheng Hong, Vertica Product Marketing Engineer

I have an entrepreneur friend who used to carry a butter knife around.  He claimed this “almighty” tool was the only one he ever needed!  While the butter knife does serve a wide range of purposes (especially with a stretch of the imagination), in practice it doesn’t always yield optimal results.  For example, as a screwdriver, it may work for common screws, but certainly not a Phillips (unless you push down very hard and hope not to strip the screw).  As a hammer, you may be able to drive finishing nails, but your success and mileage may vary.  As a pry bar, well, I think you get my point!  Clearly one tool isn’t sufficient for all purposes – a good toolbox includes various tools each fulfilling a specific purpose.

When it comes to Big Data Analytics, Hadoop (as a platform) has received an incredible amount of attention.  Some highlights include: scalable architecture based on commodity hardware, flexible programming language support, and strong open source community support committed to its on-going development.  However, Hadoop is not without limitations: due to its batch oriented nature, Hadoop alone cannot be deployed as a real-time analytics solution.  Its highly technical and low-level programming interface makes it extremely flexible and friendly to developers but not optimal for business analysts.  In an enterprise business intelligence environment Hadoops’s limited integration with existing BI tools makes people scratch their head trying to figure out how to fit it into their environment.

As Hadoop has continued to gain traction in the market and (in my opinion) moved beyond the peak of the hype cycle, it is becoming clear that to maximize its effectiveness, one should leverage Hadoop in conjunction with other business intelligence platforms and tools.  Best practices are emerging regarding the choice of such companions, as well as how to leverage each component in a joint deployment.

Among the various BI platforms and tools, Vertica has proved an excellent choice. Many of its customers have successfully leveraged the joint deployment of Hadoop and Vertica to tackle BI challenges in algorithmic trading, web analytics, and countless other industry verticals.

What makes the joint deployment so effective, and what are the common use cases?

First, both platforms have a lot in common:

  • Purpose-built from scratch for Big Data transformation and analytics
  • Leverage MPP architecture to scale out with commodity hardware, capable of managing TBs through PBs of data
  • Native HA support with low administration overhead

In the Big Data space crowded with existing and emerging solutions, the above architectural elements have been accepted as must-haves for any solution to deliver scalability, cost effectiveness and ease of use.  Both platforms have obtained strong market traction in the last few years, with customer success stories from a wide range of industry verticals.

While agreeing on things can be pleasant, it is the following key differences that make Hadoop and Vertica complement each other when addressing Big Data challenges:

Aspect / Feature Hadoop VERTICA
Interface and extensibility Hadoop’s map-reduce programming interface is designed for developers.The platform is acclaimed for its multi-language support as well as ready-made analytic library packages supplied by a strong community. Vertica’s interface complies with BI industry standards (SQL, ODBC, JDBC etc).  This enables both technologists and business analysts to leverage Vertica in their analytic use cases.Vertica’s 5.0 analytics SDK enables users to plug their custom analytic logic into the platform, with in-process and parallel execution.  The SDK is an alternative to the map-reduce paradigm, and often delivers higher performance.
Tool chain /
Eco system
Hadoop and HDFS integrate well with many other open source tools. Its integration with existing BI tools is emerging. Vertica integrates with the BI tools because of its standards compliant interface.  Through Vertica’s Hadoop connector, data can be exchanged in parallel between Hadoop and Vertica.
Storage management Hadoop replicates data 3 times by default for HA.  It segments data across the machine cluster for loading balancing, but the data segmentation scheme is opaque to the end users and cannot be tweaked to optimize for the analytic jobs. Vertica’s columnar compression often achieves 10:1 in its compression ratio.  A typical Vertica deployment replicates data once for HA, and both data replicas can attain different physical layout in order to optimize for a wider range of queries.  Finally, Vertica segments data not only for load balancing, but for compression and query workload optimization as well.
Runtime optimization Because the HDFS storage management does not sort or segment data in ways that optimize for an analytic job, at job runtime the input data often needs to be resegmented across the cluster and/or sorted, incurring a large amount of network and disk I/O. The data layout is often optimized for the target query workload during data loading, so that a minimal amount of I/O is incurred at query runtime.  As a result, Vertica is designed for real-time analytics as opposed to batch oriented data processing.
Auto tuning The map-reduce programs use procedural languages (Java, python, etc), which provide the developers fine-grained control of the analytic logic, but also requires that the developers optimize the jobs carefully in their programs. The Vertica Database Designer provides automatic performance tuning given an input workload.  Queries are specified in the declarative SQL language, and are automatically optimized by the Vertica columnar optimizer.

 

After working with a number of customers involving joint Hadoop and Vertica deployments, we have identified a number of best practices combing the power of both platforms.  As an example, Hadoop is ideal for the initial exploratory data analysis, where the data is often available in HDFS and is schema-less, and batch jobs usually suffice, whereas Vertica is ideal for stylized, interactive analysis, where a known analytic method needs to be applied repeatedly to incoming batches of data.  Sessionizing clickstreams, Monte Carlo analysis or web-scale graph analytics are some such examples.  For those analytic features supported by both platforms, we have observed significant performance advantages in Vertica, due to the key architectural differences between the two platforms as described above.

Finally, by leveraging Vertica’s Hadoop connector, users can easily move data between the two platforms.  Also, a single analytic job can be decomposed into bits and pieces that leverage the execution power of both platforms; for instance, in a web analytics use case, the JSON data generated by web servers is initially dumped into HDFS.  A map-reduce job is then invoked to convert such semi-structured data into relational tuples, with the results being loaded into Vertica for optimized storage and retrieval by subsequent analytic queries.   As another example, when an analytic job retrieves input data from the Vertica storage, its initial stages of computation, often consisting of filter, join and aggregation, should be conducted in Vertica for optimal performance.  The intermediate result can then be fed into a map-reduce job for further processing, such as building a decision tree or some other machine learning model.

Big Data with Hadoop and Vertica – OSCON ‘11

The recent OSCON ’11 was filled with exciting technology and best practice discussions on Big Data, Java and many other subjects. There I had an opportunity to deliver a talk to the open source community on the subject of this post. In a subsequent talk, my colleagues Steve Watt and Glenn Gebhart presented a compelling demo to illustrate the power of combining Hadoop and Vertica to analyze unstructured and structured data. We were delighted at the feedback that both talks received from the follow-up conversations in person as well as from Twitter. This interview captured the gist of the numerous conversations we had with other attendants of OSCON about Vertica’s real-time analytics capabilities and its underlying technology.

Monte Carlo Analysis Using the Vertica Analytics Platform: Part 1

This post is part of a new series of blogs on harnessing the power of the Vertica Analytics Platform to solve problems that, at first glance, do not appear to be database-centric.

Introduction

The main ingredient in any Monte Carlo method is, as the name suggests, some process akin to going to the casino and repeatedly throwing the dice to see if the odds are in your favor or stacked towards the house.  While this description is simple, the true beauty lies in the application of brute force; instead of applying a complex analysis to the rules and physics of a game, just go play it a number of times and see what happens.

Illustrated graphically, suppose we have a model to analyze:

The Monte Carlo method of analysis says to characterize the model by repeated trials:

Databases, especially those geared toward massively parallel analytics, are surprisingly good platforms for implementing Monte Carlo techniques.  Conversely, the Monte Carlo method is a great tool to have on hand for analyzing data, particularly in situations where brute force seems more appealing than over-thinking things.

A Simple Monte Carlo Example

A software engineering manager wants to know the odds of a release shipping in three months.  There are several projects in the release; whichever one of them takes the longest will determine the total time until ship, as seen in the following figure:

Project A will take 2-4 months (if the estimate is uniformly distributed, that’s a 50/50 chance of an on-time completion); in parallel is an initiative that has several end products, Project B and Project C (that share infrastructure).  Project B is expected to take 2-3 months, with the infrastructure required for Project C being completed half way through Project B, followed by 1-2 months of additional work on Project C.  Project B will get done on time, but it’s not immediately clear what the odds are for finishing Project C on schedule, or shipping the release as a whole.  We can either break out the math books to calculate the final probability (a strategy that decreases in appeal as schedule complexity increases), or do a Monte Carlo analysis.

The steps are simple:

1.     Generate random values for the time taken to implement Project A, B, and C.

2.     Apply the formula GREATEST(A, B, B/2 + C)

3.     Count the number of times  this is less than or equal to 3, as a proportion of total trials.

In SQL:

I ran this a few times and got values ranging from 34-41%.  Clearly, 1000 samples is not enough to get a solid read on this model, but that is easy enough to fix:

With a million samples, the answer is consistently 37.4-37.5%.  Certainly, this percentage range is far more precise than the input data.

Why Bother Using a Database?

For the simple example above, using a database may run afoul of the maxim, “when you have a hammer, everything looks like a nail.”  I could have written this in C, Java, perl, or maybe even Z80 assembly language.  If I wanted parallelism, I could have used MapReduce.  Or, I could have used an off-the-shelf project management tool, which may have resorted to the same technique, but it would probably have made a prettier project plan graphic.

That said, there are a few appealing characteristics when using the Vertica Analytics Platform:

  1. The declarative nature of SQL makes this pretty easy to express.  It only took a couple of minutes to code and debug.
  2. I get scalability and parallelism for free.  The Vertica Analytics Platform will assign work to many processors or nodes, where each node will run some trials and record the results, and then the results will be aggregated.
  3. It ran extremely fast; in the blink of an eye on my laptop.

However, the real power here lies in the fact that the data to be analyzed may already be in a database.  We want to bring the computation to the data, as long as this is easy and efficient:

As a concrete example, lets say we have a database that contains, among other things, all the stock trades that happened over a certain period of time (day, hour, minute, second, whatever).  We can create some kind of model that predicts the upcoming price change percentages, based partly on past performance (which is known from database data, in orange), and partly on future events (random variables such as market forces) which ultimately dictate the predicted valuation and uncertainty.  We can then use Monte Carlo techniques and run the model for a number of input scenarios, and then rank the stocks based on the predicted performance and uncertainty.

In order to arrive at SQL that is simple enough to present here, let’s use a very simple process and model:

  1. The model of the expected price changes will be Ax2+By+C, where A, B, and C are computed from the historical data using in-database analytics, and x and y are random inputs representing potential future market forces, uniformly distributed between 0 and 1.  The manner in which A, B, and C are derived is some sort of secret sauce in this “get rich quick” scheme to beat the market.  Of course x and y are the random inputs subject to Monte Carlo “roll the dice” analysis.
  2. The secret formula has already been computed, giving us a “model” table associating each stock symbol with its A, B, and C values:

  3. We want the top 5 stocks, based on the output of the model (expected percentage change).  We’ll compute the standard deviation as well, so that the analyst can disregard any stocks at his or her discretion, based on a gut feel of which stocks are too volatile.

Given the simplifications, here is the SQL that runs the model (we will save the output in case we need it for additional analysis later):

Then, gathering the top 5 picks is also easy to express in SQL:

This produces the results we hoped to see:

While this is a simple example, there is no reason to stop here.  We can use any secret formula, and a much more complicated model, without changing the overall approach.  Or, if we want to change the overall approach, we can incorporate performance assessment and adaptation into the process:

In our example, this would entail comparing the predictions of the model to the actual data acquired by the database as the stock market plays out.  In-database analytics can be used to assess the quality of the model and make adjustments to improve future results.

Stay Tuned for Part 2…

While the examples presented here may suffice as food for thought, there is much more to be written about the use of in-database analytics to build models and measure their effectiveness.  Also, there are several Vertica-specific features and tips to leverage in performing Monte Carlo analysis.  These will be discussed in an upcoming post.

References

http://en.wikipedia.org/wiki/Monte_Carlo_Method

http://westclintech.com/Blog/tabid/132/EntryID/28/Default.aspx

Sessionize with Style – Part 1

The Vertica Approach

Sessionization is a common analytic operation in clickstream analysis. Given an input clickstream table, where each row records a webpage click made by a particular user (or IP address), the sessionization operation identifies user’s web browsing sessions from the recorded clicks, by grouping the clicks from each user based on the time-intervals between the clicks. Conceptually, if two clicks from the same user are made too far apart in time (as defined by a time-out threshold), they will be treated as coming from two browsing sessions.

Here is an example input clickstream table with a simplified schema. Ignore the output column session_id for now.

user_id timestamp URL session_id
U0 15:00:00 http://online.wsj.com/home-page 0
U0 15:00:25 http://online.wsj.com/public/page/news-world-business.html 0
U0 15:00:45 http://online.wsj.com/articles/capital_journal 0
U0 15:01:45 http://www.xkcd.com/ 0

The standard semantics of sessionization takes a single input parameter: the time-out threshold, which is a constant time interval value. An example time-out threshold value is 30 seconds. Sessionization performs its computation on two columns in the input clickstream table, the user_id and the timestamp of the click. The output session_id column produced by sessionization is shown in the above table.

Vertica’s Sessionization Support

Sessionization in Vertica is built on top of the event-based window function CONDITIONAL_TRUE_EVENT (or CTE in short). Recall the semantics of CTE with input Boolean expression P: CTE(P) is evaluated once per input row, and defines a new window starting at the current row, whenever P is evaluated to true for that row. For example, given a sequence of values <1, 2, 3, 4> for column X, CTE(X > 2) assigns to these rows a sequence of window Ids <0, 0, 1, 2>. Also, recall that the expression P in CTE can access column values in the current row, as well as in previous rows. For example, CTE (X > LAG(X)) defines a new window whenever the value of column X in the current row is greater than X in the last row.

Despite of its powerful semantics, the run-time complexity of CTE is at the level of the simplest SQL ’99 analytic functions such as RANK and ROW_NUMBER – it takes only a single pass over the sorted data, while retaining a minimal amount of state in the computation.

Thanks to CTE, sessionization with its standard semantics can be expressed in Vertica as follows.

SELECT user_id, timestamp, CONDITIONAL_TRUE_EVENT(timestamp – LAG(timestamp) > ’30 seconds’) OVER (PARTITION BY user_id ORDER BY timestamp)

FROM clickstream;

Beyond the Standard Semantics of Sessionization

One limitation of the standard semantics of sessionization is that the time-out threshold is a constant value. However, different users may have different styles and preferences for internet browsing, and therefore the same time-out threshold may not accurately identify sessions for all users.

For example, say user A is a slower web-surfer than an average user, perhaps because A is multi-tasking heavily. Say if an average user does not perform page clicks in a particular web domain D in 30 seconds, it indicates the end of a session. However, for user A, the typical interval between two clicks in same domain is 1 minute, as she is busy tweeting, listening to music, and harvesting in Farmville at the same time. So a better solution is to adaptively determine the session timeout threshold of user A based on her recent browsing behavior (e.g. the average time interval between 2 consecutive clicks in the last 10 clicks which A has performed). This allows the clickstream analyst to customize the timeout threshold for difference users.

For example, to adaptively compute the time-out threshold for a user based on her last 10 clicks with a “fudge factor” of 3 seconds, we can use the following CTE expression: CONDITIONAL_TRUE_EVENT (timestamp – lag(timestamp) <= (LAG(timestamp, 1) – LAG(timestamp,11)) / 10) + ‘3 seconds’. The fudge factor can be a multiplicative factor instead of an additive one. For example, it can be 110% of the average time intervals of the last 10 clicks.

Another sessionization use case involving a more sophisticated time-out threshold is to use different threshold values based on other factors, such as the time of the day, or the nature of the website being browsed. For example, the time-out threshold for Wall Street Journal Online should be higher than xkcd webcomic, as the WSJ articles take longer to read in average than the xkcd comic strips.

Conclusion

The Vertica approach to sessionization enjoys the multiple benefits of ease of use, strong expressive power, as well as highly efficient and scalable execution. Wondering how some alternative approaches will stack up against Vertica’s (hint: they won’t)? That’s what we will answer in a future post.

Get Started With Vertica Today

Subscribe to Vertica