This post is part of a new series of blogs on harnessing the power of the Vertica Analytics Platform to solve problems that, at first glance, do not appear to be database-centric.
The main ingredient in any Monte Carlo method is, as the name suggests, some process akin to going to the casino and repeatedly throwing the dice to see if the odds are in your favor or stacked towards the house. While this description is simple, the true beauty lies in the application of brute force; instead of applying a complex analysis to the rules and physics of a game, just go play it a number of times and see what happens.
Illustrated graphically, suppose we have a model to analyze:
The Monte Carlo method of analysis says to characterize the model by repeated trials:
Databases, especially those geared toward massively parallel analytics, are surprisingly good platforms for implementing Monte Carlo techniques. Conversely, the Monte Carlo method is a great tool to have on hand for analyzing data, particularly in situations where brute force seems more appealing than over-thinking things.
A Simple Monte Carlo Example
A software engineering manager wants to know the odds of a release shipping in three months. There are several projects in the release; whichever one of them takes the longest will determine the total time until ship, as seen in the following figure:
Project A will take 2-4 months (if the estimate is uniformly distributed, that’s a 50/50 chance of an on-time completion); in parallel is an initiative that has several end products, Project B and Project C (that share infrastructure). Project B is expected to take 2-3 months, with the infrastructure required for Project C being completed half way through Project B, followed by 1-2 months of additional work on Project C. Project B will get done on time, but it’s not immediately clear what the odds are for finishing Project C on schedule, or shipping the release as a whole. We can either break out the math books to calculate the final probability (a strategy that decreases in appeal as schedule complexity increases), or do a Monte Carlo analysis.
The steps are simple:
1. Generate random values for the time taken to implement Project A, B, and C.
2. Apply the formula GREATEST(A, B, B/2 + C)
3. Count the number of times this is less than or equal to 3, as a proportion of total trials.
I ran this a few times and got values ranging from 34-41%. Clearly, 1000 samples is not enough to get a solid read on this model, but that is easy enough to fix:
With a million samples, the answer is consistently 37.4-37.5%. Certainly, this percentage range is far more precise than the input data.
Why Bother Using a Database?
For the simple example above, using a database may run afoul of the maxim, “when you have a hammer, everything looks like a nail.” I could have written this in C, Java, perl, or maybe even Z80 assembly language. If I wanted parallelism, I could have used MapReduce. Or, I could have used an off-the-shelf project management tool, which may have resorted to the same technique, but it would probably have made a prettier project plan graphic.
That said, there are a few appealing characteristics when using the Vertica Analytics Platform:
- The declarative nature of SQL makes this pretty easy to express. It only took a couple of minutes to code and debug.
- I get scalability and parallelism for free. The Vertica Analytics Platform will assign work to many processors or nodes, where each node will run some trials and record the results, and then the results will be aggregated.
- It ran extremely fast; in the blink of an eye on my laptop.
However, the real power here lies in the fact that the data to be analyzed may already be in a database. We want to bring the computation to the data, as long as this is easy and efficient:
As a concrete example, lets say we have a database that contains, among other things, all the stock trades that happened over a certain period of time (day, hour, minute, second, whatever). We can create some kind of model that predicts the upcoming price change percentages, based partly on past performance (which is known from database data, in orange), and partly on future events (random variables such as market forces) which ultimately dictate the predicted valuation and uncertainty. We can then use Monte Carlo techniques and run the model for a number of input scenarios, and then rank the stocks based on the predicted performance and uncertainty.
In order to arrive at SQL that is simple enough to present here, let’s use a very simple process and model:
- The model of the expected price changes will be Ax2+By+C, where A, B, and C are computed from the historical data using in-database analytics, and x and y are random inputs representing potential future market forces, uniformly distributed between 0 and 1. The manner in which A, B, and C are derived is some sort of secret sauce in this “get rich quick” scheme to beat the market. Of course x and y are the random inputs subject to Monte Carlo “roll the dice” analysis.
- The secret formula has already been computed, giving us a “model” table associating each stock symbol with its A, B, and C values:
- We want the top 5 stocks, based on the output of the model (expected percentage change). We’ll compute the standard deviation as well, so that the analyst can disregard any stocks at his or her discretion, based on a gut feel of which stocks are too volatile.
Given the simplifications, here is the SQL that runs the model (we will save the output in case we need it for additional analysis later):
Then, gathering the top 5 picks is also easy to express in SQL:
This produces the results we hoped to see:
While this is a simple example, there is no reason to stop here. We can use any secret formula, and a much more complicated model, without changing the overall approach. Or, if we want to change the overall approach, we can incorporate performance assessment and adaptation into the process:
In our example, this would entail comparing the predictions of the model to the actual data acquired by the database as the stock market plays out. In-database analytics can be used to assess the quality of the model and make adjustments to improve future results.
Stay Tuned for Part 2…
While the examples presented here may suffice as food for thought, there is much more to be written about the use of in-database analytics to build models and measure their effectiveness. Also, there are several Vertica-specific features and tips to leverage in performing Monte Carlo analysis. These will be discussed in an upcoming post.