Vertica

Author Archive

Big Data is Changing Software and (Product) Development as We Know It

I am often asked about “Big Data”, its use cases, real-world business value and how it will transform various products, services and markets.  This is one of my favorite topics, and I am fortunate in that I get to spend significant amounts of time with our amazing customers and partners who teach me a lot.  I am actually writing this from a plane after a few recent customer meetings that inspired me to share a point of view.

“Big Data” is already having and will continue to have the most impact in products and services where there is an ability to capture information about usage, experience and behavior in a manner that is accepted, yet not disruptive by the consumer of that product or service.  Data warehousing has been around for a long time with regards to retail transactions and purchasing behavior, but usage and experience measurement hasn’t had the same repository equivalent.  It now does, and I believe this will lead to an exponential jump in the quality and variety of products and services that are delivered to consumers.  In fact, this will not only improve existing solutions, but it will spawn entirely new products and services in industries as diverse as entertainment to medical treatments.

While the notion of experience analysis has been around for a long time through various manual observation efforts, focus groups, and survey methods- the results have been fragmented, small, and analyzed in what I’ll call a “basic” manner. Thanks to technology advancements and the resulting cost shifts, massive near real-time “feedback” collection can now be done through automation and sensor technology.  While the prospect of having this information delights any product manager and merchandiser, the challenge of capturing, storing, and analyzing the information at this scale is still foreign to many.

 

There is one community who is embracing this feedback fire hose with greater ease and speed than most- software developers.  Vertica has several ISVs, who are leaving “breadcrumbs” in their code to collect usage information that can be anonymously transferred back to headquarters for very specific feedback on how users of software are interfacing with it.  Their users agree to this data collection and sharing, and the ISVs ensure that it has no impact on the operational performance of their software.

These “breadcrumbs” can measure how long someone spends on a screen, which buttons they clicked on to get there, how successful they were, etc.  For instance, good development organizations analyze the time that a user should get from one place to another, that is, navigation within and between screens.  If and ISVs software is the track, this is the laser measurement for precise timing.

Vertica is an ideal platform to store and analyze this information.  Using Vertica’s advanced analytic and pattern matching capabilities, correlations of usage patterns can be identified and the developers can patch, redesign, or document accordingly to deliver a better experience to end users.  For example, you could quite easily determine that users who spent 3 minutes on one screen, clicked a certain button, spent less than 1 minute on that screen, then quit might not be happy with their experience compared with users who started in the same place but stayed online longer. Further analysis could determine “why” through the more traditional interview techniques to improve the experience.

Why are software developers so eager to embrace this as the early adopters?  Well, one reason is that it gives them direct feedback on their work, without having to get the sometimes editorialized version from sales, support, management and yes even product managers!  Traditionally, most feedback to this community is sparse at best with highly anecdotal sentiment mixed in.  This method can augment that sentiment, (which should still be captured through sales, support, and product management by the way) with very complete data sets.  The product managers at these customers actually love this capability, and many of them are directly interacting and analyzing with the raw data collected.

Software developers also have the ability to make and control their own sensors- pretty cool when you think about it.  The savvy developer is able to create these listening points at various places in their code.  Savvy developers and product managers these days are spending time on these breadcrumbs because while they know they require more work (just as good quality assurance does), the payback is huge and ultimately can save them a lot of time.  Recently I visited one of our customers that develops enterprise software and they are piloting a project in this area that already has 8 Billion rows of this type of information- now that’s bigger than a breadbox!

This capability is not limited to SaaS vendors (although they certainly have more control and an easier time collecting the data).  Our online gaming customers are at the forefront, but we see all ISVs getting into this.  There is so much we can learn from software developers.  What is especially exciting is seeing how other physical sensors are being used in everything from automobiles to jet engines and even refrigerators to deliver the same type of feedback.  There is no question, the sensor economy is upon us.  In the end, this will lead to better products and services for you and me, the consumer, which is a good thing.

Moneyball – Not Just for Baseball Anymore

Spring is in the air, major league baseball is now underway here in North America, and thoughts of Michael Lewis’ fantastic book and film, “Moneyball” come to mind.  The plot captures how Billy Beane (played by Brad Pitt) leverages an extreme data analyst/quant to fundamentally change baseball strategy and scouting after 100 years of tradition.  The unorthodox data driven strategy was counter to the traditional approach.  Not surprisingly, Billy Beane was questioned until ultimately, the strategy proved successful.  Now, every team in the league, including our Boston Red Sox, is deploying a variant of this approach.  I see the exact same thing happening in just about every industry when it comes to the race for better insight and competitive advantage through extreme information and analytics.  The struggle now of course is where to find the expert quants, analysts, managers, and solution providers who understand how to make it happen.

At Hewlett-Packard, I get to witness and enable real-world moneyball every day in a variety of global industries.  I see how savvy organizations are creating swat teams of business leaders, statisticians, and IT to leverage extreme information and platforms like Vertica in ways that fundamentally alter markets and business dynamics.

In business school I was lucky enough to take Frances Frei’s course “Managing Service Operations”.  The course and her recent best-selling book “Uncommon Service: How to Win by Putting Customers at the Core of Your Business” investigate organizations’ efforts to diagnose and improve service experiences.  Interestingly though, Frances was way ahead of her time and forced us to crunch numbers with statistical programs combining fundamental business information with detailed historical data for true forensics and root cause analysis.  She stressed the importance of math and data analysis.  We were careful never to rely solely on data or theory, but rather bring all of the information together to make the best informed decisions we could.  In the current Big Data era, this can be taken to a whole new level and every company must work this way from the top down.

In addition to the baseball season starting, we know that “April showers bring May flowers”.  The equivalent in our industry is that for the past several years, so many organizations have been “showered” with data.  The “flowers” of course bloom when those same organizations are able to monetize the information to create better products and services and shareholder value.  Modern technologies and comprehensive solution providers like Hewlett-Packard can help organizations drastically reduce the cost and increase the efficacy of analytics by provisioning comprehensive offerings of hardware, software, and services.  Organizations are now able to cost effectively take disparate sources of extreme information, both structured and unstructured and seamlessly combine them for constant ad hoc analysis.  This can lead to fundamentally better decisions and value creation.  Spring is an exciting time of year- let the insights bloom!

Colin Mahony
VP & GM
Vertica, An HP Company

Announcing the Vertica Community Edition

by Colin Mahony, VP of Products & Business Development
and Shilpa Lawande, VP of Engineering

Vertica has had an amazing journey since it was founded in 2005. We’ve built a great product, a great team and an incredibly strong and loyal customer base and partner ecosystem. When we first started, no one had even heard of a column store, and today, ‘Big Data Analytics’ is taking the industry by storm. Every day we see companies – big and small – in industries from retail to gaming becoming more data-driven and doing amazing things with the help of analytics. We feel proud and humbled to see the transformation impact the Vertica Analytics Platform has had on our customers’ businesses, and we believe the time has come to broaden access to our technology to a wider Big Data community.

Today, we are truly excited to announce the Vertica Community Edition beta program! The Vertica Community Edition will offer many of the same features as the enterprise edition of the Vertica Analytics Platform to anyone who wants to discover the power of Vertica.  And, as part of the Community Edition beta announcement, we are developing a new MyVertica Community portal which will provide a platform for Vertica users and partners to interact and share knowledge and code with the entire Vertica user community.

Vertica has always been a customer-driven company and we couldn’t have built Vertica without ideas, feedback and guidance from our customers and partners. We hope that the Vertica community will play a similar role going forward – sharing ideas and best practices and providing candid feedback about the product and how it can be made richer and simpler to use.  The MyVertica community portal will feature product downloads, forums, documentation,  training materials, FAQs and best practice guides. We will also be maintaining a GitHub code repository where community users will be able to share code samples, user-defined extensions built using our SDK, adapters to 3rd party products, and more. We hope that with the Community Edition, we take a small step towards our vision of democratizing data and making data and analytics accessible to all!

To register for the Vertica Community Edition beta program, simply visit www.vertica.com/community and complete the registration form.  The beta program will be limited initially, but full availability of the Vertica Community Edition software is expected by the end of the year.

On behalf of Vertica and HP, we are excited to contribute something back to the Vertica Community.  We sincerely invite you to join and contribute, and we can’t wait to see the many cool things you will do with Big Data and Vertica!

Shilpa & Colin

The Power of Projections – Part 3

By Colin Mahony and Shilpa Lawande

Part III – Comparing and Contrasting Projections to Materialized Views and Indexes

In Part I and Part II of this post, we introduced you to Vertica’s projections, and described how easy it is to interface with them directly via SQL or through our Database Designer™ tool.  We will now end this series by comparing and contrasting Vertica’s projections with traditional indexes and materialized views.

Row-store databases often use Btree indexes as a performance enhancement.  Btree indexes are designed for highly concurrent single-record inserts and updates, e.g. an OLTP scenario. Most data warehouse practitioners would agree that index rebuilds after a batch load are preferable to taking the hit of maintaining them record by record given the logging overhead. Bitmap indexes are designed for bulk loads and are better than btrees for data warehousing but only suitable for low cardinality columns and a certain class of queries.  Even though you have these indexes to help find data, you still have to go to the base table to get the actual data, which brings with it all the disadvantages of a row store.

In a highly simplified view, you can think of a Vertica projection as a single level, densely packed, clustered index which stores the actual data values, is never updated in place, and has no logging. Any “maintenance” such as merging sorted chunks or purging deleted records is done as automatic and background activity, not in the path of real-time loads.  So yes, projections are a type of native index if you will, but they are very different from traditional indexes like Bitmap and Btrees.

Vertica also offers a unique feature known as “pre-join projections”. Pre-join projections denormalize tables at the physical layer under the covers providing a significant performance advantage over joining tables at run-time.  Pre-join projections automatically store the results of a join ahead of time, yet the logical schema is maintained – again, flexibility of the storage structure without having to rewrite your ETL or application.  Vertica can get away with this because it excels at sparse data storage, and in particular isn’t penalized at all for null values nor for wide fact tables. Since Vertica does not charge extra for additional projections, this is a great way to reap the benefits of denormalization without the need to purchase a larger capacity license.

So to sum up, here’s how Vertica projections stack up versus materialized views and conventional indexes.

Vertica’s Projections

Traditional Materialized Views

Traditional Indexes

  • Are primary storage – no base tables are required
  • Are secondary storage
  • Are secondary storage pointing to base table data
  • Can be segmented, partitioned, sorted, compressed and encoded to suit your needs
  • Are rigid: Practically limited to columns and query needs, more columns = more I/O
  • Support one clustered index at most – tough to scale out
  • Have a simple physical design
  • Use Aggregation losing valuable detail
  • Require complex design choices
  • Are efficient to load & maintain
  • Are mostly batch updated
  • Are expensive to update
  • Are versatile – they can support any data model
  • Provide high data latency
  • Provide high data latency
  • Allow you to work with the detailed data
  • Provide near-real time low data latency
  • Combine high availability with special optimizations for query performance

.
That’s pretty much all there is to it.  Whether you are running ad-hoc queries or canned operational-BI workloads, you will find projections to be a very powerful backbone for getting the job done!

Read the rest of the 3-part series…

The Power of Projections – Part 1: Understanding Projections and What They Do
The Power of Projections – Part 2: Understanding the Simplicity of Projections and the Vertica Database Designer™

The Power of Projections – Part 2

By Colin Mahony and Shilpa Lawande

Part II – Understanding the Simplicity of Projections and the Vertica Database Designer™

In Part I of this post, we introduced you to the simple concept of Vertica’s projections.  Now that you have an understanding of what they are, we wanted to go into more detail on how users interface with them, and introduce you to Vertica’s unique Database Designer tool.

For each table in the database, Vertica requires a minimum of one projection, called a “superprojection”. A superprojection is a projection for a single table that contains all the columns and rows in the table.  Although the data may be the same as a traditional base table, it has the advantages of segmentation (spreading the data evenly across the nodes in the cluster), sorting, and encoding (compressing the size of the data on a per column basis).  This leads to significant footprint reduction as well as load and query performance enhancements.  To give you a sense of the impact that Vertica’s projections have on size, most Vertica customers have at least a 50% reduction in footprint thanks to our compression.  This includes the high availability copy and on average 3-5 projections.  Again, contrast this to traditional row-store databases ballooning upwards of 5x their original size and that is a 10:1 difference in Vertica.

To get your database up and running quickly, Vertica automatically creates a default superprojection for each table created through the CREATE TABLE and CREATE TEMPORARY TABLE statements. This means that if database admins and users never want to know about a projection, they don’t have to – Vertica will automatically handle it under the covers. To further illustrate this point, users can simply pass in projection parameters such as Order By, Encodings, Segmentation, High Availability, and Partitioning right after the CREATE TABLE statement, never interfacing directly with a projection under the hood.

By creating a superprojection for each table in the database, Vertica ensures that all SQL queries can be answered. Default superprojections alone will do far better than a row-store, however, by themselves they may not fully optimize database performance and Vertica’s full potential.  Vertica recommends that you start with the default projections and then use Vertica’s nifty Database Designer™  to optimize your database.  Database Designer creates new projections that optimize your database based on its data statistics and the queries you use. Database Designer:

1. Analyzes your logical schema, sample data, and sample queries (optional).
2. Creates a physical schema design (projections) in the form of a SQL script that can be deployed automatically or manually.
3. Can be used by anyone without specialized database knowledge (even business users can run Database Designer).
4. Can be run and re-run anytime for additional optimization without stopping the database.

Designs created by the Database Designer provide exceptional query performance. The Database Designer uses sophisticated strategies to provide excellent ad-hoc query performance while using disk space efficiently. Of course, a proficient human may do even better than the Database Designer with more intimate knowledge of the data and the use-case – a small minority of our customers prefer to do manual projection design and can usually get a good feel for it after working with the product for a few weeks.

We’ve heard people ask if we need a projection for each query in Vertica, which we absolutely do not! Typically our customers use 3-5 projections and several are using the single superprojection only. A typical customer would have the superprojection along with a few smaller projections (often comprised of only a few columns each).  Unlike MVs and indexes, projections are cheap to maintain during load and due to Vertica’s compression, the resulting data size tends to be 5-25x smaller than the base data. Depending on your data latency needs (seconds to minutes) and storage availability you could choose to add more projections to further optimize the database.  Also important to note is that Vertica does not charge extra for projections, regardless of how many are deployed.  So whether a customer has 1 or 50 projections, their license fees are the same – entirely based on raw data.

As you can see, projections are very easy to work with, and if you are a business analyst who doesn’t know SQL/DDL, that’s okay, we created a tool that designs, deploys and optimizes the database automatically for you.  Our objective from day one has always been to enable customers to ask more questions and get faster answers from their data without having to constantly tune the underlying database.  Part III of this post goes into more detail on projections versus indexes and materialized views.

Read the rest of the 3-part series…

The Power of Projections – Part 1: Understanding Projections and What They Do
The Power of Projections – Part 3: Comparing and Contrasting Projections to Materialized Views and Indexes

The Power of Projections – Part 1

By Colin Mahony and Shilpa Lawande

Part I: Understanding Projections and What They Do

Many of us here at Vertica have been amazed and frankly flattered at how much FUD our competitors are putting out there regarding Vertica’s “projections”.  Having heard some incredibly inaccurate statements about them, we’ve decided to clarify what they are, how and why we have them, and the advantages they bring.  Actually, projections are a pivotal component of our platform, and a major area of differentiation from the competition.  Most importantly, Vertica’s customers love the benefits projections bring! In an effort to provide you with as much detail as possible, this blog is broken up into three posts with Parts II and III being more technical.

First, some background. In traditional database architectures, data is primarily stored in tables. Additionally, secondary tuning structures such as indexes and materialized views are created for improved query performance.  Secondary structures like MVs and indexes have drawbacks – for instance they are expensive to maintain during data load (more detail on this in Part III).  Hence best practices often require rebuilding them during nightly batch windows, which prevents the ability to do real-time analytics.  Also, it isn’t uncommon to find data warehouse implementations that balloon to 3-6x base table sizes due to these structures. As a result, customers are often forced to remove valuable detailed data and replace it with aggregated data to solve this problem. However you can’t monetize what you lost!

Vertica created a superior solution by optimizing around performance, storage footprint, flexibility and simplicity. We removed the trade-off between performance and data size by using projections as the lynchpin of our purpose-built architecture.  Physical storage consists of optimized collections of table columns, which we call “projections”. In the traditional sense, Vertica has no raw uncompressed base tables, no materialized views, and no indexes. As a result there are no complex choices – everything is a projection!  Of course, your logical schema (we support any) remains the same as with any other database so that importing data is a cinch.  Furthermore, you still work with standard SQL/DDL (i.e. Create Table statements, etc).  The magic of projections and Vertica are what we do under the covers for you with the physical storage objects.  We provide the same benefits as indexes without all of the baggage.  We also provide an automatic tool, the Database Designer (more on this in Part II) to create projections automatically.

Projections store data in formats that optimize query execution. They share one similarity to materialized views in that they store data sets on disk rather than compute them each time they are used in a query (e.g. physical storage).  However, projections aren’t aggregated but rather store every row in a table, e.g. the full atomic detail. The data sets are automatically refreshed whenever data values are inserted, appended, or changed – again, all of this happens beneath the covers without user intervention – unlike materialized views. Projections provide the following benefits:

  • • Projections are transparent to end-users and SQL. The Vertica query optimizer automatically picks the best projections to use for any query.
  • • Projections allow for the sorting of data in any order (even if different from the source tables). This enhances query performance and compression.
  • • Projections deliver high availability optimized for performance, since the redundant copies of data are always actively used in analytics.  We have the ability to automatically store the redundant copy using a different sort order.  This provides the same benefits as a secondary index in a more efficient manner.
  • • Projections do not require a batch update window.  Data is automatically available upon loads.
  • • Projections are dynamic and can be added/changed on the fly without stopping the database.

In summary, Vertica’s projections represent collections of columns (okay so it is a table!), but are optimized for analytics at the physical storage structure level and are not constrained by the logical schema.  This allows for much more freedom and optimization without having to change the actual schema that certain applications are built upon.

Hopefully this gave you an overview of what projections are and how they work.  Please read Part II and Part III of this post to drill down into projections even further.

Read the rest of the 3-part series…

The Power of Projections – Part 2: Understanding the Simplicity of Projections and the Vertica Database Designer™
The Power of Projections – Part 3: Comparing and Contrasting Projections to Materialized Views and Indexes

Vertica at the Boulder BI Brain Trust (BBBT)

Back on December 17th, Shilpa Lawande (Vertica’s VP Engineering), Sam Madden (MIT Professor and Technical Advisor for Vertica) and I had the pleasure to spend the day in Boulder, Colorado as part of the Boulder BI Brain Trust series, “a gathering of leading, local, BI consultants and experts who attend 1/2-day presentations from interesting and innovative BI vendors.”

As part of our half-day briefing, Claudia Imhoff, Phd. interviewed Shilpa, Sam, and I about Vertica and the Vertica Analytics Platform. In this 15 minute recording, we covered:

  • Vertica’s impressive growth
  • Our column-oriented architecture
  • The Four “C’s” of Vertica
  • Our customers and how they monetize their data
  • Vertica’s ease of use through the Database Designer and connectors
  • How Vertica is looking towards the future

We had a great time with the BBBT, and they provided a ton of insight that helped validate our product and business vision. If you would like to listen to the 15-minute podcast, please visit the link below:

Boulder BI Brain Trust (BBBT) podcast featuring Vertica Systems.

You can also get a feel for how the briefing progressed by searching for the #BBBT hashtag in Twitter. A number of the analysts commented about the briefing throughout the day.

Get Started With Vertica Today

Subscribe to Vertica