Watch this video to see how you can apply Amazon Web Services Cloud Formation templates to quickly create your HP Vertica database cluster in Amazon Web Services using the Vertica AMI.
Download the template used in this video here.
As an online retailer, there are many challenges to face. You don’t have a physical outlet for customers to browse and see what they like, nor do you have a floor manager to recommend certain products or answer questions your customers might have. However e-retailers make up for that by enabling customers to look through an entire stores inventory before making a purchase, see suggestions made by the website based on their past browsing history, and in many cases are just a few clicks away from an online customer service rep. But how can e-retailers become just as personal, if not more so, than their brick-and-mortar counterparts? The answer for more e-retailers is Big Data analytics, and, more specifically, the HP Vertica Analytics Platform.
E-retailer Snapfish is a textbook example of using Big Data analytics with HP Vertica to revamp their online business. Every online company these days collects data – some more than others. Yet few actually know what to do with their data or don’t have an analytics system in place to capitalize on it. Prior to Vertica, Snapfish had much of the data they needed right there in front of them, but couldn’t access or take action fast enough.
In order to create an online environment that is as personal as a physical store, e retailers have to be able to understand their customers goals, see what they do, and pick up on social cues in real-time just like one would in a face to face conversation. That’s why Snapfish chose HP Vertica.
Blazing-Fast Speed for Immediate Buyer Insight
Following the implementation of Vertica, Snapfish saw an immediate turnaround. They are now able to finish queries up to 400 times faster, and send out email promotions and offers to capitalize on near real-time trends. Snapfish can now see where customers on the site are going, what items they prefer, and identify overall trends that are taking shape, all of which they are able to do on the very same day instead of months down the line (it doesn’t help to know what kind of Christmas card customers prefer when its March, does it?).
One of the more fascinating aspects of Implementing HP Vertica that makes customers interaction seem more human is, perhaps ironically titled, the use of “machine learning techniques”. As I discussed in my last blog post about Vertica and CI Vertica works very well with R, the open-source language used by many data scientists for predictive analytics and data mining. Machine learning is the brains behind suggesting items to you that you may not have looked at, but are related to ones you have. For example, If you are shopping for a DSLR camera online it might suggest that you look at an extra battery, or perhaps recommend either an external flash or a shoulder mount, depending on whether you tended to gravitate towards equipment for photo or video use during your browsing. It is Vertica’s tight integration with R that allows these techniques to be put into action swiftly and effectively.
Vertica enabled Snapfish to react quickly to customers’ preferences and patterns, allowing them to market themselves accordingly, boost revenue and increase customer retention. As a highlight: Snapfish DBA’s can now perform look-up queries in under 200 milliseconds, enabling them to accurately recommend products to customers much more effectively than before, once again adding to that personal in-store feel that’s so important. Snapfish can now see where their customers on the site are going, what items they prefer, and observe the overall trends that are taking shape – just as if they were running a brick and mortar store. Big data analytics are clearly critical in tailoring customer experiences online to be more personal, and the faster you can execute the more valuable it will be for both.
To Learn more, click here to view our recent webinar with E-Retailer Etsy, where we discuss how they use HP Vertica to their advantage. Both e- retailers Etsy and One King’s Lane will be represented at our upcoming BDC in Boston to speak about their success stories with HP Vertica. Visit our registration page to learn more if you’re interested.
Today, we announced “Dragline,” the code name for the latest release of the HP Vertica Analytics Platform. Focused on the strategic value of all data to every organization, “Dragline” includes a range of industrial-strength features befitting its code name for serious Big Data initiatives.
Our data sheet provides detailed feature descriptions on this release’s full range of capabilities and benefits, so we’ll just focus on three top features that are sure to capture your attention (after all, they came highly requested from our growing customer base).
By the way, learn about these and all of the new “Dragline” features in our upcoming Webinar.
Speed is a given … but what about in every possible direction? “Project Maverick” has multiple elements including fast, self-service analytics about discrete individuals or devices. Live Aggregate Projections, a key, new feature of “Project Maverick,” accelerates the speed and performance of these live lookups by up to 10x – more frequent or highly concurrent queries on an individual customer basis — by computing metrics on the data as it arrives for targeted and personalized analytics without programming accelerator layers.
For telecommunications companies as well as utilities and energy providers who, due to deregulation, are pursuing smart metering initiatives to differentiate from competitors, this capability is hot. With Live Aggregate Projections, these providers can deliver smart metering reports that educate their customers on consumption rates vis a vis their neighbors in promoting a greener planet and establishing a tighter relationship with their provider. Interested in learning how HP Vertica easily chews through the enormous volume and scale of smart meter data and the Internet of Things? Check out our newly published white paper, including detailed benchmarks.
Another major advancement of “Project Maverick” is Dynamic Mixed Workload Management. Commonly found in many data warehouse technologies and requested by some of our largest customers, this powerful new feature identifies and adapts to varying query complexities — simple and ad-hoc queries as well as long-running advanced queries — and dynamically assigns the appropriate amount of resources to meet the needs of all data consumers.
So, now, you can preserve your precious hardware and system resources, while pleasing even your most demanding internal and external constituents – from analysts who seek ad-hoc insights to data scientists who manage complex analytics to business executives who need customized views or dashboards on Monday mornings.
Hadoop is commonly used as a data lake to store bulk data. That’s why with each release, we offer tighter integration that delivers the most open SQL on Hadoop. “Dragline” builds on that strong heritage by helping you to manage multiple storage tiers for cost-effective analytics. You can perform advanced SQL queries on bulk data stored in HDFS using the HP Vertica Analytics Platform without moving the data or using any connectors. You can move data into HP Vertica when your organization requires faster performance in-depth analytics.
HP Vertica supports all of the major Hadoop distributors but the power of MapR and HP Vertica on the same cluster is something special. Join our upcoming Webinar with MapR to get the full details behind this solution and to build your business case for SQL on Hadoop.
Try HP Vertica 7 today and stay tuned for more blog posts and materials to support this release.
Welcome back to another edition of our “Meet the Team” feature! This week we sat down with our inside sales manager Patrick Day, and talked about everything from what it’s like to manage the badass sales team we have here in Cambridge MA, to ballroom dancing.
When I’m on a flight sitting next to someone, and we’re making polite conversations, often the question comes up “what do you do?” In these situations, I have to assess whether the person works in the IT industry or is otherwise familiar with the lingo. If not, my stock response is “I fix databases”. This usually warrants a polite nod, and then we both go back to sleep. This over-simplified explanation generally suffices, but in truth, it is wholly inadequate. The truth of the matter is that my job is to ensure that databases don’t get broken in the first place; more specifically – an HP Vertica database. But our clients have different, complex goals in mind, they sometimes configure their systems incorrectly for the kind of stuff they’re doing. I’m constantly looking for ways to empower clients to understand problems to look for before they become bigger problems.
That’s why I’m a huge fan of the HP Vertica Workload Analyzer (also referred to as WLA). The WLA is a great tool in the war against problems. The WLA’s goal in life (if it has one) is to find and report on problems – 20 specific ones to be exact. If you are using Management Console, the WLA runs once per day and produces a report. This report indicates actions to take, like tables that should have statistics updated, or queries that aren’t performing adequately. This report (actually just entries in a table) is reported in the Management Console in a window called “Tuning Recommendations”.
But you don’t need Management Console to take advantage of the Workload Analyzer – you can run it manually by running the following command:
SELECT ANALYZE_WORKLOAD(”) ;
Interestingly, even though it’s called “Workload Analyzer”, the actual command is ANALYZE_WORKLOAD. You could toss this into CRON and run it multiple times per day if you’d like, but it’s probably not necessary.
The output of this command comes back to your screen – it reports on all the various “problems” that it is programmed to identify and returns them. It also creates entries in two separate tables V_MONITOR.TUNING_RECOMMENDATIONS and TUNING_RECOMMENDATION_DETAILS.
As I mentioned earlier, there are 20 queries that WLA runs in order to look for problems. These queries are called rules. WLA rules can be disabled or enabled, and many of them have parameters that can be modified- increasing or decreasing the scope of the rule. These rules have clever names like sys_r1 and sys_r13. It’s ok though – WLA doesn’t allow you to run rules directly, but I’ll break down the most useful rules, and how they can be modified for maximum effect.
sys_r1: Stale and Missing Statistics
sys_r1 looks for tables that have stale or missing statistics. It recommends updating statistics on those tables. sys_r5 and sys_r6 find queries that have reported JOIN_SPILLED or GROUP BY SPILLED events. These are indications that there is no good projection design which would enable these queries to operate at maximum efficiency. These three rules generate the most typical output I see at client sites.
sys_r16 and ssys_r20: System Utilization
sys_r16 and sys_r20 are rules that check for system utilization. sys_r16 checks for a CPU utilization that exceeds a 95% threshold for more than 20 minutes. sys_r20 looks for memory utilization of 99% for more than 10 minutes.
Many of the other rules cover very specific edge-case scenarios, and are unlikely to fire in most client sites and we hope to cover those in future blogs. For now, let’s stick to this handful, and talk about how we can modify these rules, and add our own rules!
What you may not know about the WLA is that you can modify certain rules. That is, you can change some parameter settings such as threshold values that the rule uses to find problems. Let’s take rules sys_r16 and sys_r20. Each looks for CPU and memory utilization in your system. The default thresholds might be too high for your liking–but that’s ok, you can adjust them.
dbadmin=> ALTER TUNING RULE sys_r16 SET USAGE_THRESHOLD_PCT=80;
ALTER TUNING RULE
dbadmin=> ALTER_TUNING_RULE sys_r16 SET DURATION_MIN=10;
ALTER TUNING RULE
Now this rule will fire under less extreme circumstances. Instead of looking for a CPU usage of 95% for 20 minutes, it will now report on a CPU usage of 80% for 10 minutes. All the tuning rule parameters (if applicable) are defined in another system table called VS_TUNING_RULE_PARAMETERS. You can use that table to determine which parameters are available to which rules. You can tweak them if you desire.
Furthermore, you can disable or enable rules with the following commands:
dbadmin=> ALTER_TUNING_RULE sys_r16 disable;
ALTER TUNING RULE
dbadmin=> ALTER_TUNING_RULE sys_r16 enable;
ALTER TUNING RULE
Creating your own rules
The coolest part of the WLA is that you can actually create your own rules. The first step is to write your rule. This is just a regular SQL query whose purpose it is to find a problem. The example I’m going to include here is not currently a rule, but it would make a great addition – finding too many delete vectors. This rule has a threshold defined as a percentage of delete vectors as compared to the number of rows in a table. When this percent exceeds 20%, this rule will fire.
CREATE TUNING RULE dv_rule (DELETE_VECTORS, PURGE, dv_threshold=20) AS SELECT CURRENT_TIME AS time, ‘Delete vectors account for ‘ || round(deleted_row_count/row_count, 2)*100::char(4) || ‘% of rows for projection ‘ || ps.projection_name AS observation_description
, ps.anchor_table_id AS table_id
, ps.anchor_table_name AS table_name
, s.schema_id AS table_schema_id
, ps.projection_schema AS table_schema
, NULL AS transaction_id
, NULL AS statement_id
, (SELECT current_value || ‘%’ FROM VS_TUNING_RULE_PARAMETERS
WHERE tuning_rule = ‘dv_rule’ AND parameter = ‘dv_threshold’)
, ‘run purge operations on tables with delete vectors’ AS tuning_description
, ‘SELECT PURGE_PROJECTION(”’ || ps.projection_name || ”’);’ AS tuning_command
, ‘LOW’ as tuning_cost
FROM delete_vectors dv
JOIN projection_storage ps ON dv.projection_name = ps.projection_name
JOIN schemata s ON s.schema_name = dv.schema_name
WHERE ps.row_count > 0
AND deleted_row_count / row_count >
(SELECT current_value/100::numeric FROM VS_TUNING_RULE_PARAMETERS
WHERE tuning_rule = ‘dv_rule’
AND parameter = ‘dv_threshold’) ;
The first line sets up my rule. I give it a name “dv_rule” and I define some parameters. The first two are like labels. They are arbitrary. Technically, these are referred to as the “observation type” and the “tuning type.” You can see examples by looking in the TUNING_RECOMMENDATON_DETAILS table. All remaining parameters are optional — and I can optionally define tuning parameters. Here, I’ve defined my 20% threshold for delete vectors. This query follows the order of the columns in the TUNING_RECOMMENDATION_DETAILS table. If you compare the SELECT list, and the columns in that table, you’ll note the similarities.
My query actually references the VS_TUNING_RULE_PARAMETERS table in order to get my % threshold. This creates a bit of a chicken and egg problem – the query won’t work since the parameter doesn’t yet exist in this table. So, the first thing you have to do is create the rule, which also creates the parameter. Then you can modify the query as needed. If you need to drop the query, you can do so with the following command:
dbadmin=> DROP TUNING RULE dv_rule ;
DROP TUNING RULE
Because you might go through this process several times, I recommend putting your work in to a file so you can reference it easily.
And that’s it! That’s everything you need to know in order to create and manage WLA rules. Not only can you use these rules to find problems in your HP Vertica cluster, you could use them to look for issues in your own data.
In the community forums, I’d like to continue the discussion – come up with your own rules, and post them. Who knows – the good ones we might include in a future HP Vertica release!
I have just come back from a business trip to China where I visited several large Chinese telecom customers to talk about the recent big Vertica win at Facebook. Two questions these customers had constantly asked me were: What’s the future of MPP databases? Will Hadoop become one database that rules the whole analytic space?
These seemed to be odd questions considering that Facebook, one of the juggernauts in the Open Source community in general and Hadoop world in particular, has recently picked Vertica to be the anchoring database to satisfy its ever-increasing analytical demands and has since put the biggest Vertica cluster (with ~300 nodes and effective data storage of 6+ PB) into production. It tells me that if a Hadoop power-house and the inventor of Hive (the most popular SQL-on-Hadoop database) like Facebook, with its teams of brilliant programmers and bound-less resources, still thinks that it needs a MPP database like Vertica in its “Big Data” technology stack in the foreseeable future, it sends a clear and strong message. Obviously Facebook thinks the answers to both questions are NO, not so fast. In the meantime, Facebook will continue to use Hive/HBase and other Hadoop technologies for the tasks they are good at: ETL, handling unstructured data and conducting complex data-mining types of deep analysis.
So why does Facebook think that it needs a MPP database? Facebook has been running an EDW (Oracle Exadata ~50TB) for some time but feels that their existing EDW is running out of steam because it cannot keep up with the rapid data growth especially as mobile platform becomes more and more popular. Facebook would like to take advantage of the established commercial MPP databases for lower cost, robust eco-system, improved data security and better scalability/performance. Their main reasons for going with an MPP database can be summarized as follows:
Facebook has invited four MPP vendors (including Vertica) to participate in two rounds of competitive POCs before declaring Vertica as the ultimate winner on the basis of Vertica’s low TCO, ease of management and superior ad-hoc query performance.
There have recently been many SQL-on-Hadoop offerings in the last couple of years, both open source and proprietary, including but not limited to Hive, Hadapt, Citus, Impala, Stinger and Apache Drill. Though their effort in making Hadoop more SQL friendly is welcome, my general impression is that they are still a long way off in terms of closing the performance gap to the popular MPP databases in the marketplace (e.g. Vertica). Depending on your perspective, you may argue that this gap is not exactly getting narrower at any pace that foretells its closing any time soon.
There is strong reason for me to believe that the SQL-on-Hadoop camp may have over-estimated the effectiveness of bolting/wrapping around open source SQL optimizers (e.g. PostgreSQL) to HDFS and severely underestimated the effort and time it takes to produce an enterprise quality MPP database whose core optimizer/execution engine technology requires years of intensive real world use to mature, and 100s (if not 1000s) of customers to validate and millions of cases to test and train. This is certainly more about practice than theory or concept. Query optimization is fundamentally a software problem and there is a limit to what any “brute force” hardware-based approach can do. To echo and rephrase what the authors of the MapReduce and Parallel Databases: Friends or Foes?” said, smart software (like MPP databases) is still a good idea in the age of Hadoop and “Big Data” and there is plenty of room and opportunity for MPP databases to thrive for a long time to come….
Po Hong is a senior pre-sales engineer in HP Vertica’s Corporate Systems Engineering (CSE) group with a broad range of experience in various relational databases such as Vertica, Neoview, Teradata and Oracle.