This blog post was authored by Anh Le. Introduction As the number of features in your data set grows, it becomes harder to work with. Visualizing 2D or 3D data is straightforward, but for higher dimensions you can only select a subset of two or three features to plot at a time, or turn to […]
This blog post was authored by Ginger Ni. The precision-recall curve is a measure for evaluating binary classifiers. It is a basic measure derived from the confusion matrix. In Vertica 9.1, we provide a new machine learning evaluation function PRC() for calculating precision and recall values from the results of binary classifiers. Along with the […]
This blog post was authored by Soniya Shah. Machine learning and data science have the potential to transform businesses because of their ability to deliver non-obvious, valuable insights from massive amounts of data. However, many data scientist’s workflows are hindered by computational constraints, especially when working with very large data sets. While most real-world data […]
This blog post was authored by Vincent Xu. Vertica 9.0 is out and here is the updated Vertica machine learning cheat sheet. Vertica 9.0 introduces a slew of new machine learning features including one-hot encoding, Lasso regression, cross validation, model import/export, and many more. See the cheat sheet for examples of how to use the […]
This blog post was authored by Soniya Shah. Analytic functions handle complex analysis and reporting tasks. Here are some example use cases for Vertica analytic functions: • Rank the longest standing customers in a particular state • Calculate the moving average of retail volume over a specific time • Find the highest score among all […]
This blog post was authored by Soniya Shah. Time series analytics is a powerful Vertica tool that evaluates the values of a given set of variables over time and groups those values into a window based on a time interval for analysis and aggregation. Time series analytics is useful when you want to analyze discrete […]
This blog post was authored by Sarah Lemaire. On Tuesday, August 22, The Boston Vertica User Group hosted a late-summer Meetup to talk to attendees about compute engines and data mart applications, and the advantages and disadvantages of both solutions. In the cozy rustic-industrial atmosphere of Commonwealth Market and Restaurant, decorated with recycled wood pallets, […]
This blog post was authored by Ginger Ni. Like any natural disaster, hurricanes can leave behind extensive damage to life and property. The question asked by NGOs, government agencies, and insurance companies is, “How can we predict the locations where a storm will inflict the most damage?” Modern spatial analysis enables us to predict the […]
This blog post was authored by Ginger Ni. Counting Distinct Values Data cardinality is a commonly used statistic in data analysis. Vertica has the exact COUNT(DISTINCT) function to count distinct values in a data set, but the function does not scale well for extremely large data sets. When exploring large data sets, speed is critical. […]
This blog post was authored by Soniya Shah. Vertica 8.1.1 continues with the fast-paced development for machine learning. In this release, we introduce the highly-requested random forest algorithm. We added support for SVM to include SVM for regression, in addition to the existing SVM for classification algorithm. L2 regularization was added to both the linear […]
This blog post was authored by Ginger Ni. Median and percentile functions are commonly used data statistic functions. They are also used in other sophisticated data analysis algorithms, such as the robust z-Score normalization function. Vertica has exact MEDIAN and PERCENTILE_CONT functions, but these functions do not scale well for extremely large data sets, because […]
This blog post was authored by Steve Sarsfield. There is a new feature in analytical databases that seems to be all the rage, particular in cloud data warehouse – Autoscaling. Autoscaling’s promise is that if you have a particularly hard analytical workload, autoscaling will spin up new storage and compute to get the job done. […]
This blog post was authored by Steve Sarsfield. Crowd-sourced reviews are becoming more and more important in our lives. When you’re thinking about going to a new job, you check out Glassdoor. If you’re heading out to dinner, you check out Yelp. When buying online, the reviews on Amazon are not only informative, but sometimes hilarious. […]
While Henry Ford did not in fact develop or even patent the modern assembly line (that credit goes to Ransom E. Olds), he relied heavily on the process for automobile production.
An earlier blog covered the first edition of directed queries, which appeared with the first release of Vertica 7.2. With each release since then, Vertica has offered various enhancements to directed queries functionality.
When learning new database applications, a good place to start is with some compelling, real-world data. It’s not necessarily so easy to find.
Event series occur in tables with a time column, most typically a TIMESTAMP data type. In Vertica, you perform an event series join to analyze two series in different tables when their measurement intervals don’t align, such as with mismatched timestamps.
Watch this video to learn more about the Vertica Machine Learning for Predictive Analytics features new in 7.2
Vertica’s event series pattern matching functionally lets you identify events that occur is specific patterns. In this blog, we’ll introduce you to the pattern matching key features.
New in Vertica 7.2.2 is the Machine Learning for Predictive Analytics package. This analytics package allows you to use built-in machine learning algorithms on data in your Vertica database. Machine learning algorithms are extremely valuable in data analytics because, as their name suggests, they can learn from your data and provide information about deductive and […]
Every so often we hear about the seemingly confusing nature of SQL functions that return the current time. But what is current? Is it the start time of a transaction or statement? Is it the time returned by the system clock? The answer is: all of these, depending on which function you call.
When you submit a query to Vertica, you want it to execute as quickly and efficiently as possible. The query optimizer creates a plan that is designed to do just that. The directives in the query plan determine your query’s run-time performance and resource consumption, but the properties of your projections and the system parameters also impact the query’s performance.
Read how we used User-Defined-Loads to track the habits of the red-tailed hawk!
As a Vertica user, you know that using joins can improve query performance by combining records from one or more tables. But sometimes, you need to develop complex joins. Vertica supports many different kinds of joins that perform different functions based on your needs.
Today?s organizations need to be able to measure the effectiveness of online ads and marketing campaigns. In your particular organization, you may want to measure how effectively your ads drive unique visitors to your website. Or you may want to see if your ads drive repeat visits from the same user over a specific period of time.
During the summer of 2015, I participated in an internship program with Vertica. Most interns assisted in software development, but my primary goal was to use Vertica, Vertica Place, and HP Distributed R to address an ecological problem
If your organization deals with low latency, high concurrency applications and queries, you can benefit from having as few nodes as possible involved in each query.
More than once I have worked with customers who need to update a superprojection or create a new projection for a large fact table. It seems like a simple and easy process: just create a projection and perform a refresh. However, refreshing projections for large fact tables can produce unwanted complications. In this blog, we?ll discuss these complications and how they can be remedied.
We talk a lot about database security and how you can protect your sensitive data from outside threats. But what about internal, unintentional data corruption? What if the data you are trying to analyze or manipulate is simultaneously being manipulated by another transaction? A scenario such as this could lead to data loss and inconsistency. In some cases, this can be as bad as an external threat. This is where locks come into the picture.
Time series analytics is a little-known, but very powerful Vertica tool. In Vertica, the TIMESERIES clause and time series aggregate functions normalize data into time slices. Then they interpolate missing values that fill in the gaps.Using time series analytics is useful when you want to analyze discrete data collected over time, such as stock market trades and performance, but find that there are gaps in your collected data.
The Vertica Database Designer is a tool that analyzes a logical schema definition, sample queries, and sample data, and creates a physical schema (projections) in the form of a SQL script that you deploy automatically or manually. The result is an optimized database with optimal query performance and data compression.
Last week, I was at the 2015 Conference on Innovative Data Systems Research (CIDR), held at the beautiful Asilomar Conference Grounds. The picture above shows one of the many gorgeous views you won’t see when you watch other people do PowerPoint presentations. One Vertica user at the conference said he saw a ?range join? in a query plan, and wondered what it is and why it is so fast.
Welcome to another installment of our Top Tech Support Questions Answered blog series. In our first blog , we discussed ways to optimize your database for deletes. In this installment, we?ll talk about optimizing projections.
In an earlier post, join operations were introduced followed by hash join operations. The other operator, merge join, may sometimes be needed in situations when a spill to disk occurs. In these situations, resources are being wasted. One approach may be optimizing for merge join. In this post, a query was optimized and tested for merge join using subqueries and specific projections.
Over the last month or so, this series has discussed how organizations often deal with a missed big data opportunity in ways that closely resemble the grieving process, and how that process maps to the commonly understood five stages of grief: denial, anger, bargaining, depression, and acceptance. This is the last entry in the series; it focuses on how an organization can move forward effectively with a big data project.
Continuing the five part series which explores how organizations coping with big data often go through a process that closely resembles grief, this segment addresses the point at which the organization finally grasps the reality of big data and realizes the magnitude of the opportunity and challenge?and gets depressed about the reality of it.
Continuing the five part series about the stages of big-data grief that organizations experience, this segment focuses on the first time organizations explore the reality of the challenges and opportunities presented by big data and start to work their way forward?with bargaining.
Continuing this five part series focused on how organizations frequently go through the five stages of grief when confronting big data challenges, this post will focus on the second stage: anger.
Automatic physical database design is a challenging task. Different customers have different requirements and expectations, bounded by their resource constraints. To deal with these challenges in Vertica, we adopt a customizable approach by allowing users to tailor their designs for specific scenarios and applications. To meet different customer requirements, any physical database design tool should allow its users to trade off query performance and storage footprint for different applications.
Modern databases are often required to process many different kinds of workloads, ranging from short/tactical queries, to medium complexity ad-hoc queries, to long-running batch ETL jobs to extremely complex data mining jobs (See my previous blog on workload classification for more information.) DBAs must ensure that all concurrent workload, along with their respective Service Level Agreements (SLAs), can co-exist well with each other while maximizing a system?s overall performance.
My father passed away recently, and so I?ve found myself in the midst of a cycle of grief. And, in thinking about good blog topics, I realized that many of the organizations I?ve worked with over the years have gone through something very much like grief as they?ve come to confront big data challenges?and the stages they go through even map pretty cleanly to the five stages of grief! So this series was born.
In previous installments of this series, I de-bunked some of the more common myths around big data analytics. In this final installment, I?ll address one of the most pervasive and costly myths: that there exists an easy button that organizations can press to automagically solve their big data problems. I?ll provide some insights as to how this myth has come about, and recommend strategies for dealing with the real challenges inherent in big data analytics.
The Dragline release of Vertica offers an exciting new feature that is unique in the world of big data analytics platforms. We now offer Live Aggregate projections as part of the platform. The impact is that you can really fly through certain types of big data analytics that typically grind down any analytics system.
In this part of the de-mythification series, Ill address another common misconception in the big data marketplace: that there exists a single piece of technology that will solve all big data problems. Whereas the first two entries in this series focused on market needs, this will focus more on the vendor side of things in terms of how big data has driven technology development, and give some practical guidance on how an organization can better align their needs with their technology purchases.
In this, the second of the multi-part ?de-mythification? series, I?ll address another common misconception in the Big Data marketplace today ? that there are only two types of data an enterprise must deal with for Big Data analytics ? structured and unstructured, and that unstructured data is somehow structure-free.
In the first of this multi-part series, I?ll address one of the most common myths my colleagues and I have to confront in the Big Data marketplace today: the notion of ?real-time? data visibility. Whether it?s real-time analytics or real-time data, the same misconception always seems to come up. So I figured I?d address this, define what ?real-time? really means, and provide readers some advice on how to approach this topic in a productive way.
ROLLUP is a very common Online Analytic Processing (OLAP) function and is part of ANSI SQL. Many customers use ROLLUP to write reports that automatically perform sub-total aggregations across multiple dimensions at different levels in one SQL query.
When I?m on a flight sitting next to someone, and we?re making polite conversations, often the question comes up ?what do you do?? In these situations, I have to assess whether the person works in the IT industry or is otherwise familiar with the lingo. If not, my stock response is ?I fix databases?. This usually warrants a polite nod, and then we both go back to sleep. This over-simplified explanation generally suffices, but in truth, it is wholly inadequate. The truth of the matter is that my job is to ensure that databases don?t get broken in the first place; more specifically ? an Vertica database. But our clients have different, complex goals in mind, they sometimes configure their systems incorrectly for the kind of stuff they?re doing. I?m constantly looking for ways to empower clients to understand problems to look for before they become bigger problems.
The answer is YES if it is the right kind of tree. Here ?tree? refers to a common data structure that consists of parent-child hierarchical relationship such as an org chart. Traditionally this kind of hierarchical data structure can be modeled and stored in tables but is usually not simple to navigate and use in a relational database (RDBMS). Some other RDBMS (e.g. Oracle) has a built-in CONNECT_BY function that can be used to find the level of a given node and navigate the tree. However if you take a close look at its syntax, you will realize that it is quite complicated and not at all easy to understand or use.
With Vertica’s latest release (Vertica 7 Crane”), we introduced Vertica Flex Zone, based on the patent-pending flex tables technology, which dynamically adapt to whatever schema is present in the data. Flex tables offer dramatic usability improvements over regular tables. In this post, we take a look under the hood and show how flex tables are similar to regular Vertica tables, with a little pinch of magic thrown in.
In December, HP released version 7 of the Vertica analytics platform which includes, among others, a great new feature called Vertica Flex Zone (Flex Zone). Flex Zone enables you to quickly and easily load, explore and analyze some forms of semi-structured data. It eliminates the need for coding-intensive schemas to be defined or applied before the data is loaded for exploration.
Here at Vertica, we had to solve a technical challenge that many of you might be facing: data analysis from legacy products.
Much of the data analytics we perform occurs with data whose schema changes over time. Many organizations have developers who define the record types that are collected and analysts who comb through these records looking for important patterns. In today’s fast-changing business landscape, agility is the key to success: developers need to continually define new metrics for their analyst’s consumption.
With our Vertica 7 release, we announced Vertica Flex Zone, a new product offering that simplifies the way that you consume and then explore semi-structured data, such as Web logs, sensor data, and other emerging data types. In this blog post, our first ?Flex Zone Friday? post, let’s look at how you can use Vertica Flex Zone to get a leg up on your latest data analysis problem, using Twitter data as the sample data type.