
A development or quality assurance team often times requests access to a sub-set of production data. One way to do that would be to make use of the LIMIT clause.
Example:
dbadmin=> SELECT COUNT(*) FROM big_number_table;
COUNT
------------
1000000000
(1 row)
dbadmin=> SELECT 0.05*1000000000 "5_percent";
5_percent
-------------
50000000.00
(1 row)
dbadmin=> CREATE TABLE big_number_table_5_pct AS SELECT * FROM big_number_table LIMIT 50000000;
CREATE TABLE
Time: First fetch (0 rows): 13057.251 ms. All rows formatted: 13057.281 ms
The problem with that method is the team won’t get a very good sampling of the data. That is, they’ll be getting the first 5% of the rows that Vertica retrieves from the table.In order to provide a true sampling of the entire data set, Vertica provides the TABLESAMPLE clause.
Example:
dbadmin=> CREATE TABLE big_number_table_5_better_pct AS SELECT * FROM big_number_table TABLESAMPLE(25);
CREATE TABLE
Time: First fetch (0 rows): 93910.763 ms. All rows formatted: 93910.795 ms
It takes a little longer to build the subset of data, but it will be an accurate representation of the production data!Have fun!