
Searching the contents of a sizeable CHAR, VARCHAR, LONG VARCHAR, VARBINARY, or LONG VARBINARY field within a table to locate a specific keyword can be quite time consuming. Especially when dealing in Big Data.
Fortunately, Vertica includes a text indexing feature which allows you to query that data with mind blowing results!
Example:
dbadmin=> \d big_varchar_table
List of Fields by Tables
Schema | Table | Column | Type | Size | Default | Not Null | Primary Key | Foreign Key
--------+-------------------+-------------+----------------+-------+---------+----------+-------------+-------------
public | big_varchar_table | pk | int | 8 | | t | t |
public | big_varchar_table | the_varchar | varchar(25000) | 25000 | | f | f |
(2 rows)
dbadmin=> \timing
Timing is on.
dbadmin=> SELECT COUNT(*) FROM big_varchar_table;
COUNT
------------
1000000000
(1 row)
Time: First fetch (1 row): 315.768 ms. All rows formatted: 315.819 ms
dbadmin=> SELECT COUNT(*) FROM big_varchar_table WHERE the_varchar LIKE '%GAEMO%';
COUNT
-------
703
(1 row)
Time: First fetch (1 row): 82770.310 ms. All rows formatted: 82770.419 ms
That took about 1 minute and 40 seconds.Now let’s try a text index.
dbadmin=> CREATE TEXT INDEX big_varchar_table_text_index ON big_varchar_table (pk, the_varchar);
CREATE INDEX
Time: First fetch (0 rows): 3432163.531 ms. All rows formatted: 3432163.531 ms
dbadmin=> SELECT COUNT(*) FROM big_varchar_table_text_index WHERE token = v_txtindex.StemmerCaseInsensitive('GAEMO');
COUNT
-------
703
(1 row)
Time: First fetch (1 row): 94.237 ms. All rows formatted: 94.296 ms
Wow, querying the text index takes less than one second! Now I can join the text index back to the source table to grab the complete contents of its text field…
dbadmin=> SELECT pk, substr(the_varchar, 1, 35) "Small portion to fit on screen"
dbadmin-> FROM big_varchar_table
dbadmin-> WHERE pk IN (SELECT doc_id
dbadmin(> FROM big_varchar_table_text_index
dbadmin(> WHERE token = v_txtindex.StemmerCaseInsensitive('GAEMO'))
dbadmin-> LIMIT 1;
pk | Small portion to fit on screen
---------+-------------------------------------
4408169 | IGCUY OOWNS GAEMO DRJQK NOVXN DUBVY
(1 row)
Time: First fetch (1 row): 300.099 ms. All rows formatted: 300.174 ms
Have fun!