Finding Patterns in Stock Prices With Vertica

Posted January 21, 2020 by Bryan Herger, Vertica Big Data Solution Architect at Micro Focus

Business man showing profitable stock market graph hologram over a tablet.

Vertica can ingest data from many sources and enable SQL-based preparation and analytics.

Here is another real-world example: Let’s use Vertica to find patterns in a public data set of stock and (Exchange-Traded Fund) ETF prices from Kaggle. I’ll use the following data source in this post: https://www.kaggle.com/borismarjanovic/price-volume-data-for-all-us-stocks-etfs
This data source provides several thousand files for stocks and ETF’s listing daily volume and daily Open, High, Low, and Close prices over several years in CSV format.

Let’s define a table for the data:
CREATE TABLE kaggle.stocketf(Symbol varchar(20),StockETF varchar(2),Date date,Open float,High float,Close float,Volume int,OpenInt int);

We can import data using a bash script to iterate over the files and load into Vertica with the following COPY statement (example for ETF data files, with filename mapped to Symbol field and StockETF field set to “S” for Stocks and “E” for ETF’s):
for filename in ETFs/*.txt; dosymbol=`basename $filename .us.txt`vsql -U dbadmin -w XXX -c "COPY kaggle.stocketf (Symbol AS '$symbol',StockETF as 'S',Date,Open,High,Low,Close,Volume,OpenInt) FROM LOCAL '$filename' DELIMITER E',' DIRECT;"done

A quick data analysis:
Let’s look at basic statistics for the data set, finding some 17.4 million daily reports across 8507 symbols:
dbadmin=> select count(*) from kaggle.stocketf ;count----------17453243dbadmin=> select count(distinct symbol) from kaggle.stocketf ;count-------8507

Surely there’s something interesting here! Let’s look at the Dow Jones Transport Index components (* components as of Jan 2020: ‘ALK’,’AAL’,’CAR’,’CHRW’,’CSX’,’DAL’,’EXPD’,’FDX’,’JBHT’,’JBLU’,’KSU’,’KEX’,’LSTR’,’MATX’,’NSC’,’R’,’LUV’,’UNP’,’UAL’,’UPS’)

Let’s look for correlations in daily price moves among index members by calculating a table with daily percent change:

create table kaggle.djta asselect distinct date, (select (close-open)/open from kaggle.stocketf i where i.symbol='aal' and i.date=e.date) as 'aal', (select (close-open)/open from kaggle.stocketf i where i.symbol='alk' and i.date=e.date) as 'alk', (select (close-open)/open from kaggle.stocketf i where i.symbol='car' and i.date=e.date) as 'car', (select (close-open)/open from kaggle.stocketf i where i.symbol='chrw' and i.date=e.date) as 'chrw', (select (close-open)/open from kaggle.stocketf i where i.symbol='csx' and i.date=e.date) as 'csx', (select (close-open)/open from kaggle.stocketf i where i.symbol='dal' and i.date=e.date) as 'dal', (select (close-open)/open from kaggle.stocketf i where i.symbol='expd' and i.date=e.date) as 'expd', (select (close-open)/open from kaggle.stocketf i where i.symbol='fdx' and i.date=e.date) as 'fdx', (select (close-open)/open from kaggle.stocketf i where i.symbol='jbht' and i.date=e.date) as 'jbht', (select (close-open)/open from kaggle.stocketf i where i.symbol='jblu' and i.date=e.date) as 'jblu', (select (close-open)/open from kaggle.stocketf i where i.symbol='ksu' and i.date=e.date) as 'ksu', (select (close-open)/open from kaggle.stocketf i where i.symbol='kex' and i.date=e.date) as 'kex', (select (close-open)/open from kaggle.stocketf i where i.symbol='lstr' and i.date=e.date) as 'lstr', (select (close-open)/open from kaggle.stocketf i where i.symbol='matx' and i.date=e.date) as 'matx', (select (close-open)/open from kaggle.stocketf i where i.symbol='nsc' and i.date=e.date) as 'nsc', (select (close-open)/open from kaggle.stocketf i where i.symbol='r' and i.date=e.date) as 'r', (select (close-open)/open from kaggle.stocketf i where i.symbol='luv' and i.date=e.date) as 'luv', (select (close-open)/open from kaggle.stocketf i where i.symbol='unp' and i.date=e.date) as 'unp', (select (close-open)/open from kaggle.stocketf i where i.symbol='ual' and i.date=e.date) as 'ual', (select (close-open)/open from kaggle.stocketf i where i.symbol='ups' and i.date=e.date) as 'ups'from kaggle.stocketf ewhere e.symbol in 'aal','alk','car','chrw','csx','dal','expd','fdx','jbht','jblu','ksu','kex','lstr','matx','nsc','r','luv','unp','ual','ups') and e.date between '2014-01-01' and '2016-12-31'order by e.date;

Let’s look for correlations in the price moves with Vertica’s built-in correlation matrix function:

create table kaggle.djta_corr as selectcorr_matrix(aal,alk,car,chrw,csx,dal,expd,fdx,jbht,jblu,ksu,kex,lstr,matx,nsc,r,luv,unp,ual,ups)OVER() from kaggle.djta ;

Sort the table by correlation score with min/max on symbol to eliminate duplicates:
select min(variable_name_1),max(variable_name_2),corr_value from kaggle.djta_corr group by corr_value order by corr_value desc;(* Omitted here, but the first value with correlation 1 is an artifact of self-correlation! No, AAL and UPS don't correlate exactly!)min | max | corr_value------+------+-------------------aal | dal | 0.788507725090297dal | ual | 0.772113325934412aal | ual | 0.770889743228666dal | luv | 0.767670222664251csx | nsc | 0.725832748356349csx | unp | 0.723114433968166aal | luv | 0.711196406250434alk | dal | 0.693781699858637luv | ual | 0.690804824968984ksu | unp | 0.683294816096574fdx | ups | 0.681333650640936jblu | luv | 0.678632940503586dal | jblu | 0.672436463181798alk | luv | 0.672078624124824alk | ual | 0.664194579441607nsc | unp | 0.655098682802784ksu | nsc | 0.651077523925447csx | ksu | 0.64166243939123

From this list, we observe that:
Airline stocks (AAL,ALK,DAL,JBLU,UAL,LUV) have a high correlation score;
Railroad stocks (CSX,KSU,NSC,UNP) have a high correlation score;
Shippers (FDX,UPS) also have a high correlation.
With Vertica, we can run a lot of data preparation, normalization, and analytics in SQL accessible to many client applications and to many users already familiar with SQL-based tools and scripts.
How much more can we learn about stock and ETF trading patterns beyond this short exercise? Stay tuned, or get your own copy of the dataset and Vertica Community Edition to find out!

Helpful Links:
Vertica CE download: https://www.vertica.com/download/vertica/community-edition/
COPY data into Vertica: https://www.vertica.com/docs/9.3.x/HTML/Content/Authoring/SQLReferenceManual/Statements/COPY/COPY.htm
Correlation matrix: https://www.vertica.com/docs/9.3.x/HTML/Content/Authoring/SQLReferenceManual/Functions/MachineLearning/CORR_MATRIX.htm
Kaggle dataset used in this post: https://www.kaggle.com/borismarjanovic/price-volume-data-for-all-us-stocks-etfs

Have fun!