VerticaPy

Python API for Vertica Data Science at Scale

The Virtual DataFrame

The Virtual DataFrame is the main object and star of the library and acts as the perfect transition between small and Big Data. The principle is quite simple: As Vertica is a powerful columnar massive parallel processing (MPP) database with many built-in functions, we want it to do as much of the computation work as possible.

Indeed, columnar orientation allows for high compression, and its structure inherently avoids unncessary parsing when retrieving data. MPP allows to parallelize our computations accross the different nodes.

The best way to take advantage of your data is by simply keeping it in your Vertica database, rather than within the limitations of working memory. VerticaPy pushes all computation to your Vertica database before aggregating the final result, so you can get the best of both worlds: Vertica's power and Python's flexibility.



With Python, it's easy to add abstractions, and the vDataFrame acts as the primary abstraction layer. Simple but powerful, it'll help any user through the data science life cycle.

Creating the vDataFrame

There are two main ways to create a vDataFrame.

The first is to create one directly using an existing relation.

In [1]:
from verticapy.datasets import load_titanic
load_titanic() # Loading the titanic dataset in Vertica

import verticapy as vp
vp.vDataFrame("public.titanic")
Out[1]:
123
pclass
Int
123
survived
Int
Abc
Varchar(164)
Abc
sex
Varchar(20)
123
age
Numeric(6,3)
123
sibsp
Int
123
parch
Int
Abc
ticket
Varchar(36)
123
fare
Numeric(10,5)
Abc
cabin
Varchar(30)
Abc
embarked
Varchar(20)
Abc
boat
Varchar(100)
123
body
Int
Abc
home.dest
Varchar(100)
110female2.012113781151.55C22 C26S[null][null]Montreal, PQ / Chesterville, ON
210male30.012113781151.55C22 C26S[null]135Montreal, PQ / Chesterville, ON
310female25.012113781151.55C22 C26S[null][null]Montreal, PQ / Chesterville, ON
410male39.0001120500.0A36S[null][null]Belfast, NI
510male71.000PC 1760949.5042[null]C[null]22Montevideo, Uruguay
610male47.010PC 17757227.525C62 C64C[null]124New York, NY
710male[null]00PC 1731825.925[null]S[null][null]New York, NY
810male24.001PC 17558247.5208B58 B60C[null][null]Montreal, PQ
910male36.0001305075.2417C6CA[null]Winnipeg, MN
1010male25.0001390526.0[null]C[null]148San Francisco, CA
1110male45.00011378435.5TS[null][null]Trenton, NJ
1210male42.00011048926.55D22S[null][null]London / Winnipeg, MB
1310male41.00011305430.5A21S[null][null]Pomeroy, WA
1410male48.000PC 1759150.4958B10C[null]208Omaha, NE
1510male[null]0011237939.6[null]C[null][null]Philadelphia, PA
1610male45.00011305026.55B38S[null][null]Washington, DC
1710male[null]0011379831.0[null]S[null][null][null]
1810male33.0006955.0B51 B53 B55S[null][null]New York, NY
1910male28.00011305947.1[null]S[null][null]Montevideo, Uruguay
2010male17.00011305947.1[null]S[null][null]Montevideo, Uruguay
2110male49.0001992426.0[null]S[null][null]Ascot, Berkshire / Rochester, NY
2210male36.0101987778.85C46S[null]172Little Onn Hall, Staffs
2310male46.010W.E.P. 573461.175E31S[null][null]Amenia, ND
2410male[null]001120510.0[null]S[null][null]Liverpool, England / Belfast
2510male27.01013508136.7792C89C[null][null]Los Angeles, CA
2610male[null]0011046552.0A14S[null][null]Stoughton, MA
2710male47.000572725.5875E58S[null][null]Victoria, BC
2810male37.011PC 1775683.1583E52C[null][null]Lakewood, NJ
2910male[null]0011379126.55[null]S[null][null]Roachdale, IN
3010male70.011WE/P 573571.0B22S[null]269Milwaukee, WI
3110male39.010PC 1759971.2833C85C[null][null]New York, NY
3210male31.010F.C. 1275052.0B71S[null][null]Montreal, PQ
3310male50.010PC 17761106.425C86C[null]62Deephaven, MN / Cedar Rapids, IA
3410male39.000PC 1758029.7A18C[null]133Philadelphia, PA
3510female36.000PC 1753131.6792A29C[null][null]New York, NY
3610male[null]00PC 17483221.7792C95S[null][null][null]
3710male30.00011305127.75C111C[null][null]New York, NY
3810male19.03219950263.0C23 C25 C27S[null][null]Winnipeg, MB
3910male64.01419950263.0C23 C25 C27S[null][null]Winnipeg, MB
4010male[null]0011377826.55D34S[null][null]Westcliff-on-Sea, Essex
4110male[null]001120580.0B102S[null][null][null]
4210male37.01011380353.1C123S[null][null]Scituate, MA
4310male47.00011132038.5E63S[null]275St Anne's-on-Sea, Lancashire
4410male24.000PC 1759379.2B86C[null][null][null]
4510male71.000PC 1775434.6542A5C[null][null]New York, NY
4610male38.001PC 17582153.4625C91S[null]147Winnipeg, MB
4710male46.000PC 1759379.2B82 B84C[null][null]New York, NY
4810male[null]0011379642.4[null]S[null][null][null]
4910male45.0103697383.475C83S[null][null]New York, NY
5010male40.0001120590.0B94S[null]110[null]
5110male55.0111274993.5B69S[null]307Montreal, PQ
5210male42.00011303842.5B11S[null][null]London / Middlesex
5310male[null]001746351.8625E46S[null][null]Brighton, MA
5410male55.00068050.0C39S[null][null]London / Birmingham
5510male42.01011378952.0[null]S[null]38New York, NY
5610male[null]00PC 1760030.6958[null]C14[null]New York, NY
5710female50.000PC 1759528.7125C49C[null][null]Paris, France New York, NY
5810male46.00069426.0[null]S[null]80Bennington, VT
5910male50.00011304426.0E60S[null][null]London
6010male32.500113503211.5C132C[null]45[null]
6110male58.0001177129.7B37C[null]258Buffalo, NY
6210male41.0101746451.8625D21S[null][null]Southington / Noank, CT
6310male[null]0011302826.55C124S[null][null]Portland, OR
6410male[null]00PC 1761227.7208[null]C[null][null]Chicago, IL
6510male29.00011350130.0D6S[null]126Springfield, MA
6610male30.00011380145.5[null]S[null][null]London / New York, NY
6710male30.00011046926.0C106S[null][null]Brockton, MA
6810male19.01011377353.1D30S[null][null]New York, NY
6910male46.0001305075.2417C6C[null]292Vancouver, BC
7010male54.0001746351.8625E46S[null]175Dorchester, MA
7110male28.010PC 1760482.1708[null]C[null][null]New York, NY
7210male65.0001350926.55E38S[null]249East Bridgewater, MA
7310male44.0201992890.0C78Q[null]230Fond du Lac, WI
7410male55.00011378730.5C30S[null][null]Montreal, PQ
7510male47.00011379642.4[null]S[null][null]Washington, DC
7610male37.001PC 1759629.7C118C[null][null]Brooklyn, NY
7710male58.00235273113.275D48C[null]122Lexington, MA
7810male64.00069326.0[null]S[null]263Isle of Wight, England
7910male65.00111350961.9792B30C[null]234Providence, RI
8010male28.500PC 1756227.7208D43C[null]189?Havana, Cuba
8110male[null]001120520.0[null]S[null][null]Belfast
8210male45.50011304328.5C124S[null]166Surbiton Hill, Surrey
8310male23.0001274993.5B24S[null][null]Montreal, PQ
8410male29.01011377666.6C2S[null][null]Isleworth, England
8510male18.010PC 17758108.9C65C[null][null]Madrid, Spain
8610male47.00011046552.0C110S[null]207Worcester, MA
8710male38.000199720.0[null]S[null][null]Rotterdam, Netherlands
8810male22.000PC 17760135.6333[null]C[null]232[null]
8910male[null]00PC 17757227.525[null]C[null][null][null]
9010male31.000PC 1759050.4958A24S[null][null]Trenton, NJ
9110male[null]0011376750.0A32S[null][null]Seattle, WA
9210male36.0001304940.125A10C[null][null]Winnipeg, MB
9310male55.010PC 1760359.4[null]C[null][null]New York, NY
9410male33.00011379026.55[null]S[null]109London
9510male61.013PC 17608262.375B57 B59 B63 B66C[null][null]Haverford, PA / Cooperstown, NY
9610male50.0101350755.9E44S[null][null]Duluth, MN
9710male56.00011379226.55[null]S[null][null]New York, NY
9810male56.0001776430.6958A7C[null][null]St James, Long Island, NY
9910male24.0101369560.0C31S[null][null]Huntington, WV
10010male[null]0011305626.0A19S[null][null]Streatham, Surrey
Rows: 1-100 | Columns: 14

We can also create one using a customized relation.

In [2]:
vp.vDataFrame(sql = "SELECT pclass, AVG(survived) AS survived FROM titanic GROUP BY 1")
Out[2]:
123
pclass
Integer
123
survived
Float
110.612179487179487
220.416988416988417
330.227752639517345
Rows: 1-3 | Columns: 2

In-memory vs. In-database Loading and Processing

First, let's load the expedia dataset in Vertica.

In [4]:
vp.read_csv("data/expedia.csv", schema = "public", parse_nrows = 2000)
Out[4]:
📅
date_time
Timestamp
123
site_name
Int
123
posa_continent
Int
123
user_location_country
Int
123
user_location_region
Int
123
user_location_city
Int
123
orig_destination_distance
Numeric(11,5)
123
user_id
Int
123
is_mobile
Int
123
is_package
Int
123
channel
Int
📅
srch_ci
Date
📅
srch_co
Date
123
srch_adults_cnt
Int
123
srch_children_cnt
Int
123
srch_rm_cnt
Int
123
srch_destination_id
Int
123
srch_destination_type_id
Int
123
is_booking
Int
123
cnt
Int
123
hotel_continent
Int
123
hotel_country
Int
123
hotel_market
Int
123
hotel_cluster
Int
12013-01-07 00:00:022423505703[null]4618990092013-03-142013-03-1521166930125021241
22013-01-07 00:00:062366174211775713.6206137960092013-01-192013-01-2610188211036173058
32013-01-07 00:00:0611320515514703795.729811285750092013-01-192013-01-2210125064601250123091
42013-01-07 00:00:093716976141949[null]10804760192013-05-292013-06-05201763530125067510
52013-01-07 00:00:173716976141949[null]10188950002013-09-082013-09-102012721560225064559
62013-01-07 00:00:173716976141949[null]10804760192013-05-292013-06-05201763530125067510
72013-01-07 00:00:182366462187672696.50037837251092013-07-292013-08-04331885510225021348
82013-01-07 00:00:233716976141949[null]10804760192013-05-292013-06-05201763530125067510
92013-01-07 00:00:2823662947976511.01211979680092013-01-162013-01-18101581130225035051
102013-01-07 00:00:28236635647794948.18615933750092013-05-092013-05-10402276330267019012
112013-01-07 00:00:2924234919105[null]11748190092013-04-252013-04-2820114908101388103275
122013-01-07 00:00:3324235053819[null]2335340012013-01-082013-01-091112672960161051248
132013-01-07 00:00:3923119027731[null]5190860092013-01-092013-01-111011510303399104382
142013-01-07 00:00:42231631213476[null]1767090092013-01-192013-01-20201235076026701959
152013-01-07 00:00:4824234919105[null]11748190092013-04-252013-04-2820114908103388103296
162013-01-07 00:00:502366442197445315.414111735040092013-01-202013-01-261118739111614442
172013-01-07 00:00:5124235053819[null]2335340012013-01-082013-01-091112672960261051261
182013-01-07 00:00:552366459438051768.1618369470192013-04-132013-04-183211225760245115095
192013-01-07 00:00:592323484924[null]10465580092013-02-152013-02-17101824410425064368
202013-01-07 00:00:59236617416634382.62038864360012013-02-152013-02-20201825010125062845
212013-01-07 00:01:102423505703[null]2077690012013-04-292013-05-012011856910231828326
222013-01-07 00:01:182321564651733149.14119015731092013-01-172013-01-20201221196024812858
232013-01-07 00:01:34236635647794948.18615933750092013-05-092013-05-10201276330367019012
242013-01-07 00:01:36236617467352404.024411035720032013-02-112013-02-13211251696022506744
252013-01-07 00:01:4523662947976509.484811979680092013-01-162013-01-18101581130425035076
262013-01-07 00:01:5324234919105[null]11748190092013-04-252013-04-282011490810338810325
272013-01-07 00:02:022366462147032605.09716143220012013-01-082013-01-152011309430125021241
282013-01-07 00:02:0234320515541977155.8357257530052013-01-112013-01-122018288111219839910
292013-01-07 00:02:202321564651733149.23439015731092013-01-172013-01-20201221196014812836
302013-01-07 00:02:2134320513513892118.08717996630052013-01-082013-01-1031125946611219837073
312013-01-07 00:02:2411320515514703796.328111285750092013-01-192013-01-221012506460125012304
322013-01-07 00:02:281314634718105555.79969703960092013-10-092013-10-162018279102250123070
332013-01-07 00:02:331132051551229154.31127568700002013-01-082013-01-0921126385601219811317
342013-01-07 00:02:362366348479971692.39118567620122013-03-242013-04-0160311634101435161640
352013-01-07 00:02:432366462147032605.40936143220012013-01-082013-01-152011309430125021240
362013-01-07 00:02:432426634631371384.018211620591022013-01-112013-01-1431114985101250124199
372013-01-07 00:02:45236635622202145.544611504960032013-01-182013-01-202011227160125066391
382013-01-07 00:02:462423505703[null]50610032013-01-212013-01-22111874610161052910
392013-01-07 00:02:592426634631371386.405511620591022013-01-112013-01-1431114985101250124191
402013-01-07 00:03:09236617442881379.3402807370092013-03-202013-03-2320147429301250123018
412013-01-07 00:03:19231631213476[null]1767090092013-01-192013-01-20201235076016701949
422013-01-07 00:03:232423505703[null]486357000[null][null]201804330125067683
432013-01-07 00:03:27171133251503[null]2286870192013-02-182013-02-2322288211016173082
442013-01-07 00:03:332321564651733149.24959015731092013-01-172013-01-20201221196014812867
452013-01-07 00:03:3523704714566[null]5980760092013-01-122013-01-13613822010231824646
462013-01-07 00:03:372426634631371383.601411620591022013-01-112013-01-1431114985101250124194
472013-01-07 00:03:412366174211775713.6394137960092013-01-192013-01-2610188211016173067
482013-01-07 00:03:521132051551229154.2167568700002013-01-082013-01-0921118111012198113189
492013-01-07 00:03:5911320515514703796.308211285750092013-01-192013-01-221012506460225012305
502013-01-07 00:04:02236635647794948.18615933750092013-05-092013-05-10201276331267019012
512013-01-07 00:04:142366348488621093.72183680840032013-03-042013-03-084014404530225070198
522013-01-07 00:04:172423505703[null]4863570002013-01-242013-01-28101804330425067683
532013-01-07 00:04:24231116714716[null]10027370192013-01-172013-01-2220146810134815364
542013-01-07 00:04:251314634718105557.0739703960092013-10-092013-10-162018279101250123068
552013-01-07 00:04:29236617442881379.3402807370092013-03-202013-03-2340247429301250123018
562013-01-07 00:04:312366348488621093.72183680840032013-03-012013-03-054014404530125070198
572013-01-07 00:04:3211320515553078989.65255646640192013-03-172013-03-21421825010525062845
582013-01-07 00:04:332335031800[null]1920040092013-02-242013-02-281112223860267726
592013-01-07 00:04:33231034538784[null]7533941192013-02-092013-02-17231826810125068231
602013-01-07 00:04:372366467161591510.87141034490192013-03-012013-03-0520187911114811065
612013-01-07 00:04:402426634631371390.732611620591022013-01-112013-01-1431114985101250124140
622013-01-07 00:04:422366348488621093.72183680840032013-03-022013-03-064014404530125070198
632013-01-07 00:04:4434320515514703995.19543328531092013-03-242013-03-282211220660325062879
642013-01-07 00:04:512366462187672688.26257837251092013-07-292013-08-0433188551012502136
652013-01-07 00:04:553716976141949[null]9761180132013-02-152013-02-15211101410139216892
662013-01-07 00:04:582366348488621093.72183680840032013-03-042013-03-064014404530325070198
672013-01-07 00:05:042335031800[null]1920040092013-02-242013-02-282112223860167726
682013-01-07 00:05:063716976141949[null]10804760192013-05-292013-06-05201763530125067583
692013-01-07 00:05:133716976141949[null]10804760192013-05-292013-06-05201763530125067555
702013-01-07 00:05:311711332046168[null]1245610032013-03-092013-03-1342182531026701948
712013-01-07 00:05:322319820854488[null]4253400092013-02-102013-02-1111146810134815364
722013-01-07 00:05:3424235022013[null]4872650002013-05-232013-05-2741166930125021296
732013-01-07 00:05:393716976141949[null]10188950002013-09-082013-09-102012721560325064591
742013-01-07 00:05:422366348488621093.72183680840032013-03-042013-03-074014404530525070198
752013-01-07 00:05:4724266174262320.28632129000012013-01-062013-01-0710112269601250123037
762013-01-07 00:05:561132051551229154.31127568700002013-01-082013-01-092111811105219811317
772013-01-07 00:05:572423488158[null]2461770002013-01-312013-02-052011544101342122938
782013-01-07 00:06:01236617436222411.2046584290132013-07-032013-07-0720166930225021233
792013-01-07 00:06:072366174211775712.9457137960092013-01-192013-01-2610188211016173067
802013-01-07 00:06:112366174274372439.532111158040192013-02-022013-02-1120188551012502130
812013-01-07 00:06:112321564651733149.49349015731092013-01-182013-01-20201221196014812862
822013-01-07 00:06:232366142174402299.53838716630092013-01-112013-01-12101825010325062851
832013-01-07 00:06:3324235022013[null]4872650002013-05-232013-05-2741166930125021283
842013-01-07 00:06:352426634631371384.299911620591022013-01-112013-01-1431114985101250124140
852013-01-07 00:06:4011320538550121[null]7624230092013-05-072013-05-12211874510162042768
862013-01-07 00:06:422323484924[null]3998921012013-01-102013-01-132011268250138223046
872013-01-07 00:06:51236617436222411.0806584290132013-07-032013-07-0720166930125021283
882013-01-07 00:06:512423505703[null]182500122013-03-312013-04-022018282101312623257
892013-01-07 00:07:01236617436222411.1283584290132013-07-032013-07-0720166930125021291
902013-01-07 00:07:02236618917017149.95122040661192013-02-012013-02-041012580060225069695
912013-01-07 00:07:0311320515514703796.308211285750092013-01-192013-01-221012506461125012305
922013-01-07 00:07:08371696486514[null]10848150192013-03-182013-04-01201826810525068223
932013-01-07 00:07:1224234840722[null]11817110032013-02-102013-02-1320188191033168124297
942013-01-07 00:07:1734320515514703994.40143328531092013-03-242013-03-28221122066052506281
952013-01-07 00:07:232366174211775712.5501137960092013-01-192013-01-2610188211036173043
962013-01-07 00:07:25236617436222411.2683584290132013-07-032013-07-072016693012502125
972013-01-07 00:07:29171133251503[null]2286870192013-02-182013-02-2322288211026173022
982013-01-07 00:07:3024234910457[null]11948800002013-01-302013-02-02101202256053182465
992013-01-07 00:07:302423505703[null]4863570002013-01-242013-01-28211804330125067618
1002013-01-07 00:07:3624235022013[null]4872650002013-05-232013-05-2741166930325021269
Rows: 1-100 | Columns: 24

To understand the main difference between loading data into memory and loading data into a Vertica database, let's create a vDataFrame using an existing relation.

In [5]:
import time
start_time = time.time()
expedia = vp.vDataFrame("public.expedia")
print("elapsed time = {}".format(time.time() - start_time))
elapsed time = 0.07027411460876465

It took less than a second to create a vDataFrame. This dataset comes in at 6GB, which is very expensive for a personal machine, so we store the data entirely in Vertica and nothing is loaded into memory.

Let's compare this to loading the data into memory with pandas. You can try to load the entire dataset in your computer if you have at least 8GB of memory.

In [6]:
import pandas as pd

L_nrows = [10000, 100000, 1000000, 2000000, 5000000, 10000000, 20000000]
L_time = []
for nrows in L_nrows:
    start_time = time.time()
    expedia_df = pd.read_csv("data/expedia.csv", nrows = nrows)
    elapsed_time = time.time() - start_time
    L_time.append(elapsed_time)
    print("nrows = {}; elapsed time = {}".format(nrows, elapsed_time))
nrows = 10000; elapsed time = 0.05047297477722168
nrows = 100000; elapsed time = 0.39075613021850586
nrows = 1000000; elapsed time = 4.192815780639648
nrows = 2000000; elapsed time = 8.545155763626099
nrows = 5000000; elapsed time = 25.91479206085205
nrows = 10000000; elapsed time = 62.76401090621948
nrows = 20000000; elapsed time = 136.46023797988892

Loading data into pandas is quite fast when the data volume is low (less than some MB), but as the size of the dataset increases, it can become exponentially more expensive.

In [7]:
import matplotlib.pyplot as plt
plt.plot(L_nrows, L_time)
plt.show()

Performance will also drastically decrease.

In [9]:
start_time = time.time()
expedia_df.corr()
print("elapsed time = 119.78299331665039".format(time.time() - start_time))
elapsed time = 119.78299331665039

We're only using a little bit more than half of the dataset and it took almost 2 minutes to compute the correlation matrix.

Let's compute the entire correlation matrix using the vDataFrame.

In [11]:
start_time = time.time()
expedia.corr(show = False)
print("elapsed time = 88.62625098228455".format(time.time() - start_time))
elapsed time = 88.62625098228455

It took almost 1 minute and 30 seconds on one single Community Edition Vertica cluster without using any Vertica-specific features that might increase performance, like creating projections and or compression features.

VerticaPy caches the computed aggregations. With this cache available, we can repeat the correlation matrix computation almost instantaneously.

In [12]:
start_time = time.time()
expedia.corr(show = False)
print("elapsed time = {}".format(time.time() - start_time))
elapsed time = 0.4171907901763916

If needed, the cache can be deactivated.

Let's look at the memory usage for less than half of the dataset: Pandas is taking more than 3.6 GB.

In [13]:
expedia_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000000 entries, 0 to 19999999
Data columns (total 24 columns):
 #   Column                     Dtype  
---  ------                     -----  
 0   date_time                  object 
 1   site_name                  int64  
 2   posa_continent             int64  
 3   user_location_country      int64  
 4   user_location_region       int64  
 5   user_location_city         int64  
 6   orig_destination_distance  float64
 7   user_id                    int64  
 8   is_mobile                  int64  
 9   is_package                 int64  
 10  channel                    int64  
 11  srch_ci                    object 
 12  srch_co                    object 
 13  srch_adults_cnt            int64  
 14  srch_children_cnt          int64  
 15  srch_rm_cnt                int64  
 16  srch_destination_id        int64  
 17  srch_destination_type_id   int64  
 18  is_booking                 int64  
 19  cnt                        int64  
 20  hotel_continent            int64  
 21  hotel_country              int64  
 22  hotel_market               int64  
 23  hotel_cluster              int64  
dtypes: float64(1), int64(20), object(3)
memory usage: 3.6+ GB

Let's compare that to the total memory usage of the vDataFrame; but instead of loading half the dataset, we'll load the entire dataset: less than 44KB!

The vDataFrame remembers the user's modifications to the data, but never loads the data itself into memory.

In [14]:
expedia.memory_usage()
Out[14]:
value
object692
"date_time"1775
"site_name"1775
"posa_continent"1780
"user_location_country"1787
"user_location_region"1786
"user_location_city"1784
"orig_destination_distance"1791
"user_id"1773
"is_mobile"1775
"is_package"1776
"channel"1773
"srch_ci"1773
"srch_co"1773
"srch_adults_cnt"1781
"srch_children_cnt"1783
"srch_rm_cnt"1777
"srch_destination_id"1785
"srch_destination_type_id"1790
"is_booking"1776
"cnt"1769
"hotel_continent"1781
"hotel_country"1779
"hotel_market"1778
"hotel_cluster"1779
total43391
Rows: 1-26 | Columns: 2

We can see a clear difference. With VerticaPy, we can take advantage of Vertica's structure and scalability and run fast queries without ever loading the data into memory. In-memory processing is limited by many factors which lead to downsampling most of the time.

The Structure of the vDataFrame

A vDataFrame is composed of columns called vColumns. You can see these with the 'get_columns' method.

In [15]:
expedia.get_columns()
Out[15]:
['"date_time"',
 '"site_name"',
 '"posa_continent"',
 '"user_location_country"',
 '"user_location_region"',
 '"user_location_city"',
 '"orig_destination_distance"',
 '"user_id"',
 '"is_mobile"',
 '"is_package"',
 '"channel"',
 '"srch_ci"',
 '"srch_co"',
 '"srch_adults_cnt"',
 '"srch_children_cnt"',
 '"srch_rm_cnt"',
 '"srch_destination_id"',
 '"srch_destination_type_id"',
 '"is_booking"',
 '"cnt"',
 '"hotel_continent"',
 '"hotel_country"',
 '"hotel_market"',
 '"hotel_cluster"']

To access a vColumn, simply write its name between between square brackets.

In [16]:
expedia["is_booking"]
Out[16]:
123
is_booking
Integer
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
161
170
180
190
200
210
220
230
240
250
260
270
281
290
301
310
320
330
340
350
360
370
380
390
400
410
420
430
440
450
460
470
480
490
501
510
520
530
540
550
560
570
580
590
601
610
620
630
640
650
660
670
680
690
700
710
720
730
740
750
760
770
780
790
800
810
820
830
840
850
860
870
880
890
900
911
920
930
940
950
960
970
980
990
1000
Rows: 1-100 of 37670293 | Column: is_booking | Type: Integer

VerticaPy is smart enough to not recompute an aggregation that it's already computed.

In [17]:
expedia["is_booking"].describe()
Out[17]:
value
name"is_booking"
dtypeint
unique2.0
count37670293.0
034669600
13000693
Rows: 1-6 | Columns: 2

Each vColumn has its own catalog to save user modifications.

For example, we previously computed some aggregations for the column 'is_booking'. Let's look at the catalog of the vColumn.

In [18]:
expedia["is_booking"].catalog
Out[18]:
{'approx_unique': 2,
 'biserial': {},
 'count': 37670293,
 'cov': {},
 'cramer': {},
 'kendall': {},
 'pearson': {'"channel"': 0.0244378703859355,
  '"cnt"': -0.112906513841061,
  '"hotel_cluster"': -0.021548065656294,
  '"hotel_continent"': -0.0261279829743103,
  '"hotel_country"': -0.00394808105853647,
  '"hotel_market"': 0.0119576115952104,
  '"is_booking"': 1.0,
  '"is_mobile"': -0.0307536571387009,
  '"is_package"': -0.0763467336220978,
  '"orig_destination_distance"': -0.037732062561249,
  '"posa_continent"': 0.00972436717006096,
  '"site_name"': -0.0103791330144224,
  '"srch_adults_cnt"': -0.0490450028206689,
  '"srch_children_cnt"': -0.0222719630493467,
  '"srch_destination_id"': 0.0249567593167076,
  '"srch_destination_type_id"': 0.0404728725370959,
  '"srch_rm_cnt"': 0.0108395233625253,
  '"user_id"': 0.00182006552587162,
  '"user_location_city"': 0.00222760096637636,
  '"user_location_country"': 0.00752614887257535,
  '"user_location_region"': 0.00635166380501374},
 'regr_avgx': {},
 'regr_avgy': {},
 'regr_count': {},
 'regr_intercept': {},
 'regr_r2': {},
 'regr_slope': {},
 'regr_sxx': {},
 'regr_sxy': {},
 'regr_syy': {},
 'spearman': {},
 'spearmand': {}}

It will save the most important aggregations to avoid recomputation. The catalog will be updated whenever we make major changes to our data.

We can also view the vDataFrame's backend SQL code generation by setting 'sql_on' with the 'set_option' function.

In [19]:
vp.set_option("sql_on", True)
expedia["cnt"].describe()

Computing the different aggregations.

  SELECT
    APPROXIMATE_COUNT_DISTINCT("cnt")  
  FROM
"public"."expedia" LIMIT 1

Computing the descriptive statistics of all numerical columns using SUMMARIZE_NUMCOL.

  SELECT
    SUMMARIZE_NUMCOL("cnt") OVER ()  
  FROM
"public"."expedia"
Out[19]:
value
name"cnt"
dtypeint
unique103.0
count37670293
mean1.48338392271081
std1.21977557865576
min1.0
approx_25%1.0
approx_50%1.0
approx_75%2.0
max269.0
Rows: 1-11 | Columns: 2

You can also display the elapsed time of the different queries. For example, let's compute the correlation matrix of the vDataFrame.

Note: In order to display matplotlib graphics in Jupyter, you'll need to use the '%matplotlib inline' command the first time you decide to draw a graphic.

In [21]:
expedia = vp.vDataFrame("public.expedia") # creating a new vDataFrame to delete the catalog 
%matplotlib inline
expedia.corr()
Execution: 0.025s

Computing the pearson Corr Matrix.

  SELECT
    CORR_MATRIX("site_name", "posa_continent", "user_location_country", "user_location_region", "user_location_city", "orig_destination_distance", "user_id", "is_mobile", "is_package", "channel", "srch_adults_cnt", "srch_children_cnt", "srch_rm_cnt", "srch_destination_id", "srch_destination_type_id", "is_booking", "cnt", "hotel_continent", "hotel_country", "hotel_market", "hotel_cluster") OVER ()  
  FROM
"public"."expedia"
Execution: 103.668s
Out[21]:
"site_name"
"posa_continent"
"user_location_country"
"user_location_region"
"user_location_city"
"orig_destination_distance"
"user_id"
"is_mobile"
"is_package"
"channel"
"srch_adults_cnt"
"srch_children_cnt"
"srch_rm_cnt"
"srch_destination_id"
"srch_destination_type_id"
"is_booking"
"cnt"
"hotel_continent"
"hotel_country"
"hotel_market"
"hotel_cluster"
"site_name"1.0-0.6348166121436840.1664132212657090.12821437947825-0.016053963327060.03032478108460350.0240712207009264-0.007103805830062830.0518134532878163-0.0297380109487778-0.009045122175538-0.03223408373322960.01566116032719160.0290203455279926-0.0106931925782493-0.01037913301442240.02002836086509520.2027563292288970.261030775499846-0.0670971656075757-0.0224084561384892
"posa_continent"-0.6348166121436841.00.175935856883486-0.02850191326854850.03882178289470870.0480512845900701-0.01115551883079580.0162988210265805-0.094096913382640.09203252613019410.007996765881348410.0337225436643851-0.0316002379055981-0.01243634715931410.0425902130095240.00972436717006096-0.0148476525499434-0.333604963254112-0.1563979954738710.05057099521092480.0149381746836798
"user_location_country"0.1664132212657090.1759358568834861.00.05536354274660720.1234946267784970.0477198072121744-0.02294058147376160.00361121977714446-0.02532075943052240.1060401615373850.0372858870458130.0369011876687484-0.0002119819474623540.009075177253896020.03055584169311220.007526148872575350.00276083147077441-0.06725256171896140.09464892750320330.0182121272175058-0.0104772970912144
"user_location_region"0.12821437947825-0.02850191326854850.05536354274660721.00.1297564955597830.139158216738426-0.001497678393316990.01668011971254340.03805884400146740.0006032195056277450.009084130704715680.01234970709083880.0002709867422972280.02210991688297710.009860284546138930.00635166380501374-0.008889486972185620.0459288486415407-0.0545003468243970.04498743927338020.0074534610007365
"user_location_city"-0.016053963327060.03882178289470870.1234946267784970.1297564955597831.00.0127803340681775-0.006873459912914950.001440171706179160.01386466156915440.0264944739046070.008191533353516790.00904155620877023-6.816773064322e-050.002158661037280490.0005084964212388390.00222760096637636-0.001201345167046450.00535640996061662-0.009396294478005580.009446412165345460.00083053274838471
"orig_destination_distance"0.03032478108460350.04805128459007010.04771980721217440.1391582167384260.01278033406817751.00.0155015141284054-0.05495334495815660.03886404117843720.00295070722383832-0.0240177422409684-0.0613962725746934-0.00759779409994088-0.0273266777830876-0.039254647380722-0.0377320625612490.009996389083101380.4148283226016520.254986628696838-0.08768185971499420.00726002985218034
"user_id"0.0240712207009264-0.0111555188307958-0.0229405814737616-0.00149767839331699-0.006873459912914950.01550151412840541.0-0.00574972633365601-0.0118955887406791-0.00204060626454281-0.00453751724480532-0.0002214843344027380.001217887230714950.00301200307349740.004548904322459440.00182006552587162-0.0003802867139462270.003775015816363040.00949570839283134-0.004097007192123580.00105157644341496
"is_mobile"-0.007103805830062830.01629882102658050.003611219777144460.01668011971254340.00144017170617916-0.0549533449581566-0.005749726333656011.00.0541784707255951-0.03263897428290660.01958707822758990.0204365745420044-0.0223911112026215-0.00808136825631841-0.0179307464937316-0.03075365713870090.00494003348150226-0.0211207197358126-0.02571753047167850.007820162807232610.00841153438654652
"is_package"0.0518134532878163-0.09409691338264-0.02532075943052240.03805884400146740.01386466156915440.0388640411784372-0.01189558874067910.05417847072559511.0-0.0102508833763389-0.0231097799709034-0.0335062467550592-0.0390976480309395-0.14754569415511-0.228205454125721-0.07634673362209780.1269286856347610.111825686970504-0.0389092304457683-0.01578018754793760.0387325456425916
"channel"-0.02973801094877780.09203252613019410.1060401615373850.0006032195056277450.0264944739046070.00295070722383832-0.00204060626454281-0.0326389742829066-0.01025088337633891.0-0.02251597384322850.00653297509986320.006843024502901170.001718635408834330.02714705443043240.0244378703859355-0.0119804027217825-0.0212967280428668-0.001918422059844180.00502573302084590.000707020681183821
"srch_adults_cnt"-0.0090451221755380.007996765881348410.0372858870458130.009084130704715680.00819153335351679-0.0240177422409684-0.004537517244805320.0195870782275899-0.0231097799709034-0.02251597384322851.00.1091399810630060.5119841182717840.00414866532620076-0.010341811696591-0.04904500282066890.0143207899461151-0.0163055290691414-0.01968087639742010.006384723527640880.0123092101846642
"srch_children_cnt"-0.03223408373322960.03372254366438510.03690118766874840.01234970709083880.00904155620877023-0.0613962725746934-0.0002214843344027380.0204365745420044-0.03350624675505920.00653297509986320.1091399810630061.00.0894186825993112-0.00838418933162597-0.00843304285144073-0.02227196304934670.0178175541861587-0.0580978049729952-0.0441895231798710.001894229856935250.0162605522871085
"srch_rm_cnt"0.0156611603271916-0.0316002379055981-0.0002119819474623540.000270986742297228-6.816773064322e-05-0.007597794099940880.00121788723071495-0.0223911112026215-0.03909764803093950.006843024502901170.5119841182717840.08941868259931121.00.0123774892475750.0112112100599250.0108395233625253-0.0001918458312365490.01733479976701460.00831746161170434-0.000250246328321126-0.00595406145125926
"srch_destination_id"0.0290203455279926-0.01243634715931410.009075177253896020.02210991688297710.00215866103728049-0.02732667778308760.0030120030734974-0.00808136825631841-0.147545694155110.001718635408834330.00414866532620076-0.008384189331625970.0123774892475751.00.4376650756713560.0249567593167076-0.02131270157823160.0276967773099850.05383529013982960.0839537551580626-0.0117117007471161
"srch_destination_type_id"-0.01069319257824930.0425902130095240.03055584169311220.009860284546138930.000508496421238839-0.0392546473807220.00454890432245944-0.0179307464937316-0.2282054541257210.0271470544304324-0.010341811696591-0.008433042851440730.0112112100599250.4376650756713561.00.0404728725370959-0.0255622234293765-0.0378828342389403-0.02233014674005040.0324633908894349-0.0328496822620841
"is_booking"-0.01037913301442240.009724367170060960.007526148872575350.006351663805013740.00222760096637636-0.0377320625612490.00182006552587162-0.0307536571387009-0.07634673362209780.0244378703859355-0.0490450028206689-0.02227196304934670.01083952336252530.02495675931670760.04047287253709591.0-0.112906513841061-0.0261279829743103-0.003948081058536470.0119576115952104-0.021548065656294
"cnt"0.0200283608650952-0.01484765254994340.00276083147077441-0.00888948697218562-0.001201345167046450.00999638908310138-0.0003802867139462270.004940033481502260.126928685634761-0.01198040272178250.01432078994611510.0178175541861587-0.000191845831236549-0.0213127015782316-0.0255622234293765-0.1129065138410611.00.0176651279788286-0.00109795422989866-0.01124531182801970.00294386623308423
"hotel_continent"0.202756329228897-0.333604963254112-0.06725256171896140.04592884864154070.005356409960616620.4148283226016520.00377501581636304-0.02112071973581260.111825686970504-0.0212967280428668-0.0163055290691414-0.05809780497299520.01733479976701460.027696777309985-0.0378828342389403-0.02612798297431030.01766512797882861.00.298871980714726-0.0925976257775776-0.0139632952967878
"hotel_country"0.261030775499846-0.1563979954738710.0946489275032033-0.054500346824397-0.009396294478005580.2549866286968380.00949570839283134-0.0257175304716785-0.0389092304457683-0.00191842205984418-0.0196808763974201-0.0441895231798710.008317461611704340.0538352901398296-0.0223301467400504-0.00394808105853647-0.001097954229898660.2988719807147261.00.0211167593299163-0.0242885512464789
"hotel_market"-0.06709716560757570.05057099521092480.01821212721750580.04498743927338020.00944641216534546-0.0876818597149942-0.004097007192123580.00782016280723261-0.01578018754793760.00502573302084590.006384723527640880.00189422985693525-0.0002502463283211260.08395375515806260.03246339088943490.0119576115952104-0.0112453118280197-0.09259762577757760.02111675932991631.00.0342053589585268
"hotel_cluster"-0.02240845613848920.0149381746836798-0.01047729709121440.00745346100073650.000830532748384710.007260029852180340.001051576443414960.008411534386546520.03873254564259160.0007070206811838210.01230921018466420.0162605522871085-0.00595406145125926-0.0117117007471161-0.0328496822620841-0.0215480656562940.00294386623308423-0.0139632952967878-0.02428855124647890.03420535895852681.0
Rows: 1-21 | Columns: 22

All heavy computations are pushed to Vertica, and each aggregation is saved to each vColumn's catalog. If we call the 'corr' method again, it'll only take a couple seconds (time needed to draw the graphic).

In [22]:
start_time = time.time()
expedia.corr()
print("elapsed time = {}".format(time.time() - start_time))
elapsed time = 0.6716580390930176

We can turn off the different functions to display the elapsed time and the SQL code generation.

In [24]:
vp.set_option("sql_on", False)
vp.set_option("time_on", False)

You can access the current vDataFrame relation with the 'current_relation' method.

In [25]:
print(expedia.current_relation())
"public"."expedia"

Since we're working with SQL code generation, this relation will change according to the user's modifications. For example, let's impute the missing values of the vColumn 'orig_destination_distance' by its average and drop the vColumn 'is_package'.

In [26]:
expedia["orig_destination_distance"].fillna(method = "avg")
expedia["is_package"].drop()
print(expedia.current_relation())
13525001 elements were filled.
(
   SELECT
     "date_time",
     "site_name",
     "posa_continent",
     "user_location_country",
     "user_location_region",
     "user_location_city",
     COALESCE("orig_destination_distance", 1970.0900267207) AS "orig_destination_distance",
     "user_id",
     "is_mobile",
     "channel",
     "srch_ci",
     "srch_co",
     "srch_adults_cnt",
     "srch_children_cnt",
     "srch_rm_cnt",
     "srch_destination_id",
     "srch_destination_type_id",
     "is_booking",
     "cnt",
     "hotel_continent",
     "hotel_country",
     "hotel_market",
     "hotel_cluster" 
   FROM
 (
   SELECT
     "date_time",
     "site_name",
     "posa_continent",
     "user_location_country",
     "user_location_region",
     "user_location_city",
     "orig_destination_distance",
     "user_id",
     "is_mobile",
     "channel",
     "srch_ci",
     "srch_co",
     "srch_adults_cnt",
     "srch_children_cnt",
     "srch_rm_cnt",
     "srch_destination_id",
     "srch_destination_type_id",
     "is_booking",
     "cnt",
     "hotel_continent",
     "hotel_country",
     "hotel_market",
     "hotel_cluster" 
   FROM
 "public"."expedia") 
VERTICAPY_SUBTABLE) 
VERTICAPY_SUBTABLE

Notice how our dropping the vColumn 'is_package' simply removes it from the SELECT statement in our SQL query. Similarly, imputing a vColumn translates to using the 'COALESCE' SQL function.

vDataFrame Attributes and Management

As we saw, the vDataFrame has many attributes and methods. vDataFrames have two types of attributes:

  • Virtual Columns
  • Main Attributes (columns, main_relation ...)

The vDataFrame's main attributes are stored in the _VERTICAPYVARIABLES dictionary.

In [27]:
expedia._VERTICAPY_VARIABLES_
Out[27]:
{'allcols_ind': 24,
 'columns': ['"date_time"',
  '"site_name"',
  '"posa_continent"',
  '"user_location_country"',
  '"user_location_region"',
  '"user_location_city"',
  '"orig_destination_distance"',
  '"user_id"',
  '"is_mobile"',
  '"channel"',
  '"srch_ci"',
  '"srch_co"',
  '"srch_adults_cnt"',
  '"srch_children_cnt"',
  '"srch_rm_cnt"',
  '"srch_destination_id"',
  '"srch_destination_type_id"',
  '"is_booking"',
  '"cnt"',
  '"hotel_continent"',
  '"hotel_country"',
  '"hotel_market"',
  '"hotel_cluster"'],
 'count': 37670293,
 'exclude_columns': [],
 'history': ['{Wed May  4 20:35:03 2022} [Fillna]: 13525001 "orig_destination_distance" missing values were  filled.',
  '{Wed May  4 20:35:03 2022} [Drop]: vColumn "is_package" was deleted from the vDataFrame.'],
 'input_relation': 'expedia',
 'main_relation': '"public"."expedia"',
 'order_by': {},
 'saving': [],
 'schema': 'public',
 'where': []}

You should never change these attributes manually.

vDataFrame Data Types

The vDataFrame uses the data types of its vColumns. Computing an histogram for a numerical data type is not the same as computing one other for a categorical data type. The vDataFrame identifies four main categories:

  • int: integers are treated like categorical data types when their cardinality is low and will be considered numeric otherwise
  • float: numerics
  • date: date-like data types
  • text: categorical data types

Other data types may automatically treated as categorical. You can examine these different data types using the 'dtypes' method.

In [28]:
expedia.dtypes()
Out[28]:
dtype
"date_time"timestamp
"site_name"int
"posa_continent"int
"user_location_country"int
"user_location_region"int
"user_location_city"int
"orig_destination_distance"float
"user_id"int
"is_mobile"int
"channel"int
"srch_ci"date
"srch_co"date
"srch_adults_cnt"int
"srch_children_cnt"int
"srch_rm_cnt"int
"srch_destination_id"int
"srch_destination_type_id"int
"is_booking"int
"cnt"int
"hotel_continent"int
"hotel_country"int
"hotel_market"int
"hotel_cluster"int
Rows: 1-23 | Columns: 2

You can perform conversions with the 'astype' method.

In [29]:
expedia["hotel_market"].astype("varchar")
expedia["hotel_market"].ctype()
Out[29]:
'varchar'

You can also get the vColumn category using the 'category' method.

In [30]:
expedia["hotel_market"].category()
Out[30]:
'text'

Exporting / Saving / Loading a vDataFrame

The functions 'save' and 'load' allow the user to save and load their vDataFrame structure.

In [31]:
expedia.save()
expedia.filter("is_booking = 1")
34669600 elements were filtered.
Out[31]:
📅
date_time
Timestamp
123
site_name
Int
123
posa_continent
Int
123
user_location_country
Int
123
user_location_region
Int
123
user_location_city
Int
123
orig_destination_distance
Float
123
user_id
Int
123
is_mobile
Int
123
channel
Int
📅
srch_ci
Date
📅
srch_co
Date
123
srch_adults_cnt
Int
123
srch_children_cnt
Int
123
srch_rm_cnt
Int
123
srch_destination_id
Int
123
srch_destination_type_id
Int
123
is_booking
Int
123
cnt
Int
123
hotel_continent
Int
123
hotel_country
Int
Abc
hotel_market
Varchar
123
hotel_cluster
Int
12013-01-07 00:00:502366442197445315.41411173504092013-01-202013-01-261118739111614442
22013-01-07 00:02:0234320515541977155.835725753052013-01-112013-01-122018288111219839910
32013-01-07 00:02:2134320513513892118.0871799663052013-01-082013-01-1031125946611219837073
42013-01-07 00:04:02236635647794948.1861593375092013-05-092013-05-10201276331267019012
52013-01-07 00:04:372366467161591510.8714103449092013-03-012013-03-0520187911114811065
62013-01-07 00:07:0311320515514703796.30821128575092013-01-192013-01-221012506461125012305
72013-01-07 00:11:47113205354557631027.0272715634002013-02-172013-02-24201826811125068210
82013-01-07 00:14:242366447507361976.68361100037092013-02-012013-02-031014420181125067555
92013-01-07 00:16:0124235057031970.09002672071153606042013-05-022013-05-032111977961125021284
102013-01-07 00:16:5317113320516341970.0900267207759265092013-02-012013-02-04201874011161051243
112013-01-07 00:17:0824235195271970.09002672071193156042013-01-132013-01-14101499211131107210
122013-01-07 00:18:10131634803827611447.14451012948092013-02-082013-02-09111821711103413962
132013-01-07 00:18:59242350538191970.0900267207233534012013-01-082013-01-091112672961161051275
142013-01-07 00:23:3024235057031970.0900267207233286112013-05-252013-05-27101874511162042797
152013-01-07 00:28:05236633714241183.1422409569102013-01-132013-01-14101406031125069815
162013-01-07 00:28:561711332515031970.0900267207228687092013-02-182013-02-2322288211116173022
172013-01-07 00:29:048477462349210465.10871170845012013-05-162013-05-201012694361125074350
182013-01-07 00:29:352366246336720.69461097089092013-01-112013-01-12101149786112506364
192013-01-07 00:29:5613185464136061970.0900267207728889042013-01-072013-01-0820114643115203146130
202013-01-07 00:32:001711332073251970.0900267207940442092013-01-172013-01-2010188231113014478
212013-01-07 00:35:1724235057031970.0900267207233286112013-05-252013-05-27101874511162042792
222013-01-07 00:36:1724223168422961970.090026720710153192013-01-312013-02-0110182211113994782
232013-01-07 00:39:5623149290331161970.0900267207473260092013-01-242013-01-262012304811134815242
242013-01-07 00:41:0224235195271970.0900267207259991012013-05-052013-05-0610123507611670199
252013-01-07 00:41:37231190277311970.0900267207519086092013-01-092013-01-111011510311399104382
262013-01-07 00:43:49236617459382.6539563376092013-01-072013-01-09201827811125036848
272013-01-07 00:46:19236631137484196.4225334885092013-01-072013-01-09201826711125067555
282013-01-07 00:48:558477462349210913.54461170845012013-05-282013-05-301011227561125041491
292013-01-07 00:49:31236634857574823.93451104564092013-02-032013-02-0510166931125021233
302013-01-07 00:52:1224235057031970.0900267207486892032013-02-102013-02-12211874011161051258
312013-01-07 00:54:06230356374461970.0900267207729835092013-01-292013-02-01201807061134815311
322013-01-07 00:59:4537169668229911970.09002672071189249092013-02-112013-02-13201874511162042711
332013-01-07 00:59:5624235057031970.0900267207486892032013-02-082013-02-10211874611161052943
342013-01-07 01:03:032366351365084577.652885627012013-02-152013-02-171113339061161051243
352013-01-07 01:06:5924235057031970.0900267207486892032013-02-132013-02-16211878811167729
362013-01-07 01:07:4337169639562931970.09002672071084371002013-02-282013-03-012014790041167030915
372013-01-07 01:07:508477462349210524.19511170845012013-06-012013-06-031012660161125055028
382013-01-07 01:11:05231332033921970.09002672071127970082013-01-082013-01-1021182531116701997
392013-01-07 01:14:192335177335432650.4725119414092013-01-172013-01-1910118788111520325361
402013-01-07 01:17:042369923489391970.0900267207337690092013-01-112013-01-131011291651161073697
412013-01-07 01:18:47242345357471970.09002672071166542002013-01-142013-01-191011223361125067555
422013-01-07 01:25:10235512404481970.0900267207627909092013-01-102013-01-1120121660111610518143
432013-01-07 01:25:5624235057031970.0900267207486892032013-02-132013-02-16211878811167722
442013-01-07 01:26:2124216751112611970.0900267207210009092013-01-092013-01-126031181511135170130
452013-01-07 01:26:272346172561530.248889738092013-01-072013-01-08201874311161442459
462013-01-07 01:26:3723229407163731970.0900267207217610092013-01-082013-01-111011201411125064415
472013-01-07 01:26:4923202269252931970.0900267207913278092013-01-092013-01-10211286106116222455
482013-01-07 01:27:19236617414752342.98991043452092013-03-012013-03-022018279111250123083
492013-01-07 01:28:332375144524671970.0900267207945993092013-01-122013-01-1720146811134815315
502013-01-07 01:30:2023191376335811970.0900267207325869092013-01-082013-01-1110187971116144145059
512013-01-07 01:31:5324235057031970.09002672071155502192013-02-062013-02-07201874111161441382
522013-01-07 01:35:4911320513527655171.416174242042013-01-072013-01-081018228111219837139
532013-01-07 01:35:5324235057031970.0900267207487272092013-01-292013-02-03101822011131824615
542013-01-07 01:46:042366174326286936.5268526624012013-08-182013-08-243222373111162230120
552013-01-07 01:48:4123198208544881970.0900267207848741102013-01-072013-01-113014568761134815346
562013-01-07 01:49:12237047145661970.0900267207756106092013-01-112013-01-16101202256113182463
572013-01-07 01:50:4923139529254819.1125881839092013-01-082013-01-0920182901116153846
582013-01-07 01:55:102366174241032299.5898104938092013-01-192013-01-23401829111125019116
592013-01-07 01:55:3624235057031970.09002672071181832112013-02-102013-02-121018282111312623215
602013-01-07 01:56:452335177335432652.7368119414092013-01-192013-01-2110118788111520325364
612013-01-07 02:00:352366246506612757.8656283970002013-05-262013-05-28101825011125062888
622013-01-07 02:01:532323168422961970.0900267207167267022013-03-102013-03-112014132911131044146
632013-01-07 02:02:37231332033921970.0900267207167923012013-01-242013-01-26211215886116144249
642013-01-07 02:05:192335177335431787.217119414092013-01-212013-01-2210182641115108156867
652013-01-07 02:06:052424643028602110.2393259068092013-01-192013-01-20301208786116144135382
662013-01-07 02:09:572323513536321970.0900267207627910092013-01-172013-01-18101156555113114238
672013-01-07 02:10:0311320536730789822.9619160582002013-01-272013-01-2920126023611219839733
682013-01-07 02:11:3723229249560901970.0900267207621828092013-01-182013-01-2130213441113162171063
692013-01-07 02:14:2423155241460011970.090026720771089092013-01-142013-01-17101108611154380320
702013-01-07 02:17:27242350505471970.09002672071181832012013-03-202013-03-211018282111312623215
712013-01-07 02:22:032366174188705581.1726826240092013-03-242013-03-302218747111310610768
722013-01-07 02:25:342313182096296.7651111771192013-01-122013-01-13232874611161052938
732013-01-07 02:29:26242364417901970.09002672071161551012013-02-102013-02-121018279111250123023
742013-01-07 02:34:532316213341701970.0900267207439290092013-01-262013-01-28201152025115194155562
752013-01-07 02:35:1024235057031970.09002672071166532012013-01-212013-01-23101882211131309129
762013-01-07 02:40:022322958186331970.0900267207822913092013-01-082013-01-10201234361113162115036
772013-01-07 02:49:052366174464325359.5651193021092013-05-242013-05-2712143199111310610622
782013-01-07 02:53:05237047145661970.09002672071157092112013-01-082013-01-13202540161131828357
792013-01-07 03:01:2323154196381401970.0900267207283158092013-01-072013-01-0910123076111425153059
802013-01-07 03:02:0234320535466781363.9733565041092013-01-272013-01-311011162861125066673
812013-01-07 03:04:582362135430301970.0900267207116070092013-01-072013-01-081018253111670199
822013-01-07 03:06:43113205385454942368.07831136837092013-03-292013-04-0110119002311425153085
832013-01-07 03:13:05234842459571970.0900267207922172032013-01-102013-01-1110119091116180152536
842013-01-07 03:13:51239371153621970.0900267207171221092013-01-072013-01-0810123756611622179457
852013-01-07 03:14:242346172561535794.7986512802092013-02-032013-02-07302882211131309164
862013-01-07 03:17:0623139529254806.861275766092013-01-102013-01-13201186336116682759
872013-01-07 03:18:542322839270631970.09002672071177767022013-02-282013-03-0421182201113182469
882013-01-07 03:19:342366174316865439.98889926092013-01-092013-01-14101149071110875878
892013-01-07 03:20:59230203542261970.0900267207540541092013-01-152013-01-161011223661125067633
902013-01-07 03:26:292366348488621100.7225961860092013-02-142013-02-20222113731114128145569
912013-01-07 03:32:1023198208544881970.0900267207438844012013-01-162013-01-171112270411131046364
922013-01-07 03:33:232366348488623.4778869522092013-01-142013-01-151011223461125067541
932013-01-07 03:36:5623198208544881970.0900267207438844012013-01-162013-01-171112270411131046364
942013-01-07 03:37:0823206169402511970.0900267207348841092013-01-072013-01-081118224111613527829
952013-01-07 03:39:532366363123461709.967191258092013-01-152013-01-161011885311219839316
962013-01-07 04:00:06230203542261970.0900267207540541092013-01-162013-01-1711112267611250123049
972013-01-07 04:02:37236614920498913.8834980042002013-01-312013-02-021011193911125069448
982013-01-07 04:07:27234842459571970.0900267207823750032013-01-082013-01-11111525331125069972
992013-01-07 04:08:5823190133273831970.0900267207857247092013-01-252013-01-29201843961161441325
1002013-01-07 04:09:16230317561368077.8877438592092013-01-152013-01-161015012311250110550
Rows: 1-100 of 3000693 | Columns: 23

In this example, we filtered some data and want to go back to the previous structure.

In [33]:
expedia = expedia.load()
print(expedia.shape())
(37670293, 23)

Don't forget to use the help function when you need more information about the different functions!

In [34]:
help(expedia.load)
Help on method load in module verticapy.vdataframe:

load(offset:int=-1) method of verticapy.vdataframe.vDataFrame instance
    ---------------------------------------------------------------------------
    Loads a previous structure of the vDataFrame. 
    
    Parameters
    ----------
    offset: int, optional
        offset of the saving. Example: -1 to load the last saving.
    
    Returns
    -------
    vDataFrame
        vDataFrame of the loading.
    
    See Also
    --------
    vDataFrame.save : Saves the current vDataFrame structure.

The vDataFrame works the same way as a view. However, nothing is stored in the database unless you do it explicitly with the 'to_db' method.

You can save the vDataFrame's final relation using the 'to_db' method. If you want to save the result into a table, be sure to look at the expected disk usage of exporting the vDataFrame.

In [35]:
expedia.expected_store_usage(unit = "Gb")

Out[35]:
expected_size (Gb)
max_size (Gb)
type
"date_time"7.450580596923828e-090.2806655541062355timestamp
"site_name"0.0470501743257045750.2806655541062355int
"posa_continent"0.035083194263279440.2806655541062355int
"user_location_country"0.074533438310027120.2806655541062355int
"user_location_region"0.099749343469738960.2806655541062355int
"user_location_city"0.167342474684119220.2806655541062355int
"orig_destination_distance"7.450580596923828e-090.2806655541062355float
"user_id"0.213396676816046240.2806655541062355int
"is_mobile"0.035083194263279440.2806655541062355int
"channel"0.035088971257209780.2806655541062355int
"srch_ci"7.450580596923828e-090.28031475841999054date
"srch_co"7.450580596923828e-090.28031475096940994date
"srch_adults_cnt"0.035083194263279440.2806655541062355int
"srch_children_cnt"0.035083194263279440.2806655541062355int
"srch_rm_cnt"0.035083194263279440.2806655541062355int
"srch_destination_id"0.156296298839151860.2806655541062355int
"srch_destination_type_id"0.035083194263279440.2806655541062355int
"is_booking"0.035083194263279440.2806655541062355int
"cnt"0.035189012065529820.2806655541062355int
"hotel_continent"0.035083194263279440.2806655541062355int
"hotel_country"0.07826952729374170.2806655541062355int
"hotel_market"0.105387241579592232.806655541062355varchar
"hotel_cluster"0.066515024751424790.2806655541062355int
separator0.80691346805542710.8069134680554271
header3.4831464290618896e-073.4831464290618896e-07
rawsize2.1663975836709149.787509948946536
Rows: 1-26 | Columns: 4

After we decide that we have the space to store the vDataFrame, we can store it in our database.

In [36]:
expedia.to_db("public.expedia_clean",
              relation_type = "table")
Out[36]:
📅
date_time
Timestamp
123
site_name
Int
123
posa_continent
Int
123
user_location_country
Int
123
user_location_region
Int
123
user_location_city
Int
123
orig_destination_distance
Float
123
user_id
Int
123
is_mobile
Int
123
channel
Int
📅
srch_ci
Date
📅
srch_co
Date
123
srch_adults_cnt
Int
123
srch_children_cnt
Int
123
srch_rm_cnt
Int
123
srch_destination_id
Int
123
srch_destination_type_id
Int
123
is_booking
Int
123
cnt
Int
123
hotel_continent
Int
123
hotel_country
Int
Abc
hotel_market
Varchar
123
hotel_cluster
Int
12013-01-07 00:00:0224235057031970.0900267207461899092013-03-142013-03-1521166930125021241
22013-01-07 00:00:062366174211775713.620613796092013-01-192013-01-2610188211036173058
32013-01-07 00:00:0611320515514703795.72981128575092013-01-192013-01-2210125064601250123091
42013-01-07 00:00:0937169761419491970.09002672071080476092013-05-292013-06-05201763530125067510
52013-01-07 00:00:1737169761419491970.09002672071018895002013-09-082013-09-102012721560225064559
62013-01-07 00:00:1737169761419491970.09002672071080476092013-05-292013-06-05201763530125067510
72013-01-07 00:00:182366462187672696.5003783725192013-07-292013-08-04331885510225021348
82013-01-07 00:00:2337169761419491970.09002672071080476092013-05-292013-06-05201763530125067510
92013-01-07 00:00:2823662947976511.0121197968092013-01-162013-01-18101581130225035051
102013-01-07 00:00:28236635647794948.1861593375092013-05-092013-05-10402276330267019012
112013-01-07 00:00:29242349191051970.09002672071174819092013-04-252013-04-2820114908101388103275
122013-01-07 00:00:33242350538191970.0900267207233534012013-01-082013-01-091112672960161051248
132013-01-07 00:00:39231190277311970.0900267207519086092013-01-092013-01-111011510303399104382
142013-01-07 00:00:422316312134761970.0900267207176709092013-01-192013-01-20201235076026701959
152013-01-07 00:00:48242349191051970.09002672071174819092013-04-252013-04-2820114908103388103296
162013-01-07 00:00:502366442197445315.41411173504092013-01-202013-01-261118739111614442
172013-01-07 00:00:51242350538191970.0900267207233534012013-01-082013-01-091112672960261051261
182013-01-07 00:00:552366459438051768.161836947092013-04-132013-04-183211225760245115095
192013-01-07 00:00:5923234849241970.09002672071046558092013-02-152013-02-17101824410425064368
202013-01-07 00:00:59236617416634382.6203886436012013-02-152013-02-20201825010125062845
212013-01-07 00:01:1024235057031970.0900267207207769012013-04-292013-05-012011856910231828326
222013-01-07 00:01:182321564651733149.1411901573192013-01-172013-01-20201221196024812858
232013-01-07 00:01:34236635647794948.1861593375092013-05-092013-05-10201276330367019012
242013-01-07 00:01:36236617467352404.02441103572032013-02-112013-02-13211251696022506744
252013-01-07 00:01:4523662947976509.48481197968092013-01-162013-01-18101581130425035076
262013-01-07 00:01:53242349191051970.09002672071174819092013-04-252013-04-282011490810338810325
272013-01-07 00:02:022366462147032605.0971614322012013-01-082013-01-152011309430125021241
282013-01-07 00:02:0234320515541977155.835725753052013-01-112013-01-122018288111219839910
292013-01-07 00:02:202321564651733149.2343901573192013-01-172013-01-20201221196014812836
302013-01-07 00:02:2134320513513892118.0871799663052013-01-082013-01-1031125946611219837073
312013-01-07 00:02:2411320515514703796.32811128575092013-01-192013-01-221012506460125012304
322013-01-07 00:02:281314634718105555.7996970396092013-10-092013-10-162018279102250123070
332013-01-07 00:02:331132051551229154.3112756870002013-01-082013-01-0921126385601219811317
342013-01-07 00:02:362366348479971692.3911856762022013-03-242013-04-0160311634101435161640
352013-01-07 00:02:432366462147032605.4093614322012013-01-082013-01-152011309430125021240
362013-01-07 00:02:432426634631371384.01821162059122013-01-112013-01-1431114985101250124199
372013-01-07 00:02:45236635622202145.54461150496032013-01-182013-01-202011227160125066391
382013-01-07 00:02:4624235057031970.09002672075061032013-01-212013-01-22111874610161052910
392013-01-07 00:02:592426634631371386.40551162059122013-01-112013-01-1431114985101250124191
402013-01-07 00:03:09236617442881379.340280737092013-03-202013-03-2320147429301250123018
412013-01-07 00:03:192316312134761970.0900267207176709092013-01-192013-01-20201235076016701949
422013-01-07 00:03:2324235057031970.090026720748635700[null][null]201804330125067683
432013-01-07 00:03:271711332515031970.0900267207228687092013-02-182013-02-2322288211016173082
442013-01-07 00:03:332321564651733149.2495901573192013-01-172013-01-20201221196014812867
452013-01-07 00:03:35237047145661970.0900267207598076092013-01-122013-01-13613822010231824646
462013-01-07 00:03:372426634631371383.60141162059122013-01-112013-01-1431114985101250124194
472013-01-07 00:03:412366174211775713.639413796092013-01-192013-01-2610188211016173067
482013-01-07 00:03:521132051551229154.216756870002013-01-082013-01-0921118111012198113189
492013-01-07 00:03:5911320515514703796.30821128575092013-01-192013-01-221012506460225012305
502013-01-07 00:04:02236635647794948.1861593375092013-05-092013-05-10201276331267019012
512013-01-07 00:04:142366348488621093.7218368084032013-03-042013-03-084014404530225070198
522013-01-07 00:04:1724235057031970.0900267207486357002013-01-242013-01-28101804330425067683
532013-01-07 00:04:242311167147161970.09002672071002737092013-01-172013-01-2220146810134815364
542013-01-07 00:04:251314634718105557.073970396092013-10-092013-10-162018279101250123068
552013-01-07 00:04:29236617442881379.340280737092013-03-202013-03-2340247429301250123018
562013-01-07 00:04:312366348488621093.7218368084032013-03-012013-03-054014404530125070198
572013-01-07 00:04:3211320515553078989.6525564664092013-03-172013-03-21421825010525062845
582013-01-07 00:04:3323350318001970.0900267207192004092013-02-242013-02-281112223860267726
592013-01-07 00:04:332310345387841970.0900267207753394192013-02-092013-02-17231826810125068231
602013-01-07 00:04:372366467161591510.8714103449092013-03-012013-03-0520187911114811065
612013-01-07 00:04:402426634631371390.73261162059122013-01-112013-01-1431114985101250124140
622013-01-07 00:04:422366348488621093.7218368084032013-03-022013-03-064014404530125070198
632013-01-07 00:04:4434320515514703995.1954332853192013-03-242013-03-282211220660325062879
642013-01-07 00:04:512366462187672688.2625783725192013-07-292013-08-0433188551012502136
652013-01-07 00:04:5537169761419491970.0900267207976118032013-02-152013-02-15211101410139216892
662013-01-07 00:04:582366348488621093.7218368084032013-03-042013-03-064014404530325070198
672013-01-07 00:05:0423350318001970.0900267207192004092013-02-242013-02-282112223860167726
682013-01-07 00:05:0637169761419491970.09002672071080476092013-05-292013-06-05201763530125067583
692013-01-07 00:05:1337169761419491970.09002672071080476092013-05-292013-06-05201763530125067555
702013-01-07 00:05:3117113320461681970.0900267207124561032013-03-092013-03-1342182531026701948
712013-01-07 00:05:3223198208544881970.0900267207425340092013-02-102013-02-1111146810134815364
722013-01-07 00:05:34242350220131970.0900267207487265002013-05-232013-05-2741166930125021296
732013-01-07 00:05:3937169761419491970.09002672071018895002013-09-082013-09-102012721560325064591
742013-01-07 00:05:422366348488621093.7218368084032013-03-042013-03-074014404530525070198
752013-01-07 00:05:4724266174262320.2863212900012013-01-062013-01-0710112269601250123037
762013-01-07 00:05:561132051551229154.3112756870002013-01-082013-01-092111811105219811317
772013-01-07 00:05:5724234881581970.0900267207246177002013-01-312013-02-052011544101342122938
782013-01-07 00:06:01236617436222411.204658429032013-07-032013-07-0720166930225021233
792013-01-07 00:06:072366174211775712.945713796092013-01-192013-01-2610188211016173067
802013-01-07 00:06:112366174274372439.53211115804092013-02-022013-02-1120188551012502130
812013-01-07 00:06:112321564651733149.4934901573192013-01-182013-01-20201221196014812862
822013-01-07 00:06:232366142174402299.5383871663092013-01-112013-01-12101825010325062851
832013-01-07 00:06:33242350220131970.0900267207487265002013-05-232013-05-2741166930125021283
842013-01-07 00:06:352426634631371384.29991162059122013-01-112013-01-1431114985101250124140
852013-01-07 00:06:40113205385501211970.0900267207762423092013-05-072013-05-12211874510162042768
862013-01-07 00:06:4223234849241970.0900267207399892112013-01-102013-01-132011268250138223046
872013-01-07 00:06:51236617436222411.080658429032013-07-032013-07-0720166930125021283
882013-01-07 00:06:5124235057031970.090026720718250022013-03-312013-04-022018282101312623257
892013-01-07 00:07:01236617436222411.128358429032013-07-032013-07-0720166930125021291
902013-01-07 00:07:02236618917017149.9512204066192013-02-012013-02-041012580060225069695
912013-01-07 00:07:0311320515514703796.30821128575092013-01-192013-01-221012506461125012305
922013-01-07 00:07:083716964865141970.09002672071084815092013-03-182013-04-01201826810525068223
932013-01-07 00:07:12242348407221970.09002672071181711032013-02-102013-02-1320188191033168124297
942013-01-07 00:07:1734320515514703994.4014332853192013-03-242013-03-28221122066052506281
952013-01-07 00:07:232366174211775712.550113796092013-01-192013-01-2610188211036173043
962013-01-07 00:07:25236617436222411.268358429032013-07-032013-07-072016693012502125
972013-01-07 00:07:291711332515031970.0900267207228687092013-02-182013-02-2322288211026173022
982013-01-07 00:07:30242349104571970.09002672071194880002013-01-302013-02-02101202256053182465
992013-01-07 00:07:3024235057031970.0900267207486357002013-01-242013-01-28211804330125067618
1002013-01-07 00:07:36242350220131970.0900267207487265002013-05-232013-05-2741166930325021269
Rows: 1-100 of 37670293 | Columns: 23