The Virtual DataFrame#

The Virtual DataFrame (vDataFrame) is the core object of the VerticaPy library. Leveraging the power of Vertica and the flexibility of Python, the vDataFrame is a Python object that lets you manipulate the data representation in a Vertica database without modifying the underlying data. The data represented by a vDataFrame remains in the Vertica database, bypassing the limitations of working memory. When a vDataFrame is created or altered, VerticaPy formulates the operation as an SQL query and pushes the computation to the Vertica database, harnessing Vertica’s massive parallel processing and in-built functions. Vertica then aggregates and returns the result to VerticaPy. In essence, vDataFrames behave similar to views in the Vertica database.

For more information about Vertica’s performance advantages, including its columnar orientation and parallelization across nodes, see the Vertica documentation.

In the following tutorial, we will introduce the basic functionality of the vDataFrame and then explore the ways in which they utilize in-database processing to enhance performance.

Creating vDataFrames#

First, run the load_titanic() function to ingest into Vertica a dataset with information about titanic passengers:

[2]:
from verticapy.datasets import load_titanic

# By default, the table is created in the `public` schema
load_titanic()
[2]:
123
pclass
Integer
123
survived
Integer
Abc
Varchar(164)
Abc
sex
Varchar(20)
123
age
Numeric(8)
123
sibsp
Integer
123
parch
Integer
Abc
ticket
Varchar(36)
123
fare
Numeric(12)
Abc
cabin
Varchar(30)
Abc
embarked
Varchar(20)
Abc
boat
Varchar(100)
123
body
Integer
Abc
Varchar(100)
110male47.010PC 17757227.525C62 C64C[null]124
210male24.001PC 17558247.5208B58 B60C[null][null]
310male25.0001390526.0[null]C[null]148
410male42.00011048926.55D22S[null][null]
510male45.00011305026.55B38S[null][null]
610male46.010W.E.P. 573461.175E31S[null][null]
710male64.01419950263.0C23 C25 C27S[null][null]
810male[null]0011379642.4[null]S[null][null]
910male32.500113503211.5C132C[null]45
1010male55.00011378730.5C30S[null][null]
1110male37.001PC 1759629.7C118C[null][null]
1210male64.00069326.0[null]S[null]263
1310male56.00011379226.55[null]S[null][null]
1410male24.0101369560.0C31S[null][null]
1511male0.9212113781151.55C22 C26S11[null]
1611female18.010PC 17757227.525C62 C64C4[null]
1711male80.0002704230.0A23SB[null]
1811female42.000PC 17757227.525[null]C4[null]
1911male25.0101196791.0792B49C7[null]
2011female45.000PC 17608262.375[null]C4[null]
2111female22.00111350555.0E33S6[null]
2211female[null]001777027.7208[null]C5[null]
2311female22.000113781151.55[null]S11[null]
2411female64.002PC 1775683.1583E45C14[null]
2511female54.0103694778.2667D20C4[null]
2611male43.0101776527.7208D40C5[null]
2711female22.0021356849.5B39C5[null]
2811male23.001PC 1775963.3583D10 D12C7[null]
2911female35.01011378952.0[null]S8[null]
3011female[null]101746451.8625D21S8[null]
3111male42.0101175352.5542D19S5[null]
3211female45.0101175352.5542D19S5[null]
3311female16.001PC 1759239.4D28S9[null]
3411female21.0001350277.9583D9S10[null]
3511male36.000PC 1747326.2875E25S7[null]
3611male[null]00F.C. 1299825.7417[null]C7[null]
3711female33.000PC 1761327.7208A11C11[null]
3811female48.013PC 17608262.375B57 B59 B63 B66C4[null]
3911female23.0102122882.2667B45S7[null]
4011female31.00236928164.8667C7S8[null]
4120male57.00024434613.0[null]S[null][null]
4220female29.010SC/AH 2903726.0[null]S[null][null]
4320male29.000W./C. 1426310.5[null]S[null][null]
4420female30.00023724913.0[null]S[null][null]
4520male24.00024872613.5[null]S[null]297
4620male30.00025064613.0[null]S[null]305
4720male52.00025064713.0[null]S[null]19
4820male44.0102670726.0[null]S[null][null]
4920male57.00021953312.35[null]Q[null][null]
5020male32.00023721613.5[null]S[null]209
5120male70.000C.A. 2458010.5[null]S[null][null]
5220male54.0002901114.0[null]S[null][null]
5320male16.000S.O./P.P. 310.5[null]S[null][null]
5420male62.0002402769.6875[null]Q[null][null]
5520male27.000SC/PARIS 216815.0333[null]C[null][null]
5620male36.000C.A. 1724810.5[null]S[null][null]
5720female27.0101166821.0[null]S[null][null]
5820male27.01022841426.0[null]S[null]293
5920male66.000C.A. 2457910.5[null]S[null][null]
6021female48.002C.A. 3311236.75[null]S14[null]
6121female30.010SC/PARIS 214813.8583[null]C12[null]
6221female34.00024388013.0[null]S12[null]
6321female24.00225064914.5[null]S4[null]
6421female48.01222084565.0[null]S9[null]
6521female3.012SC/Paris 212341.5792[null]C14[null]
6621male3.01123008026.0F2SD[null]
6721female2.0112636026.0[null]S11[null]
6821female18.00225065213.0[null]S16[null]
6930male30.000C 70767.25[null]S[null]72
7030female40.01075469.475[null]S[null][null]
7130male35.0003734508.05[null]S[null][null]
7230female18.001269114.4542[null]C[null][null]
7330male[null]0026647.225[null]C[null][null]
7430female32.01136484915.5[null]Q[null][null]
7530male22.0003500457.7958[null]S[null][null]
7630male35.0003645128.05[null]S[null][null]
7730male[null]10268914.4583[null]C[null][null]
7830female[null]10268914.4583[null]C[null][null]
7930male31.0003350977.75[null]Q[null][null]
8030female22.000755210.5167[null]S[null][null]
8130male0.330234708014.4[null]S[null][null]
8230male34.01134708014.4[null]S[null]197
8330female28.01134708014.4[null]S[null][null]
8430male25.0003492037.8958[null]S[null][null]
8530male25.0003492507.8958[null]S[null][null]
8630male[null]003492387.8958[null]S[null][null]
8730male[null]003492257.8958[null]S[null][null]
8830male22.000A/5 211727.25[null]S[null][null]
8930male[null]0026747.225[null]C[null][null]
9030male[null]0026317.225[null]C[null][null]
9130male40.500C.A. 621215.1[null]S[null]187
9230male40.5003672327.75[null]Q[null]68
9330male18.0003500367.7958[null]S[null][null]
9430female[null]003648597.75[null]Q[null][null]
9530male[null]003492547.8958[null]C[null][null]
9630male40.016CA 214446.9[null]S[null][null]
9730male51.000214408.05[null]S[null][null]
9830male26.0103500257.8542[null]S[null][null]
9930female[null]003826497.75[null]Q[null][null]
10030male[null]003492207.8958[null]S[null][null]
Rows: 1-100 | Columns: 14

You can create a vDataFrame from either an existing relation or a customized relation.

To create a vDataFrame using an existing relation, in this case the Titanic dataset, provide the name of the dataset:

[3]:
import verticapy as vp

vp.vDataFrame("public.titanic")
[3]:
123
pclass
Integer
123
survived
Integer
Abc
Varchar(164)
Abc
sex
Varchar(20)
123
age
Numeric(8)
123
sibsp
Integer
123
parch
Integer
Abc
ticket
Varchar(36)
123
fare
Numeric(12)
Abc
cabin
Varchar(30)
Abc
embarked
Varchar(20)
Abc
boat
Varchar(100)
123
body
Integer
Abc
Varchar(100)
110male47.010PC 17757227.525C62 C64C[null]124
210male24.001PC 17558247.5208B58 B60C[null][null]
310male25.0001390526.0[null]C[null]148
410male42.00011048926.55D22S[null][null]
510male45.00011305026.55B38S[null][null]
610male46.010W.E.P. 573461.175E31S[null][null]
710male64.01419950263.0C23 C25 C27S[null][null]
810male[null]0011379642.4[null]S[null][null]
910male32.500113503211.5C132C[null]45
1010male55.00011378730.5C30S[null][null]
1110male37.001PC 1759629.7C118C[null][null]
1210male64.00069326.0[null]S[null]263
1310male56.00011379226.55[null]S[null][null]
1410male24.0101369560.0C31S[null][null]
1511male0.9212113781151.55C22 C26S11[null]
1611female18.010PC 17757227.525C62 C64C4[null]
1711male80.0002704230.0A23SB[null]
1811female42.000PC 17757227.525[null]C4[null]
1911male25.0101196791.0792B49C7[null]
2011female45.000PC 17608262.375[null]C4[null]
2111female22.00111350555.0E33S6[null]
2211female[null]001777027.7208[null]C5[null]
2311female22.000113781151.55[null]S11[null]
2411female64.002PC 1775683.1583E45C14[null]
2511female54.0103694778.2667D20C4[null]
2611male43.0101776527.7208D40C5[null]
2711female22.0021356849.5B39C5[null]
2811male23.001PC 1775963.3583D10 D12C7[null]
2911female35.01011378952.0[null]S8[null]
3011female[null]101746451.8625D21S8[null]
3111male42.0101175352.5542D19S5[null]
3211female45.0101175352.5542D19S5[null]
3311female16.001PC 1759239.4D28S9[null]
3411female21.0001350277.9583D9S10[null]
3511male36.000PC 1747326.2875E25S7[null]
3611male[null]00F.C. 1299825.7417[null]C7[null]
3711female33.000PC 1761327.7208A11C11[null]
3811female48.013PC 17608262.375B57 B59 B63 B66C4[null]
3911female23.0102122882.2667B45S7[null]
4011female31.00236928164.8667C7S8[null]
4120male57.00024434613.0[null]S[null][null]
4220female29.010SC/AH 2903726.0[null]S[null][null]
4320male29.000W./C. 1426310.5[null]S[null][null]
4420female30.00023724913.0[null]S[null][null]
4520male24.00024872613.5[null]S[null]297
4620male30.00025064613.0[null]S[null]305
4720male52.00025064713.0[null]S[null]19
4820male44.0102670726.0[null]S[null][null]
4920male57.00021953312.35[null]Q[null][null]
5020male32.00023721613.5[null]S[null]209
5120male70.000C.A. 2458010.5[null]S[null][null]
5220male54.0002901114.0[null]S[null][null]
5320male16.000S.O./P.P. 310.5[null]S[null][null]
5420male62.0002402769.6875[null]Q[null][null]
5520male27.000SC/PARIS 216815.0333[null]C[null][null]
5620male36.000C.A. 1724810.5[null]S[null][null]
5720female27.0101166821.0[null]S[null][null]
5820male27.01022841426.0[null]S[null]293
5920male66.000C.A. 2457910.5[null]S[null][null]
6021female48.002C.A. 3311236.75[null]S14[null]
6121female30.010SC/PARIS 214813.8583[null]C12[null]
6221female34.00024388013.0[null]S12[null]
6321female24.00225064914.5[null]S4[null]
6421female48.01222084565.0[null]S9[null]
6521female3.012SC/Paris 212341.5792[null]C14[null]
6621male3.01123008026.0F2SD[null]
6721female2.0112636026.0[null]S11[null]
6821female18.00225065213.0[null]S16[null]
6930male30.000C 70767.25[null]S[null]72
7030female40.01075469.475[null]S[null][null]
7130male35.0003734508.05[null]S[null][null]
7230female18.001269114.4542[null]C[null][null]
7330male[null]0026647.225[null]C[null][null]
7430female32.01136484915.5[null]Q[null][null]
7530male22.0003500457.7958[null]S[null][null]
7630male35.0003645128.05[null]S[null][null]
7730male[null]10268914.4583[null]C[null][null]
7830female[null]10268914.4583[null]C[null][null]
7930male31.0003350977.75[null]Q[null][null]
8030female22.000755210.5167[null]S[null][null]
8130male0.330234708014.4[null]S[null][null]
8230male34.01134708014.4[null]S[null]197
8330female28.01134708014.4[null]S[null][null]
8430male25.0003492037.8958[null]S[null][null]
8530male25.0003492507.8958[null]S[null][null]
8630male[null]003492387.8958[null]S[null][null]
8730male[null]003492257.8958[null]S[null][null]
8830male22.000A/5 211727.25[null]S[null][null]
8930male[null]0026747.225[null]C[null][null]
9030male[null]0026317.225[null]C[null][null]
9130male40.500C.A. 621215.1[null]S[null]187
9230male40.5003672327.75[null]Q[null]68
9330male18.0003500367.7958[null]S[null][null]
9430female[null]003648597.75[null]Q[null][null]
9530male[null]003492547.8958[null]C[null][null]
9630male40.016CA 214446.9[null]S[null][null]
9730male51.000214408.05[null]S[null][null]
9830male26.0103500257.8542[null]S[null][null]
9930female[null]003826497.75[null]Q[null][null]
10030male[null]003492207.8958[null]S[null][null]
Rows: 1-100 | Columns: 14

To create a vDataFrame using a customized relation, specify the SQL query for that relation as the argument:

[4]:
vp.vDataFrame("SELECT pclass, AVG(survived) AS survived FROM titanic GROUP BY 1")
[4]:
123
pclass
Integer
123
survived
Float(22)
120.416988416988417
210.612179487179487
330.227752639517345
Rows: 1-3 | Columns: 2

For more examples of creating vDataFrames, see the vDataFrame reference page.

Loading and processing in-memory vs. in-database#

The following examples demonstrate the performance advantages of loading and processing data in-database versus in-memory.

First, we download the Expedia dataset from Kaggle and then load it into Vertica:

NOTE: You might need to change the value of the path agrument for the read_csv function, depending on where your data file is located.

[ ]:
vp.read_csv("train.csv", schema = "public", parse_nrows = 20000000)

Once the data is loaded into the Vertica database, we can create a vDataFrame using the relation that contains the Expedia dataset:

[6]:
import time

start_time = time.time()
expedia = vp.vDataFrame("public.train")
print("elapsed time = {}".format(time.time() - start_time))
display(expedia)
elapsed time = 0.15794777870178223
📅
date_time
Timestamp(29)
123
site_name
Integer
123
posa_continent
Integer
123
user_location_country
Integer
123
user_location_region
Integer
123
user_location_city
Integer
123
orig_destination_distance
Numeric(14)
123
user_id
Integer
123
is_mobile
Integer
123
is_package
Integer
123
channel
Integer
📅
srch_ci
Date
📅
srch_co
Date
123
srch_adults_cnt
Integer
123
srch_children_cnt
Integer
123
srch_rm_cnt
Integer
123
srch_destination_id
Integer
123
srch_destination_type_id
Integer
123
is_booking
Integer
123
cnt
Integer
123
hotel_continent
Integer
123
hotel_country
Integer
123
hotel_market
Integer
123
hotel_cluster
Integer
12013-01-07 00:00:28236635647794948.18615933750092013-05-092013-05-10402276330267019012
22013-01-07 00:01:34236635647794948.18615933750092013-05-092013-05-10201276330367019012
32013-01-07 00:01:4523662947976509.484811979680092013-01-162013-01-18101581130425035076
42013-01-07 00:01:5324234919105[null]11748190092013-04-252013-04-282011490810338810325
52013-01-07 00:02:2411320515514703796.328111285750092013-01-192013-01-221012506460125012304
62013-01-07 00:02:432366462147032605.40936143220012013-01-082013-01-152011309430125021240
72013-01-07 00:02:432426634631371384.018211620591022013-01-112013-01-1431114985101250124199
82013-01-07 00:03:3523704714566[null]5980760092013-01-122013-01-13613822010231824646
92013-01-07 00:03:412366174211775713.6394137960092013-01-192013-01-2610188211016173067
102013-01-07 00:04:332335031800[null]1920040092013-02-242013-02-281112223860267726
112013-01-07 00:05:042335031800[null]1920040092013-02-242013-02-282112223860167726
122013-01-07 00:05:063716976141949[null]10804760192013-05-292013-06-05201763530125067583
132013-01-07 00:05:393716976141949[null]10188950002013-09-082013-09-102012721560325064591
142013-01-07 00:06:352426634631371384.299911620591022013-01-112013-01-1431114985101250124140
152013-01-07 00:07:1734320515514703994.40143328531092013-03-242013-03-28221122066052506281
162013-01-07 00:09:1911320515553078990.60275646640192013-03-172013-03-18221825010425062849
172013-01-07 00:09:3023662947976509.840411979680092013-01-162013-01-18101581130325035057
182013-01-07 00:11:00236617421316229.3426313550002013-01-182013-01-21211336130125035189
192013-01-07 00:11:272366174149875992.45218936270192013-05-222013-05-302018788101677211
202013-01-07 00:11:57236617438899341.9762533271092013-01-072013-01-102212437460325062732
212013-01-07 00:14:1924235053819[null]2335340012013-01-082013-01-091112672960261051275
222013-01-07 00:14:20131634803827611447.144510129480092013-02-082013-02-09111821710103413962
232013-01-07 00:14:472366246506612756.35932839700002013-05-232013-05-30201825010125062854
242013-01-07 00:17:0523937115362[null]5689880092013-01-262013-01-302113011302610782258
252013-01-07 00:17:2534320515514703995.79233328531092013-03-242013-03-28221122066012506281
262013-01-07 00:18:38236617442881379.1427807370092013-03-212013-03-242018279103250123094
272013-01-07 00:18:5234320515514703994.65853328531092013-03-242013-03-282211220660125062879
282013-01-07 00:20:542314929033116[null]4732600092013-01-242013-01-262014090360134815229
292013-01-07 00:23:131711332046168[null]1245610032013-03-092013-03-1322120796601670190425
302013-01-07 00:23:422319820854488[null]683860092013-03-062013-03-0721120118601354485
312013-01-07 00:23:57371696486514[null]10848150192013-03-182013-04-01201826810225068251
322013-01-07 00:24:113716957432474[null]10848160092013-02-242013-02-263012460660331518020
332013-01-07 00:27:363716976141949[null]9761180132013-02-152013-02-192111014101139216892
342013-01-07 00:28:012319820854488[null]683860092013-03-062013-03-0721120115606354489
352013-01-07 00:30:342366184117406875.696210425470072013-02-022013-02-061117883033015008
362013-01-07 00:32:168477462349210465.108711708450012013-05-162013-05-20101821810125074350
372013-01-07 00:32:442423505703[null]2332861012013-05-252013-05-27101874510162042792
382013-01-07 00:34:17236987925671[null]8660320092013-02-192013-02-2320118774101399123967
392013-01-07 00:35:192366189170172194.94412040661192013-01-312013-02-11101827210125065915
402013-01-07 00:35:293716957432474[null]10848160092013-02-242013-02-26301246066013151809
412013-01-07 00:37:342366174211775711.6309137960092013-01-192013-01-2611188211016173062
422013-01-07 00:37:473716963956293[null]10843710002013-02-282013-03-012014790040267030915
432013-01-07 00:37:492366348488621093.47153680840132013-03-042013-03-084014404530225070170
442013-01-07 00:38:4224237136693[null]11682970002013-01-272013-01-28101880610161052264
452013-01-07 00:40:012366189170172191.57592040661192013-01-312013-02-11101827210125065950
462013-01-07 00:41:022423519527[null]2599910012013-05-052013-05-0610123507611670199
472013-01-07 00:41:26236644845508762.662911591300192013-04-172013-04-25201826810525068295
482013-01-07 00:42:232366348488621093.5613680840132013-03-042013-03-0840144045301125070141
492013-01-07 00:43:262314929033116[null]6921100192013-02-222013-02-2820117251023182149353
502013-01-07 00:44:482423488158[null]11560050092013-02-092013-02-14201202256013182465
512013-01-07 00:45:25236617426259352.430811143390092013-02-152013-02-172011217560125036613
522013-01-07 00:46:182366258425001643.0566164470092013-02-022013-02-033112921140125016303
532013-01-07 00:49:2711320538550121[null]7624230092013-05-072013-05-122111224160662042758
542013-01-07 00:50:0323661744258645.21798449830092013-02-162013-02-17201883510225035734
552013-01-07 00:51:4823114515028[null]1860701092013-01-072013-01-08201124115056105298
562013-01-07 00:53:551716348038276139.1247583051102013-01-112013-02-1110122076603646146383
572013-01-07 00:54:092366448298721283.202211186120192013-01-132013-01-1720187911014811052
582013-01-07 00:54:2724234825460[null]193850092013-08-252013-08-26221783103359836
592013-01-07 00:55:04236644845508768.59311591300192013-04-172013-04-25201826810125068291
602013-01-07 00:55:052366189170172188.56182040661192013-01-312013-02-11101827210125065942
612013-01-07 00:56:0724234825460[null]193850092013-08-252013-08-26221783104359830
622013-01-07 00:58:132366174194571646.97476082310022013-02-212013-02-25201824310125040918
632013-01-07 00:59:04343205135276553493.5323149290092013-02-102013-02-1610112839508451150934
642013-01-07 00:59:152314929033116[null]6921100192013-02-222013-02-272012122210631527327
652013-01-07 00:59:4423937115362[null]5689880092013-01-302013-02-022112131760161078223
662013-01-07 00:59:562423505703[null]4868920032013-02-082013-02-10211874611161052943
672013-01-07 01:01:08232316842296[null]753860182013-02-112013-02-192011490810238810325
682013-01-07 01:02:082311595717195[null]6482310092013-01-242013-01-25211822010131824646
692013-01-07 01:02:252426984422970[null]11538200002013-01-152013-01-1820182531016701950
702013-01-07 01:04:352366174262322568.6493618970102013-01-112013-01-11101826710125067510
712013-01-07 01:04:563716963956293[null]10843710002013-02-282013-03-022014790040267030915
722013-01-07 01:07:3724238322548[null]11546670092013-08-042013-08-0610182451016222458
732013-01-07 01:10:222423488158[null]11560050092013-02-092013-02-142012022560431824662
742013-01-07 01:11:45236644853078828.1394682910192013-03-072013-03-10201826010625070170
752013-01-07 01:14:192335177335432650.47251194140092013-01-172013-01-1910118788111520325361
762013-01-07 01:18:14841333028810[null]10147290112014-01-032014-01-04211874510162042764
772013-01-07 01:18:152366174180377445.04783851720092013-01-222013-01-2410113747401063125895
782013-01-07 01:20:432322940716373[null]2176100092013-01-082013-01-101011201410225064415
792013-01-07 01:26:372322940716373[null]2176100092013-01-082013-01-111011201411125064415
802013-01-07 01:27:19236617414752342.989910434520092013-03-012013-03-022018279111250123083
812013-01-07 01:31:052366174362422331.52573950561092013-05-162013-05-182011222660225070131
822013-01-07 01:35:492426984422970[null]11538200002013-01-152013-01-1710182531016701973
832013-01-07 01:35:542319820854488[null]8487411002013-01-072013-01-113014568760234815346
842013-01-07 01:37:5823237823278[null]8007690092013-01-112013-01-132018282102312623244
852013-01-07 01:38:032366348488621093.60263680840132013-03-042013-03-084014404530225070184
862013-01-07 01:38:2623704714566[null]7561060092013-01-112013-01-17101202256033182463
872013-01-07 01:47:423716617459385.09924581290002013-01-072013-01-111112496460225036847
882013-01-07 01:48:1623139529254916.40859011620092013-01-162013-01-182011952010167715446
892013-01-07 01:51:263716976141949[null]10848301092013-02-142013-02-152014221910138232967
902013-01-07 01:51:4123704714566[null]1322310092013-02-252013-03-01303408224013998861
912013-01-07 01:53:3524236422203[null]2553110012013-02-242013-02-28101874710131061073
922013-01-07 01:58:262423505703[null]47690042013-01-212013-01-22101874110261441347
932013-01-07 01:58:432366246506612757.86562839700002013-05-262013-05-28101825010125062888
942013-01-07 02:01:5923237823278[null]8007690092013-01-112013-01-132018282101312623264
952013-01-07 02:04:192366174147525442.28079149840132013-02-092013-03-10201122136016701946
962013-01-07 02:04:542322940716373[null]3428970092013-01-142013-01-16101829110225019159
972013-01-07 02:07:3523937115362[null]7770270092013-02-162013-02-2010146810634815342
982013-01-07 02:09:002366246284912377.83483550070042013-01-172013-01-183012418960125036048
992013-01-07 02:09:57232351353632[null]6279100092013-01-172013-01-18101156555113114238
1002013-01-07 02:13:173431331454307[null]9743160052013-02-012013-02-0320130596016204178622
Rows: 1-100 | Columns: 24

The vDataFrame was created in about a second. All the data—about 4GB—is stored in Vertica, requiring no in-memory data loading.

Now, to compare the above result with in-memory loading, we load about half the dataset into pandas:

NOTE: This process is expensive on local machines, so avoid running the following code if your computer has less than 2GB of memory.

[7]:
import pandas as pd

L_nrows = [10000, 100000, 1000000, 2000000, 5000000, 10000000, 20000000]
L_time = []
for nrows in L_nrows:
    start_time = time.time()
    expedia_df = pd.read_csv("train.csv", nrows = nrows)
    elapsed_time = time.time() - start_time
    L_time.append(elapsed_time)
    print("nrows = {}; elapsed time = {}".format(nrows, elapsed_time))
nrows = 10000; elapsed time = 0.03764772415161133
nrows = 100000; elapsed time = 0.19388937950134277
nrows = 1000000; elapsed time = 1.8165876865386963
nrows = 2000000; elapsed time = 3.613091230392456
nrows = 5000000; elapsed time = 8.884109735488892
nrows = 10000000; elapsed time = 19.819123029708862
nrows = 20000000; elapsed time = 62.796350717544556

It took about a minute to load half the dataset into memory. Compared with the one second required to create the vDataFrame on the entire dataset in-database, in-memory loading is magnitudes more expensive. Loading data into pandas is quite fast when the data volume is low (less than some MB), but as the size of the dataset increases, the load time can become exponentially more expensive, as seen in the following plot:

[8]:
import matplotlib.pyplot as plt
plt.plot(L_nrows, L_time)
plt.show()
../../../_images/notebooks_introduction_vdf_index_20_0.png

Even after the data is loaded into memory, the performance is very slow. The following example removes non-numeric columns from the dataset, then computes a correlation matrix:

[9]:
columns_to_drop = ['date_time', 'srch_ci', 'srch_co']
expedia_df = expedia_df.drop(columns_to_drop, axis=1)
[10]:
start_time = time.time()
expedia_df.corr()
print(f"elapsed time = {time.time() - start_time}")
elapsed time = 20.517507791519165

The operation took about 20 seconds. Let’s compare the performance in-database using a vDataFrame to compute the correlation matrix of the entire dataset:

[11]:
# Remove non-numeric columns
expedia.drop(columns = ['date_time', 'srch_ci', 'srch_co'])
[11]:
123
site_name
Integer
123
posa_continent
Integer
123
user_location_country
Integer
123
user_location_region
Integer
123
user_location_city
Integer
123
orig_destination_distance
Numeric(14)
123
user_id
Integer
123
is_mobile
Integer
123
is_package
Integer
123
channel
Integer
123
srch_adults_cnt
Integer
123
srch_children_cnt
Integer
123
srch_rm_cnt
Integer
123
srch_destination_id
Integer
123
srch_destination_type_id
Integer
123
is_booking
Integer
123
cnt
Integer
123
hotel_continent
Integer
123
hotel_country
Integer
123
hotel_market
Integer
123
hotel_cluster
Integer
12366448509471892.118310820270093011281150145717282
22423505703[null]2299160041018744101310610885
3236644218617[null]1080633019111822310125069098
42366448248482270.5665445852009101827810125036868
54036631448683831.188777545201500187451116204275
6236644850947790.861765691601121114840111250162990
7236643514812[null]1120833019001180461016105180843
837119815317956[null]69518300910182641015108156838
92311595717195[null]932702009101268102399103360
10236634848862206.70248298230011011791840225019169
113432053542531511.680152257400921112951502219839716
12236975132273[null]44517011101279126016702335
13236633534369[null]40711301120187911024811034
142366348479972235.26383709210192011220660425062845
15343205354335672360.1619506148119201114391014163150365
162315419638140[null]9971190091012286560149320865
1723133203392[null]1115968009201216621016105183567
18236643514812[null]1120833019001180461016105180822
1911320538546963101.47262704760091118270102219839042
202319815317956[null]1037184001201880810161692864
2134320535421728218.94213860200920126022602219839768
222366337217071203.36845744000320128514602250145755
2323662203781496.4838765850092012610541125067825
2411320535420788[null]90142900420112951501219839719
2523133203392[null]1115968009201216621016105183544
26236631142961162.54758879660091014532660125067521
27236644232693[null]504767009101827810125036821
2823697501393[null]573693009201821310166827529
292366174150816190.791235251600921153130161051222
30236634812318[null]1895550092311136110125059118
31236631445853999.6031678175019101826610125041195
32231279053139[null]769993009201822010131824658
33236622015928403.52421196850194021192310248020484
342366348488621098.05571156720091011185310325070269
352366318173296555.139570083200910120324601312623258
362366318232087127.5052179440001101199506013998858
37236622020865853.792551044700110111983026127128167
3823662204948550.28510143370002011131910125068132
392366448317991262.833985412019311119381034812665
402366348488622234.392669435600120182501112506281
41236633760851538.8677111591701920187911014811065
422423505703[null]1176551002101484230125021390
43113205411507612.031487090700910112939501219839283
44236646714721106.77183637720093111218960125063747
4511387595152[null]663969112101825010125062879
462366348488625019.262255366009101163125016208141867
47236635632467[null]158074012201119381034812689
4823664425875105.37054566260002211245350225040694
492423505703[null]1176551002101484230125021372
5023182416321005926.9012260006009201874010161051225
5123482314308[null]163518009202826710225067556
522366174150816191.022735251600921153130161051286
532366348488623461.7653776382009201122106116701925
542366462171122245.6622331320092013559740125050277
552366226423001431.30945702350192011143910104163150365
56236634812318[null]18955500923124581601250137891
5723182363530343009.7962565982009421826810125068291
58236622015257332.51324499900091014254960125070191
59231139529254890.9952811835019221535333016701998
60236625846871958.2962511368119201114391024163150365
612366442761957.7993108941011020188161014163193523
6223662584687762.92221147729009201879310125067773
63231332046962[null]845483009201219056016135155748
642366447902367.7459802983009201825010125062888
6511320533034157733.0887864640152018287101219839755
6623662209486147.740910750920092211185310225070291
673711674748840[null]52980800920188181013998820
682366351163133052.679390956500910151210125071142
692366442353905029.2231459308019201874510162042795
702366311545251198.96081105418009201424830125070396
712366337413809.5963568989009201584330125073783
722319839431116[null]1002738119211826710225067598
7323664353352799.58751126510001011217760125035095
742366226423001745.4772329081012402825010125062879
7523663517929105.24994688920091012731560125052750
76236631432759237.914751555009201827410125068433
772366318434811770.99641020601102211825010125062879
7811320535425315444.9894762590022011218960125063795
79343205135387491210.0346772734015201825010325062888
80236633367371496.23731117614019201210596024811096
812323484924[null]5910270091011226760125012305
8223661742623232.6626361244001201883510125035773
83236643516159204.34535615000001011217760125035013
8423662584687754.90281147729009201879310125067728
85343205339109353225.5579497468015201825010125062890
86236614338121268.28172732650092011222860125041198
87403663517929412.906687563300911125750611250150732
882323484924[null]59102700920112267601250123095
89343205135142911846.35592948580093021282050225064448
902366348479973813.34571093118000201880010361441559
91236619624285221.612611193000920182251014124193938
9211387595152[null]663969112101825010125062851
93343205339109353223.2475497468015201825010125062879
94236622020861098.0998562773019201826710125067537
952366348479971638.6578899428104201176010145119784
962366442196053.015831401007201430021012505309
97236628828562103.27388230121092012674060225019239
98236631147791657.8031536444009147114391014163150352
992315419638140[null]5900670082015651014811310
10023663119547177.289910347270093014562660425067583
Rows: 1-100 | Columns: 21
[12]:
start_time = time.time()
expedia.corr(show = False)
print(f"elapsed time = {time.time() - start_time}")
elapsed time = 23.69867491722107

In just over 20 seconds, VerticaPy computes the correlation matrix on the entire dataset. Compared to the 20 seconds it took to compute the correlation matrix on half the dataset in-memory, this is huge improvement. Using other Vertica-specific features, such as projections, we could further enhance this performance.

VerticaPy also caches the computed aggregations. With this cache available, we can repeat the correlation matrix computation almost instantaneously:

NOTE: If necessary, you can deactivate the cache by calling the set_option() function with the cache parameter set to False.

[13]:
start_time = time.time()
expedia.corr(show = False)
print(f"elapsed time = {time.time() - start_time}")
100%|██████████| 20/20 [00:00<00:00, 800.07it/s]
elapsed time = 0.31166625022888184

Memory usage in-memory vs. in-database#

Now, we will examine how the memory usage compares between in-memory and in-database.

First, use the pandas info() method to explore the DataFrame’s memory usage:

[14]:
expedia_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000000 entries, 0 to 19999999
Data columns (total 21 columns):
 #   Column                     Dtype
---  ------                     -----
 0   site_name                  int64
 1   posa_continent             int64
 2   user_location_country      int64
 3   user_location_region       int64
 4   user_location_city         int64
 5   orig_destination_distance  float64
 6   user_id                    int64
 7   is_mobile                  int64
 8   is_package                 int64
 9   channel                    int64
 10  srch_adults_cnt            int64
 11  srch_children_cnt          int64
 12  srch_rm_cnt                int64
 13  srch_destination_id        int64
 14  srch_destination_type_id   int64
 15  is_booking                 int64
 16  cnt                        int64
 17  hotel_continent            int64
 18  hotel_country              int64
 19  hotel_market               int64
 20  hotel_cluster              int64
dtypes: float64(1), int64(20)
memory usage: 3.1 GB

The DataFrame is using 3.1GB to store half the Expedia dataset. Use the memory_usage() method to list the expedia vDataFrame’s memory usage:

[15]:
expedia.memory_usage()
[15]:
value
object1039
"site_name"1727
"posa_continent"1732
"user_location_country"1739
"user_location_region"1738
"user_location_city"1736
"orig_destination_distance"1743
"user_id"1725
"is_mobile"1727
"is_package"1728
"channel"1725
"srch_adults_cnt"1733
"srch_children_cnt"1735
"srch_rm_cnt"1729
"srch_destination_id"1737
"srch_destination_type_id"1742
"is_booking"1728
"cnt"1721
"hotel_continent"1733
"hotel_country"1731
"hotel_market"1730
"hotel_cluster"1731
total37409
Rows: 1-23 | Columns: 2

The vDataFrame only uses about 37KB! By storing the data in the Vertica database, and only recording the user’s data modifications in memory, the memory usage is reduced to a minimum.

With VerticaPy, we can take advantage of Vertica’s structure and scalability, providing fast queries without ever loading the data into memory. In the above examples, we’ve seen that in-memory processing is much more expensive in both computation and memory usage. This often leads to the decesion to downsample the data, which sacrfices the possibility of further data insights.

The vDataFrame structure#

Now that we’ve seen the performance and memory benefits of the vDataFrame, let’s dig into some of the underlying structures and methods that produce these great results.

vDataFrames are composed of columns called vDataColumns. To view all vDataColumns in a vDataFrame, use the get_columns() method:

[16]:
expedia.get_columns()
[16]:
['"site_name"',
 '"posa_continent"',
 '"user_location_country"',
 '"user_location_region"',
 '"user_location_city"',
 '"orig_destination_distance"',
 '"user_id"',
 '"is_mobile"',
 '"is_package"',
 '"channel"',
 '"srch_adults_cnt"',
 '"srch_children_cnt"',
 '"srch_rm_cnt"',
 '"srch_destination_id"',
 '"srch_destination_type_id"',
 '"is_booking"',
 '"cnt"',
 '"hotel_continent"',
 '"hotel_country"',
 '"hotel_market"',
 '"hotel_cluster"']

To access a vDataColumn, specify the column name in square brackets, for example:

NOTE: VerticaPy saves computed aggregations to avoid unncessary recomputations.

[17]:
expedia["is_booking"].describe()
[17]:
value
name"is_booking"
dtypeint
unique2.0
count37670293.0
034669600
13000693
Rows: 1-6 | Columns: 2

Each vDataColumn has its own catalog to save user modifications. In the previous example, we computed some aggregations for the is_booking column. Let’s look at the catalog for that vDataColumn:

[18]:
expedia["is_booking"]._catalog
[18]:
{'cov': {},
 'pearson': {'"site_name"': -0.0103791330144224,
  '"posa_continent"': 0.00972436717006096,
  '"user_location_country"': 0.00752614887257535,
  '"user_location_region"': 0.00635166380501374,
  '"user_location_city"': 0.00222760096536968,
  '"orig_destination_distance"': -0.0377320625613809,
  '"user_id"': 0.00182006552590588,
  '"is_mobile"': -0.0307536571387009,
  '"is_package"': -0.0763467336220978,
  '"channel"': 0.0244378703859355,
  '"srch_adults_cnt"': -0.0490450028206689,
  '"srch_children_cnt"': -0.0222719630493467,
  '"srch_rm_cnt"': 0.0108395233625253,
  '"srch_destination_id"': 0.0249567593022902,
  '"srch_destination_type_id"': 0.0404728725370959,
  '"is_booking"': 1.0,
  '"cnt"': -0.112906513841061,
  '"hotel_continent"': -0.0261279829743103,
  '"hotel_country"': -0.00394808105853647,
  '"hotel_market"': 0.0119576115952104,
  '"hotel_cluster"': -0.021548065656294},
 'spearman': {},
 'spearmand': {},
 'kendall': {},
 'cramer': {},
 'biserial': {},
 'regr_avgx': {},
 'regr_avgy': {},
 'regr_count': {},
 'regr_intercept': {},
 'regr_r2': {},
 'regr_slope': {},
 'regr_sxx': {},
 'regr_sxy': {},
 'regr_syy': {},
 'approx_unique': 2,
 'count': 37670293}

The catalog is updated whenever major changes are made to the data.

We can also view the vDataFrame’s backend SQL code generation by setting the sql_on parameter to True with the set_option() function:

[19]:
vp.set_option("sql_on", True)
expedia["cnt"].describe()

Computing the different aggregations.

  SELECT
    /*+LABEL('vDataframe.aggregate')*/ APPROXIMATE_COUNT_DISTINCT("cnt")  
  FROM
(  
  SELECT
    "site_name",
    "posa_continent",
    "user_location_country",
    "user_location_region",
    "user_location_city",
    "orig_destination_distance",
    "user_id",
    "is_mobile",
    "is_package",
    "channel",
    "srch_adults_cnt",
    "srch_children_cnt",
    "srch_rm_cnt",
    "srch_destination_id",
    "srch_destination_type_id",
    "is_booking",
    "cnt",
    "hotel_continent",
    "hotel_country",
    "hotel_market",
    "hotel_cluster"  
  FROM
"public"."train")  
VERTICAPY_SUBTABLE  
  LIMIT 1

Computing the descriptive statistics of all numerical columns using SUMMARIZE_NUMCOL.

  SELECT
    /*+LABEL('vDataframe.describe')*/ SUMMARIZE_NUMCOL("cnt") OVER ()  
  FROM
(  
  SELECT
    "site_name",
    "posa_continent",
    "user_location_country",
    "user_location_region",
    "user_location_city",
    "orig_destination_distance",
    "user_id",
    "is_mobile",
    "is_package",
    "channel",
    "srch_adults_cnt",
    "srch_children_cnt",
    "srch_rm_cnt",
    "srch_destination_id",
    "srch_destination_type_id",
    "is_booking",
    "cnt",
    "hotel_continent",
    "hotel_country",
    "hotel_market",
    "hotel_cluster"  
  FROM
"public"."train")  
VERTICAPY_SUBTABLE
[19]:
value
name"cnt"
dtypeint
unique103.0
count37670293
mean1.48338392271067
std1.21977557865588
min1.0
approx_25%1.0
approx_50%1.0
approx_75%2.0
max269.0
Rows: 1-11 | Columns: 2

To control whether each query outputs its elasped time, use the time_on parameter of the set_option() function:

NOTE: To display matplotlib graphics in Jupyter, you must use the ‘%matplotlib inline’ command the first time you draw a graphic.

[20]:
expedia = vp.vDataFrame("public.train") # creating a new vDataFrame to delete the catalog
vp.set_option("time_on", True)
expedia.corr()

Checking if the table is a flextable.

  SELECT
    is_flextable  
  FROM
v_catalog.tables  
  WHERE table_name = 'train' AND table_schema = 'public' AND is_flextable  
  LIMIT 1

Getting the data types.

  SELECT
    /*+LABEL('get_data_types')*/ column_name,
    data_type  
  FROM
((  
  SELECT
    column_name,
    data_type,
    ordinal_position  
  FROM
columns  
  WHERE table_name = 'train' AND table_schema = 'public') UNION (  
  SELECT
    column_name,
    data_type,
    ordinal_position  
  FROM
view_columns  
  WHERE table_name = 'train' AND table_schema = 'public')) x  
  ORDER BY ordinal_position

Computing the pearson Corr Matrix.

  SELECT
    /*+LABEL('vDataframe._aggregate_matrix')*/ CORR_MATRIX("site_name", "posa_continent", "user_location_country", "user_location_region", "user_location_city", "orig_destination_distance", "user_id", "is_mobile", "is_package", "channel", "srch_adults_cnt", "srch_children_cnt", "srch_rm_cnt", "srch_destination_id", "srch_destination_type_id", "is_booking", "cnt", "hotel_continent", "hotel_country", "hotel_market", "hotel_cluster") OVER ()  
  FROM
"public"."train"
Execution: 23.258s

Data type cannot be displayed: application/vnd.plotly.v1+json

The aggregation’s for each vDataColumn are saved to its catalog. If we again call the corr() method, it’ll complete in a couple seconds—the time needed to draw the graphic—because the aggregations have already been computed and saved during the last call:

[22]:
start_time = time.time()
expedia.corr()
print("elapsed time = {}".format(time.time() - start_time))

Computing the pearson Corr Matrix.

  SELECT
    /*+LABEL('vDataframe._aggregate_matrix')*/ CORR_MATRIX("site_name", "posa_continent", "user_location_country", "user_location_region", "user_location_city", "orig_destination_distance", "user_id", "is_mobile", "is_package", "channel", "srch_adults_cnt", "srch_children_cnt", "srch_rm_cnt", "srch_destination_id", "srch_destination_type_id", "is_booking", "cnt", "hotel_continent", "hotel_country", "hotel_market", "hotel_cluster") OVER ()  
  FROM
"public"."train"
Execution: 23.608s
elapsed time = 23.79699158668518

To turn off the elapsed time and the SQL code generation options:

[23]:
vp.set_option("sql_on", False)
vp.set_option("time_on", False)

You can obtain the current vDataFrame relation with the current_relation() method:

[24]:
print(expedia.current_relation())
"public"."train"

The generated SQL for the relation changes according to the user’s modifications. For example, if we impute the missing values of the orig_destination_distance vDataColumn by its average and then drop the is_package vDataColumn, these changes are reflected in the relation:

[25]:
expedia["orig_destination_distance"].fillna(method = "avg")
expedia["is_package"].drop()
print(expedia.current_relation())
13525001 elements were filled.
(
   SELECT
     "date_time",
     "site_name",
     "posa_continent",
     "user_location_country",
     "user_location_region",
     "user_location_city",
     COALESCE("orig_destination_distance", 1970.0900267207) AS "orig_destination_distance",
     "user_id",
     "is_mobile",
     "channel",
     "srch_ci",
     "srch_co",
     "srch_adults_cnt",
     "srch_children_cnt",
     "srch_rm_cnt",
     "srch_destination_id",
     "srch_destination_type_id",
     "is_booking",
     "cnt",
     "hotel_continent",
     "hotel_country",
     "hotel_market",
     "hotel_cluster"
   FROM
 (

   SELECT

                    "date_time",
     "site_name",
     "posa_continent",
     "user_location_country",
     "user_location_region",
     "user_location_city",
     "orig_destination_distance",
     "user_id",
     "is_mobile",
     "channel",
     "srch_ci",
     "srch_co",
     "srch_adults_cnt",
     "srch_children_cnt",
     "srch_rm_cnt",
     "srch_destination_id",
     "srch_destination_type_id",
     "is_booking",
     "cnt",
     "hotel_continent",
     "hotel_country",
     "hotel_market",
     "hotel_cluster"

   FROM
 "public"."train")
VERTICAPY_SUBTABLE)
VERTICAPY_SUBTABLE

Notice that the is_package column has been removed from the SELECT statement and the orig_destination_distance is now using a COALESCE SQL function.

vDataFrame attributes and management#

The vDataFrame has many attributes and methods, some of which were demonstrated in the above examples. vDataFrames have two types of attributes:

  • Virtual Columns (vDataColumn)

  • Main attributes (columns, main_relation …)

The vDataFrame’s main attributes are stored in the _vars dictionary:

NOTE: You should never change these attributes manually.

[26]:
expedia._vars
[26]:
{'allcols_ind': 24,
 'count': 37670293,
 'clean_query': True,
 'exclude_columns': [],
 'history': ['{Wed Jan 10 14:39:26 2024} [Fillna]: 13525001 "orig_destination_distance" missing values were  filled.',
  '{Wed Jan 10 14:39:26 2024} [Drop]: vDataColumn "is_package" was deleted from the vDataFrame.'],
 'isflex': False,
 'max_columns': -1,
 'max_rows': -1,
 'order_by': {},
 'saving': [],
 'sql_push_ext': False,
 'sql_magic_result': 0,
 'symbol': '$',
 'where': [],
 'has_dpnames': False,
 'columns': ['"date_time"',
  '"site_name"',
  '"posa_continent"',
  '"user_location_country"',
  '"user_location_region"',
  '"user_location_city"',
  '"orig_destination_distance"',
  '"user_id"',
  '"is_mobile"',
  '"channel"',
  '"srch_ci"',
  '"srch_co"',
  '"srch_adults_cnt"',
  '"srch_children_cnt"',
  '"srch_rm_cnt"',
  '"srch_destination_id"',
  '"srch_destination_type_id"',
  '"is_booking"',
  '"cnt"',
  '"hotel_continent"',
  '"hotel_country"',
  '"hotel_market"',
  '"hotel_cluster"'],
 'main_relation': '"public"."train"'}

vDataFrame data types#

vDataFrames use the data types of its vDataColumns. The behavior of some vDataFrame methods depend on the data type of the columns. For example, computing a histogram for a numerical data type is not the same as computing a histogram for a categorical data type.

The vDataFrame identifies four main data types: - int: integers are treated like categorical data types when their cardinality is low; otherwise, they are considered numeric - float: numeric data types - date: date-like data types (including timestamp) - text: categorical data types

Data types not included in the above list are automatically treated as categorical. You can examine the data types of the vDataColumns in a vDataFrame using the dtypes() method:

[27]:
expedia.dtypes()
[27]:
dtype
"date_time"timestamp
"site_name"int
"posa_continent"int
"user_location_country"int
"user_location_region"int
"user_location_city"int
"orig_destination_distance"float
"user_id"int
"is_mobile"int
"channel"int
"srch_ci"date
"srch_co"date
"srch_adults_cnt"int
"srch_children_cnt"int
"srch_rm_cnt"int
"srch_destination_id"int
"srch_destination_type_id"int
"is_booking"int
"cnt"int
"hotel_continent"int
"hotel_country"int
"hotel_market"int
"hotel_cluster"int
Rows: 1-23 | Columns: 2

To convert the data type of a vDataColumn, use the astype() method:

[28]:
expedia["hotel_market"].astype("varchar")
expedia["hotel_market"].ctype()
[28]:
'varchar'

To view the category of a specific vDataColumn, specify the vDataColumn and use the category() method:

[29]:
expedia["hotel_market"].category()
[29]:
'text'

Exporting, saving, and loading vDataFrames#

The save() and load() functions allow you to save and load vDataFrames:

[30]:
expedia.save()
expedia.filter("is_booking = 1")
34669600 elements were filtered.
[30]:
📅
date_time
Timestamp(29)
123
site_name
Integer
123
posa_continent
Integer
123
user_location_country
Integer
123
user_location_region
Integer
123
user_location_city
Integer
123
orig_destination_distance
Numeric(19)
123
user_id
Integer
123
is_mobile
Integer
123
channel
Integer
📅
srch_ci
Date
📅
srch_co
Date
123
srch_adults_cnt
Integer
123
srch_children_cnt
Integer
123
srch_rm_cnt
Integer
123
srch_destination_id
Integer
123
srch_destination_type_id
Integer
123
is_booking
Integer
123
cnt
Integer
123
hotel_continent
Integer
123
hotel_country
Integer
Abc
hotel_market
Varchar(20)
123
hotel_cluster
Integer
12013-01-07 00:41:0224235195271970.0900267207259991012013-05-052013-05-0610123507611670199
22013-01-07 00:59:5624235057031970.0900267207486892032013-02-082013-02-10211874611161052943
32013-01-07 01:14:192335177335432650.4725119414092013-01-172013-01-1910118788111520325361
42013-01-07 01:26:3723229407163731970.0900267207217610092013-01-082013-01-111011201411125064415
52013-01-07 01:27:19236617414752342.98991043452092013-03-012013-03-022018279111250123083
62013-01-07 02:09:572323513536321970.0900267207627910092013-01-172013-01-18101156555113114238
72013-01-07 03:26:292366348488621100.7225961860092013-02-142013-02-20222113731114128145569
82013-01-07 04:17:26230317561367872.7567438592092013-01-232013-01-25211827311125066094
92013-01-07 04:39:20235070677361970.09002672071093463012013-03-292013-03-3020195231138981167
102013-01-07 04:52:59236625445890143.30241164465002013-09-142013-09-151052767111125042450
112013-01-07 05:00:13230317561364490.0419438592092013-01-122013-01-1311123507611670196
122013-01-07 05:13:3223190133273831970.0900267207857247092013-01-292013-02-0121127103116144101320
132013-01-07 05:14:372366189454841970.09002672071186146092013-01-142013-01-152022533561125019168
142013-01-07 05:57:58236622056137211.5138111728092013-01-252013-01-262011509311125067931
152013-01-07 06:15:012366348488621.0905106903002013-02-042013-02-07111826711125067537
162013-01-07 06:15:19304195991226481970.0900267207992548092013-01-162013-01-19201222586110344399
172013-01-07 06:15:423434634730537108.3081871047092013-01-082013-01-0910121909611613528167
182013-01-07 06:31:552366335535421866.5732229324092013-01-152013-01-222019171111411820087
192013-01-07 06:32:5523661962428593.9359144229022013-03-062013-03-08101823011125063732
202013-01-07 06:37:352323168422961970.0900267207433110092013-01-082013-01-13101714631125066348
212013-01-07 06:43:12113205330121632091.0945422818002013-02-092013-02-19211121886114811073
222013-01-07 06:47:382366337474071970.0900267207328666092013-03-022013-03-0531187401116105127
232013-01-07 06:53:42242350220131970.09002672074166012013-01-152013-01-1720288181113998882
242013-01-07 06:57:0123663114779245.5375798321092013-01-112013-01-1420126227611219838455
252013-01-07 07:00:35113205385342741970.0900267207446411042013-01-102013-01-14202326231121989377
262013-01-07 07:12:36236617414752468.83971043452002013-02-252013-02-262018833111250101293
272013-01-07 07:18:532317918305701970.0900267207801328092013-01-162013-01-172112696411125041016
282013-01-07 07:29:5623662933480799.6924135775092013-01-212013-01-241111236451125064721
292013-01-07 07:32:03238039165631970.0900267207609648022013-01-252013-01-28101821311166827583
302013-01-07 07:32:5423662584687826.8573391512092013-02-122013-02-16101823911125040793
312013-01-07 07:37:48113205354489513517.9097319264042013-04-142013-04-1620112731511617077336
322013-01-07 07:38:25113205385469630.43821069720042013-01-182013-01-191018262111219838495
332013-01-07 07:39:17847782423566451.4598965461092013-02-142013-02-15202225536110639622
342013-01-07 07:45:04236626030771396.0527555580092013-02-212013-02-24101148111112505127
352013-01-07 07:48:532313326494651970.0900267207665776092013-01-102013-01-11221226166116204145275
362013-01-07 07:51:35113205385342741970.0900267207446411042013-01-142013-01-1510112942511219838450
372013-01-07 08:01:502366174168786482.22771076786092013-03-022013-03-0911182171110344385
382013-01-07 08:04:472366442761479.376119642092013-02-042013-02-0710112269611250123050
392013-01-07 08:05:51236622025112176.807185301192013-02-132013-02-162011220661125062845
402013-01-07 08:08:0137169959201861970.090026720710762092013-04-052013-04-1741288181113998846
412013-01-07 08:11:4923662589875411.3601766627092013-01-102013-01-123012674061125019228
422013-01-07 08:14:58236618411878826.1879448154192013-01-082013-01-10221126035112503657
432013-01-07 08:15:1023663319189225.419118167092013-01-172013-01-18101119286112501917
442013-01-07 08:21:2323662204948647.0217656567092013-01-082013-01-091011201611125074194
452013-01-07 08:22:05113205354334521284.8882716411092013-02-072013-02-08111826011125070177
462013-01-07 08:24:50236632191011397.561982641012013-04-282013-05-02201825011125062879
472013-01-07 08:29:49236631533371106.0414466806092013-02-012013-02-02402952411125056142
482013-01-07 08:49:22236618454612110.11671053407092013-01-072013-01-081011245251125065628
492013-01-07 08:50:1223663517929966.7501554890092013-03-012013-03-05201826011125070117
502013-01-07 08:51:572366363260062381.5825988395022013-01-232013-01-280011213211125036648
512013-01-07 08:52:2223664484441797.0171887225092013-01-182013-01-20212113581112505706
522013-01-07 09:07:472366348184874.7465183665092013-01-112013-01-131012479961125060972
532013-01-07 09:19:4137169900134391970.0900267207749006142013-01-072013-01-101012210311670191530
542013-01-07 09:23:31236635612482618.1144752237022013-04-192013-04-252011217561125036631
552013-01-07 09:26:44236625820843889.6633657617002013-01-132013-01-162011219661125065691
562013-01-07 09:27:5323661842795194.2886814964092013-01-092013-01-102211219661125065628
572013-01-07 09:30:172346157348681.92421088771092013-01-092013-01-1120184376116144415
582013-01-07 09:30:32236625653274290.8555958039092013-01-182013-01-20211827611125066148
592013-01-07 09:32:2423661745592636.42967544092013-01-252013-01-262212412361125036331
602013-01-07 09:33:552366346313715582.4818561643002013-07-302013-08-0321187391116144478
612013-01-07 09:35:102366174259511558.7211538076092013-01-152013-01-161012675661125041418
622013-01-07 09:39:5223662942078598.8608564188092013-01-132013-01-171014375311250105891
632013-01-07 09:43:552366348488621584.7565372001092013-04-042013-04-0720182771112504124
642013-01-07 09:43:572366435304318176.5169751319092013-01-252013-01-26101317131131716164
652013-01-07 09:46:45236617451219209.92071047580092013-01-092013-01-10201825011125062854
662013-01-07 09:56:112366348337051970.0900267207403847002013-01-122013-01-132012462961125042042
672013-01-07 09:57:1237169596244101354.78922482002013-02-122013-02-1640220147111611124446
682013-01-07 10:02:422366363165031970.090026720772073012013-01-102013-01-12101113731114128145526
692013-01-07 10:10:5317113312470831970.0900267207705176002013-02-012013-02-041018279111250123070
702013-01-07 10:11:032366346292172.3981560722092013-01-252013-01-262012329361125063354
712013-01-07 10:13:072366220397301970.0900267207813085092013-06-092013-06-11211195351112505935
722013-01-07 10:14:462366226423008385.6368426784092013-03-262013-03-272011168361131716198
732013-01-07 10:17:44236617427251334.5682656573092013-02-072013-02-10201153131125036533
742013-01-07 10:18:472366184131111871.5249807452092013-02-062013-02-071012019231125052928
752013-01-07 10:19:0337169682524361970.0900267207125562002013-01-212013-01-2310187971116144145059
762013-01-07 10:20:0634320513527655185.131936657092013-02-212013-02-2320125944611219837010
772013-01-07 10:21:08236644246407157.744335913092013-01-082013-01-09111821811125074348
782013-01-07 10:23:132312790531391970.0900267207345926092013-03-012013-03-122114420181125067632
792013-01-07 10:26:4123663631234681.4298858744092013-01-102013-01-111014563161125067541
802013-01-07 10:28:302366184539805647.1654844887092013-01-102013-01-15101126605114124194636
812013-01-07 10:32:26113205354432011041.7216748599092013-02-012013-02-022014174341125058213
822013-01-07 10:39:12236631443382777.47711100704092013-01-072013-01-08211826811125068275
832013-01-07 10:39:502313448312891970.0900267207766457092013-04-132013-04-1640288081116169289
842013-01-07 10:40:472366448473572322.2325728749092013-01-202013-01-26101828111125066349
852013-01-07 10:48:522366174520101970.09002672071016574092013-01-112013-01-121012497761125036517
862013-01-07 10:49:0223661741749426.4261696583092013-02-042013-02-071013547311250123010
872013-01-07 10:49:09236619624283833.6279551478092013-06-212013-06-23312874511162042797
882013-01-07 10:49:5734320531210553812.0063106944092013-03-232013-03-2922111629111219837068
892013-01-07 10:51:362366220430261272.3001640828092013-02-102013-02-1320212257611451150934
902013-01-07 10:51:39236622020862179.5787417637092013-01-282013-01-31211825011125062879
912013-01-07 11:04:5834320541154864744.8963324697092013-03-042013-03-0520126414612219839910
922013-01-07 11:05:13236617446432134.0731339047012013-01-072013-01-10201242966112503553
932013-01-07 11:09:5613146244453711970.0900267207686091092013-05-242013-05-25211239881116105183785
942013-01-07 11:12:372366348227721970.09002672071115906092013-01-072013-01-081012581961125069747
952013-01-07 11:19:002366184220121970.0900267207418305092013-01-262013-01-2811180523113104100346
962013-01-07 11:20:032366258425003962.4226351105092013-02-122013-02-1320122869611493208536
972013-01-07 11:21:50236622037814338.04811040971092013-02-032013-02-051011848111167030977
982013-01-07 11:23:48236618479761970.0900267207808937092013-01-212013-01-231012562561125056291
992013-01-07 11:28:201132051552209144.9115537131092013-01-152013-01-16101263781112198167143
1002013-01-07 11:29:521132051551395128.4059677922092013-01-092013-01-102012352961125043513
Rows: 1-100 of 3000693 | Columns: 23

To return a vDataFrame to a previously saved structure, use the load() function:

[31]:
expedia = expedia.load()
print(expedia.shape())
(37670293, 23)

Tip: For more information about a function, use the help() function

Because vDataFrames are views of data stored in the connected Vertica database, any modifications made to the vDataFrame are not reflected in the underlying data in the database. To save a vDataFrame’s relation to the database, use the to_db() method.

It’s good practice to examine the expected disk usage of the vDataFrame before exporting it to the database:

[33]:
expedia.expected_store_usage(unit = "Gb")
100%|██████████| 2/2 [00:03<00:00,  1.74s/it]
[33]:
expected_size (Gb)
max_size (Gb)
type
"date_time"7.450580596923828e-090.2806655541062355timestamp
"site_name"0.0470501743257045750.2806655541062355int
"posa_continent"0.035083194263279440.2806655541062355int
"user_location_country"0.074533438310027120.2806655541062355int
"user_location_region"0.099749343469738960.2806655541062355int
"user_location_city"0.167342474684119220.2806655541062355int
"orig_destination_distance"7.450580596923828e-090.2806655541062355float
"user_id"0.213396676816046240.2806655541062355int
"is_mobile"0.035083194263279440.2806655541062355int
"channel"0.035088971257209780.2806655541062355int
"srch_ci"7.450580596923828e-090.28031475841999054date
"srch_co"7.450580596923828e-090.28031475096940994date
"srch_adults_cnt"0.035083194263279440.2806655541062355int
"srch_children_cnt"0.035083194263279440.2806655541062355int
"srch_rm_cnt"0.035083194263279440.2806655541062355int
"srch_destination_id"0.156296298839151860.2806655541062355int
"srch_destination_type_id"0.035083194263279440.2806655541062355int
"is_booking"0.035083194263279440.2806655541062355int
"cnt"0.035189012065529820.2806655541062355int
"hotel_continent"0.035083194263279440.2806655541062355int
"hotel_country"0.07826952729374170.2806655541062355int
"hotel_market"0.105387241579592232.806655541062355varchar
"hotel_cluster"0.066515024751424790.2806655541062355int
separator0.80691346805542710.8069134680554271
header3.4831464290618896e-073.4831464290618896e-07
rawsize2.1663975836709149.787509948946536
Rows: 1-26 | Columns: 4

If you decide that there is sufficient space to store the vDataFrame in the database, run the to_db() method:

[ ]:
expedia.to_db("public.expedia_clean",
              relation_type = "table")