Booking

This example uses the 'Expedia' dataset to predict, based on site activity, whether a user is likely to make a booking. You can download the Jupyter Notebook of the study here and the the dataset here.

  • cnt: Number of similar events in the context of the same user session.
  • user_location_city: The ID of the city in which the customer is located.
  • is_package: 1 if the click/booking was generated as a part of a package (i.e. combined with a flight), 0 otherwise.
  • user_id: ID of the user
  • srch_children_cnt: The number of (extra occupancy) children specified in the hotel room.
  • channel: marketing ID of a marketing channel.
  • hotel_cluster: ID of a hotel cluster.
  • srch_destination_id: ID of the destination where the hotel search was performed.
  • is_mobile: 1 if the user is on a mobile device, 0 otherwise.
  • srch_adults_cnt: The number of adults specified in the hotel room.
  • user_location_country: The ID of the country in which the customer is located.
  • srch_destination_type_id: ID of the destination where the hotel search was performed.
  • srch_rm_cnt: The number of hotel rooms specified in the search.
  • posa_continent: ID of the continent associated with the site_name.
  • srch_ci: Check-in date.
  • user_location_region: The ID of the region in which the customer is located.
  • hotel_country: Hotel's country.
  • srch_co: Check-out date.
  • is_booking: 1 if a booking, 0 if a click.
  • orig_destination_distance: Physical distance between a hotel and a customer at the time of search. A null means the distance could not be calculated.
  • hotel_continent: Hotel continent.
  • site_name: ID of the Expedia point of sale (i.e. Expedia.com, Expedia.co.uk, Expedia.co.jp, ...).

We will follow the data science cycle (Data Exploration - Data Preparation - Data Modeling - Model Evaluation - Model Deployment) to solve this problem.

Initialization

This example uses the following version of VerticaPy:

In [1]:
import verticapy as vp
vp.__version__
Out[1]:
'0.9.0'

Connect to Vertica. This example uses an existing connection called "VerticaDSN." For details on how to create a connection, use see the connection tutorial.

In [2]:
vp.connect("VerticaDSN")

Let's create a Virtual DataFrame of the dataset.

In [3]:
expedia = vp.read_csv('data/expedia.csv', parse_nrows=1000)
expedia.head(5)
Out[3]:
📅
date_time
Timestamp
123
site_name
Int
123
posa_continent
Int
123
user_location_country
Int
123
user_location_region
Int
123
user_location_city
Int
123
orig_destination_distance
Numeric(11,5)
123
user_id
Int
123
is_mobile
Int
123
is_package
Int
123
channel
Int
📅
srch_ci
Date
📅
srch_co
Date
123
srch_adults_cnt
Int
123
srch_children_cnt
Int
123
srch_rm_cnt
Int
123
srch_destination_id
Int
123
srch_destination_type_id
Int
123
is_booking
Int
123
cnt
Int
123
hotel_continent
Int
123
hotel_country
Int
123
hotel_market
Int
123
hotel_cluster
Int
12013-01-07 00:00:022423505703[null]4618990092013-03-142013-03-1521166930125021241
22013-01-07 00:00:062366174211775713.6206137960092013-01-192013-01-2610188211036173058
32013-01-07 00:00:0611320515514703795.729811285750092013-01-192013-01-2210125064601250123091
42013-01-07 00:00:093716976141949[null]10804760192013-05-292013-06-05201763530125067510
52013-01-07 00:00:173716976141949[null]10188950002013-09-082013-09-102012721560225064559
Rows: 1-5 | Columns: 24

Data Exploration and Preparation

Sessionization is the process of gathering clicks for a certain period of time. We usually consider that after 30 minutes of inactivity, the user session ends (date_time - lag(date_time) > 30 minutes). For these kinds of use cases, aggregating sessions with meaningful statistics is the key for making accurate predictions.

We start by using the sessionize' method to create the variable 'session_id. We can then use this variable to aggregate the data.

In [4]:
expedia.sessionize(ts = "date_time", 
                   by = ["user_id"], 
                   session_threshold = "30 minutes", 
                   name = "session_id")
Out[4]:
📅
date_time
Timestamp
123
site_name
Int
123
posa_continent
Int
123
user_location_country
Int
123
user_location_region
Int
123
user_location_city
Int
123
orig_destination_distance
Numeric(11,5)
123
user_id
Int
123
is_mobile
Int
123
is_package
Int
123
channel
Int
📅
srch_ci
Date
📅
srch_co
Date
123
srch_adults_cnt
Int
123
srch_children_cnt
Int
123
srch_rm_cnt
Int
123
srch_destination_id
Int
123
srch_destination_type_id
Int
123
is_booking
Int
123
cnt
Int
123
hotel_continent
Int
123
hotel_country
Int
123
hotel_market
Int
123
hotel_cluster
Int
123
session_id
Integer
12014-09-03 08:18:412366174904553.605901022014-09-062014-09-07221246936012501241190
22014-12-03 11:28:11236617442538343.786601022014-12-292014-12-3021182791012501230981
32014-12-03 11:28:58236617442538343.990101022014-12-292014-12-3021182791012501230401
42014-12-03 11:30:21236617442538343.796801022014-12-292014-12-3021182791012501230561
52014-12-03 11:32:43236617442538342.216301022014-12-292014-12-303218279102250358721
62014-12-06 01:38:262366174904588.148901022014-12-192014-12-202218279101250358282
72014-04-25 16:14:282366174374495405.488911022014-11-072014-11-09311628814014982052200
82014-04-25 16:14:462366174374495405.488911022014-11-072014-11-09512628814014982052200
92014-04-25 16:15:112366174374495405.488911022014-11-072014-11-09312628814014982052200
102014-04-25 16:15:502366174374495405.740911022014-11-072014-11-09312628814014982052600
112014-01-27 12:56:00236631125671[null]20192014-04-082014-04-1220111621102250703960
122014-01-27 13:00:31236631125671[null]20192014-04-082014-04-1220111621101250703720
132014-01-27 13:04:55236631125671[null]20192014-04-082014-04-1220111621101250703910
142014-01-27 13:06:31236631125671[null]20192014-04-082014-04-1220111621101250703910
152014-01-27 13:14:21236631125671[null]20092014-04-082014-04-1220111621102250703910
162014-01-27 13:18:43236631125671[null]20192014-04-082014-04-1220111621101250703940
172014-01-27 13:22:39236631125671[null]20192014-04-082014-04-1220111621101250703210
182014-01-27 13:39:30236631125671[null]20192014-04-082014-04-1120111621102250703960
192014-01-27 13:54:07236631125671[null]20192014-04-082014-04-1120111621102250703940
202014-01-28 12:20:02236631125671[null]20192014-04-082014-04-1120112008103250686951
212014-01-28 12:27:47236631125671[null]20192014-04-082014-04-1120112008101250686411
222014-01-28 12:28:31236631125671[null]20192014-04-082014-04-1120112008101250686731
232014-01-28 12:31:18236631125671[null]20192014-04-082014-04-1120112008101250686681
242014-01-28 12:32:35236631125671[null]20192014-04-082014-04-1120112008101250686911
252014-01-28 12:34:21236631125671[null]20192014-04-082014-04-1120112008101250686911
262014-01-28 12:37:49236631125671[null]20192014-04-082014-04-1120112008102250686181
272014-01-28 12:44:16236631125671[null]20092014-04-082014-04-1120112008101250686181
282014-01-28 17:06:02236631125671[null]20122014-04-082014-04-112018250104250628182
292014-01-28 17:11:35236631125671[null]20122014-04-082014-04-11201825010125062812
302014-07-31 07:10:392366321516421432.965630092015-02-072015-02-103222858660125070550
312014-09-08 09:56:2123664675222142.746730092014-10-112014-10-122018261101250646181
322014-09-08 09:58:5023664675222141.045430092014-10-112014-10-12201826110125064621
332014-09-09 10:49:4823664675222143.299830092015-02-012015-02-0222127699601250646482
342014-09-22 13:19:0223664675222141.127330032015-02-012015-02-0220127699601250646943
352014-09-22 13:22:2523664675222141.127330032015-02-012015-02-023018261101250646943
362014-09-30 12:56:5123664675222142.746730092014-10-112014-10-122018261101250646184
372014-10-20 11:19:2123664675222144.408430092015-02-012015-02-0222127704601250646475
382014-08-14 18:11:08236625845545110.468240092014-10-182014-10-202018283101250414210
392014-08-29 16:53:112366258455451318.415240072015-01-302015-02-036018250101250628711
402014-08-29 16:54:162366258455451318.817740072015-01-302015-02-036018250101250628711
412014-08-29 16:55:212366258455451319.002240072015-01-302015-02-036018250101250628381
422014-08-29 16:58:412366258455451318.138440072015-01-302015-02-032018250101250628341
432014-08-29 16:59:412366258455451321.061440072015-01-302015-02-032018250101250628711
442014-08-29 17:00:332366258455451317.399940072015-01-302015-02-032018250101250628181
452014-12-07 18:18:2623662584687978.186840012015-03-182015-03-2222111624601250666732
462014-12-07 18:19:1023662584687978.369140012015-03-182015-03-2222111624601250666732
472014-12-07 18:25:2023662584687149.62540012015-01-022015-01-0422144751012501343152
482014-01-26 23:56:46236617424103960.42501102014-03-212014-03-25111828110125066390
492014-01-27 00:00:31236617424103960.42501102014-03-222014-03-26111828110325066390
502014-01-27 00:03:44236617424103960.1149501102014-03-222014-03-262118281101250663680
512014-01-27 00:05:12236617424103960.5072501102014-03-222014-03-262118281101250663180
522014-01-27 00:06:18236617424103960.7861501102014-03-222014-03-261118281101250663420
532014-01-27 00:08:48236617424103960.0054501102014-03-222014-03-261118281101250663700
542014-01-27 00:10:17236617424103960.001501102014-03-222014-03-2611182811012250663410
552014-01-27 00:36:54236617424103960.001501102014-03-222014-03-261118281111250663410
562014-01-27 01:17:09236617424103960.4250122014-03-222014-03-26111828110125066391
572014-07-30 17:13:07236635651532836.968361092014-12-032014-12-0623112175601250366280
582014-08-28 11:54:122366246497952574.207561022014-12-032014-12-0620112175601250366131
592014-08-28 12:03:062366246497952573.166761022014-12-032014-12-0620112175601250366281
602014-08-29 16:25:062366246506612579.843361022014-12-032014-12-0620112175601250366282
612014-08-29 16:26:142366246506612577.550961022014-12-032014-12-0620112175601250366912
622014-08-29 16:27:072366246506612577.692661022014-12-032014-12-0620112175601250366132
632014-09-01 23:56:272366246506612577.239961022014-12-032014-12-0620112175602250366703
642014-09-01 23:58:042366246506612577.692661022014-12-032014-12-0620112175601250366133
652014-09-11 19:32:52236635622202756.903161022014-12-082014-12-11201825010125062814
662014-10-10 12:25:062366246506612759.777860092014-12-082014-12-11201825010125062815
672014-11-03 01:01:07236646249272876.151661022014-12-082014-12-112018250101250628546
682014-11-03 11:09:562366246506612755.818260002014-12-082014-12-092018250101250628547
692014-11-04 20:26:42236617424103225.671761092014-12-082014-12-092018250101250628548
702014-11-21 08:16:012366246506612609.687460002014-12-112014-12-1220112264601250368519
712014-11-21 08:18:122366246506612610.281660002014-12-112014-12-1220112264601250368479
722014-11-21 08:31:092366246506612610.280860002014-12-112014-12-1220112264601250368519
732014-11-24 20:36:282366246506612609.904360002014-12-112014-12-12221122646022503681710
742014-11-24 20:36:382366246506612610.40160002014-12-112014-12-12221122646012503681610
752014-11-24 20:37:162366246506612610.418160002014-12-112014-12-1222112264601250368410
762014-11-24 21:00:402366246506612607.752760002014-12-112014-12-12221525123012503681610
772014-11-24 21:01:012366246506612607.738360002014-12-112014-12-12221525123012503683910
782014-08-27 09:45:422366318316465378.691870022014-10-312014-11-0120179210562081480640
792014-08-27 09:51:322366318316465378.691870022015-01-292015-02-0820179210262081480640
802014-08-27 09:58:022366318316465378.691870022015-01-302015-02-0520179210262081480640
812014-08-27 09:58:552366318316465375.904570022014-10-312014-11-0120179210162081480860
822014-08-27 10:01:092366318316465380.631970022014-10-312014-11-0120179210162081480640
832014-08-27 10:03:532366318316465380.631970122015-01-302015-02-0520179210162081480640
842014-09-10 12:20:392366318316463936.106270022014-10-122014-10-13101226166016204145251
852014-09-24 11:23:592366318878830.259170092014-12-042014-12-072011953510225059352
862014-09-24 11:36:082366318878830.259170092014-12-042014-12-072011953511125059352
872014-09-24 11:57:492366318878830.259170192014-12-042014-12-072011953510125059352
882014-10-06 18:28:52236631833705826.944370002014-12-042014-12-0720119535101250593913
892014-10-06 18:32:36236631833705826.999470002014-12-042014-12-072011953510125059353
902014-10-29 05:57:4423663188031862.759970022014-11-112014-11-1220119535102250593324
912014-10-29 05:58:4423663188031862.733770022014-11-112014-11-122011953510125059374
922014-10-31 07:14:0323663188031863.078870142014-12-042014-12-0720119535106250593905
932014-11-06 10:14:59236631841478[null]70122014-12-042014-12-0720119535101250593326
942014-11-06 10:44:50236631841478[null]70122014-12-042014-12-0720119535101250593286
952014-11-06 10:45:59236631841478[null]70122014-12-042014-12-0720119535101250593426
962014-11-06 10:47:25236631841478[null]70122014-12-042014-12-0720119535101250593766
972014-11-06 11:26:14236631841478[null]70022014-12-042014-12-0720119535101250593337
982014-11-06 11:27:14236631841478[null]70122014-12-042014-12-0720119535101250593337
992014-11-06 11:27:47236631841478[null]70122014-12-042014-12-0720119535102250593327
1002014-11-07 11:01:05236631841478[null]70192014-12-042014-12-0720119535102250593738
Rows: 1-100 | Columns: 25

The duration of the trip should also influence/be indicative of the user's behavior on the site, so we'll take that into account.

In [5]:
expedia["trip_duration"] = expedia["srch_co"] - expedia["srch_ci"]

If a user looks at the same hotel several times, then it might mean that they're looking to book that hotel during the session.

In [6]:
expedia.analytic('mode', 
                 columns = "hotel_cluster", 
                 by = ["user_id",
                       "session_id"], 
                 name = "mode_hotel_cluster",
                 add_count = True)
Out[6]:
📅
date_time
Timestamp
123
site_name
Int
123
posa_continent
Int
123
user_location_country
Int
123
user_location_region
Int
123
user_location_city
Int
123
orig_destination_distance
Numeric(11,5)
123
user_id
Int
123
is_mobile
Int
123
is_package
Int
123
channel
Int
📅
srch_ci
Date
📅
srch_co
Date
123
srch_adults_cnt
Int
123
srch_children_cnt
Int
123
srch_rm_cnt
Int
123
srch_destination_id
Int
123
srch_destination_type_id
Int
123
is_booking
Int
123
cnt
Int
123
hotel_continent
Int
123
hotel_country
Int
123
hotel_market
Int
123
hotel_cluster
Int
123
session_id
Integer
123
trip_duration
Integer
123
mode_hotel_cluster
Integer
123
mode_hotel_cluster_count
Integer
12014-09-03 08:18:412366174904553.605901022014-09-062014-09-072212469360125012411901191
22014-12-03 11:32:43236617442538342.216301022014-12-292014-12-3032182791022503587211721
32014-12-03 11:30:21236617442538343.796801022014-12-292014-12-30211827910125012305611721
42014-12-03 11:28:58236617442538343.990101022014-12-292014-12-30211827910125012304011721
52014-12-03 11:28:11236617442538343.786601022014-12-292014-12-30211827910125012309811721
62014-12-06 01:38:262366174904588.148901022014-12-192014-12-2022182791012503582821281
72014-04-25 16:14:462366174374495405.488911022014-11-072014-11-095126288140149820522002203
82014-04-25 16:15:112366174374495405.488911022014-11-072014-11-093126288140149820522002203
92014-04-25 16:15:502366174374495405.740911022014-11-072014-11-093126288140149820526002203
102014-04-25 16:14:282366174374495405.488911022014-11-072014-11-093116288140149820522002203
112014-01-27 13:06:31236631125671[null]20192014-04-082014-04-12201116211012507039104913
122014-01-27 12:56:00236631125671[null]20192014-04-082014-04-12201116211022507039604913
132014-01-27 13:18:43236631125671[null]20192014-04-082014-04-12201116211012507039404913
142014-01-27 13:14:21236631125671[null]20092014-04-082014-04-12201116211022507039104913
152014-01-27 13:39:30236631125671[null]20192014-04-082014-04-11201116211022507039603913
162014-01-27 13:54:07236631125671[null]20192014-04-082014-04-11201116211022507039403913
172014-01-27 13:04:55236631125671[null]20192014-04-082014-04-12201116211012507039104913
182014-01-27 13:00:31236631125671[null]20192014-04-082014-04-12201116211012507037204913
192014-01-27 13:22:39236631125671[null]20192014-04-082014-04-12201116211012507032104913
202014-01-28 12:37:49236631125671[null]20192014-04-082014-04-11201120081022506861813182
212014-01-28 12:34:21236631125671[null]20192014-04-082014-04-11201120081012506869113182
222014-01-28 12:32:35236631125671[null]20192014-04-082014-04-11201120081012506869113182
232014-01-28 12:28:31236631125671[null]20192014-04-082014-04-11201120081012506867313182
242014-01-28 12:31:18236631125671[null]20192014-04-082014-04-11201120081012506866813182
252014-01-28 12:27:47236631125671[null]20192014-04-082014-04-11201120081012506864113182
262014-01-28 12:20:02236631125671[null]20192014-04-082014-04-11201120081032506869513182
272014-01-28 12:44:16236631125671[null]20092014-04-082014-04-11201120081012506861813182
282014-01-28 17:06:02236631125671[null]20122014-04-082014-04-1120182501042506281823181
292014-01-28 17:11:35236631125671[null]20122014-04-082014-04-112018250101250628123181
302014-07-31 07:10:392366321516421432.965630092015-02-072015-02-103222858660125070550351
312014-09-08 09:58:5023664675222141.045430092014-10-112014-10-12201826110125064621121
322014-09-08 09:56:2123664675222142.746730092014-10-112014-10-122018261101250646181121
332014-09-09 10:49:4823664675222143.299830092015-02-012015-02-02221276996012506464821481
342014-09-22 13:22:2523664675222141.127330032015-02-012015-02-0230182611012506469431942
352014-09-22 13:19:0223664675222141.127330032015-02-012015-02-02201276996012506469431942
362014-09-30 12:56:5123664675222142.746730092014-10-112014-10-1220182611012506461841181
372014-10-20 11:19:2123664675222144.408430092015-02-012015-02-02221277046012506464751471
382014-08-14 18:11:08236625845545110.468240092014-10-182014-10-2020182831012504142102211
392014-08-29 16:59:412366258455451321.061440072015-01-302015-02-0320182501012506287114713
402014-08-29 16:54:162366258455451318.817740072015-01-302015-02-0360182501012506287114713
412014-08-29 17:00:332366258455451317.399940072015-01-302015-02-0320182501012506281814713
422014-08-29 16:53:112366258455451318.415240072015-01-302015-02-0360182501012506287114713
432014-08-29 16:55:212366258455451319.002240072015-01-302015-02-0360182501012506283814713
442014-08-29 16:58:412366258455451318.138440072015-01-302015-02-0320182501012506283414713
452014-12-07 18:19:1023662584687978.369140012015-03-182015-03-22221116246012506667324732
462014-12-07 18:18:2623662584687978.186840012015-03-182015-03-22221116246012506667324732
472014-12-07 18:25:2023662584687149.62540012015-01-022015-01-04221447510125013431522732
482014-01-27 00:00:31236617424103960.42501102014-03-222014-03-26111828110325066390492
492014-01-27 00:36:54236617424103960.001501102014-03-222014-03-261118281111250663410492
502014-01-27 00:08:48236617424103960.0054501102014-03-222014-03-261118281101250663700492
512014-01-27 00:06:18236617424103960.7861501102014-03-222014-03-261118281101250663420492
522014-01-27 00:10:17236617424103960.001501102014-03-222014-03-2611182811012250663410492
532014-01-27 00:05:12236617424103960.5072501102014-03-222014-03-262118281101250663180492
542014-01-27 00:03:44236617424103960.1149501102014-03-222014-03-262118281101250663680492
552014-01-26 23:56:46236617424103960.42501102014-03-212014-03-25111828110125066390492
562014-01-27 01:17:09236617424103960.4250122014-03-222014-03-26111828110125066391491
572014-07-30 17:13:07236635651532836.968361092014-12-032014-12-06231121756012503662803281
582014-08-28 11:54:122366246497952574.207561022014-12-032014-12-06201121756012503661313131
592014-08-28 12:03:062366246497952573.166761022014-12-032014-12-06201121756012503662813131
602014-08-29 16:27:072366246506612577.692661022014-12-032014-12-06201121756012503661323131
612014-08-29 16:26:142366246506612577.550961022014-12-032014-12-06201121756012503669123131
622014-08-29 16:25:062366246506612579.843361022014-12-032014-12-06201121756012503662823131
632014-09-01 23:58:042366246506612577.692661022014-12-032014-12-06201121756012503661333131
642014-09-01 23:56:272366246506612577.239961022014-12-032014-12-06201121756022503667033131
652014-09-11 19:32:52236635622202756.903161022014-12-082014-12-11201825010125062814311
662014-10-10 12:25:062366246506612759.777860092014-12-082014-12-11201825010125062815311
672014-11-03 01:01:07236646249272876.151661022014-12-082014-12-1120182501012506285463541
682014-11-03 11:09:562366246506612755.818260002014-12-082014-12-0920182501012506285471541
692014-11-04 20:26:42236617424103225.671761092014-12-082014-12-0920182501012506285481541
702014-11-21 08:16:012366246506612609.687460002014-12-112014-12-12201122646012503685191512
712014-11-21 08:31:092366246506612610.280860002014-12-112014-12-12201122646012503685191512
722014-11-21 08:18:122366246506612610.281660002014-12-112014-12-12201122646012503684791512
732014-11-24 21:00:402366246506612607.752760002014-12-112014-12-122215251230125036816101162
742014-11-24 20:37:162366246506612610.418160002014-12-112014-12-12221122646012503684101162
752014-11-24 21:01:012366246506612607.738360002014-12-112014-12-122215251230125036839101162
762014-11-24 20:36:282366246506612609.904360002014-12-112014-12-122211226460225036817101162
772014-11-24 20:36:382366246506612610.40160002014-12-112014-12-122211226460125036816101162
782014-08-27 09:58:022366318316465378.691870022015-01-302015-02-05201792102620814806406645
792014-08-27 10:01:092366318316465380.631970022014-10-312014-11-01201792101620814806401645
802014-08-27 10:03:532366318316465380.631970122015-01-302015-02-05201792101620814806406645
812014-08-27 09:51:322366318316465378.691870022015-01-292015-02-082017921026208148064010645
822014-08-27 09:45:422366318316465378.691870022014-10-312014-11-01201792105620814806401645
832014-08-27 09:58:552366318316465375.904570022014-10-312014-11-01201792101620814808601645
842014-09-10 12:20:392366318316463936.106270022014-10-122014-10-13101226166016204145251151
852014-09-24 11:23:592366318878830.259170092014-12-042014-12-072011953510225059352353
862014-09-24 11:57:492366318878830.259170192014-12-042014-12-072011953510125059352353
872014-09-24 11:36:082366318878830.259170092014-12-042014-12-072011953511125059352353
882014-10-06 18:28:52236631833705826.944370002014-12-042014-12-07201195351012505939133911
892014-10-06 18:32:36236631833705826.999470002014-12-042014-12-0720119535101250593533911
902014-10-29 05:57:4423663188031862.759970022014-11-112014-11-12201195351022505933241321
912014-10-29 05:58:4423663188031862.733770022014-11-112014-11-1220119535101250593741321
922014-10-31 07:14:0323663188031863.078870142014-12-042014-12-07201195351062505939053901
932014-11-06 10:47:25236631841478[null]70122014-12-042014-12-07201195351012505937663761
942014-11-06 10:45:59236631841478[null]70122014-12-042014-12-07201195351012505934263761
952014-11-06 10:44:50236631841478[null]70122014-12-042014-12-07201195351012505932863761
962014-11-06 10:14:59236631841478[null]70122014-12-042014-12-07201195351012505933263761
972014-11-06 11:26:14236631841478[null]70022014-12-042014-12-07201195351012505933373332
982014-11-06 11:27:14236631841478[null]70122014-12-042014-12-07201195351012505933373332
992014-11-06 11:27:47236631841478[null]70122014-12-042014-12-07201195351022505933273332
1002014-11-07 11:03:11236631841478[null]70192014-12-042014-12-07201195351012505933283321
Rows: 1-100 | Columns: 28

We can now aggregate the session and get some useful statistics out of it:

  • end_session_date_time : Date and time when the session ends.
  • session_duration : Session duration.
  • is_booking : 1 if the user booked during the session, 0 otherwise.
  • trip_duration : Trip duration.
  • orig_destination_distance : Average of the physical distances between the hotels and the customer.
  • srch_family_cnt : The number of people specified in the hotel room.
In [7]:
import verticapy.stats as st
expedia = expedia.groupby(
      columns = ["user_id",
                 "session_id", 
                 "mode_hotel_cluster_count"], 
      expr = [st.max(expedia["date_time"])._as("end_session_date_time"),
              ((st.max(expedia["date_time"]) - st.min(expedia["date_time"])) / st.interval("1 second"))._as(
                  "session_duration"),
              st.max(expedia["is_booking"])._as("is_booking"),
              st.avg(expedia["trip_duration"])._as("trip_duration"),
              st.avg(expedia["orig_destination_distance"])._as("avg_distance"),
              st.sum(expedia["cnt"])._as("nb_click_session"),
              st.median(expedia["srch_children_cnt"] + expedia["srch_adults_cnt"])._as("srch_family_cnt")])

Let's look at the missing values.

In [8]:
expedia.count_percent()
Out[8]:
count
percent
"user_id"12242331.0100.0
"session_id"12242331.0100.0
"mode_hotel_cluster_count"12242331.0100.0
"end_session_date_time"12242331.0100.0
"session_duration"12242331.0100.0
"is_booking"12242331.0100.0
"nb_click_session"12242331.0100.0
"srch_family_cnt"12242331.0100.0
"trip_duration"12226336.099.869
"avg_distance"7966565.065.074
Rows: 1-10 | Columns: 3

Let's impute the missing values for avg_distance and trip_duration.

In [9]:
expedia["avg_distance" ].fillna(method = "avg")
expedia["trip_duration"].fillna(method = "avg")
4275766 elements were filled.
15995 elements were filled.
Out[9]:
123
user_id
Integer
100%
123
session_id
Integer
100%
123
mode_hotel_cluster_count
Integer
100%
📅
end_session_date_time
Datetime
100%
123
session_duration
Float
100%
123
is_booking
Integer
100%
123
trip_duration
Float
100%
123
avg_distance
Float
100%
123
nb_click_session
Integer
100%
123
srch_family_cnt
Float
100%
10012014-09-03 08:18:410.001.053.605914.0
20112014-12-03 11:32:43272.001.0343.4474553.0
30212014-12-06 01:38:260.001.088.148914.0
41032014-04-25 16:15:5082.002.05405.551944.0
52032014-01-27 13:54:073487.003.777777777777781847.26115380414132.0
62122014-01-28 12:44:161454.003.01847.26115380414112.0
72212014-01-28 17:11:35333.003.01847.2611538041452.0
83012014-07-31 07:10:390.003.01432.965615.0
93112014-09-08 09:58:50149.001.0141.8960522.0
103212014-09-09 10:49:480.001.0143.299814.0
113322014-09-22 13:22:25203.001.0141.127322.5
123412014-09-30 12:56:510.001.0142.746712.0
133512014-10-20 11:19:210.001.0144.408414.0
144012014-08-14 18:11:080.002.0110.468212.0
154132014-08-29 17:00:33442.004.01318.805864.0
164222014-12-07 18:25:20414.003.33333333333333702.060334.0
175022014-01-27 00:36:542408.014.0960.28195212.0
185112014-01-27 01:17:090.004.0960.4212.0
196012014-07-30 17:13:070.003.0836.968315.0
206112014-08-28 12:03:06534.003.02573.687122.0
216212014-08-29 16:27:07121.003.02578.3622666666732.0
226312014-09-01 23:58:0497.003.02577.4662532.0
236412014-09-11 19:32:520.003.0756.903112.0
246512014-10-10 12:25:060.003.02759.777812.0
256612014-11-03 01:01:070.003.0876.151612.0
266712014-11-03 11:09:560.001.02755.818212.0
276812014-11-04 20:26:420.001.0225.671712.0
286922014-11-21 08:31:09908.001.02610.0832666666732.0
2961022014-11-24 21:01:011473.001.02609.2428864.0
307052014-08-27 10:03:531091.004.166666666666675378.87395122.0
317112014-09-10 12:20:390.001.03936.106211.0
327232014-09-24 11:57:492030.013.0830.259142.0
337312014-10-06 18:32:36224.003.0826.9718522.0
347412014-10-29 05:58:4460.001.0862.746832.0
357512014-10-31 07:14:030.003.0863.078862.0
367612014-11-06 10:47:251946.003.01847.2611538041442.0
377722014-11-06 11:27:4793.003.01847.2611538041442.0
387812014-11-07 11:03:11126.003.01847.2611538041432.0
397912014-11-10 16:27:130.005.01847.2611538041412.0
4071012014-11-12 11:40:2843.001.01847.2611538041422.0
4171112014-11-20 07:56:48126.007.05457.3971522.0
4271212014-11-20 09:02:530.003.0866.334912.0
4371312014-12-01 15:07:490.003.0862.332912.0
4471412014-12-18 11:04:141587.002.0616.597522.0
4571512014-12-18 12:16:290.002.0628.253812.0
4671612014-12-18 13:19:090.002.0628.253812.0
478012014-02-08 19:20:310.003.01658.195911.0
488112014-03-26 10:39:480.003.0450.407312.0
498232014-03-26 12:19:101519.003.0453.8837562.0
508312014-04-06 23:13:49530.001.0453.416552.0
518412014-04-19 12:30:500.002.01318.273211.0
528512014-04-19 13:11:530.002.01318.118511.0
538612014-04-21 09:53:490.003.01318.118511.0
548712014-08-13 20:53:481364.002.01067.9173522.0
558832014-08-16 21:43:25532.009.3333333333333310.566566666666732.0
568912014-08-18 06:11:310.007.07.484312.0
5781022014-09-11 21:39:10108.002.020.698322.0
5881122014-09-14 02:44:58331.011.018.406932.0
599022014-01-26 23:45:072596.005.888888888888892353.924591.0
609112014-01-27 00:36:290.014.02354.144711.0
6110012014-08-11 13:03:57203.003.02958.1604523.0
6210112014-08-13 10:58:330.006.01804.076613.0
6310212014-08-25 08:51:520.005.01830.004622.0
6410312014-08-26 06:25:350.005.01830.004612.0
6510412014-09-15 11:25:120.006.01847.2611538041412.0
6610512014-10-14 08:21:100.007.01847.2611538041415.0
6711032014-01-27 00:55:261878.004.666666666666671995.5654104.0
68111222014-02-15 01:15:302897.005.173913043478261987.6269173913364.0
6911212014-02-15 02:40:520.001.01995.70214.0
7011352014-02-15 04:30:132823.005.833333333333333086.0918666666764.0
7111412014-02-15 05:27:520.006.01987.604134.0
7211532014-02-28 00:23:19707.001.06128.47568333333104.0
7311652014-07-23 01:13:021444.011.83333333333333367.251373.5
7411752014-10-04 06:17:31201.002.0367.487573.0
7511812014-10-06 01:05:530.001.0254.303113.0
7611912014-10-06 02:28:28123.001.0217.39223333333334.0
77111052014-10-13 03:02:342764.011.5200.137025143.0
78111112014-10-24 13:01:210.001.0217.435614.0
79111212014-10-24 22:46:570.001.0217.435613.0
80111332014-10-25 05:00:481685.011.0215.8254133.5
8112012014-08-11 07:46:590.004.02234.264132.0
8212122014-08-11 08:24:33141.014.02234.264122.0
8313042014-07-09 12:04:531547.014.251847.2611538041481.0
8413112014-07-10 05:42:460.006.01847.2611538041471.0
8513222014-07-11 06:55:20141.006.01847.26115380414121.0
8613312014-07-11 07:40:330.006.01847.2611538041411.0
8713432014-07-14 09:36:15592.012.666666666666671847.2611538041432.0
8813512014-07-15 09:01:430.003.01847.2611538041422.0
8913612014-07-16 07:02:120.000.01847.2611538041412.0
9013742014-07-17 13:33:44216.002.51847.2611538041452.0
9113812014-07-17 20:53:380.003.01847.2611538041412.0
9213912014-07-17 22:16:180.003.01847.2611538041412.0
93131012014-07-20 15:27:200.003.01847.2611538041412.0
94131112014-07-21 07:26:260.003.01847.2611538041412.0
95131222014-07-22 06:47:2269.003.01847.2611538041421.5
96131322014-08-18 13:14:50825.004.666666666666674880.02743333333141.0
97131432014-08-18 15:05:221818.003.04881.464162.0
98131512014-08-19 10:23:12648.003.04880.41435102.0
99131612014-08-19 14:14:09855.003.04925.120122.0
100131722014-08-21 04:07:522210.002.81847.26115380414102.0
Rows: 1-100 of 12242331 | Columns: 10

We can then look at the links between the variables. We will use Spearman's rank correleation coefficient to get all the monotonic relationships.

In [10]:
%matplotlib inline
expedia.corr(method = "spearman")
Out[10]:
"user_id"
"session_id"
"mode_hotel_cluster_count"
"session_duration"
"is_booking"
"trip_duration"
"avg_distance"
"nb_click_session"
"srch_family_cnt"
"user_id"1.00.01253728397696910.000585488528099556-0.0005415978773046150.00168318615437918-0.001722943323589230.00722625140700152-0.00130288969225954-0.00145643651427301
"session_id"0.01253728397696911.0-0.0362824718948203-0.0456184717754819-0.05790978073407280.02062253488811920.0424225822484464-0.0614509659970033-0.0267334768989548
"mode_hotel_cluster_count"0.000585488528099556-0.03628247189482031.00.6931495838087710.496196271565582-0.0229387237074547-0.001187241744772810.6542376371189150.0430835831263844
"session_duration"-0.000541597877304615-0.04561847177548190.6931495838087711.00.3318666314641220.01252538554063240.01885420879826990.8349609773403270.0545771115310364
"is_booking"0.00168318615437918-0.05790978073407280.4961962715655820.3318666314641221.0-0.191911129719119-0.06600726780690690.200140132943157-0.0646466731033034
"trip_duration"-0.001722943323589230.0206225348881192-0.02293872370745470.0125253855406324-0.1919111297191191.00.2883265879591070.04940042190099550.0716986783665811
"avg_distance"0.007226251407001520.0424225822484464-0.001187241744772810.0188542087982699-0.06600726780690690.2883265879591071.00.0322018254928665-0.0205418178720803
"nb_click_session"-0.00130288969225954-0.06145096599700330.6542376371189150.8349609773403270.2001401329431570.04940042190099550.03220182549286651.00.0753040443807521
"srch_family_cnt"-0.00145643651427301-0.02673347689895480.04308358312638440.0545771115310364-0.06464667310330340.0716986783665811-0.02054181787208030.07530404438075211.0
Rows: 1-9 | Columns: 10

We can see huge links between some of the variables (mode_hotel_cluster_count and session_duration) and our response variable (is_booking). A logistic regression would work well in this case because the response and predictors have a monotonic relationship.

Machine Learning

Let's create our logistic regression model.

In [11]:
from verticapy.learn.linear_model import LogisticRegression
model_logit = LogisticRegression("expedia_logit", 
                                 max_iter = 1000, 
                                 solver = 'BFGS')
model_logit.fit(expedia, 
                ["avg_distance", 
                 "session_duration",
                 "nb_click_session",
                 "mode_hotel_cluster_count",
                 "session_id",
                 "srch_family_cnt",
                 "trip_duration"], 
                 "is_booking")
Out[11]:

=======
details
=======
       predictor        |coefficient|std_err | z_value  |p_value 
------------------------+-----------+--------+----------+--------
       Intercept        | -1.59409  | 0.00275|-579.61668| 0.00000
      avg_distance      | -0.00004  | 0.00000|-81.38289 | 0.00000
    session_duration    |  0.00056  | 0.00000|415.75575 | 0.00000
    nb_click_session    | -0.16990  | 0.00031|-543.77330| 0.00000
mode_hotel_cluster_count|  1.13834  | 0.00125|910.63579 | 0.00000
       session_id       | -0.00583  | 0.00005|-124.80774| 0.00000
    srch_family_cnt     | -0.20765  | 0.00087|-238.93988| 0.00000
     trip_duration      | -0.24810  | 0.00050|-497.93583| 0.00000


==============
regularization
==============
type| lambda 
----+--------
none| 1.00000


===========
call_string
===========
logistic_reg('public.expedia_logit', '"public"."_verticapy_tmp_view_dbadmin_150246_4175658595_"', '"is_booking"', '"avg_distance", "session_duration", "nb_click_session", "mode_hotel_cluster_count", "session_id", "srch_family_cnt", "trip_duration"'
USING PARAMETERS optimizer='bfgs', epsilon=1e-06, max_iterations=1000, regularization='none', lambda=1, alpha=0.5)

===============
Additional Info
===============
       Name       | Value  
------------------+--------
 iteration_count  |   56   
rejected_row_count|   0    
accepted_row_count|12242331

None of our coefficients are rejected (pvalue = 0). Let's look at their importance.

In [12]:
model_logit.features_importance()
Out[12]:
importance
sign
mode_hotel_cluster_count42.121
trip_duration42.09-1
nb_click_session12.1-1
session_duration2.641
srch_family_cnt0.65-1
session_id0.31-1
avg_distance0.09-1
Rows: 1-7 | Columns: 3

It looks like there are two main predictors: mode_hotel_cluster_count and trip_duration. According to our model, users likely to make a booking during a particular session will tend to:

  • look at the same hotel many times
  • look for a shorter trip duration
  • not click as much (spend more time at the same web page).

Let's add our prediction to the vDataFrame.

In [13]:
model_logit.predict_proba(expedia, 
                          name = "booking_prob_logit",
                          pos_label = 1)
Out[13]:
123
user_id
Integer
123
session_id
Integer
123
mode_hotel_cluster_count
Integer
📅
end_session_date_time
Datetime
123
session_duration
Float
123
is_booking
Integer
123
trip_duration
Float
123
avg_distance
Float
123
nb_click_session
Integer
123
srch_family_cnt
Float
123
booking_prob_logit
Float
10012014-09-03 08:18:410.001.053.605914.00.153605784760313
20112014-12-03 11:32:43272.001.0343.4474553.00.114769694559958
30212014-12-06 01:38:260.001.088.148914.00.151909565323304
41032014-04-25 16:15:5082.002.05405.551944.00.409667557696363
52032014-01-27 13:54:073487.003.777777777777781847.26115380414132.00.537789354058044
62122014-01-28 12:44:161454.003.01847.26115380414112.00.166867978942862
72212014-01-28 17:11:35333.003.01847.2611538041452.00.0858210195354118
83012014-07-31 07:10:390.003.01432.965615.00.0781213517098348
93112014-09-08 09:58:50149.001.0141.8960522.00.199954136104512
103212014-09-09 10:49:480.001.0143.299814.00.15161252923937
113322014-09-22 13:22:25203.001.0141.127322.50.417461623963391
123412014-09-30 12:56:510.001.0142.746712.00.211093932210174
133512014-10-20 11:19:210.001.0144.408414.00.149370471025827
144012014-08-14 18:11:080.002.0110.468212.00.176278424893632
154132014-08-29 17:00:33442.004.01318.805864.00.303033318892354
164222014-12-07 18:25:20414.003.33333333333333702.060334.00.215521000283274
175022014-01-27 00:36:542408.014.0960.28195212.00.0486391653944605
185112014-01-27 01:17:090.004.0960.4212.00.111120003993773
196012014-07-30 17:13:070.003.0836.968315.00.0799361826415578
206112014-08-28 12:03:06534.003.02573.687122.00.145902571229701
216212014-08-29 16:27:07121.003.02578.3622666666732.00.101902835247903
226312014-09-01 23:58:0497.003.02577.4662532.00.10014591874866
236412014-09-11 19:32:520.003.0756.903112.00.137023874419858
246512014-10-10 12:25:060.003.02759.777812.00.126766078502665
256612014-11-03 01:01:070.003.0876.151612.00.135066898027681
266712014-11-03 11:09:560.001.02755.818212.00.190747110120129
276812014-11-04 20:26:420.001.0225.671712.00.206666819276909
286922014-11-21 08:31:09908.001.02610.0832666666732.00.46519458222375
2961022014-11-24 21:01:011473.001.02609.2428864.00.320559312392026
307052014-08-27 10:03:531091.004.166666666666675378.87395122.00.731208803627096
317112014-09-10 12:20:390.001.03936.106211.00.222369617950368
327232014-09-24 11:57:492030.013.0830.259142.00.746809728657831
337312014-10-06 18:32:36224.003.0826.9718522.00.132303404351664
347412014-10-29 05:58:4460.001.0862.746832.00.160515864407895
357512014-10-31 07:14:030.003.0863.078862.00.0629717632058175
367612014-11-06 10:47:251946.003.01847.2611538041442.00.212768759915639
377722014-11-06 11:27:4793.003.01847.2611538041442.00.227549632713416
387812014-11-07 11:03:11126.003.01847.2611538041432.00.101759156668488
397912014-11-10 16:27:130.005.01847.2611538041412.00.0823229504633143
4071012014-11-12 11:40:2843.001.01847.2611538041422.00.172178789723069
4171112014-11-20 07:56:48126.007.05457.3971522.00.0403551481630567
4271212014-11-20 09:02:530.003.0866.334912.00.131078774199718
4371312014-12-01 15:07:490.003.0862.332912.00.130435118963389
4471412014-12-18 11:04:141587.002.0616.597522.00.285305806952283
4571512014-12-18 12:16:290.002.0628.253812.00.160988939727481
4671612014-12-18 13:19:090.002.0628.253812.00.160202966761629
478012014-02-08 19:20:310.003.01658.195911.00.161519234636658
488112014-03-26 10:39:480.003.0450.407312.00.14064814658706
498232014-03-26 12:19:101519.003.0453.8837562.00.615157634255203
508312014-04-06 23:13:49530.001.0453.416552.00.15370436297465
518412014-04-19 12:30:500.002.01318.273211.00.196554253629183
528512014-04-19 13:11:530.002.01318.118511.00.195636150971596
538612014-04-21 09:53:490.003.01318.118511.00.158728164011895
548712014-08-13 20:53:481364.002.01067.9173522.00.264581377504695
558832014-08-16 21:43:25532.009.3333333333333310.566566666666732.00.23752210930637
568912014-08-18 06:11:310.007.07.484312.00.0557026167192674
5781022014-09-11 21:39:10108.002.020.698322.00.361953805824228
5881122014-09-14 02:44:58331.011.018.406932.00.408901454285757
599022014-01-26 23:45:072596.005.888888888888892353.924591.00.240904076884289
609112014-01-27 00:36:290.014.02354.144711.00.126749295648676
6110012014-08-11 13:03:57203.003.02958.1604523.00.10229971561409
6210112014-08-13 10:58:330.006.01804.076613.00.0563330008245311
6310212014-08-25 08:51:520.005.01830.004622.00.0731306610860517
6410312014-08-26 06:25:350.005.01830.004612.00.0850608359883184
6510412014-09-15 11:25:120.006.01847.2611538041412.00.0672237056428636
6610512014-10-14 08:21:100.007.01847.2611538041415.00.0291131307102079
6711032014-01-27 00:55:261878.004.666666666666671995.5654104.00.291216526340216
68111222014-02-15 01:15:302897.005.173913043478261987.6269173913364.00.999999947661436
6911212014-02-15 02:40:520.001.01995.70214.00.141910547298366
7011352014-02-15 04:30:132823.005.833333333333333086.0918666666764.00.904468660118949
7111412014-02-15 05:27:520.006.01987.604134.00.0325740395685856
7211532014-02-28 00:23:19707.001.06128.47568333333104.00.30087585260552
7311652014-07-23 01:13:021444.011.83333333333333367.251373.50.923560522218305
7411752014-10-04 06:17:31201.002.0367.487573.00.863709639231675
7511812014-10-06 01:05:530.001.0254.303113.00.174511485669671
7611912014-10-06 02:28:28123.001.0217.39223333333334.00.115446768081075
77111052014-10-13 03:02:342764.011.5200.137025143.00.901854863743191
78111112014-10-24 13:01:210.001.0217.435614.00.144601568257166
79111212014-10-24 22:46:570.001.0217.435613.00.171396220731711
80111332014-10-25 05:00:481685.011.0215.8254133.50.378479686174866
8112012014-08-11 07:46:590.004.02234.264132.00.0782311917400202
8212122014-08-11 08:24:33141.014.02234.264122.00.252638417951801
8313042014-07-09 12:04:531547.014.251847.2611538041481.00.756628224060677
8413112014-07-10 05:42:460.006.01847.2611538041471.00.0315408287604047
8513222014-07-11 06:55:20141.006.01847.26115380414121.00.0447101520215156
8613312014-07-11 07:40:330.006.01847.2611538041411.00.0819112691735906
8713432014-07-14 09:36:15592.012.666666666666671847.2611538041432.00.614915133006404
8813512014-07-15 09:01:430.003.01847.2611538041422.00.112886909051649
8913612014-07-16 07:02:120.000.01847.2611538041412.00.239900791653794
9013742014-07-17 13:33:44216.002.51847.2611538041452.00.74613064994227
9113812014-07-17 20:53:380.003.01847.2611538041412.00.129073806922214
9213912014-07-17 22:16:180.003.01847.2611538041412.00.128419799454148
93131012014-07-20 15:27:200.003.01847.2611538041412.00.12776861964389
94131112014-07-21 07:26:260.003.01847.2611538041412.00.127120260189066
95131222014-07-22 06:47:2269.003.01847.2611538041421.50.3054941900462
96131322014-08-18 13:14:50825.004.666666666666674880.02743333333141.00.0533732862884125
97131432014-08-18 15:05:221818.003.04881.464162.00.594519517619808
98131512014-08-19 10:23:12648.003.04880.41435102.00.0376841554129484
99131612014-08-19 14:14:09855.003.04925.120122.00.145334180009601
100131722014-08-21 04:07:522210.002.81847.26115380414102.00.25829708322599
Rows: 1-100 of 12242331 | Columns: 11

While analyzing the following boxplot (prediction partitioned by is_booking), we can notice that the cutoff is around 0.22 because most of the positive predictions have a probability between 0.23 and 0.5. Most of the negative predictions are between 0.05 and 0.2.

In [14]:
expedia["booking_prob_logit"].boxplot(by = "is_booking")
Out[14]:
<AxesSubplot:xlabel='"is_booking"', ylabel='"booking_prob_logit"'>

Let's confirm our hypothesis by computing the best cutoff.

In [15]:
model_logit.score(method = "best_cutoff")
Out[15]:
0.2204

Let's look at the efficiency of our model with a cutoff of 0.22.

In [16]:
model_logit.roc_curve()
model_logit.report(cutoff = 0.22)
Out[16]:
value
auc0.840145448458096
prc_auc0.5465504635612507
accuracy0.8021478916065903
log_loss0.182579215050228
precision0.5299008517143481
recall0.7976312038512661
f1_score0.6367687356440429
mcc0.5283725198559831
informedness0.601033970517129
markedness0.4644954086351514
csi0.46710250292335903
cutoff0.22
Rows: 1-12 | Columns: 2

We're left with an excellent model. With this, we can predict whether a user will book a hotel during a specific session and make adjustments to our site accordingly. For example, to influence a user to make a booking, we could propose new hotels.

Conclusion

We've solved our problem in a Pandas-like way, all without ever loading data into memory!


VerticaPy

Python API for Vertica Data Science at Scale

About the Author


Badr Ouali
Head of Data Science


Badr Ouali works as a Lead Data Scientist for Vertica worldwide. He can embrace data projects end to end through a clear understanding of the “big picture” as well as attention to details, resulting in achieving great business outcomes – a distinctive differentiator in his role. Badr enjoys sharing knowledge and insights related to data analytics with colleagues & peers and has a sweet spot for Python. He loves helping customers finding the best value from their data and empower them to solve their use-cases.