Booking

This example uses the 'Expedia' dataset to predict, based on site activity, whether a user is likely to make a booking. You can download the Jupyter Notebook of the study here and the the dataset here.

  • cnt: Number of similar events in the context of the same user session.
  • user_location_city: The ID of the city in which the customer is located.
  • is_package: 1 if the click/booking was generated as a part of a package (i.e. combined with a flight), 0 otherwise.
  • user_id: ID of the user
  • srch_children_cnt: The number of (extra occupancy) children specified in the hotel room.
  • channel: marketing ID of a marketing channel.
  • hotel_cluster: ID of a hotel cluster.
  • srch_destination_id: ID of the destination where the hotel search was performed.
  • is_mobile: 1 if the user is on a mobile device, 0 otherwise.
  • srch_adults_cnt: The number of adults specified in the hotel room.
  • user_location_country: The ID of the country in which the customer is located.
  • srch_destination_type_id: ID of the destination where the hotel search was performed.
  • srch_rm_cnt: The number of hotel rooms specified in the search.
  • posa_continent: ID of the continent associated with the site_name.
  • srch_ci: Check-in date.
  • user_location_region: The ID of the region in which the customer is located.
  • hotel_country: Hotel's country.
  • srch_co: Check-out date.
  • is_booking: 1 if a booking, 0 if a click.
  • orig_destination_distance: Physical distance between a hotel and a customer at the time of search. A null means the distance could not be calculated.
  • hotel_continent: Hotel continent.
  • site_name: ID of the Expedia point of sale (i.e. Expedia.com, Expedia.co.uk, Expedia.co.jp, ...).

We will follow the data science cycle (Data Exploration - Data Preparation - Data Modeling - Model Evaluation - Model Deployment) to solve this problem.

Initialization

This example uses the following version of VerticaPy:

In [1]:
import verticapy as vp
vp.__version__
Out[1]:
'0.9.0'

Connect to Vertica. This example uses an existing connection called "VerticaDSN." For details on how to create a connection, use see the connection tutorial.

In [2]:
vp.connect("VerticaDSN")

Let's create a Virtual DataFrame of the dataset.

In [3]:
expedia = vp.read_csv('data/expedia.csv', parse_nrows=1000)
expedia.head(5)
Out[3]:
📅
date_time
Timestamp
123
site_name
Int
123
posa_continent
Int
123
user_location_country
Int
123
user_location_region
Int
123
user_location_city
Int
123
orig_destination_distance
Numeric(11,5)
123
user_id
Int
123
is_mobile
Int
123
is_package
Int
123
channel
Int
📅
srch_ci
Date
📅
srch_co
Date
123
srch_adults_cnt
Int
123
srch_children_cnt
Int
123
srch_rm_cnt
Int
123
srch_destination_id
Int
123
srch_destination_type_id
Int
123
is_booking
Int
123
cnt
Int
123
hotel_continent
Int
123
hotel_country
Int
123
hotel_market
Int
123
hotel_cluster
Int
12013-01-07 00:00:022423505703[null]4618990092013-03-142013-03-1521166930125021241
22013-01-07 00:00:062366174211775713.6206137960092013-01-192013-01-2610188211036173058
32013-01-07 00:00:0611320515514703795.729811285750092013-01-192013-01-2210125064601250123091
42013-01-07 00:00:093716976141949[null]10804760192013-05-292013-06-05201763530125067510
52013-01-07 00:00:173716976141949[null]10188950002013-09-082013-09-102012721560225064559
Rows: 1-5 | Columns: 24

Data Exploration and Preparation

Sessionization is the process of gathering clicks for a certain period of time. We usually consider that after 30 minutes of inactivity, the user session ends (date_time - lag(date_time) > 30 minutes). For these kinds of use cases, aggregating sessions with meaningful statistics is the key for making accurate predictions.

We start by using the sessionize' method to create the variable 'session_id. We can then use this variable to aggregate the data.

In [4]:
expedia.sessionize(ts = "date_time", 
                   by = ["user_id"], 
                   session_threshold = "30 minutes", 
                   name = "session_id")
Out[4]:
📅
date_time
Timestamp
123
site_name
Int
123
posa_continent
Int
123
user_location_country
Int
123
user_location_region
Int
123
user_location_city
Int
123
orig_destination_distance
Numeric(11,5)
123
user_id
Int
123
is_mobile
Int
123
is_package
Int
123
channel
Int
📅
srch_ci
Date
📅
srch_co
Date
123
srch_adults_cnt
Int
123
srch_children_cnt
Int
123
srch_rm_cnt
Int
123
srch_destination_id
Int
123
srch_destination_type_id
Int
123
is_booking
Int
123
cnt
Int
123
hotel_continent
Int
123
hotel_country
Int
123
hotel_market
Int
123
hotel_cluster
Int
123
session_id
Integer
12014-09-03 08:18:412366174904553.605901022014-09-062014-09-07221246936012501241190
22014-12-03 11:28:11236617442538343.786601022014-12-292014-12-3021182791012501230981
32014-12-03 11:28:58236617442538343.990101022014-12-292014-12-3021182791012501230401
42014-12-03 11:30:21236617442538343.796801022014-12-292014-12-3021182791012501230561
52014-12-03 11:32:43236617442538342.216301022014-12-292014-12-303218279102250358721
62014-12-06 01:38:262366174904588.148901022014-12-192014-12-202218279101250358282
72014-04-25 16:14:282366174374495405.488911022014-11-072014-11-09311628814014982052200
82014-04-25 16:14:462366174374495405.488911022014-11-072014-11-09512628814014982052200
92014-04-25 16:15:112366174374495405.488911022014-11-072014-11-09312628814014982052200
102014-04-25 16:15:502366174374495405.740911022014-11-072014-11-09312628814014982052600
112014-01-27 12:56:00236631125671[null]20192014-04-082014-04-1220111621102250703960
122014-01-27 13:00:31236631125671[null]20192014-04-082014-04-1220111621101250703720
132014-01-27 13:04:55236631125671[null]20192014-04-082014-04-1220111621101250703910
142014-01-27 13:06:31236631125671[null]20192014-04-082014-04-1220111621101250703910
152014-01-27 13:14:21236631125671[null]20092014-04-082014-04-1220111621102250703910
162014-01-27 13:18:43236631125671[null]20192014-04-082014-04-1220111621101250703940
172014-01-27 13:22:39236631125671[null]20192014-04-082014-04-1220111621101250703210
182014-01-27 13:39:30236631125671[null]20192014-04-082014-04-1120111621102250703960
192014-01-27 13:54:07236631125671[null]20192014-04-082014-04-1120111621102250703940
202014-01-28 12:20:02236631125671[null]20192014-04-082014-04-1120112008103250686951
212014-01-28 12:27:47236631125671[null]20192014-04-082014-04-1120112008101250686411
222014-01-28 12:28:31236631125671[null]20192014-04-082014-04-1120112008101250686731
232014-01-28 12:31:18236631125671[null]20192014-04-082014-04-1120112008101250686681
242014-01-28 12:32:35236631125671[null]20192014-04-082014-04-1120112008101250686911
252014-01-28 12:34:21236631125671[null]20192014-04-082014-04-1120112008101250686911
262014-01-28 12:37:49236631125671[null]20192014-04-082014-04-1120112008102250686181
272014-01-28 12:44:16236631125671[null]20092014-04-082014-04-1120112008101250686181
282014-01-28 17:06:02236631125671[null]20122014-04-082014-04-112018250104250628182
292014-01-28 17:11:35236631125671[null]20122014-04-082014-04-11201825010125062812
302014-07-31 07:10:392366321516421432.965630092015-02-072015-02-103222858660125070550
312014-09-08 09:56:2123664675222142.746730092014-10-112014-10-122018261101250646181
322014-09-08 09:58:5023664675222141.045430092014-10-112014-10-12201826110125064621
332014-09-09 10:49:4823664675222143.299830092015-02-012015-02-0222127699601250646482
342014-09-22 13:19:0223664675222141.127330032015-02-012015-02-0220127699601250646943
352014-09-22 13:22:2523664675222141.127330032015-02-012015-02-023018261101250646943
362014-09-30 12:56:5123664675222142.746730092014-10-112014-10-122018261101250646184
372014-10-20 11:19:2123664675222144.408430092015-02-012015-02-0222127704601250646475
382014-08-14 18:11:08236625845545110.468240092014-10-182014-10-202018283101250414210
392014-08-29 16:53:112366258455451318.415240072015-01-302015-02-036018250101250628711
402014-08-29 16:54:162366258455451318.817740072015-01-302015-02-036018250101250628711
412014-08-29 16:55:212366258455451319.002240072015-01-302015-02-036018250101250628381
422014-08-29 16:58:412366258455451318.138440072015-01-302015-02-032018250101250628341
432014-08-29 16:59:412366258455451321.061440072015-01-302015-02-032018250101250628711
442014-08-29 17:00:332366258455451317.399940072015-01-302015-02-032018250101250628181
452014-12-07 18:18:2623662584687978.186840012015-03-182015-03-2222111624601250666732
462014-12-07 18:19:1023662584687978.369140012015-03-182015-03-2222111624601250666732
472014-12-07 18:25:2023662584687149.62540012015-01-022015-01-0422144751012501343152
482014-01-26 23:56:46236617424103960.42501102014-03-212014-03-25111828110125066390
492014-01-27 00:00:31236617424103960.42501102014-03-222014-03-26111828110325066390
502014-01-27 00:03:44236617424103960.1149501102014-03-222014-03-262118281101250663680
512014-01-27 00:05:12236617424103960.5072501102014-03-222014-03-262118281101250663180
522014-01-27 00:06:18236617424103960.7861501102014-03-222014-03-261118281101250663420
532014-01-27 00:08:48236617424103960.0054501102014-03-222014-03-261118281101250663700
542014-01-27 00:10:17236617424103960.001501102014-03-222014-03-2611182811012250663410
552014-01-27 00:36:54236617424103960.001501102014-03-222014-03-261118281111250663410
562014-01-27 01:17:09236617424103960.4250122014-03-222014-03-26111828110125066391
572014-07-30 17:13:07236635651532836.968361092014-12-032014-12-0623112175601250366280
582014-08-28 11:54:122366246497952574.207561022014-12-032014-12-0620112175601250366131
592014-08-28 12:03:062366246497952573.166761022014-12-032014-12-0620112175601250366281
602014-08-29 16:25:062366246506612579.843361022014-12-032014-12-0620112175601250366282
612014-08-29 16:26:142366246506612577.550961022014-12-032014-12-0620112175601250366912
622014-08-29 16:27:072366246506612577.692661022014-12-032014-12-0620112175601250366132
632014-09-01 23:56:272366246506612577.239961022014-12-032014-12-0620112175602250366703
642014-09-01 23:58:042366246506612577.692661022014-12-032014-12-0620112175601250366133
652014-09-11 19:32:52236635622202756.903161022014-12-082014-12-11201825010125062814
662014-10-10 12:25:062366246506612759.777860092014-12-082014-12-11201825010125062815
672014-11-03 01:01:07236646249272876.151661022014-12-082014-12-112018250101250628546
682014-11-03 11:09:562366246506612755.818260002014-12-082014-12-092018250101250628547
692014-11-04 20:26:42236617424103225.671761092014-12-082014-12-092018250101250628548
702014-11-21 08:16:012366246506612609.687460002014-12-112014-12-1220112264601250368519
712014-11-21 08:18:122366246506612610.281660002014-12-112014-12-1220112264601250368479
722014-11-21 08:31:092366246506612610.280860002014-12-112014-12-1220112264601250368519
732014-11-24 20:36:282366246506612609.904360002014-12-112014-12-12221122646022503681710
742014-11-24 20:36:382366246506612610.40160002014-12-112014-12-12221122646012503681610
752014-11-24 20:37:162366246506612610.418160002014-12-112014-12-1222112264601250368410
762014-11-24 21:00:402366246506612607.752760002014-12-112014-12-12221525123012503681610
772014-11-24 21:01:012366246506612607.738360002014-12-112014-12-12221525123012503683910
782014-08-27 09:45:422366318316465378.691870022014-10-312014-11-0120179210562081480640
792014-08-27 09:51:322366318316465378.691870022015-01-292015-02-0820179210262081480640
802014-08-27 09:58:022366318316465378.691870022015-01-302015-02-0520179210262081480640
812014-08-27 09:58:552366318316465375.904570022014-10-312014-11-0120179210162081480860
822014-08-27 10:01:092366318316465380.631970022014-10-312014-11-0120179210162081480640
832014-08-27 10:03:532366318316465380.631970122015-01-302015-02-0520179210162081480640
842014-09-10 12:20:392366318316463936.106270022014-10-122014-10-13101226166016204145251
852014-09-24 11:23:592366318878830.259170092014-12-042014-12-072011953510225059352
862014-09-24 11:36:082366318878830.259170092014-12-042014-12-072011953511125059352
872014-09-24 11:57:492366318878830.259170192014-12-042014-12-072011953510125059352
882014-10-06 18:28:52236631833705826.944370002014-12-042014-12-0720119535101250593913
892014-10-06 18:32:36236631833705826.999470002014-12-042014-12-072011953510125059353
902014-10-29 05:57:4423663188031862.759970022014-11-112014-11-1220119535102250593324
912014-10-29 05:58:4423663188031862.733770022014-11-112014-11-122011953510125059374
922014-10-31 07:14:0323663188031863.078870142014-12-042014-12-0720119535106250593905
932014-11-06 10:14:59236631841478[null]70122014-12-042014-12-0720119535101250593326
942014-11-06 10:44:50236631841478[null]70122014-12-042014-12-0720119535101250593286
952014-11-06 10:45:59236631841478[null]70122014-12-042014-12-0720119535101250593426
962014-11-06 10:47:25236631841478[null]70122014-12-042014-12-0720119535101250593766
972014-11-06 11:26:14236631841478[null]70022014-12-042014-12-0720119535101250593337
982014-11-06 11:27:14236631841478[null]70122014-12-042014-12-0720119535101250593337
992014-11-06 11:27:47236631841478[null]70122014-12-042014-12-0720119535102250593327
1002014-11-07 11:01:05236631841478[null]70192014-12-042014-12-0720119535102250593738
Rows: 1-100 | Columns: 25

The duration of the trip should also influence/be indicative of the user's behavior on the site, so we'll take that into account.

In [5]:
expedia["trip_duration"] = expedia["srch_co"] - expedia["srch_ci"]

If a user looks at the same hotel several times, then it might mean that they're looking to book that hotel during the session.

In [6]:
expedia.analytic('mode', 
                 columns = "hotel_cluster", 
                 by = ["user_id",
                       "session_id"], 
                 name = "mode_hotel_cluster",
                 add_count = True)
Out[6]:
📅
date_time
Timestamp
123
site_name
Int
123
posa_continent
Int
123
user_location_country
Int
123
user_location_region
Int
123
user_location_city
Int
123
orig_destination_distance
Numeric(11,5)
123
user_id
Int
123
is_mobile
Int
123
is_package
Int
123
channel
Int
📅
srch_ci
Date
📅
srch_co
Date
123
srch_adults_cnt
Int
123
srch_children_cnt
Int
123
srch_rm_cnt
Int
123
srch_destination_id
Int
123
srch_destination_type_id
Int
123
is_booking
Int
123
cnt
Int
123
hotel_continent
Int
123
hotel_country
Int
123
hotel_market
Int
123
hotel_cluster
Int
123
session_id
Integer
123
trip_duration
Integer
123
mode_hotel_cluster
Integer
123
mode_hotel_cluster_count
Integer
12014-09-03 08:18:412366174904553.605901022014-09-062014-09-072212469360125012411901191
22014-12-03 11:32:43236617442538342.216301022014-12-292014-12-3032182791022503587211721
32014-12-03 11:30:21236617442538343.796801022014-12-292014-12-30211827910125012305611721
42014-12-03 11:28:58236617442538343.990101022014-12-292014-12-30211827910125012304011721
52014-12-03 11:28:11236617442538343.786601022014-12-292014-12-30211827910125012309811721
62014-12-06 01:38:262366174904588.148901022014-12-192014-12-2022182791012503582821281
72014-04-25 16:14:462366174374495405.488911022014-11-072014-11-095126288140149820522002203
82014-04-25 16:15:112366174374495405.488911022014-11-072014-11-093126288140149820522002203
92014-04-25 16:15:502366174374495405.740911022014-11-072014-11-093126288140149820526002203
102014-04-25 16:14:282366174374495405.488911022014-11-072014-11-093116288140149820522002203
112014-01-27 13:06:31236631125671[null]20192014-04-082014-04-12201116211012507039104913
122014-01-27 12:56:00236631125671[null]20192014-04-082014-04-12201116211022507039604913
132014-01-27 13:18:43236631125671[null]20192014-04-082014-04-12201116211012507039404913
142014-01-27 13:14:21236631125671[null]20092014-04-082014-04-12201116211022507039104913
152014-01-27 13:39:30236631125671[null]20192014-04-082014-04-11201116211022507039603913
162014-01-27 13:54:07236631125671[null]20192014-04-082014-04-11201116211022507039403913
172014-01-27 13:04:55236631125671[null]20192014-04-082014-04-12201116211012507039104913
182014-01-27 13:00:31236631125671[null]20192014-04-082014-04-12201116211012507037204913
192014-01-27 13:22:39236631125671[null]20192014-04-082014-04-12201116211012507032104913
202014-01-28 12:37:49236631125671[null]20192014-04-082014-04-11201120081022506861813182
212014-01-28 12:34:21236631125671[null]20192014-04-082014-04-11201120081012506869113182
222014-01-28 12:32:35236631125671[null]20192014-04-08