Booking¶
This example uses the 'Expedia' dataset to predict, based on site activity, whether a user is likely to make a booking. You can download the Jupyter Notebook of the study here and the the dataset here.
- cnt: Number of similar events in the context of the same user session.
- user_location_city: The ID of the city in which the customer is located.
- is_package: 1 if the click/booking was generated as a part of a package (i.e. combined with a flight), 0 otherwise.
- user_id: ID of the user
- srch_children_cnt: The number of (extra occupancy) children specified in the hotel room.
- channel: marketing ID of a marketing channel.
- hotel_cluster: ID of a hotel cluster.
- srch_destination_id: ID of the destination where the hotel search was performed.
- is_mobile: 1 if the user is on a mobile device, 0 otherwise.
- srch_adults_cnt: The number of adults specified in the hotel room.
- user_location_country: The ID of the country in which the customer is located.
- srch_destination_type_id: ID of the destination where the hotel search was performed.
- srch_rm_cnt: The number of hotel rooms specified in the search.
- posa_continent: ID of the continent associated with the site_name.
- srch_ci: Check-in date.
- user_location_region: The ID of the region in which the customer is located.
- hotel_country: Hotel's country.
- srch_co: Check-out date.
- is_booking: 1 if a booking, 0 if a click.
- orig_destination_distance: Physical distance between a hotel and a customer at the time of search. A null means the distance could not be calculated.
- hotel_continent: Hotel continent.
- site_name: ID of the Expedia point of sale (i.e. Expedia.com, Expedia.co.uk, Expedia.co.jp, ...).
We will follow the data science cycle (Data Exploration - Data Preparation - Data Modeling - Model Evaluation - Model Deployment) to solve this problem.
Initialization¶
This example uses the following version of VerticaPy:
import verticapy as vp
vp.__version__
Connect to Vertica. This example uses an existing connection called "VerticaDSN." For details on how to create a connection, use see the connection tutorial.
vp.connect("VerticaDSN")
Let's create a Virtual DataFrame of the dataset.
expedia = vp.read_csv('data/expedia.csv', parse_nrows=1000)
expedia.head(5)
Data Exploration and Preparation¶
Sessionization is the process of gathering clicks for a certain period of time. We usually consider that after 30 minutes of inactivity, the user session ends (date_time - lag(date_time) > 30 minutes). For these kinds of use cases, aggregating sessions with meaningful statistics is the key for making accurate predictions.
We start by using the sessionize' method to create the variable 'session_id. We can then use this variable to aggregate the data.
expedia.sessionize(ts = "date_time",
by = ["user_id"],
session_threshold = "30 minutes",
name = "session_id")
The duration of the trip should also influence/be indicative of the user's behavior on the site, so we'll take that into account.
expedia["trip_duration"] = expedia["srch_co"] - expedia["srch_ci"]
If a user looks at the same hotel several times, then it might mean that they're looking to book that hotel during the session.
expedia.analytic('mode',
columns = "hotel_cluster",
by = ["user_id",
"session_id"],
name = "mode_hotel_cluster",
add_count = True)