Booking¶
This example uses the 'Expedia' dataset to predict, based on site activity, whether a user is likely to make a booking. You can download the Jupyter Notebook of the study here and the the dataset here.
- cnt: Number of similar events in the context of the same user session.
- user_location_city: The ID of the city in which the customer is located.
- is_package: 1 if the click/booking was generated as a part of a package (i.e. combined with a flight), 0 otherwise.
- user_id: ID of the user
- srch_children_cnt: The number of (extra occupancy) children specified in the hotel room.
- channel: marketing ID of a marketing channel.
- hotel_cluster: ID of a hotel cluster.
- srch_destination_id: ID of the destination where the hotel search was performed.
- is_mobile: 1 if the user is on a mobile device, 0 otherwise.
- srch_adults_cnt: The number of adults specified in the hotel room.
- user_location_country: The ID of the country in which the customer is located.
- srch_destination_type_id: ID of the destination where the hotel search was performed.
- srch_rm_cnt: The number of hotel rooms specified in the search.
- posa_continent: ID of the continent associated with the site_name.
- srch_ci: Check-in date.
- user_location_region: The ID of the region in which the customer is located.
- hotel_country: Hotel's country.
- srch_co: Check-out date.
- is_booking: 1 if a booking, 0 if a click.
- orig_destination_distance: Physical distance between a hotel and a customer at the time of search. A null means the distance could not be calculated.
- hotel_continent: Hotel continent.
- site_name: ID of the Expedia point of sale (i.e. Expedia.com, Expedia.co.uk, Expedia.co.jp, ...).
We will follow the data science cycle (Data Exploration - Data Preparation - Data Modeling - Model Evaluation - Model Deployment) to solve this problem.
Initialization¶
This example uses the following version of VerticaPy:
import verticapy as vp
vp.__version__
Connect to Vertica. This example uses an existing connection called "VerticaDSN." For details on how to create a connection, use see the connection tutorial.
vp.connect("VerticaDSN")
Let's create a Virtual DataFrame of the dataset.
expedia = vp.read_csv('data/expedia.csv', parse_nrows=1000)
expedia.head(5)
Data Exploration and Preparation¶
Sessionization is the process of gathering clicks for a certain period of time. We usually consider that after 30 minutes of inactivity, the user session ends (date_time - lag(date_time) > 30 minutes). For these kinds of use cases, aggregating sessions with meaningful statistics is the key for making accurate predictions.
We start by using the sessionize' method to create the variable 'session_id. We can then use this variable to aggregate the data.
expedia.sessionize(ts = "date_time",
by = ["user_id"],
session_threshold = "30 minutes",
name = "session_id")
The duration of the trip should also influence/be indicative of the user's behavior on the site, so we'll take that into account.
expedia["trip_duration"] = expedia["srch_co"] - expedia["srch_ci"]
If a user looks at the same hotel several times, then it might mean that they're looking to book that hotel during the session.
expedia.analytic('mode',
columns = "hotel_cluster",
by = ["user_id",
"session_id"],
name = "mode_hotel_cluster",
add_count = True)
We can now aggregate the session and get some useful statistics out of it:
- end_session_date_time : Date and time when the session ends.
- session_duration : Session duration.
- is_booking : 1 if the user booked during the session, 0 otherwise.
- trip_duration : Trip duration.
- orig_destination_distance : Average of the physical distances between the hotels and the customer.
- srch_family_cnt : The number of people specified in the hotel room.
import verticapy.stats as st
expedia = expedia.groupby(
columns = ["user_id",
"session_id",
"mode_hotel_cluster_count"],
expr = [st.max(expedia["date_time"])._as("end_session_date_time"),
((st.max(expedia["date_time"]) - st.min(expedia["date_time"])) / st.interval("1 second"))._as(
"session_duration"),
st.max(expedia["is_booking"])._as("is_booking"),
st.avg(expedia["trip_duration"])._as("trip_duration"),
st.avg(expedia["orig_destination_distance"])._as("avg_distance"),
st.sum(expedia["cnt"])._as("nb_click_session"),
st.median(expedia["srch_children_cnt"] + expedia["srch_adults_cnt"])._as("srch_family_cnt")])
Let's look at the missing values.
expedia.count_percent()
Let's impute the missing values for avg_distance and trip_duration.
expedia["avg_distance" ].fillna(method = "avg")
expedia["trip_duration"].fillna(method = "avg")
We can then look at the links between the variables. We will use Spearman's rank correleation coefficient to get all the monotonic relationships.
%matplotlib inline
expedia.corr(method = "spearman")
We can see huge links between some of the variables (mode_hotel_cluster_count and session_duration) and our response variable (is_booking). A logistic regression would work well in this case because the response and predictors have a monotonic relationship.
Machine Learning¶
Let's create our logistic regression model.
from verticapy.learn.linear_model import LogisticRegression
model_logit = LogisticRegression("expedia_logit",
max_iter = 1000,
solver = 'BFGS')
model_logit.fit(expedia,
["avg_distance",
"session_duration",
"nb_click_session",
"mode_hotel_cluster_count",
"session_id",
"srch_family_cnt",
"trip_duration"],
"is_booking")
None of our coefficients are rejected (pvalue = 0). Let's look at their importance.
model_logit.features_importance()
It looks like there are two main predictors: mode_hotel_cluster_count and trip_duration. According to our model, users likely to make a booking during a particular session will tend to:
- look at the same hotel many times
- look for a shorter trip duration
- not click as much (spend more time at the same web page).
Let's add our prediction to the vDataFrame.
model_logit.predict_proba(expedia,
name = "booking_prob_logit",
pos_label = 1)
While analyzing the following boxplot (prediction partitioned by is_booking), we can notice that the cutoff is around 0.22 because most of the positive predictions have a probability between 0.23 and 0.5. Most of the negative predictions are between 0.05 and 0.2.
expedia["booking_prob_logit"].boxplot(by = "is_booking")
Let's confirm our hypothesis by computing the best cutoff.
model_logit.score(method = "best_cutoff")
Let's look at the efficiency of our model with a cutoff of 0.22.
model_logit.roc_curve()
model_logit.report(cutoff = 0.22)
We're left with an excellent model. With this, we can predict whether a user will book a hotel during a specific session and make adjustments to our site accordingly. For example, to influence a user to make a booking, we could propose new hotels.
Conclusion¶
We've solved our problem in a Pandas-like way, all without ever loading data into memory!
VerticaPy
About the Author
Badr Ouali
Head of Data Science
Badr Ouali works as a Lead Data Scientist for Vertica worldwide. He can embrace data projects end to end through a clear understanding of the “big picture” as well as attention to details, resulting in achieving great business outcomes – a distinctive differentiator in his role. Badr enjoys sharing knowledge and insights related to data analytics with colleagues & peers and has a sweet spot for Python. He loves helping customers finding the best value from their data and empower them to solve their use-cases.
