Health Insurance Costs¶
In this example, we use a dataset of personal medical costs to create a model to estimate treatment costs. You can download the Jupyter notebook here.
The columns provided include:
- age: age of the primary beneficiary
- sex: insurance contractor's gender
- bmi: body mass index
- children: number of dependent children covered by health insurance
- smoker: smoker on non-smoker
- region: the beneficiary's residential area in the US: northeast, southeast, southwest, northwest.
- charges: individual medical costs billed by health insurance
Initialization¶
This example uses the following version of VerticaPy:
import verticapy as vp
vp.__version__
Connect to Vertica. This example uses an existing connection called "VerticaDSN." For details on how to create a connection, use see the connection tutorial.
vp.connect("VerticaDSN")
Let's create a new schema and assign the data to a vDataFrame object.
vp.drop("insurance", method="schema")
vp.create_schema("insurance")
data = vp.read_csv('data/insurance.csv', schema = 'insurance')
display(data)
Let's take a look at the first few entries in the dataset.
# returns the first five rows
data.head(5)
Data exploration¶
Let's check our dataset for missing values. If we find any, we'll have to impute them before we create any models.
# count the number of non-null entries per column
data.count_percent()
There aren't missing any values, so let's get a summary of the features.
# returns summary data of each feature
data.describe(method='all')
The dataset covers 1338 individuals up to age 64 from four different regions, each with up to six dependent children.
We might find some interesting patterns if we check age distribution, so let's create a histogram.
# histogram of age
data["age"].hist(method = "count", color = "#0073E7", h = 1)
We have a pretty obvious trend here: the 18 and 19 year old age groups are significantly more frequent than any other, older age group. The other ages range from 20 to 30 people.
Before we do anything else, let's discretize the age column using equal-width binning with a width of 5. Our goal is to see if there are any obvious patterns among the different age groups.
# discretize the age using a bin of 5
data["age"].discretize(method = "same_width", h = 5)
Age probably influences one's body mass index (BMI), so let's compare the average of body mass indexes of each age group and look for patterns there. We'll use a bar graph this time.
# average of BMI for each age group
data.hchart(x = "age",
y = "AVG(bmi)",
aggregate = True,
kind = "bar")
There's a pretty clear trend here, and we can say that, in general, older individuals tend to have a greater BMIs.
Let's check the average number of smokers for each age-group. Before we do, we'll convert the 'yes' and 'no' 'smoker' values to more convenient boolean values.
# Importing the stats module
import verticapy.stats as st
# Applying the decode function
data["smoker_int"] = st.decode(data["smoker"], True, 1, 0)
Now we can plot the average number of smokers for each age group.
# average of number of smokers per age group
data.hchart(x = "age",
y = "AVG(smoker_int)",
aggregate = True,
kind = "bar")
Unfortuantely, there's no obvious relationship between age and smoking habits - none that we can find from this graph, anyway.
Let's see if we can relate an individual's smoking habits with their sex.
# average of number of smokers per sex
data.hchart(x = "sex",
y = "AVG(smoker_int)",
aggregate = True,
kind = "bar")
Now we're getting somewhere! Looks like we have noticeably more male smokers than female ones.
Let's see how an individual's BMI relates to their sex.
# average bmi per sex
data.hchart(x = "sex",
y = "AVG(bmi)",
aggregate = True,
kind = "bar")
Males seem to have a slightly higher BMI, but it'd be hard to draw any conclusions from such a small difference.
Going back to our earlier patterns, let's check the distribution of sexes among age groups and see if the patterns we identified earlier skews toward one of the sexes.
# pivot table with number of each sex per age group
data.pivot_table(['age','sex'])
It seems that sex is pretty evenly distributed in each age group.
Let's move onto costs: how much do people tend to spend on medical treatments?
data["charges"].hist(method = "count", color = "#0073E7")
Based on this graph, the majority of insurance holders tend to spend less than 1500 and only a handful of people spend more than 5000.
Encoding¶
Since our features vary in type, let's start by encoding our categorical features. Remember, we label-encoded 'smoker' from boolean. Let's label-encode some other features: sex, region, and age groups.
# encoding sex
data["sex"].label_encode()
# encoding region
data["region"].label_encode()
# encoding age
data["age"].label_encode()