Health Insurance Costs

In this example, we use a dataset of personal medical costs to create a model to estimate treatment costs. You can download the Jupyter notebook here.

The columns provided include:

  • age: age of the primary beneficiary
  • sex: insurance contractor's gender
  • bmi: body mass index
  • children: number of dependent children covered by health insurance
  • smoker: smoker on non-smoker
  • region: the beneficiary's residential area in the US: northeast, southeast, southwest, northwest.
  • charges: individual medical costs billed by health insurance

Initialization

This example uses the following version of VerticaPy:

In [1]:
import verticapy as vp
vp.__version__
Out[1]:
'0.9.0'

Connect to Vertica. This example uses an existing connection called "VerticaDSN." For details on how to create a connection, use see the connection tutorial.

In [2]:
vp.connect("VerticaDSN")

Let's create a new schema and assign the data to a vDataFrame object.

In [3]:
vp.drop("insurance", method="schema")
vp.create_schema("insurance")
data = vp.read_csv('data/insurance.csv', schema = 'insurance')
display(data)
The table "insurance"."insurance" has been successfully created.
123
age
Int
Abc
sex
Varchar(20)
123
bmi
Numeric(8,4)
123
children
Int
010
smoker
Boolean
Abc
region
Varchar(20)
123
charges
Float
118female20.790
southeast1607.5101
218female21.660
northeast14283.4594
318female24.091
southeast2201.0971
418female25.080
northeast2196.4732
518female26.3150
northeast2198.18985
618female26.730
southeast1615.7667
718female27.283
southeast18223.4512
818female28.2150
northeast2200.83085
918female29.1650
northeast7323.734819
1018female30.1150
northeast2203.47185
1118female30.1150
northeast21344.8467
1218female30.3050
northeast2203.73595
1318female31.130
southeast1621.8827
1418female31.350
southeast1622.1885
1518female31.354
northeast4561.1885
1618female31.920
northeast2205.9808
1718female32.122
southeast2801.2588
1818female33.1550
northeast2207.69745
1918female33.880
southeast11482.63485
2018female35.6250
northeast2211.13075
2118female36.850
southeast1629.8335
2218female36.850
southeast36149.4835
2318female37.291
southeast2219.4451
2418female38.170
southeast1631.6683
2518female38.280
southeast1631.8212
2618female38.280
southeast14133.03775
2718female38.6652
northeast3393.35635
2818female39.160
southeast1633.0444
2918female39.820
southeast1633.9618
3018female40.1850
northeast2217.46915
3118female40.260
southeast1634.5734
3218female40.280
northeast2217.6012
3318female42.240
southeast38792.6856
3418male15.960
northeast1694.7964
3518male17.292
northeast12829.4551
3618male21.470
northeast1702.4553
3718male21.5650
northeast13747.87235
3818male21.782
southeast11884.04858
3918male22.990
northeast1704.5681
4018male23.0850
northeast1704.70015
4118male23.210
southeast1121.8739
4218male23.321
southeast1711.0268
4318male23.750
northeast1705.6245
4418male25.1750
northeast15518.18025
4518male25.460
northeast1708.0014
4618male26.1250
northeast1708.92575
4718male26.182
southeast2304.0022
4818male27.361
northeast17178.6824
4918male28.311
northeast11272.33139
5018male28.50
northeast1712.227
5118male29.371
southeast1719.4363
5218male30.031
southeast1720.3537
5318male30.140
southeast1131.5066
5418male30.43
northeast3481.868
5518male31.682
southeast34303.1672
5618male31.730
northeast33732.6867
5718male33.330
southeast1135.9407
5818male33.5350
northeast34617.84065
5918male33.660
southeast1136.3994
6018male33.771
southeast1725.5523
6118male34.10
southeast1137.011
6218male34.430
southeast1137.4697
6318male35.21
southeast1727.54
6418male37.290
southeast1141.4451
6518male38.170
southeast36307.7983
6618male39.140
northeast12890.05765
6718male41.140
southeast1146.7966
6818male43.010
southeast1149.3959
6918male53.130
southeast1163.4627
7019female17.80
southwest1727.785
7119female18.60
southwest1728.897
7219female20.60
southwest1731.677
7319female21.70
southwest13844.506
7419female22.5150
northwest2117.33885
7519female23.42
southwest2913.569
7619female24.511
northwest2709.1119
7719female24.6051
northwest2709.24395
7819female24.70
southwest1737.376
7919female25.7451
northwest2710.82855
8019female27.90
southwest16884.924
8119female27.933
northwest18838.70366
8219female28.30
southwest17081.08
8319female28.310
northwest17468.9839
8419female28.41
southwest2331.519
8519female28.65
southwest4687.797
8619female28.880
northwest17748.5062
8719female28.90
southwest1743.214
8819female29.80
southwest1744.465
8919female30.020
northwest33307.5508
9019female30.4950
northwest2128.43105
9119female30.592
northwest24059.68019
9219female31.8251
northwest2719.27975
9319female32.110
northwest2130.6759
9419female32.490
northwest36898.73308
9519female32.90
southwest1748.774
9619female33.110
southeast34439.8559
9719female34.72
southwest36397.576
9819female35.150
northwest2134.9015
9919female36.5750
northwest2136.88225
10019female37.430
northwest2138.0707
Rows: 1-100 | Columns: 7

Let's take a look at the first few entries in the dataset.

In [4]:
# returns the first five rows
data.head(5)
Out[4]:
123
age
Int
Abc
sex
Varchar(20)
123
bmi
Numeric(8,4)
123
children
Int
010
smoker
Boolean
Abc
region
Varchar(20)
123
charges
Float
118female20.790
southeast1607.5101
218female21.660
northeast14283.4594
318female24.091
southeast2201.0971
418female25.080
northeast2196.4732
518female26.3150
northeast2198.18985
Rows: 1-5 | Columns: 7

Data exploration

Let's check our dataset for missing values. If we find any, we'll have to impute them before we create any models.

In [5]:
# count the number of non-null entries per column 
data.count_percent()
Out[5]:
count
percent
"age"1338.0100.0
"sex"1338.0100.0
"bmi"1338.0100.0
"children"1338.0100.0
"smoker"1338.0100.0
"region"1338.0100.0
"charges"1338.0100.0
Rows: 1-7 | Columns: 3

There aren't missing any values, so let's get a summary of the features.

In [6]:
# returns summary data of each feature
data.describe(method='all')
Out[6]:
123
"age"
Int
100%
123
"bmi"
Numeric(8,4)
100%
123
"children"
Int
100%
010
"smoker"
Boolean
100%
123
"charges"
Float
100%
Abc
"sex"
Varchar(20)
100%
Abc
"region"
Varchar(20)
100%
dtypeintnumeric(8,4)intbooleanfloatvarchar(20)varchar(20)
percent100100100100100100100
count1338133813381338133813381338
top1832.30
1639.5631malesoutheast
top_percent5.1570.97242.979.5220.14950.52327.205
avg39.207025411061330.66339686098651.09491778774290.20478325859491813270.42226514134.989536621823629.0
stddev14.04996037921626.098186911679011.205492739781910.40369403754561712110.0112366941.000319138568750.0
min1815.96001121.873949
approx_25%27.026.296250.00.04740.2871549
approx_50%39.030.41.00.09379.184749
approx_75%51.034.693752.00.016639.91251569
max6453.135163770.4280169
range4637.175162648.5541120
empty[null][null][null][null][null]00
Rows: 1-14 | Columns: 8

The dataset covers 1338 individuals up to age 64 from four different regions, each with up to six dependent children.

We might find some interesting patterns if we check age distribution, so let's create a histogram.

In [7]:
# histogram of age
data["age"].hist(method = "count", color = "#0073E7", h = 1)
Out[7]:
<AxesSubplot:xlabel='"age"', ylabel='Frequency'>

We have a pretty obvious trend here: the 18 and 19 year old age groups are significantly more frequent than any other, older age group. The other ages range from 20 to 30 people.

Before we do anything else, let's discretize the age column using equal-width binning with a width of 5. Our goal is to see if there are any obvious patterns among the different age groups.

In [8]:
# discretize the age using a bin of 5
data["age"].discretize(method = "same_width", h = 5)
Out[8]:
Abc
age
Varchar
100%
Abc
sex
Varchar(20)
100%
123
bmi
Numeric(8,4)
100%
123
children
Int
100%
010
smoker
Boolean
100%
Abc
region
Varchar(20)
100%
123
charges
Float
100%
1[15;19]female20.790
southeast1607.5101
2[15;19]female21.660
northeast14283.4594
3[15;19]female24.091
southeast2201.0971
4[15;19]female25.080
northeast2196.4732
5[15;19]female26.3150
northeast2198.18985
6[15;19]female26.730
southeast1615.7667
7[15;19]female27.283
southeast18223.4512
8[15;19]female28.2150
northeast2200.83085
9[15;19]female29.1650
northeast7323.734819
10[15;19]female30.1150
northeast2203.47185
11[15;19]female30.1150
northeast21344.8467
12[15;19]female30.3050
northeast2203.73595
13[15;19]female31.130
southeast1621.8827
14[15;19]female31.350
southeast1622.1885
15[15;19]female31.354
northeast4561.1885
16[15;19]female31.920
northeast2205.9808
17[15;19]female32.122
southeast2801.2588
18[15;19]female33.1550
northeast2207.69745
19[15;19]female33.880
southeast11482.63485
20[15;19]female35.6250
northeast2211.13075
21[15;19]female36.850
southeast1629.8335
22[15;19]female36.850
southeast36149.4835
23[15;19]female37.291
southeast2219.4451
24[15;19]female38.170
southeast1631.6683
25[15;19]female38.280
southeast1631.8212
26[15;19]female38.280
southeast14133.03775
27[15;19]female38.6652
northeast3393.35635
28[15;19]female39.160
southeast1633.0444
29[15;19]female39.820
southeast1633.9618
30[15;19]female40.1850
northeast2217.46915
31[15;19]female40.260
southeast1634.5734
32[15;19]female40.280
northeast2217.6012
33[15;19]female42.240
southeast38792.6856
34[15;19]male15.960
northeast1694.7964
35[15;19]male17.292
northeast12829.4551
36[15;19]male21.470
northeast1702.4553
37[15;19]male21.5650
northeast13747.87235
38[15;19]male21.782
southeast11884.04858
39[15;19]male22.990
northeast1704.5681
40[15;19]male23.0850
northeast1704.70015
41[15;19]male23.210
southeast1121.8739
42[15;19]male23.321
southeast1711.0268
43[15;19]male23.750
northeast1705.6245
44[15;19]male25.1750
northeast15518.18025
45[15;19]male25.460
northeast1708.0014
46[15;19]male26.1250
northeast1708.92575
47[15;19]male26.182
southeast2304.0022
48[15;19]male27.361
northeast17178.6824
49[15;19]male28.311
northeast11272.33139
50[15;19]male28.50
northeast1712.227
51[15;19]male29.371
southeast1719.4363
52[15;19]male30.031
southeast1720.3537
53[15;19]male30.140
southeast1131.5066
54[15;19]male30.43
northeast3481.868
55[15;19]male31.682
southeast34303.1672
56[15;19]male31.730
northeast33732.6867
57[15;19]male33.330
southeast1135.9407
58[15;19]male33.5350
northeast34617.84065
59[15;19]male33.660
southeast1136.3994
60[15;19]male33.771
southeast1725.5523
61[15;19]male34.10
southeast1137.011
62[15;19]male34.430
southeast1137.4697
63[15;19]male35.21
southeast1727.54
64[15;19]male37.290
southeast1141.4451
65[15;19]male38.170
southeast36307.7983
66[15;19]male39.140
northeast12890.05765
67[15;19]male41.140
southeast1146.7966
68[15;19]male43.010
southeast1149.3959
69[15;19]male53.130
southeast1163.4627
70[15;19]female17.80
southwest1727.785
71[15;19]female18.60
southwest1728.897
72[15;19]female20.60
southwest1731.677
73[15;19]female21.70
southwest13844.506
74[15;19]female22.5150
northwest2117.33885
75[15;19]female23.42
southwest2913.569
76[15;19]female24.511
northwest2709.1119
77[15;19]female24.6051
northwest2709.24395
78[15;19]female24.70
southwest1737.376
79[15;19]female25.7451
northwest2710.82855
80[15;19]female27.90
southwest16884.924
81[15;19]female27.933
northwest18838.70366
82[15;19]female28.30
southwest17081.08
83[15;19]female28.310
northwest17468.9839
84[15;19]female28.41
southwest2331.519
85[15;19]female28.65
southwest4687.797
86[15;19]female28.880
northwest17748.5062
87[15;19]female28.90
southwest1743.214
88[15;19]female29.80
southwest1744.465
89[15;19]female30.020
northwest33307.5508
90[15;19]female30.4950
northwest2128.43105
91[15;19]female30.592
northwest24059.68019
92[15;19]female31.8251
northwest2719.27975
93[15;19]female32.110
northwest2130.6759
94[15;19]female32.490
northwest36898.73308
95[15;19]female32.90
southwest1748.774
96[15;19]female33.110
southeast34439.8559
97[15;19]female34.72
southwest36397.576
98[15;19]female35.150
northwest2134.9015
99[15;19]female36.5750
northwest2136.88225
100[15;19]female37.430
northwest2138.0707
Rows: 1-100 of 1338 | Columns: 7

Age probably influences one's body mass index (BMI), so let's compare the average of body mass indexes of each age group and look for patterns there. We'll use a bar graph this time.

In [9]:
# average of BMI for each age group
data.hchart(x = "age",
            y = "AVG(bmi)", 
            aggregate = True,
            kind = "bar")
Out[9]:

There's a pretty clear trend here, and we can say that, in general, older individuals tend to have a greater BMIs.

Let's check the average number of smokers for each age-group. Before we do, we'll convert the 'yes' and 'no' 'smoker' values to more convenient boolean values.

In [10]:
# Importing the stats module
import verticapy.stats as st

# Applying the decode function
data["smoker_int"] = st.decode(data["smoker"], True, 1, 0)

Now we can plot the average number of smokers for each age group.

In [11]:
# average of number of smokers per age group
data.hchart(x = "age",
            y = "AVG(smoker_int)", 
            aggregate = True,
            kind = "bar")
Out[11]:

Unfortuantely, there's no obvious relationship between age and smoking habits - none that we can find from this graph, anyway.

Let's see if we can relate an individual's smoking habits with their sex.

In [12]:
# average of number of smokers per sex
data.hchart(x = "sex",
            y = "AVG(smoker_int)", 
            aggregate = True,
            kind = "bar")
Out[12]:

Now we're getting somewhere! Looks like we have noticeably more male smokers than female ones.

Let's see how an individual's BMI relates to their sex.

In [13]:
# average bmi per sex
data.hchart(x = "sex",
            y = "AVG(bmi)", 
            aggregate = True,
            kind = "bar")
Out[13]:

Males seem to have a slightly higher BMI, but it'd be hard to draw any conclusions from such a small difference.

Going back to our earlier patterns, let's check the distribution of sexes among age groups and see if the patterns we identified earlier skews toward one of the sexes.

In [14]:
# pivot table with number of each sex per age group
data.pivot_table(['age','sex'])
Out[14]:
"age"/"sex"
female
male
1[15;19]6671
2[20;24]6873
3[25;29]6772
4[30;34]6567
5[35;39]6263
6[40;44]6768
7[45;49]7272
8[50;54]7271
9[55;59]6563
10[60;64]5856
Rows: 1-10 | Columns: 3

It seems that sex is pretty evenly distributed in each age group.

Let's move onto costs: how much do people tend to spend on medical treatments?

In [15]:
data["charges"].hist(method = "count", color = "#0073E7")
Out[15]:
<AxesSubplot:xlabel='"charges"', ylabel='Frequency'>

Based on this graph, the majority of insurance holders tend to spend less than 1500 and only a handful of people spend more than 5000.

Encoding

Since our features vary in type, let's start by encoding our categorical features. Remember, we label-encoded 'smoker' from boolean. Let's label-encode some other features: sex, region, and age groups.

In [16]:
# encoding sex 
data["sex"].label_encode()

# encoding region
data["region"].label_encode()

# encoding age
data["age"].label_encode()
Out[16]: