
VerticaPy
Missing Values¶
Missing values occur when no data value is stored for the variable in an observation and are most often represented with a NULL or None. Not handling them can lead to unexpected results (for example, some ML algorithms can't handle missing values at all) and worse, it can lead to incorrect conclusions.
There are 3 main types of missing values:
- MCAR (Missing Completely at Random) : The events that lead to any particular data-item being missing occur entirely at random. For example, in IOT, we can lose sensory data in transmission.
- MAR (Missing {Conditionally} at Random) : Missing data doesn't happen at random and is instead related to some of the observed data. For example, some students may have not answered to some specific questions of a test because they were absent during the relevant lesson.
- MNAR (Missing not at Random) : The value of the variable that’s missing is related to the reason it’s missing. For example, if someone didn’t subscribe to a loyalty program, we can leave the cell empty.
Different types of missing values tend to suggest different methods for imputing them. For example, when dealing with MCAR values, you can use mathematical aggregations to impute the missing values. For MNAR values, we can simply create another category. MAR values, however, we'll need to do some more investigation before deciding how to impute the data.
To see how to handle missing values in VerticaPy, we'll use the well-known 'Titanic' dataset.
from verticapy.datasets import load_titanic
vdf = load_titanic()
display(vdf)
We can examine the missing values with the 'count' method.
vdf.count_percent()
The missing values for 'boat' are MNAR; missing values simply indicate that the passengers didn't pay for a lifeboat. We can replace all the missing values with a new category 'No Lifeboat' using the 'fillna' method.
vdf["boat"].fillna("No Lifeboat")
vdf["boat"]
Missing values for 'age' seem to be MCAR, so the best way to impute them is with mathematical aggregations. Let's impute the age using the average age of passengers of the same sex and class.
vdf["age"].fillna(method = "avg",
by = ["pclass", "sex"])
vdf["age"]
The features 'embarked' and 'fare' have a couple missing values. Instead of using a technique to impute them, we can just drop them with the 'dropna' method.
vdf["fare"].dropna()
vdf["embarked"].dropna()