VerticaPy

Python API for Vertica Data Science at Scale

Missing Values

Missing values occur when no data value is stored for the variable in an observation and are most often represented with a NULL or None. Not handling them can lead to unexpected results (for example, some ML algorithms can't handle missing values at all) and worse, it can lead to incorrect conclusions.

There are 3 main types of missing values:

  • MCAR (Missing Completely at Random) : The events that lead to any particular data-item being missing occur entirely at random. For example, in IOT, we can lose sensory data in transmission.
  • MAR (Missing {Conditionally} at Random) : Missing data doesn't happen at random and is instead related to some of the observed data. For example, some students may have not answered to some specific questions of a test because they were absent during the relevant lesson.
  • MNAR (Missing not at Random) : The value of the variable that’s missing is related to the reason it’s missing. For example, if someone didn’t subscribe to a loyalty program, we can leave the cell empty.

Different types of missing values tend to suggest different methods for imputing them. For example, when dealing with MCAR values, you can use mathematical aggregations to impute the missing values. For MNAR values, we can simply create another category. MAR values, however, we'll need to do some more investigation before deciding how to impute the data.

To see how to handle missing values in VerticaPy, we'll use the well-known 'Titanic' dataset.

In [64]:
from verticapy.datasets import load_titanic
vdf = load_titanic()
display(vdf)
123
pclass
Int
123
survived
Int
Abc
Varchar(164)
Abc
sex
Varchar(20)
123
age
Numeric(6,3)
123
sibsp
Int
123
parch
Int
Abc
ticket
Varchar(36)
123
fare
Numeric(10,5)
Abc
cabin
Varchar(30)
Abc
embarked
Varchar(20)
Abc
boat
Varchar(100)
123
body
Int
Abc
home.dest
Varchar(100)
110female2.012113781151.55C22 C26S[null][null]Montreal, PQ / Chesterville, ON
210male30.012113781151.55C22 C26S[null]135Montreal, PQ / Chesterville, ON
310female25.012113781151.55C22 C26S[null][null]Montreal, PQ / Chesterville, ON
410male39.0001120500.0A36S[null][null]Belfast, NI
510male71.000PC 1760949.5042[null]C[null]22Montevideo, Uruguay
610male47.010PC 17757227.525C62 C64C[null]124New York, NY
710male[null]00PC 1731825.925[null]S[null][null]New York, NY
810male24.001PC 17558247.5208B58 B60C[null][null]Montreal, PQ
910male36.0001305075.2417C6CA[null]Winnipeg, MN
1010male25.0001390526.0[null]C[null]148San Francisco, CA
1110male45.00011378435.5TS[null][null]Trenton, NJ
1210male42.00011048926.55D22S[null][null]London / Winnipeg, MB
1310male41.00011305430.5A21S[null][null]Pomeroy, WA
1410male48.000PC 1759150.4958B10C[null]208Omaha, NE
1510male[null]0011237939.6[null]C[null][null]Philadelphia, PA
1610male45.00011305026.55B38S[null][null]Washington, DC
1710male[null]0011379831.0[null]S[null][null][null]
1810male33.0006955.0B51 B53 B55S[null][null]New York, NY
1910male28.00011305947.1[null]S[null][null]Montevideo, Uruguay
2010male17.00011305947.1[null]S[null][null]Montevideo, Uruguay
2110male49.0001992426.0[null]S[null][null]Ascot, Berkshire / Rochester, NY
2210male36.0101987778.85C46S[null]172Little Onn Hall, Staffs
2310male46.010W.E.P. 573461.175E31S[null][null]Amenia, ND
2410male[null]001120510.0[null]S[null][null]Liverpool, England / Belfast
2510male27.01013508136.7792C89C[null][null]Los Angeles, CA
2610male[null]0011046552.0A14S[null][null]Stoughton, MA
2710male47.000572725.5875E58S[null][null]Victoria, BC
2810male37.011PC 1775683.1583E52C[null][null]Lakewood, NJ
2910male[null]0011379126.55[null]S[null][null]Roachdale, IN
3010male70.011WE/P 573571.0B22S[null]269Milwaukee, WI
3110male39.010PC 1759971.2833C85C[null][null]New York, NY
3210male31.010F.C. 1275052.0B71S[null][null]Montreal, PQ
3310male50.010PC 17761106.425C86C[null]62Deephaven, MN / Cedar Rapids, IA
3410male39.000PC 1758029.7A18C[null]133Philadelphia, PA
3510female36.000PC 1753131.6792A29C[null][null]New York, NY
3610male[null]00PC 17483221.7792C95S[null][null][null]
3710male30.00011305127.75C111C[null][null]New York, NY
3810male19.03219950263.0C23 C25 C27S[null][null]Winnipeg, MB
3910male64.01419950263.0C23 C25 C27S[null][null]Winnipeg, MB
4010male[null]0011377826.55D34S[null][null]Westcliff-on-Sea, Essex
4110male[null]001120580.0B102S[null][null][null]
4210male37.01011380353.1C123S[null][null]Scituate, MA
4310male47.00011132038.5E63S[null]275St Anne's-on-Sea, Lancashire
4410male24.000PC 1759379.2B86C[null][null][null]
4510male71.000PC 1775434.6542A5C[null][null]New York, NY
4610male38.001PC 17582153.4625C91S[null]147Winnipeg, MB
4710male46.000PC 1759379.2B82 B84C[null][null]New York, NY
4810male[null]0011379642.4[null]S[null][null][null]
4910male45.0103697383.475C83S[null][null]New York, NY
5010male40.0001120590.0B94S[null]110[null]
5110male55.0111274993.5B69S[null]307Montreal, PQ
5210male42.00011303842.5B11S[null][null]London / Middlesex
5310male[null]001746351.8625E46S[null][null]Brighton, MA
5410male55.00068050.0C39S[null][null]London / Birmingham
5510male42.01011378952.0[null]S[null]38New York, NY
5610male[null]00PC 1760030.6958[null]C14[null]New York, NY
5710female50.000PC 1759528.7125C49C[null][null]Paris, France New York, NY
5810male46.00069426.0[null]S[null]80Bennington, VT
5910male50.00011304426.0E60S[null][null]London
6010male32.500113503211.5C132C[null]45[null]
6110male58.0001177129.7B37C[null]258Buffalo, NY
6210male41.0101746451.8625D21S[null][null]Southington / Noank, CT
6310male[null]0011302826.55C124S[null][null]Portland, OR
6410male[null]00PC 1761227.7208[null]C[null][null]Chicago, IL
6510male29.00011350130.0D6S[null]126Springfield, MA
6610male30.00011380145.5[null]S[null][null]London / New York, NY
6710male30.00011046926.0C106S[null][null]Brockton, MA
6810male19.01011377353.1D30S[null][null]New York, NY
6910male46.0001305075.2417C6C[null]292Vancouver, BC
7010male54.0001746351.8625E46S[null]175Dorchester, MA
7110male28.010PC 1760482.1708[null]C[null][null]New York, NY
7210male65.0001350926.55E38S[null]249East Bridgewater, MA
7310male44.0201992890.0C78Q[null]230Fond du Lac, WI
7410male55.00011378730.5C30S[null][null]Montreal, PQ
7510male47.00011379642.4[null]S[null][null]Washington, DC
7610male37.001PC 1759629.7C118C[null][null]Brooklyn, NY
7710male58.00235273113.275D48C[null]122Lexington, MA
7810male64.00069326.0[null]S[null]263Isle of Wight, England
7910male65.00111350961.9792B30C[null]234Providence, RI
8010male28.500PC 1756227.7208D43C[null]189?Havana, Cuba
8110male[null]001120520.0[null]S[null][null]Belfast
8210male45.50011304328.5C124S[null]166Surbiton Hill, Surrey
8310male23.0001274993.5B24S[null][null]Montreal, PQ
8410male29.01011377666.6C2S[null][null]Isleworth, England
8510male18.010PC 17758108.9C65C[null][null]Madrid, Spain
8610male47.00011046552.0C110S[null]207Worcester, MA
8710male38.000199720.0[null]S[null][null]Rotterdam, Netherlands
8810male22.000PC 17760135.6333[null]C[null]232[null]
8910male[null]00PC 17757227.525[null]C[null][null][null]
9010male31.000PC 1759050.4958A24S[null][null]Trenton, NJ
9110male[null]0011376750.0A32S[null][null]Seattle, WA
9210male36.0001304940.125A10C[null][null]Winnipeg, MB
9310male55.010PC 1760359.4[null]C[null][null]New York, NY
9410male33.00011379026.55[null]S[null]109London
9510male61.013PC 17608262.375B57 B59 B63 B66C[null][null]Haverford, PA / Cooperstown, NY
9610male50.0101350755.9E44S[null][null]Duluth, MN
9710male56.00011379226.55[null]S[null][null]New York, NY
9810male56.0001776430.6958A7C[null][null]St James, Long Island, NY
9910male24.0101369560.0C31S[null][null]Huntington, WV
10010male[null]0011305626.0A19S[null][null]Streatham, Surrey
Rows: 1-100 | Columns: 14

We can examine the missing values with the 'count' method.

In [66]:
vdf.count_percent()
Out[66]:
count
percent
"pclass"1234.0100.0
"survived"1234.0100.0
"name"1234.0100.0
"sex"1234.0100.0
"sibsp"1234.0100.0
"parch"1234.0100.0
"ticket"1234.0100.0
"fare"1233.099.919
"embarked"1232.099.838
"age"997.080.794
"home.dest"706.057.212
"boat"439.035.575
"cabin"286.023.177
"body"118.09.562
Rows: 1-14 | Columns: 3

The missing values for 'boat' are MNAR; missing values simply indicate that the passengers didn't pay for a lifeboat. We can replace all the missing values with a new category 'No Lifeboat' using the 'fillna' method.

In [67]:
vdf["boat"].fillna("No Lifeboat")
vdf["boat"]
795 elements were filled.
Out[67]:
Abc
boat
Varchar(100)
1No Lifeboat
2No Lifeboat
3No Lifeboat
4No Lifeboat
5No Lifeboat
6No Lifeboat
7No Lifeboat
8No Lifeboat
9A
10No Lifeboat
11No Lifeboat
12No Lifeboat
13No Lifeboat
14No Lifeboat
15No Lifeboat
16No Lifeboat
17No Lifeboat
18No Lifeboat
19No Lifeboat
20No Lifeboat
21No Lifeboat
22No Lifeboat
23No Lifeboat
24No Lifeboat
25No Lifeboat
26No Lifeboat
27No Lifeboat
28No Lifeboat
29No Lifeboat
30No Lifeboat
31No Lifeboat
32No Lifeboat
33No Lifeboat
34No Lifeboat
35No Lifeboat
36No Lifeboat
37No Lifeboat
38No Lifeboat
39No Lifeboat
40No Lifeboat
41No Lifeboat
42No Lifeboat
43No Lifeboat
44No Lifeboat
45No Lifeboat
46No Lifeboat
47No Lifeboat
48No Lifeboat
49No Lifeboat
50No Lifeboat
51No Lifeboat
52No Lifeboat
53No Lifeboat
54No Lifeboat
55No Lifeboat
5614
57No Lifeboat
58No Lifeboat
59No Lifeboat
60No Lifeboat
61No Lifeboat
62No Lifeboat
63No Lifeboat
64No Lifeboat
65No Lifeboat
66No Lifeboat
67No Lifeboat
68No Lifeboat
69No Lifeboat
70No Lifeboat
71No Lifeboat
72No Lifeboat
73No Lifeboat
74No Lifeboat
75No Lifeboat
76No Lifeboat
77No Lifeboat
78No Lifeboat
79No Lifeboat
80No Lifeboat
81No Lifeboat
82No Lifeboat
83No Lifeboat
84No Lifeboat
85No Lifeboat
86No Lifeboat
87No Lifeboat
88No Lifeboat
89No Lifeboat
90No Lifeboat
91No Lifeboat
92No Lifeboat
93No Lifeboat
94No Lifeboat
95No Lifeboat
96No Lifeboat
97No Lifeboat
98No Lifeboat
99No Lifeboat
100No Lifeboat
Rows: 1-100 of 1234 | Column: boat | Type: Varchar(100)

Missing values for 'age' seem to be MCAR, so the best way to impute them is with mathematical aggregations. Let's impute the age using the average age of passengers of the same sex and class.

In [68]:
vdf["age"].fillna(method = "avg",
                  by = ["pclass", "sex"])
vdf["age"]
237 elements were filled.
Out[68]:
123
age
Float
136.0
231.0
321.0
450.0
545.0
631.0
755.0
860.0
935.0
1037.2635658914729
1139.0
1237.2635658914729
1339.0
1418.0
1548.0
1662.0
1752.0
1843.0
1937.2635658914729
2040.0
2123.0
2218.0
2339.0
2440.0
2530.0
2635.0
2724.0
2848.0
2918.0
3054.0
3133.0
3233.0
3343.0
3456.0
3530.0
3617.0
3722.0
3822.0
3939.0
4019.0
4123.0
4231.0
4337.0
4433.0
4537.2635658914729
4618.0
4716.0
4815.0
4958.0
5021.0
5151.0
5216.0
5355.0
5449.0
5530.0
5639.0
5745.0
5837.2635658914729
5938.0
6035.0
6135.0
6251.0
6344.0
6416.0
6552.0
6624.0
6735.0
6849.0
6925.0
7045.0
7158.0
7219.0
7337.2635658914729
7445.0
7522.0
7635.0
7735.0
7848.0
7922.0
8037.2635658914729
8130.0
8260.0
8323.0
8428.0
8524.0
8637.2635658914729
8737.2635658914729
8854.0
8938.0
9023.0
9148.0
9227.0
9354.0
9417.0
9527.0
9633.0
9738.0
9864.0
9936.0
10055.0
Rows: 1-100 of 1234 | Column: age | Type: Float

The features 'embarked' and 'fare' have a couple missing values. Instead of using a technique to impute them, we can just drop them with the 'dropna' method.

In [69]:
vdf["fare"].dropna()
vdf["embarked"].dropna()
1 element was filtered.
2 elements were filtered.
Out[69]:
123
pclass
Int
123
survived
Int
Abc
Varchar(164)
Abc
sex
Varchar(20)
123
age
Numeric(6,3)
123
sibsp
Int
123
parch
Int
Abc
ticket
Varchar(36)
123
fare
Numeric(10,5)
Abc
cabin
Varchar(30)
Abc
embarked
Varchar(20)
Abc
boat
Varchar(100)
123
body
Int
Abc
Varchar(100)
111female36.000PC 17760135.6333C32C8[null]
211female31.00016966134.5E39 E41C3[null]
311female21.00011379526.55[null]S8 10[null]
411female50.011113503211.5C80C4[null]
511female45.01136928164.8667[null]S8[null]
611female31.00236928164.8667C7S8[null]
711female55.000PC 17760135.6333C32C8[null]
811female60.01011081375.25D37C5[null]
911female35.000PC 17755512.3292[null]C3[null]
1011female37.263565891472900PC 1758579.2[null]CD[null]
1111female39.01117421110.8833C68C4[null]
1211female37.2635658914729101999652.0C126S5 7[null]
1311female39.01111041379.65E67S8[null]
1411female18.00211041379.65E68S8[null]
1511female48.0001746625.9292D17S8[null]
1611female52.0103694778.2667D20C4[null]
1711female43.0101177855.4417C116C5[null]
1811female37.263565891472910PC 17569146.5208B78C6[null]
1911female40.01116966134.5E34C3[null]
2011female23.0102122882.2667B45S7[null]
2111female18.0101369560.0C31S6[null]
2211female39.0101350755.9E44S11[null]
2311female40.000PC 17582153.4625C125S3[null]
2411female30.00011379831.0[null]C4[null]
2511female35.0101323657.75C28C11[null]
2611female24.000PC 1747769.3B35C9[null]
2711female48.013PC 17608262.375B57 B59 B63 B66C4[null]
2811female18.022PC 17608262.375B57 B59 B63 B66C4[null]
2911female54.010PC 1760359.4[null]C6[null]
3011female33.00011015286.5B77S8[null]
3111female33.000PC 1761327.7208A11C11[null]
3211female43.00124160211.3375B3S2[null]
3311female56.0011176783.1583C50C7[null]
3411female30.0001274993.5B73S3[null]
3511female17.010PC 17758108.9C65C8[null]
3611female22.01011377666.6C2S8[null]
3711female22.00111350961.9792B36C5[null]
3811female39.000PC 17758108.9C105C8[null]
3911female19.0021175226.2833D47S5[null]
4011female23.01035273113.275D36C6[null]
4111female31.01035273113.275D36C6[null]
4211female37.0101992890.0C78Q14[null]
4311female33.0101992890.0C78Q14[null]
4411female37.263565891472910PC 1760482.1708[null]C6[null]
4511female18.01011377353.1D30S10[null]
4611female16.00011015286.5B79S8[null]
4711female15.00124160211.3375B5S2[null]
4811female58.000PC 17569146.5208B80CNo Lifeboat[null]
4911female21.0001350277.9583D9S10[null]
5011female51.001PC 1759239.4D28S9[null]
5111female16.001PC 1759239.4D28S9[null]
5211female55.00011237727.7208[null]C6[null]
5311