# Linear Regression¶

Linear regression is one of the most popular regression algorithms and produces good predictions for well-prepared data. Its optimization function computes coefficients to express a response column as a linear relationship of its predictors.

You must verify the Gauss-Markov assumptions when using linear regression algorithms:

• Linearity : the parameters we are estimating using the OLS method must be linear.
• Non-Collinearity : the regressors being calculated aren’t perfectly correlated with each other.
• Exogeneity : the regressors aren’t correlated with the error term.
• Homoscedasticity : no matter what the values of our regressors might be, the error of the variance is constant.

To create a good linear regression model, it's important to:

• Impute missing values
• Encode categorical features (linear regression only accepts numerical variables)
• Compute the correlation matrix to retrieve highly-correlated predictors
• Decompose the data (optional)
• Normalize the data (optional, but recommended)

# Example without decomposition¶

Let's use the 'africa_education' dataset to compute a linear regression model of students' performance in school.

In [46]:
import verticapy as vp
africa = africa.select(["(zralocp + zmalocp) / 2 AS student_score",
"(zraloct + zmaloct) / 2 AS teacher_score",
"XNUMYRS AS teacher_year_teaching",
"numstu AS number_students_school",
"PENGLISH AS english_at_home",
"PTRAVEL AS travel_distance",
"PTRAVEL2 AS means_of_travel",
"PMOTHER AS m_education",
"PFATHER AS f_education",
"PLIGHT AS source_of_lighting",
"PABSENT AS days_absent",
"zpsit AS sitting_place",
"PAGE AS age",
"zpses AS socio_eco_statut",
"country_long AS country"])
display(africa)
 123student_scoreFloat 123teacher_scoreFloat 123teacher_year_teachingNumeric(7,3) 123number_students_schoolInteger Abcenglish_at_homeVarchar(32) Abctravel_distanceVarchar(22) Abcmeans_of_travelVarchar(26) AbcVarchar(68) AbcVarchar(68) Abcsource_of_lightingVarchar(24) 123days_absentInteger Abcrepeated_gradesVarchar(20) Abcsitting_placeVarchar(54) 123ageInteger 123socio_eco_statutNumeric(7,3) AbccountryVarchar(24) 1 681.138508424325 [null] 26.0 24 ALL THE TIME >0.5-1KM CAR ELECTRIC 0 NEVER I have my own sitting place 12 15.0 South Africa 2 425.993367323877 537.289572911762 10.0 23 SOMETIMES >4.5-5KM WALK PARAFFIN/OIL 0 NEVER I have my own sitting place 14 5.0 Namibia 3 534.329515370892 537.289572911762 10.0 23 SOMETIMES >0.5-1KM WALK ELECTRIC 0 NEVER I have my own sitting place 13 7.0 Namibia 4 536.690743411639 537.289572911762 10.0 23 SOMETIMES UP TO 0.5KM WALK ELECTRIC 0 NEVER I have my own sitting place 12 8.0 Namibia 5 569.392927563969 537.289572911762 10.0 23 SOMETIMES UP TO 0.5KM Rows: 1-100 | Columns: 16

First, let's look for missing values.

In [2]:
africa.count_percent()
Out[2]:
 count percent "number_students_school" 60890.0 100.0 "english_at_home" 60890.0 100.0 "travel_distance" 60890.0 100.0 "means_of_travel" 60890.0 100.0 "m_education" 60890.0 100.0 "source_of_lighting" 60890.0 100.0 "days_absent" 60890.0 100.0 "repeated_grades" 60890.0 100.0 "sitting_place" 60890.0 100.0 "age" 60890.0 100.0 "country" 60890.0 100.0 "socio_eco_statut" 60832.0 99.905 "student_score" 60809.0 99.867 "teacher_year_teaching" 60708.0 99.701 "f_education" 60599.0 99.522 "teacher_score" 52122.0 85.6
Rows: 1-16 | Columns: 3

We'll simply drop the missing values to avoid adding bias to the data.

In [47]:
africa.dropna()
8988 elements were filtered.
Out[47]:
 123student_scoreFloat 123teacher_scoreFloat 123teacher_year_teachingNumeric(7,3) 123number_students_schoolInteger Abcenglish_at_homeVarchar(32) Abctravel_distanceVarchar(22) Abcmeans_of_travelVarchar(26) AbcVarchar(68) AbcVarchar(68) Abcsource_of_lightingVarchar(24) 123days_absentInteger Abcrepeated_gradesVarchar(20) Abcsitting_placeVarchar(54) 123ageInteger 123socio_eco_statutNumeric(7,3) AbccountryVarchar(24) 1 425.993367323877 537.289572911762 10.0 23 SOMETIMES >4.5-5KM WALK PARAFFIN/OIL 0 NEVER I have my own sitting place 14 5.0 Namibia 2 534.329515370892 537.289572911762 10.0 23 SOMETIMES >0.5-1KM WALK ELECTRIC 0 NEVER I have my own sitting place 13 7.0 Namibia 3 536.690743411639 537.289572911762 10.0 23 SOMETIMES UP TO 0.5KM WALK ELECTRIC 0 NEVER I have my own sitting place 12 8.0 Namibia 4 569.392927563969 537.289572911762 10.0 23 SOMETIMES UP TO 0.5KM WALK ELECTRIC 0 NEVER I have my own sitting place 13 9.0 Namibia 5 542.037992351316 537.289572911762 10.0 23 MOST OF THE TIME 