Correlation and Dependency#
Finding links between variables is a very important task. The main purpose of data science is to find relationships between variables, and to understand how these relationships can help us make better decisions.
Machine learning models are also sensitive to the number of variables and how they relate and affect each other, so finding correlations and dependencies can help us make better use of our machine learning algorithms.
Let’s use the Telco Churn dataset to understand how we can find links between different variables in VerticaPy.
[21]:
import verticapy as vp
vdf = vp.read_csv("data/churn.csv")
display(vdf)
Abc customerIDVarchar(20) | Abc genderVarchar(20) | 123 SeniorCitizenInteger | 010 PartnerBoolean | 010 DependentsBoolean | 123 tenureInteger | 010 PhoneServiceBoolean | Abc MultipleLinesVarchar(100) | Abc InternetServiceVarchar(22) | Abc OnlineSecurityVarchar(38) | Abc OnlineBackupVarchar(38) | Abc DeviceProtectionVarchar(38) | Abc TechSupportVarchar(38) | Abc StreamingTVVarchar(38) | Abc StreamingMoviesVarchar(38) | Abc ContractVarchar(28) | 010 PaperlessBillingBoolean | Abc PaymentMethodVarchar(50) | 123 MonthlyChargesNumeric(10) | 123 TotalChargesNumeric(11) | 010 ChurnBoolean | |
1 | 0002-ORFBO | Female | 0 | 9 | No | DSL | No | Yes | No | Yes | Yes | No | One year | Mailed check | 65.6 | 593.3 | |||||
2 | 0003-MKNFE | Male | 0 | 9 | Yes | DSL | No | No | No | No | No | Yes | Month-to-month | Mailed check | 59.9 | 542.4 | |||||
3 | 0004-TLHLJ | Male | 0 | 4 | No | Fiber optic | No | No | Yes | No | No | No | Month-to-month | Electronic check | 73.9 | 280.85 | |||||
4 | 0011-IGKFF | Male | 1 | 13 | No | Fiber optic | No | Yes | Yes | No | Yes | Yes | Month-to-month | Electronic check | 98.0 | 1237.85 | |||||
5 | 0013-EXCHZ | Female | 1 | 3 | No | Fiber optic | No | No | No | Yes | Yes | No | Month-to-month | Mailed check | 83.9 | 267.4 | |||||
6 | 0013-MHZWF | Female | 0 | 9 | No | DSL | No | No | No | Yes | Yes | Yes | Month-to-month | Credit card (automatic) | 69.4 | 571.45 | |||||
7 | 0013-SMEOE | Female | 1 | 71 | No | Fiber optic | Yes | Yes | Yes | Yes | Yes | Yes | Two year | Bank transfer (automatic) | 109.7 | 7904.25 | |||||
8 | 0014-BMAQU | Male | 0 | 63 | Yes | Fiber optic | Yes | No | No | Yes | No | No | Two year | Credit card (automatic) | 84.65 | 5377.8 | |||||
9 | 0015-UOCOJ | Female | 1 | 7 | No | DSL | Yes | No | No | No | No | No | Month-to-month | Electronic check | 48.2 | 340.35 | |||||
10 | 0016-QLJIS | Female | 0 | 65 | Yes | DSL | Yes | Yes | Yes | Yes | Yes | Yes | Two year | Mailed check | 90.45 | 5957.9 | |||||
11 | 0017-DINOC | Male | 0 | 54 | No phone service | DSL | Yes | No | No | Yes | Yes | No | Two year | Credit card (automatic) | 45.2 | 2460.55 | |||||
12 | 0017-IUDMW | Female | 0 | 72 | Yes | Fiber optic | Yes | Yes | Yes | Yes | Yes | Yes | Two year | Credit card (automatic) | 116.8 | 8456.75 | |||||
13 | 0018-NYROU | Female | 0 | 5 | No | Fiber optic | No | No | No | No | No | No | Month-to-month | Electronic check | 68.95 | 351.5 | |||||
14 | 0019-EFAEP | Female | 0 | 72 | Yes | Fiber optic | Yes | Yes | Yes | No | Yes | No | Two year | Bank transfer (automatic) | 101.3 | 7261.25 | |||||
15 | 0019-GFNTW | Female | 0 | 56 | No phone service | DSL | Yes | Yes | Yes | Yes | No | No | Two year | Bank transfer (automatic) | 45.05 | 2560.1 | |||||
16 | 0020-INWCK | Female | 0 | 71 | Yes | Fiber optic | No | Yes | Yes | No | No | Yes | Two year | Credit card (automatic) | 95.75 | 6849.4 | |||||
17 | 0020-JDNXP | Female | 0 | 34 | No phone service | DSL | Yes | No | Yes | Yes | Yes | Yes | One year | Mailed check | 61.25 | 1993.2 | |||||
18 | 0021-IKXGC | Female | 1 | 1 | Yes | Fiber optic | No | No | No | No | No | No | Month-to-month | Electronic check | 72.1 | 72.1 | |||||
19 | 0022-TCJCI | Male | 1 | 45 | No | DSL | Yes | No | Yes | No | No | Yes | One year | Credit card (automatic) | 62.7 | 2791.5 | |||||
20 | 0023-HGHWL | Male | 1 | 1 | No phone service | DSL | No | No | No | No | No | No | Month-to-month | Electronic check | 25.1 | 25.1 | |||||
21 | 0023-UYUPN | Female | 1 | 50 | Yes | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | One year | Electronic check | 25.2 | 1306.3 | |||||
22 | 0023-XUOPT | Female | 0 | 13 | Yes | Fiber optic | No | Yes | Yes | No | Yes | No | Month-to-month | Electronic check | 94.1 | 1215.6 | |||||
23 | 0027-KWYKW | Female | 0 | 23 | Yes | Fiber optic | No | No | No | No | Yes | No | Month-to-month | Electronic check | 83.75 | 1849.95 | |||||
24 | 0030-FNXPP | Female | 0 | 3 | No | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Month-to-month | Mailed check | 19.85 | 57.2 | |||||
25 | 0031-PVLZI | Female | 0 | 4 | No | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Month-to-month | Mailed check | 20.35 | 76.35 | |||||
26 | 0032-PGELS | Female | 0 | 1 | No phone service | DSL | Yes | No | No | No | No | No | Month-to-month | Bank transfer (automatic) | 30.5 | 30.5 | |||||
27 | 0036-IHMOT | Female | 0 | 55 | No | Fiber optic | No | Yes | Yes | Yes | Yes | Yes | One year | Bank transfer (automatic) | 103.7 | 5656.75 | |||||
28 | 0040-HALCW | Male | 0 | 54 | No | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Two year | Credit card (automatic) | 20.4 | 1090.6 | |||||
29 | 0042-JVWOJ | Male | 0 | 26 | No | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | One year | Bank transfer (automatic) | 19.6 | 471.85 | |||||
30 | 0042-RLHYP | Female | 0 | 69 | No | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Two year | Bank transfer (automatic) | 19.7 | 1396.9 | |||||
31 | 0048-LUMLS | Male | 0 | 37 | No | Fiber optic | No | No | No | No | Yes | Yes | One year | Credit card (automatic) | 91.2 | 3247.55 | |||||
32 | 0048-PIHNL | Female | 0 | 49 | No | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | One year | Bank transfer (automatic) | 20.45 | 900.9 | |||||
33 | 0052-DCKON | Male | 0 | 66 | Yes | Fiber optic | Yes | Yes | Yes | Yes | Yes | Yes | One year | Bank transfer (automatic) | 115.8 | 7942.15 | |||||
34 | 0052-YNYOT | Female | 0 | 67 | No | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | One year | Electronic check | 20.55 | 1343.4 | |||||
35 | 0056-EPFBG | Male | 0 | 20 | No phone service | DSL | Yes | No | Yes | Yes | No | No | Two year | Credit card (automatic) | 39.4 | 825.4 | |||||
36 | 0057-QBUQH | Female | 0 | 43 | Yes | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Two year | Electronic check | 25.1 | 1070.15 | |||||
37 | 0058-EVZWM | Female | 0 | 55 | Yes | Fiber optic | Yes | No | No | No | Yes | No | Month-to-month | Bank transfer (automatic) | 89.8 | 4959.6 | |||||
38 | 0060-FUALY | Female | 0 | 59 | Yes | Fiber optic | Yes | Yes | No | No | Yes | No | Month-to-month | Electronic check | 94.75 | 5597.65 | |||||
39 | 0064-SUDOG | Female | 0 | 12 | No | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Two year | Bank transfer (automatic) | 20.3 | 224.5 | |||||
40 | 0064-YIJGF | Male | 0 | 27 | Yes | Fiber optic | No | No | No | No | No | No | Month-to-month | Bank transfer (automatic) | 75.75 | 1929.0 | |||||
41 | 0067-DKWBL | Male | 1 | 2 | No | DSL | Yes | No | No | No | No | No | Month-to-month | Electronic check | 49.25 | 91.1 | |||||
42 | 0068-FIGTF | Female | 0 | 27 | No | DSL | No | Yes | Yes | Yes | Yes | Yes | One year | Mailed check | 78.2 | 2078.95 | |||||
43 | 0071-NDAFP | Male | 0 | 25 | Yes | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Two year | Credit card (automatic) | 25.5 | 630.6 | |||||
44 | 0074-HDKDG | Male | 0 | 25 | No | DSL | Yes | Yes | Yes | No | No | No | One year | Bank transfer (automatic) | 61.6 | 1611.0 | |||||
45 | 0076-LVEPS | Male | 0 | 29 | No phone service | DSL | Yes | Yes | Yes | Yes | No | No | Month-to-month | Mailed check | 45.0 | 1242.45 | |||||
46 | 0078-XZMHT | Male | 0 | 72 | Yes | DSL | No | Yes | Yes | Yes | Yes | Yes | Two year | Bank transfer (automatic) | 85.15 | 6316.2 | |||||
47 | 0080-EMYVY | Female | 0 | 14 | No | DSL | No | Yes | No | No | No | No | One year | Credit card (automatic) | 51.45 | 727.85 | |||||
48 | 0080-OROZO | Female | 0 | 35 | No | Fiber optic | No | No | Yes | Yes | Yes | Yes | One year | Electronic check | 99.25 | 3532.0 | |||||
49 | 0082-LDZUE | Male | 0 | 1 | No | DSL | No | No | No | No | No | No | Month-to-month | Mailed check | 44.3 | 44.3 | |||||
50 | 0082-OQIQY | Male | 0 | 29 | No | Fiber optic | No | No | No | Yes | Yes | Yes | Month-to-month | Electronic check | 94.2 | 2607.6 | |||||
51 | 0083-PIVIK | Male | 0 | 64 | Yes | DSL | Yes | Yes | Yes | Yes | Yes | No | One year | Electronic check | 81.25 | 5567.55 | |||||
52 | 0089-IIQKO | Female | 0 | 39 | Yes | Fiber optic | No | No | No | Yes | Yes | Yes | Month-to-month | Credit card (automatic) | 99.95 | 3767.4 | |||||
53 | 0093-EXYQL | Female | 1 | 40 | No | Fiber optic | No | No | No | No | Yes | Yes | Month-to-month | Electronic check | 91.55 | 3673.6 | |||||
54 | 0093-XWZFY | Male | 0 | 40 | Yes | Fiber optic | No | Yes | Yes | No | Yes | Yes | Month-to-month | Credit card (automatic) | 104.5 | 4036.85 | |||||
55 | 0094-OIFMO | Female | 1 | 11 | No | Fiber optic | No | Yes | No | No | Yes | Yes | Month-to-month | Electronic check | 95.0 | 1120.3 | |||||
56 | 0096-BXERS | Female | 0 | 6 | Yes | DSL | No | No | No | No | No | No | Month-to-month | Electronic check | 50.35 | 314.55 | |||||
57 | 0096-FCPUF | Male | 0 | 30 | Yes | DSL | Yes | No | No | No | No | Yes | Month-to-month | Mailed check | 64.5 | 1888.45 | |||||
58 | 0098-BOWSO | Male | 0 | 27 | No | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Month-to-month | Electronic check | 19.4 | 529.8 | |||||
59 | 0100-DUVFC | Male | 1 | 70 | Yes | Fiber optic | No | Yes | Yes | No | Yes | Yes | One year | Electronic check | 104.8 | 7308.95 | |||||
60 | 0103-CSITQ | Female | 0 | 57 | Yes | Fiber optic | No | Yes | Yes | Yes | Yes | Yes | One year | Bank transfer (automatic) | 109.4 | 6252.7 | |||||
61 | 0104-PPXDV | Male | 0 | 58 | No | DSL | No | No | Yes | No | No | No | One year | Credit card (automatic) | 50.3 | 2878.55 | |||||
62 | 0106-GHRQR | Male | 0 | 16 | Yes | DSL | No | Yes | Yes | No | No | Yes | Month-to-month | Bank transfer (automatic) | 71.4 | 1212.1 | |||||
63 | 0106-UGRDO | Female | 0 | 69 | Yes | Fiber optic | Yes | Yes | Yes | Yes | Yes | Yes | Two year | Electronic check | 116.0 | 8182.85 | |||||
64 | 0107-WESLM | Male | 0 | 1 | No | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Month-to-month | Electronic check | 19.85 | 19.85 | |||||
65 | 0107-YHINA | Male | 0 | 1 | Yes | Fiber optic | No | No | Yes | No | Yes | Yes | Month-to-month | Electronic check | 99.75 | 99.75 | |||||
66 | 0111-KLBQG | Male | 1 | 32 | No | Fiber optic | No | Yes | No | No | Yes | Yes | Month-to-month | Mailed check | 93.95 | 2861.45 | |||||
67 | 0112-QAWRZ | Male | 0 | 16 | Yes | Fiber optic | No | No | Yes | No | Yes | No | Month-to-month | Bank transfer (automatic) | 90.8 | 1442.2 | |||||
68 | 0112-QWPNC | Male | 0 | 49 | Yes | DSL | Yes | No | Yes | Yes | Yes | Yes | One year | Electronic check | 84.35 | 4059.35 | |||||
69 | 0114-IGABW | Female | 0 | 71 | No phone service | DSL | No | Yes | Yes | Yes | Yes | Yes | Two year | Bank transfer (automatic) | 58.25 | 4145.9 | |||||
70 | 0114-PEGZZ | Female | 0 | 33 | Yes | Fiber optic | No | Yes | Yes | No | Yes | Yes | Month-to-month | Electronic check | 107.55 | 3645.5 | |||||
71 | 0114-RSRRW | Female | 0 | 10 | No | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Month-to-month | Bank transfer (automatic) | 19.95 | 187.75 | |||||
72 | 0115-TFERT | Male | 0 | 21 | Yes | Fiber optic | No | Yes | Yes | Yes | Yes | Yes | Month-to-month | Electronic check | 111.2 | 2317.1 | |||||
73 | 0117-LFRMW | Male | 0 | 37 | No phone service | DSL | Yes | Yes | Yes | No | No | No | Month-to-month | Bank transfer (automatic) | 40.2 | 1448.8 | |||||
74 | 0118-JPNOY | Female | 1 | 26 | No | Fiber optic | Yes | No | No | No | Yes | No | Month-to-month | Credit card (automatic) | 85.8 | 2193.65 | |||||
75 | 0121-SNYRK | Male | 0 | 50 | No phone service | DSL | Yes | No | No | Yes | No | No | One year | Mailed check | 35.4 | 1748.9 | |||||
76 | 0122-OAHPZ | Female | 0 | 7 | Yes | Fiber optic | No | No | No | No | No | No | Month-to-month | Electronic check | 73.85 | 511.25 | |||||
77 | 0123-CRBRT | Female | 0 | 61 | Yes | DSL | Yes | Yes | Yes | Yes | Yes | Yes | Two year | Mailed check | 88.1 | 5526.75 | |||||
78 | 0125-LZQXK | Male | 0 | 15 | No | Fiber optic | No | No | Yes | Yes | Yes | Yes | Month-to-month | Electronic check | 101.35 | 1553.95 | |||||
79 | 0128-MKWSG | Female | 0 | 26 | No phone service | DSL | Yes | No | No | Yes | No | Yes | Month-to-month | Mailed check | 45.8 | 1147.0 | |||||
80 | 0129-KPTWJ | Male | 0 | 72 | No | Fiber optic | No | No | Yes | No | Yes | Yes | Month-to-month | Electronic check | 94.65 | 6747.35 | |||||
81 | 0129-QMPDR | Male | 0 | 44 | No | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | One year | Bank transfer (automatic) | 20.5 | 865.05 | |||||
82 | 0130-SXOUN | Male | 0 | 66 | Yes | Fiber optic | No | Yes | No | No | No | Yes | Month-to-month | Credit card (automatic) | 89.4 | 5976.9 | |||||
83 | 0133-BMFZO | Female | 0 | 2 | No | Fiber optic | Yes | Yes | No | Yes | No | No | Month-to-month | Electronic check | 86.25 | 181.65 | |||||
84 | 0134-XWXCE | Female | 1 | 44 | No | DSL | No | No | Yes | Yes | Yes | Yes | One year | Bank transfer (automatic) | 74.85 | 3268.05 | |||||
85 | 0135-NMXAP | Female | 0 | 12 | Yes | Fiber optic | No | Yes | No | No | Yes | No | Month-to-month | Bank transfer (automatic) | 89.75 | 1052.4 | |||||
86 | 0136-IFMYD | Male | 1 | 69 | Yes | Fiber optic | No | Yes | Yes | Yes | Yes | Yes | Two year | Electronic check | 109.95 | 7634.25 | |||||
87 | 0137-OCGAB | Female | 0 | 1 | Yes | Fiber optic | No | Yes | No | No | No | No | Month-to-month | Mailed check | 80.2 | 80.2 | |||||
88 | 0137-UDEUO | Female | 0 | 3 | No | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Month-to-month | Mailed check | 19.85 | 63.75 | |||||
89 | 0139-IVFJG | Female | 0 | 2 | No | Fiber optic | Yes | Yes | No | No | Yes | No | Month-to-month | Electronic check | 90.35 | 190.5 | |||||
90 | 0141-YEAYS | Female | 1 | 27 | Yes | Fiber optic | No | Yes | Yes | No | No | No | Month-to-month | Bank transfer (automatic) | 86.45 | 2401.05 | |||||
91 | 0142-GVYSN | Male | 0 | 26 | No | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Month-to-month | Electronic check | 20.3 | 511.25 | |||||
92 | 0147-ESWWR | Female | 1 | 39 | Yes | Fiber optic | No | No | No | Yes | Yes | Yes | Month-to-month | Electronic check | 101.25 | 3949.15 | |||||
93 | 0148-DCDOS | Male | 0 | 25 | Yes | Fiber optic | No | No | No | No | Yes | Yes | Month-to-month | Bank transfer (automatic) | 94.7 | 2362.1 | |||||
94 | 0151-ONTOV | Female | 0 | 1 | No | Fiber optic | No | No | No | No | No | No | Month-to-month | Mailed check | 70.9 | 70.9 | |||||
95 | 0156-FVPTA | Male | 0 | 22 | No | DSL | Yes | No | No | Yes | No | No | Month-to-month | Electronic check | 54.2 | 1152.7 | |||||
96 | 0164-APGRB | Female | 0 | 72 | Yes | Fiber optic | Yes | Yes | Yes | Yes | Yes | Yes | Two year | Electronic check | 114.9 | 8496.7 | |||||
97 | 0164-XAIRP | Female | 0 | 24 | No | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Two year | Bank transfer (automatic) | 19.55 | 470.2 | |||||
98 | 0168-XZKBB | Female | 0 | 19 | Yes | Fiber optic | No | No | No | No | Yes | No | Month-to-month | Electronic check | 86.85 | 1564.4 | |||||
99 | 0174-QRVVY | Male | 0 | 71 | Yes | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Two year | Credit card (automatic) | 25.35 | 1847.55 | |||||
100 | 0177-PXBAT | Male | 1 | 33 | Yes | Fiber optic | Yes | No | Yes | Yes | Yes | Yes | Month-to-month | Bank transfer (automatic) | 109.9 | 3694.7 |
The Pearson correlation coefficient is a very common correlation function. In this case, it helped us to find linear links between the variables. Having a strong Pearson relationship means that the two input variables are linearly correlated.
[2]:
vdf.corr(method = "pearson")
[2]:
We can see that ‘tenure’ is well-correlated to the ‘TotalCharges’, which makes sense.
[3]:
vdf.scatter(["tenure", "TotalCharges"])
[3]:
[4]:
vdf.corr(["tenure", "TotalCharges"], method = "pearson")
[4]:
0.825880460933202
Note, however, that having a low Pearson relationship imply that the variables aren’t correlated. For example, let’s compute the Pearson correlation coefficient between ‘tenure’ and ‘TotalCharges’ to the power of 20.
[5]:
vdf["TotalCharges^20"] = vdf["TotalCharges"] ** 20
vdf.scatter(["tenure", "TotalCharges^20"])
[5]:
[6]:
vdf.corr(["tenure", "TotalCharges^20"], method = "pearson")
[6]:
0.224994408804537
We know that the ‘tenure’ and ‘TotalCharges’ are strongly linearly correlated. However we can notice that the correlation between the ‘tenure’ and ‘TotalCharges’ to the power of 20 is not very high. Indeed, the Pearson correlation coefficient is not robust for monotonic relationships, but rank-based correlations are. Knowing this, we’ll calculate the Spearman’s rank correlation coefficient instead.
[7]:
vdf.corr(method = "spearman", show = False)
[7]:
"SeniorCitizen" | "Partner" | "Dependents" | "tenure" | "PhoneService" | "PaperlessBilling" | "MonthlyCharges" | "TotalCharges" | "Churn" | "TotalCharges^20" | |
"SeniorCitizen" | 1.0 | 0.0164786575974139 | -0.211185088493958 | 0.0190767898701152 | 0.00857640107927944 | 0.156529559311173 | 0.221092529102162 | 0.105795342303725 | 0.150889328176473 | 0.105795342303725 |
"Partner" | 0.0164786575974139 | 1.0 | 0.452676282929464 | 0.384665710284119 | 0.017705663223972 | -0.014876622287891 | 0.108410945895981 | 0.343930553215626 | -0.150447544959177 | 0.343930553215626 |
"Dependents" | -0.211185088493958 | 0.452676282929464 | 1.0 | 0.164485741353804 | -0.00176167854468371 | -0.111377229193644 | -0.107082725586711 | 0.0866797760484616 | -0.164221401579725 | 0.0866797760484616 |
"tenure" | 0.0190767898701152 | 0.384665710284119 | 0.164485741353804 | 1.0 | 0.00815081986907184 | 0.00792876239476321 | 0.276342245223708 | 0.883103368818293 | -0.369620778763435 | 0.883103368818293 |
"PhoneService" | 0.00857640107927944 | 0.017705663223972 | -0.00176167854468371 | 0.00815081986907184 | 1.0 | 0.0165048057325697 | 0.238826410230016 | 0.0838048547856037 | 0.0119419800290031 | 0.0838048547856037 |
"PaperlessBilling" | 0.156529559311173 | -0.014876622287891 | -0.111377229193644 | 0.00792876239476321 | 0.0165048057325697 | 1.0 | 0.346158879381323 | 0.151669712799097 | 0.191825331666468 | 0.151669712799097 |
"MonthlyCharges" | 0.221092529102162 | 0.108410945895981 | -0.107082725586711 | 0.276342245223708 | 0.238826410230016 | 0.346158879381323 | 1.0 | 0.633958405301206 | 0.184839285783758 | 0.633958405301206 |
"TotalCharges" | 0.105795342303725 | 0.343930553215626 | 0.0866797760484616 | 0.883103368818293 | 0.0838048547856037 | 0.151669712799097 | 0.633958405301206 | 1.0 | -0.233211018585104 | 1.0 |
"Churn" | 0.150889328176473 | -0.150447544959177 | -0.164221401579725 | -0.369620778763435 | 0.0119419800290031 | 0.191825331666468 | 0.184839285783758 | -0.233211018585104 | 1.0 | -0.233211018585104 |
"TotalCharges^20" | 0.105795342303725 | 0.343930553215626 | 0.0866797760484616 | 0.883103368818293 | 0.0838048547856037 | 0.151669712799097 | 0.633958405301206 | 1.0 | -0.233211018585104 | 1.0 |
[8]:
vdf.corr(method = "spearman")
[8]:
The Spearman’s rank correlation coefficient determines the monotonic relationships between the variables.
[9]:
vdf.corr(["tenure", "TotalCharges^20"], method = "spearman")
[9]:
0.883103368818293
We can notice that Spearman’s rank correlation coefficient stays the same if one of the variables can be expressed using a monotonic function on the other. The same applies to Kendall rank correlation coefficient.
[10]:
vdf.corr(method = "kendall")
[10]:
Notice that the Kendall rank correlation coefficient will also detect the monotonic relationship.
[11]:
vdf.corr(["tenure", "TotalCharges^20"], method = "kendall")
[11]:
0.731699318287362
However, the Kendall rank correlation coefficient is very computationally expensive, so we’ll generally use Pearson and Spearman when dealing with correlations between numerical variables.
Binary features are considered numerical, but this isn’t technically accurate. Since binary variables can only take two values, calculating correlations between a binary and numerical variable can lead to misleading results. To account for this, we’ll want to use the ‘Biserial Point’ method to calculate the Point-Biserial correlation coefficient. This powerful method will help us understand the link between a binary variable and a numerical variable.
[12]:
vdf.corr(method = "biserial")
[12]:
Lastly, we’ll look at the relationship between categorical columns. In this case, the ‘Cramer’s V’ method is very efficient. Since there is no position in the Euclidean space for those variables, the ‘Cramer’s V’ coefficients cannot be negative (which is a sign of an opposite relationship) and they will range in the interval [0,1].
[13]:
vdf.corr(method = "cramer")
[13]:
Sometimes, we just need to look at the correlation between a response and other variables. The parameter ‘focus’ will isolate and show us the specified correlation vector.
[14]:
vdf.corr(method = "cramer", focus = "Churn")
[14]:
Sometimes a correlation coefficient can lead to incorrect assumptions, so we should always look at the coefficient p-value.
[15]:
vdf.corr_pvalue("Churn", "customerID", method = "cramer",)
[15]:
(0.7810906445878953, 1.3659871749110484e-36)
We can see that churning correlates to the type of contract (monthly, yearly, etc.) which makes sense: you would expect that different types of contracts differ in flexibility for the customer, and particularly restrictive contracts may make churning more likely.
The type of internet service also seems to correlate with churning. Let’s split the different categories to binaries to understand which services can influence the global churning rate.
[16]:
vdf["InternetService"].one_hot_encode()
vdf.corr(method = "spearman",
focus = "Churn",
columns = ["InternetService_DSL",
"InternetService_Fiber_optic"])
[16]:
We can see that the Fiber Optic option in particular seems to be directly linked to a customer’s likelihood to churn. Let’s compute some aggregations to find a causal relationship.
[17]:
vdf["contract"].one_hot_encode()
vdf.groupby(["InternetService_Fiber_optic"],
["AVG(tenure) AS tenure",
"AVG(totalcharges) AS totalcharges",
'AVG("contract_month-to-month") AS "contract_month-to-month"',
'AVG("monthlycharges") AS "monthlycharges"'])
[17]:
123 InternetService_Fiber_opticInteger | 123 tenureFloat(22) | 123 totalchargesFloat(22) | 123 contract_month-to-monthFloat(22) | 123 monthlychargesFloat(22) | |
1 | 0 | 31.9422346085635 | 1558.06548526423 | 0.442614644033443 | 43.7882442361287 |
2 | 1 | 32.9179586563307 | 3205.30457041344 | 0.68733850129199 | 91.5001291989664 |
It seems that users with the Fiber Optic option tend more to churn not because of the option itself, but probably because of the type of contracts and the monthly charges the users are paying to get it. Be careful when dealing with identifying correlations! Remember: correlation doesn’t imply causation!
Another important type of correlation is the autocorrelation. Let’s use the Amazon dataset to understand it.
[18]:
from verticapy.datasets import load_amazon
vdf = load_amazon()
display(vdf)
📅 dateDate | Abc stateVarchar(32) | 123 numberInteger | |
1 | 1998-01-01 | ACRE | 0 |
2 | 1998-01-01 | ALAGOAS | 0 |
3 | 1998-01-01 | AMAPÁ | 0 |
4 | 1998-01-01 | AMAZONAS | 0 |
5 | 1998-01-01 | BAHIA | 0 |
6 | 1998-01-01 | CEARÁ | 0 |
7 | 1998-01-01 | DISTRITO FEDERAL | 0 |
8 | 1998-01-01 | ESPÍRITO SANTO | 0 |
9 | 1998-01-01 | GOIÁS | 0 |
10 | 1998-01-01 | MARANHÃO | 0 |
11 | 1998-01-01 | MATO GROSSO | 0 |
12 | 1998-01-01 | MATO GROSSO DO SUL | 0 |
13 | 1998-01-01 | MINAS GERAIS | 0 |
14 | 1998-01-01 | PARANÁ | 0 |
15 | 1998-01-01 | PARAÍBA | 0 |
16 | 1998-01-01 | PARÁ | 0 |
17 | 1998-01-01 | PERNAMBUCO | 0 |
18 | 1998-01-01 | PIAUÍ | 0 |
19 | 1998-01-01 | RIO DE JANEIRO | 0 |
20 | 1998-01-01 | RIO GRANDE DO NORTE | 0 |
21 | 1998-01-01 | RIO GRANDE DO SUL | 0 |
22 | 1998-01-01 | RONDÔNIA | 0 |
23 | 1998-01-01 | RORAIMA | 0 |
24 | 1998-01-01 | SANTA CATARINA | 0 |
25 | 1998-01-01 | SERGIPE | 0 |
26 | 1998-01-01 | SÃO PAULO | 0 |
27 | 1998-01-01 | TOCANTINS | 0 |
28 | 1998-02-01 | ACRE | 0 |
29 | 1998-02-01 | ALAGOAS | 0 |
30 | 1998-02-01 | AMAPÁ | 0 |
31 | 1998-02-01 | AMAZONAS | 0 |
32 | 1998-02-01 | BAHIA | 0 |
33 | 1998-02-01 | CEARÁ | 0 |
34 | 1998-02-01 | DISTRITO FEDERAL | 0 |
35 | 1998-02-01 | ESPÍRITO SANTO | 0 |
36 | 1998-02-01 | GOIÁS | 0 |
37 | 1998-02-01 | MARANHÃO | 0 |
38 | 1998-02-01 | MATO GROSSO | 0 |
39 | 1998-02-01 | MATO GROSSO DO SUL | 0 |
40 | 1998-02-01 | MINAS GERAIS | 0 |
41 | 1998-02-01 | PARANÁ | 0 |
42 | 1998-02-01 | PARAÍBA | 0 |
43 | 1998-02-01 | PARÁ | 0 |
44 | 1998-02-01 | PERNAMBUCO | 0 |
45 | 1998-02-01 | PIAUÍ | 0 |
46 | 1998-02-01 | RIO DE JANEIRO | 0 |
47 | 1998-02-01 | RIO GRANDE DO NORTE | 0 |
48 | 1998-02-01 | RIO GRANDE DO SUL | 0 |
49 | 1998-02-01 | RONDÔNIA | 0 |
50 | 1998-02-01 | RORAIMA | 0 |
51 | 1998-02-01 | SANTA CATARINA | 0 |
52 | 1998-02-01 | SERGIPE | 0 |
53 | 1998-02-01 | SÃO PAULO | 0 |
54 | 1998-02-01 | TOCANTINS | 0 |
55 | 1998-03-01 | ACRE | 0 |
56 | 1998-03-01 | ALAGOAS | 0 |
57 | 1998-03-01 | AMAPÁ | 0 |
58 | 1998-03-01 | AMAZONAS | 0 |
59 | 1998-03-01 | BAHIA | 0 |
60 | 1998-03-01 | CEARÁ | 0 |
61 | 1998-03-01 | DISTRITO FEDERAL | 0 |
62 | 1998-03-01 | ESPÍRITO SANTO | 0 |
63 | 1998-03-01 | GOIÁS | 0 |
64 | 1998-03-01 | MARANHÃO | 0 |
65 | 1998-03-01 | MATO GROSSO | 0 |
66 | 1998-03-01 | MATO GROSSO DO SUL | 0 |
67 | 1998-03-01 | MINAS GERAIS | 0 |
68 | 1998-03-01 | PARANÁ | 0 |
69 | 1998-03-01 | PARAÍBA | 0 |
70 | 1998-03-01 | PARÁ | 0 |
71 | 1998-03-01 | PERNAMBUCO | 0 |
72 | 1998-03-01 | PIAUÍ | 0 |
73 | 1998-03-01 | RIO DE JANEIRO | 0 |
74 | 1998-03-01 | RIO GRANDE DO NORTE | 0 |
75 | 1998-03-01 | RIO GRANDE DO SUL | 0 |
76 | 1998-03-01 | RONDÔNIA | 0 |
77 | 1998-03-01 | RORAIMA | 0 |
78 | 1998-03-01 | SANTA CATARINA | 0 |
79 | 1998-03-01 | SERGIPE | 0 |
80 | 1998-03-01 | SÃO PAULO | 0 |
81 | 1998-03-01 | TOCANTINS | 0 |
82 | 1998-04-01 | ACRE | 0 |
83 | 1998-04-01 | ALAGOAS | 0 |
84 | 1998-04-01 | AMAPÁ | 0 |
85 | 1998-04-01 | AMAZONAS | 0 |
86 | 1998-04-01 | BAHIA | 0 |
87 | 1998-04-01 | CEARÁ | 0 |
88 | 1998-04-01 | DISTRITO FEDERAL | 0 |
89 | 1998-04-01 | ESPÍRITO SANTO | 0 |
90 | 1998-04-01 | GOIÁS | 0 |
91 | 1998-04-01 | MARANHÃO | 0 |
92 | 1998-04-01 | MATO GROSSO | 0 |
93 | 1998-04-01 | MATO GROSSO DO SUL | 0 |
94 | 1998-04-01 | MINAS GERAIS | 0 |
95 | 1998-04-01 | PARANÁ | 0 |
96 | 1998-04-01 | PARAÍBA | 0 |
97 | 1998-04-01 | PARÁ | 0 |
98 | 1998-04-01 | PERNAMBUCO | 0 |
99 | 1998-04-01 | PIAUÍ | 0 |
100 | 1998-04-01 | RIO DE JANEIRO | 0 |
Our goal is to predict the number of forest fires in Brazil. To do this, we can draw an autocorrelation plot and a partial autocorrelation plot.
[19]:
vdf.acf(column = "number",
ts = "date",
by = ["state"],
p = 48,
method = "pearson")
[19]:
[20]:
vdf.pacf(column = "number",
ts = "date",
by = ["state"],
p = 48)
[20]:
We can see the seasonality forest fires.
It’s mathematically impossible to build the perfect correlation function, but we still have several powerful functions at our disposal for finding relationships in all kinds of datasets.