Correlation and Dependency#

Finding links between variables is a very important task. The main purpose of data science is to find relationships between variables, and to understand how these relationships can help us make better decisions.

Machine learning models are also sensitive to the number of variables and how they relate and affect each other, so finding correlations and dependencies can help us make better use of our machine learning algorithms.

Let’s use the Telco Churn dataset to understand how we can find links between different variables in VerticaPy.

[21]:
import verticapy as vp

vdf = vp.read_csv("data/churn.csv")
display(vdf)
Abc
customerID
Varchar(20)
Abc
gender
Varchar(20)
123
SeniorCitizen
Integer
010
Partner
Boolean
010
Dependents
Boolean
123
tenure
Integer
010
PhoneService
Boolean
Abc
MultipleLines
Varchar(100)
Abc
InternetService
Varchar(22)
Abc
OnlineSecurity
Varchar(38)
Abc
OnlineBackup
Varchar(38)
Abc
DeviceProtection
Varchar(38)
Abc
TechSupport
Varchar(38)
Abc
StreamingTV
Varchar(38)
Abc
StreamingMovies
Varchar(38)
Abc
Contract
Varchar(28)
010
PaperlessBilling
Boolean
Abc
PaymentMethod
Varchar(50)
123
MonthlyCharges
Numeric(10)
123
TotalCharges
Numeric(11)
010
Churn
Boolean
10002-ORFBOFemale0
9
NoDSLNoYesNoYesYesNoOne year
Mailed check65.6593.3
20003-MKNFEMale0
9
YesDSLNoNoNoNoNoYesMonth-to-month
Mailed check59.9542.4
30004-TLHLJMale0
4
NoFiber opticNoNoYesNoNoNoMonth-to-month
Electronic check73.9280.85
40011-IGKFFMale1
13
NoFiber opticNoYesYesNoYesYesMonth-to-month
Electronic check98.01237.85
50013-EXCHZFemale1
3
NoFiber opticNoNoNoYesYesNoMonth-to-month
Mailed check83.9267.4
60013-MHZWFFemale0
9
NoDSLNoNoNoYesYesYesMonth-to-month
Credit card (automatic)69.4571.45
70013-SMEOEFemale1
71
NoFiber opticYesYesYesYesYesYesTwo year
Bank transfer (automatic)109.77904.25
80014-BMAQUMale0
63
YesFiber opticYesNoNoYesNoNoTwo year
Credit card (automatic)84.655377.8
90015-UOCOJFemale1
7
NoDSLYesNoNoNoNoNoMonth-to-month
Electronic check48.2340.35
100016-QLJISFemale0
65
YesDSLYesYesYesYesYesYesTwo year
Mailed check90.455957.9
110017-DINOCMale0
54
No phone serviceDSLYesNoNoYesYesNoTwo year
Credit card (automatic)45.22460.55
120017-IUDMWFemale0
72
YesFiber opticYesYesYesYesYesYesTwo year
Credit card (automatic)116.88456.75
130018-NYROUFemale0
5
NoFiber opticNoNoNoNoNoNoMonth-to-month
Electronic check68.95351.5
140019-EFAEPFemale0
72
YesFiber opticYesYesYesNoYesNoTwo year
Bank transfer (automatic)101.37261.25
150019-GFNTWFemale0
56
No phone serviceDSLYesYesYesYesNoNoTwo year
Bank transfer (automatic)45.052560.1
160020-INWCKFemale0
71
YesFiber opticNoYesYesNoNoYesTwo year
Credit card (automatic)95.756849.4
170020-JDNXPFemale0
34
No phone serviceDSLYesNoYesYesYesYesOne year
Mailed check61.251993.2
180021-IKXGCFemale1
1
YesFiber opticNoNoNoNoNoNoMonth-to-month
Electronic check72.172.1
190022-TCJCIMale1
45
NoDSLYesNoYesNoNoYesOne year
Credit card (automatic)62.72791.5
200023-HGHWLMale1
1
No phone serviceDSLNoNoNoNoNoNoMonth-to-month
Electronic check25.125.1
210023-UYUPNFemale1
50
YesNoNo internet serviceNo internet serviceNo internet serviceNo internet serviceNo internet serviceNo internet serviceOne year
Electronic check25.21306.3
220023-XUOPTFemale0
13
YesFiber opticNoYesYesNoYesNoMonth-to-month
Electronic check94.11215.6
230027-KWYKWFemale0
23
YesFiber opticNoNoNoNoYesNoMonth-to-month
Electronic check83.751849.95
240030-FNXPPFemale0
3
NoNoNo internet serviceNo internet serviceNo internet serviceNo internet serviceNo internet serviceNo internet serviceMonth-to-month
Mailed check19.8557.2
250031-PVLZIFemale0
4
NoNoNo internet serviceNo internet serviceNo internet serviceNo internet serviceNo internet serviceNo internet serviceMonth-to-month
Mailed check20.3576.35
260032-PGELSFemale0
1
No phone serviceDSLYesNoNoNoNoNoMonth-to-month
Bank transfer (automatic)30.530.5
270036-IHMOTFemale0
55
NoFiber opticNoYesYesYesYesYesOne year
Bank transfer (automatic)103.75656.75
280040-HALCWMale0
54
NoNoNo internet serviceNo internet serviceNo internet serviceNo internet serviceNo internet serviceNo internet serviceTwo year
Credit card (automatic)20.41090.6
290042-JVWOJMale0
26
NoNoNo internet serviceNo internet serviceNo internet serviceNo internet serviceNo internet serviceNo internet serviceOne year
Bank transfer (automatic)19.6471.85
300042-RLHYPFemale0
69
NoNoNo internet serviceNo internet serviceNo internet serviceNo internet serviceNo internet serviceNo internet serviceTwo year
Bank transfer (automatic)19.71396.9
310048-LUMLSMale0
37
NoFiber opticNoNoNoNoYesYesOne year
Credit card (automatic)91.23247.55
320048-PIHNLFemale0
49
NoNoNo internet serviceNo internet serviceNo internet serviceNo internet serviceNo internet serviceNo internet serviceOne year
Bank transfer (automatic)20.45900.9
330052-DCKONMale0
66
YesFiber opticYesYesYesYesYesYesOne year
Bank transfer (automatic)115.87942.15
340052-YNYOTFemale0
67
NoNoNo internet serviceNo internet serviceNo internet serviceNo internet serviceNo internet serviceNo internet serviceOne year
Electronic check20.551343.4
350056-EPFBGMale0
20
No phone serviceDSLYesNoYesYesNoNoTwo year
Credit card (automatic)39.4825.4
360057-QBUQHFemale0
43
YesNoNo internet serviceNo internet serviceNo internet serviceNo internet serviceNo internet serviceNo internet serviceTwo year
Electronic check25.11070.15
370058-EVZWMFemale0
55
YesFiber opticYesNoNoNoYesNoMonth-to-month
Bank transfer (automatic)89.84959.6
380060-FUALYFemale0
59
YesFiber opticYesYesNoNoYesNoMonth-to-month
Electronic check94.755597.65
390064-SUDOGFemale0
12
NoNoNo internet serviceNo internet serviceNo internet serviceNo internet serviceNo internet serviceNo internet serviceTwo year
Bank transfer (automatic)20.3224.5
400064-YIJGFMale0
27
YesFiber opticNoNoNoNoNoNoMonth-to-month
Bank transfer (automatic)75.751929.0
410067-DKWBLMale1
2
NoDSLYesNoNoNoNoNoMonth-to-month
Electronic check49.2591.1
420068-FIGTFFemale0
27
NoDSLNoYesYesYesYesYesOne year
Mailed check78.22078.95
430071-NDAFPMale0
25
YesNoNo internet serviceNo internet serviceNo internet serviceNo internet serviceNo internet serviceNo internet serviceTwo year
Credit card (automatic)25.5630.6
440074-HDKDGMale0
25
NoDSLYesYesYesNoNoNoOne year
Bank transfer (automatic)61.61611.0
450076-LVEPSMale0
29
No phone serviceDSLYesYesYesYesNoNoMonth-to-month
Mailed check45.01242.45
460078-XZMHTMale0
72
YesDSLNoYesYesYesYesYesTwo year
Bank transfer (automatic)85.156316.2
470080-EMYVYFemale0
14
NoDSLNoYesNoNoNoNoOne year
Credit card (automatic)51.45727.85
480080-OROZOFemale0
35
NoFiber opticNoNoYesYesYesYesOne year
Electronic check99.253532.0
490082-LDZUEMale0
1
NoDSLNoNoNoNoNoNoMonth-to-month
Mailed check44.344.3
500082-OQIQYMale0
29
NoFiber opticNoNoNoYesYesYesMonth-to-month
Electronic check94.22607.6
510083-PIVIKMale0
64
YesDSLYesYesYesYesYesNoOne year
Electronic check81.255567.55
520089-IIQKOFemale0
39
YesFiber opticNoNoNoYesYesYesMonth-to-month
Credit card (automatic)99.953767.4
530093-EXYQLFemale1
40
NoFiber opticNoNoNoNoYesYesMonth-to-month
Electronic check91.553673.6
540093-XWZFYMale0
40
YesFiber opticNoYesYesNoYesYesMonth-to-month
Credit card (automatic)104.54036.85
550094-OIFMOFemale1
11
NoFiber opticNoYesNoNoYesYesMonth-to-month
Electronic check95.01120.3
560096-BXERSFemale0
6
YesDSLNoNoNoNoNoNoMonth-to-month
Electronic check50.35314.55
570096-FCPUFMale0
30
YesDSLYesNoNoNoNoYesMonth-to-month
Mailed check64.51888.45
580098-BOWSOMale0
27
NoNoNo internet serviceNo internet serviceNo internet serviceNo internet serviceNo internet serviceNo internet serviceMonth-to-month
Electronic check19.4529.8
590100-DUVFCMale1
70
YesFiber opticNoYesYesNoYesYesOne year
Electronic check104.87308.95
600103-CSITQFemale0
57
YesFiber opticNoYesYesYesYesYesOne year
Bank transfer (automatic)109.46252.7
610104-PPXDVMale0
58
NoDSLNoNoYesNoNoNoOne year
Credit card (automatic)50.32878.55
620106-GHRQRMale0
16
YesDSLNoYesYesNoNoYesMonth-to-month
Bank transfer (automatic)71.41212.1
630106-UGRDOFemale0
69
YesFiber opticYesYesYesYesYesYesTwo year
Electronic check116.08182.85
640107-WESLMMale0
1
NoNoNo internet serviceNo internet serviceNo internet serviceNo internet serviceNo internet serviceNo internet serviceMonth-to-month
Electronic check19.8519.85
650107-YHINAMale0
1
YesFiber opticNoNoYesNoYesYesMonth-to-month
Electronic check99.7599.75
660111-KLBQGMale1
32
NoFiber opticNoYesNoNoYesYesMonth-to-month
Mailed check93.952861.45
670112-QAWRZMale0
16
YesFiber opticNoNoYesNoYesNoMonth-to-month
Bank transfer (automatic)90.81442.2
680112-QWPNCMale0
49
YesDSLYesNoYesYesYesYesOne year
Electronic check84.354059.35
690114-IGABWFemale0
71
No phone serviceDSLNoYesYesYesYesYesTwo year
Bank transfer (automatic)58.254145.9
700114-PEGZZFemale0
33
YesFiber opticNoYesYesNoYesYesMonth-to-month
Electronic check107.553645.5
710114-RSRRWFemale0
10
NoNoNo internet serviceNo internet serviceNo internet serviceNo internet serviceNo internet serviceNo internet serviceMonth-to-month
Bank transfer (automatic)19.95187.75
720115-TFERTMale0
21
YesFiber opticNoYesYesYesYesYesMonth-to-month
Electronic check111.22317.1
730117-LFRMWMale0
37
No phone serviceDSLYesYesYesNoNoNoMonth-to-month
Bank transfer (automatic)40.21448.8
740118-JPNOYFemale1
26
NoFiber opticYesNoNoNoYesNoMonth-to-month
Credit card (automatic)85.82193.65
750121-SNYRKMale0
50
No phone serviceDSLYesNoNoYesNoNoOne year
Mailed check35.41748.9
760122-OAHPZFemale0
7
YesFiber opticNoNoNoNoNoNoMonth-to-month
Electronic check73.85511.25
770123-CRBRTFemale0
61
YesDSLYesYesYesYesYesYesTwo year
Mailed check88.15526.75
780125-LZQXKMale0
15
NoFiber opticNoNoYesYesYesYesMonth-to-month
Electronic check101.351553.95
790128-MKWSGFemale0
26
No phone serviceDSLYesNoNoYesNoYesMonth-to-month
Mailed check45.81147.0
800129-KPTWJMale0
72
NoFiber opticNoNoYesNoYesYesMonth-to-month
Electronic check94.656747.35
810129-QMPDRMale0
44
NoNoNo internet serviceNo internet serviceNo internet serviceNo internet serviceNo internet serviceNo internet serviceOne year
Bank transfer (automatic)20.5865.05
820130-SXOUNMale0
66
YesFiber opticNoYesNoNoNoYesMonth-to-month
Credit card (automatic)89.45976.9
830133-BMFZOFemale0
2
NoFiber opticYesYesNoYesNoNoMonth-to-month
Electronic check86.25181.65
840134-XWXCEFemale1
44
NoDSLNoNoYesYesYesYesOne year
Bank transfer (automatic)74.853268.05
850135-NMXAPFemale0
12
YesFiber opticNoYesNoNoYesNoMonth-to-month
Bank transfer (automatic)89.751052.4
860136-IFMYDMale1
69
YesFiber opticNoYesYesYesYesYesTwo year
Electronic check109.957634.25
870137-OCGABFemale0
1
YesFiber opticNoYesNoNoNoNoMonth-to-month
Mailed check80.280.2
880137-UDEUOFemale0
3
NoNoNo internet serviceNo internet serviceNo internet serviceNo internet serviceNo internet serviceNo internet serviceMonth-to-month
Mailed check19.8563.75
890139-IVFJGFemale0
2
NoFiber opticYesYesNoNoYesNoMonth-to-month
Electronic check90.35190.5
900141-YEAYSFemale1
27
YesFiber opticNoYesYesNoNoNoMonth-to-month
Bank transfer (automatic)86.452401.05
910142-GVYSNMale0
26
NoNoNo internet serviceNo internet serviceNo internet serviceNo internet serviceNo internet serviceNo internet serviceMonth-to-month
Electronic check20.3511.25
920147-ESWWRFemale1
39
YesFiber opticNoNoNoYesYesYesMonth-to-month
Electronic check101.253949.15
930148-DCDOSMale0
25
YesFiber opticNoNoNoNoYesYesMonth-to-month
Bank transfer (automatic)94.72362.1
940151-ONTOVFemale0
1
NoFiber opticNoNoNoNoNoNoMonth-to-month
Mailed check70.970.9
950156-FVPTAMale0
22
NoDSLYesNoNoYesNoNoMonth-to-month
Electronic check54.21152.7
960164-APGRBFemale0
72
YesFiber opticYesYesYesYesYesYesTwo year
Electronic check114.98496.7
970164-XAIRPFemale0
24
NoNoNo internet serviceNo internet serviceNo internet serviceNo internet serviceNo internet serviceNo internet serviceTwo year
Bank transfer (automatic)19.55470.2
980168-XZKBBFemale0
19
YesFiber opticNoNoNoNoYesNoMonth-to-month
Electronic check86.851564.4
990174-QRVVYMale0
71
YesNoNo internet serviceNo internet serviceNo internet serviceNo internet serviceNo internet serviceNo internet serviceTwo year
Credit card (automatic)25.351847.55
1000177-PXBATMale1
33
YesFiber opticYesNoYesYesYesYesMonth-to-month
Bank transfer (automatic)109.93694.7
Rows: 1-100 | Columns: 21

The Pearson correlation coefficient is a very common correlation function. In this case, it helped us to find linear links between the variables. Having a strong Pearson relationship means that the two input variables are linearly correlated.

[2]:
vdf.corr(method = "pearson")
[2]:

We can see that ‘tenure’ is well-correlated to the ‘TotalCharges’, which makes sense.

[3]:
vdf.scatter(["tenure", "TotalCharges"])
[3]:
[4]:
vdf.corr(["tenure", "TotalCharges"], method = "pearson")
[4]:
0.825880460933202

Note, however, that having a low Pearson relationship imply that the variables aren’t correlated. For example, let’s compute the Pearson correlation coefficient between ‘tenure’ and ‘TotalCharges’ to the power of 20.

[5]:
vdf["TotalCharges^20"] = vdf["TotalCharges"] ** 20
vdf.scatter(["tenure", "TotalCharges^20"])
[5]:
[6]:
vdf.corr(["tenure", "TotalCharges^20"], method = "pearson")
[6]:
0.224994408804537

We know that the ‘tenure’ and ‘TotalCharges’ are strongly linearly correlated. However we can notice that the correlation between the ‘tenure’ and ‘TotalCharges’ to the power of 20 is not very high. Indeed, the Pearson correlation coefficient is not robust for monotonic relationships, but rank-based correlations are. Knowing this, we’ll calculate the Spearman’s rank correlation coefficient instead.

[7]:
vdf.corr(method = "spearman", show = False)
[7]:
"SeniorCitizen"
"Partner"
"Dependents"
"tenure"
"PhoneService"
"PaperlessBilling"
"MonthlyCharges"
"TotalCharges"
"Churn"
"TotalCharges^20"
"SeniorCitizen"1.00.0164786575974139-0.2111850884939580.01907678987011520.008576401079279440.1565295593111730.2210925291021620.1057953423037250.1508893281764730.105795342303725
"Partner"0.01647865759741391.00.4526762829294640.3846657102841190.017705663223972-0.0148766222878910.1084109458959810.343930553215626-0.1504475449591770.343930553215626
"Dependents"-0.2111850884939580.4526762829294641.00.164485741353804-0.00176167854468371-0.111377229193644-0.1070827255867110.0866797760484616-0.1642214015797250.0866797760484616
"tenure"0.01907678987011520.3846657102841190.1644857413538041.00.008150819869071840.007928762394763210.2763422452237080.883103368818293-0.3696207787634350.883103368818293
"PhoneService"0.008576401079279440.017705663223972-0.001761678544683710.008150819869071841.00.01650480573256970.2388264102300160.08380485478560370.01194198002900310.0838048547856037
"PaperlessBilling"0.156529559311173-0.014876622287891-0.1113772291936440.007928762394763210.01650480573256971.00.3461588793813230.1516697127990970.1918253316664680.151669712799097
"MonthlyCharges"0.2210925291021620.108410945895981-0.1070827255867110.2763422452237080.2388264102300160.3461588793813231.00.6339584053012060.1848392857837580.633958405301206
"TotalCharges"0.1057953423037250.3439305532156260.08667977604846160.8831033688182930.08380485478560370.1516697127990970.6339584053012061.0-0.2332110185851041.0
"Churn"0.150889328176473-0.150447544959177-0.164221401579725-0.3696207787634350.01194198002900310.1918253316664680.184839285783758-0.2332110185851041.0-0.233211018585104
"TotalCharges^20"0.1057953423037250.3439305532156260.08667977604846160.8831033688182930.08380485478560370.1516697127990970.6339584053012061.0-0.2332110185851041.0
Rows: 1-10 | Columns: 11
[8]:
vdf.corr(method = "spearman")
[8]:

The Spearman’s rank correlation coefficient determines the monotonic relationships between the variables.

[9]:
vdf.corr(["tenure", "TotalCharges^20"], method = "spearman")
[9]:
0.883103368818293

We can notice that Spearman’s rank correlation coefficient stays the same if one of the variables can be expressed using a monotonic function on the other. The same applies to Kendall rank correlation coefficient.

[10]:
vdf.corr(method = "kendall")
[10]:

Notice that the Kendall rank correlation coefficient will also detect the monotonic relationship.

[11]:
vdf.corr(["tenure", "TotalCharges^20"], method = "kendall")
[11]:
0.731699318287362

However, the Kendall rank correlation coefficient is very computationally expensive, so we’ll generally use Pearson and Spearman when dealing with correlations between numerical variables.

Binary features are considered numerical, but this isn’t technically accurate. Since binary variables can only take two values, calculating correlations between a binary and numerical variable can lead to misleading results. To account for this, we’ll want to use the ‘Biserial Point’ method to calculate the Point-Biserial correlation coefficient. This powerful method will help us understand the link between a binary variable and a numerical variable.

[12]:
vdf.corr(method = "biserial")
[12]:

Lastly, we’ll look at the relationship between categorical columns. In this case, the ‘Cramer’s V’ method is very efficient. Since there is no position in the Euclidean space for those variables, the ‘Cramer’s V’ coefficients cannot be negative (which is a sign of an opposite relationship) and they will range in the interval [0,1].

[13]:
vdf.corr(method = "cramer")
[13]:

Sometimes, we just need to look at the correlation between a response and other variables. The parameter ‘focus’ will isolate and show us the specified correlation vector.

[14]:
vdf.corr(method = "cramer", focus = "Churn")
[14]:

Sometimes a correlation coefficient can lead to incorrect assumptions, so we should always look at the coefficient p-value.

[15]:
vdf.corr_pvalue("Churn", "customerID", method = "cramer",)
[15]:
(0.7810906445878953, 1.3659871749110484e-36)

We can see that churning correlates to the type of contract (monthly, yearly, etc.) which makes sense: you would expect that different types of contracts differ in flexibility for the customer, and particularly restrictive contracts may make churning more likely.

The type of internet service also seems to correlate with churning. Let’s split the different categories to binaries to understand which services can influence the global churning rate.

[16]:
vdf["InternetService"].one_hot_encode()
vdf.corr(method = "spearman",
         focus = "Churn",
         columns = ["InternetService_DSL",
                    "InternetService_Fiber_optic"])
[16]:

We can see that the Fiber Optic option in particular seems to be directly linked to a customer’s likelihood to churn. Let’s compute some aggregations to find a causal relationship.

[17]:
vdf["contract"].one_hot_encode()
vdf.groupby(["InternetService_Fiber_optic"],
            ["AVG(tenure) AS tenure",
             "AVG(totalcharges) AS totalcharges",
             'AVG("contract_month-to-month") AS "contract_month-to-month"',
             'AVG("monthlycharges") AS "monthlycharges"'])
[17]:
123
InternetService_Fiber_optic
Integer
123
tenure
Float(22)
123
totalcharges
Float(22)
123
contract_month-to-month
Float(22)
123
monthlycharges
Float(22)
1031.94223460856351558.065485264230.44261464403344343.7882442361287
2132.91795865633073205.304570413440.6873385012919991.5001291989664
Rows: 1-2 | Columns: 5

It seems that users with the Fiber Optic option tend more to churn not because of the option itself, but probably because of the type of contracts and the monthly charges the users are paying to get it. Be careful when dealing with identifying correlations! Remember: correlation doesn’t imply causation!

Another important type of correlation is the autocorrelation. Let’s use the Amazon dataset to understand it.

[18]:
from verticapy.datasets import load_amazon
vdf = load_amazon()
display(vdf)
📅
date
Date
Abc
state
Varchar(32)
123
number
Integer
11998-01-01ACRE0
21998-01-01ALAGOAS0
31998-01-01AMAPÁ0
41998-01-01AMAZONAS0
51998-01-01BAHIA0
61998-01-01CEARÁ0
71998-01-01DISTRITO FEDERAL0
81998-01-01ESPÍRITO SANTO0
91998-01-01GOIÁS0
101998-01-01MARANHÃO0
111998-01-01MATO GROSSO0
121998-01-01MATO GROSSO DO SUL0
131998-01-01MINAS GERAIS0
141998-01-01PARANÁ0
151998-01-01PARAÍBA0
161998-01-01PARÁ0
171998-01-01PERNAMBUCO0
181998-01-01PIAUÍ0
191998-01-01RIO DE JANEIRO0
201998-01-01RIO GRANDE DO NORTE0
211998-01-01RIO GRANDE DO SUL0
221998-01-01RONDÔNIA0
231998-01-01RORAIMA0
241998-01-01SANTA CATARINA0
251998-01-01SERGIPE0
261998-01-01SÃO PAULO0
271998-01-01TOCANTINS0
281998-02-01ACRE0
291998-02-01ALAGOAS0
301998-02-01AMAPÁ0
311998-02-01AMAZONAS0
321998-02-01BAHIA0
331998-02-01CEARÁ0
341998-02-01DISTRITO FEDERAL0
351998-02-01ESPÍRITO SANTO0
361998-02-01GOIÁS0
371998-02-01MARANHÃO0
381998-02-01MATO GROSSO0
391998-02-01MATO GROSSO DO SUL0
401998-02-01MINAS GERAIS0
411998-02-01PARANÁ0
421998-02-01PARAÍBA0
431998-02-01PARÁ0
441998-02-01PERNAMBUCO0
451998-02-01PIAUÍ0
461998-02-01RIO DE JANEIRO0
471998-02-01RIO GRANDE DO NORTE0
481998-02-01RIO GRANDE DO SUL0
491998-02-01RONDÔNIA0
501998-02-01RORAIMA0
511998-02-01SANTA CATARINA0
521998-02-01SERGIPE0
531998-02-01SÃO PAULO0
541998-02-01TOCANTINS0
551998-03-01ACRE0
561998-03-01ALAGOAS0
571998-03-01AMAPÁ0
581998-03-01AMAZONAS0
591998-03-01BAHIA0
601998-03-01CEARÁ0
611998-03-01DISTRITO FEDERAL0
621998-03-01ESPÍRITO SANTO0
631998-03-01GOIÁS0
641998-03-01MARANHÃO0
651998-03-01MATO GROSSO0
661998-03-01MATO GROSSO DO SUL0
671998-03-01MINAS GERAIS0
681998-03-01PARANÁ0
691998-03-01PARAÍBA0
701998-03-01PARÁ0
711998-03-01PERNAMBUCO0
721998-03-01PIAUÍ0
731998-03-01RIO DE JANEIRO0
741998-03-01RIO GRANDE DO NORTE0
751998-03-01RIO GRANDE DO SUL0
761998-03-01RONDÔNIA0
771998-03-01RORAIMA0
781998-03-01SANTA CATARINA0
791998-03-01SERGIPE0
801998-03-01SÃO PAULO0
811998-03-01TOCANTINS0
821998-04-01ACRE0
831998-04-01ALAGOAS0
841998-04-01AMAPÁ0
851998-04-01AMAZONAS0
861998-04-01BAHIA0
871998-04-01CEARÁ0
881998-04-01DISTRITO FEDERAL0
891998-04-01ESPÍRITO SANTO0
901998-04-01GOIÁS0
911998-04-01MARANHÃO0
921998-04-01MATO GROSSO0
931998-04-01MATO GROSSO DO SUL0
941998-04-01MINAS GERAIS0
951998-04-01PARANÁ0
961998-04-01PARAÍBA0
971998-04-01PARÁ0
981998-04-01PERNAMBUCO0
991998-04-01PIAUÍ0
1001998-04-01RIO DE JANEIRO0
Rows: 1-100 | Columns: 3

Our goal is to predict the number of forest fires in Brazil. To do this, we can draw an autocorrelation plot and a partial autocorrelation plot.

[19]:
vdf.acf(column = "number",
        ts = "date",
        by = ["state"],
        p = 48,
        method = "pearson")
[19]:
[20]:
vdf.pacf(column = "number",
         ts = "date",
         by = ["state"],
         p = 48)
[20]:

We can see the seasonality forest fires.

It’s mathematically impossible to build the perfect correlation function, but we still have several powerful functions at our disposal for finding relationships in all kinds of datasets.