Linear Regression#
Linear Regression is a fundamental algorithm in machine learning and statistics used for predicting a continuous outcome variable based on one or more predictor variables. It models the relationship between the independent variables and the dependent variable by fitting a linear equation to observed data. Linear Regression is widely employed for tasks such as forecasting, risk assessment, and understanding the underlying relationships within datasets.
Vertica vs Spark#
Important
This benchmark aims to evaluate the performance of Vertica’s Linear Regression algorithm in comparison to its counterpart in Apache Spark. Through an in-depth analysis focusing on speed, accuracy, and scalability, we seek to uncover the distinctive characteristics of these implementations. By shedding light on the strengths and potential limitations, this study aims to guide practitioners in selecting the most suitable Linear Regression solution for their specific use cases.
Dataset#
For this dataset, we created an artifical dataset from a Linear Regression model with some noise.
Test Environment#
Version |
Instance Type |
Cluster |
vCPU (per node) |
Memory (per node) |
Deploy Mode |
OS |
OS Version |
Processor freq. (per node) |
---|---|---|---|---|---|---|---|---|
8.0.1 |
On Premise VM |
3 node cluster |
36, 2 threads per core |
755 GB |
Enterprise |
Red Hat Enterprise Linux |
8.7 (Ootpa) |
2.4GHz |
Vertica: max iter = 100
, e = 10^-6
Version |
Instance Type |
Cluster |
vCPU (per node) |
Memory (per node) |
Deploy Mode |
OS |
OS Version |
Processor freq. (per node) |
---|---|---|---|---|---|---|---|---|
2.02 |
N/A |
N/A |
36, 2 threads per core |
755 GB |
N/A |
Red Hat Enterprise Linux |
8.7 (Ootpa) |
2.4GHz |
Spark: max iter = 100
, e = 10^-6
Comparison#
Data |
Vertica 8.01 (With BFGS Optimizer) |
Vertica 8.01 (With Newton Optimizer) |
Spark 2.0.1(l-bfgs) |
Spark 2.0.1(Newton Optimizer) |
Spark loading data |
|||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Columns |
Row |
Size |
Total Time |
Total Time |
Number of Iterations |
RSQUARED |
Total Time |
Total Time |
Number of Iterations |
RSQUARED |
Training TIme |
Number of Iterations |
RSQUARED |
Training Time |
Number of Iterations |
RSQUARED |
Loading Time |
Number of Partitions |
100 |
1M |
800MB |
4.49 |
5.07 |
4 |
0.86 |
4.81 |
4.56 |
1 |
0.86 |
1.43 |
5 |
0.860057845 |
0.7 |
1 |
0.860057845 |
33.46 |
138 |
100 |
10M |
8GB |
26.39 |
23.63 |
4 |
0.758 |
26.04 |
23.19 |
1 |
0.758 |
96.98 |
4 |
0.758026674 |
2.09 |
1 |
0.758026674 |
136.9 |
288 |
100 |
100M |
80GB |
84.7 |
89.26 |
4 |
0.9044 |
85.93 |
96.38 |
1 |
0.9044 |
216 |
4 |
0.904440664 |
68.47 |
1 |
0.904440664 |
370.93 |
431 |
100 |
1B |
800GB |
1748.51 |
2038.15 |
4 |
0.9999 |
1808.56 |
2167.1 |
1 |
0.9999 |
2568.68(no cache data) |
4 |
0.99999958 |
1788.75(no cache data) |
1 |
0.99999958 |
0 |
4143 |
10 |
10M |
800MB |
3.52 |
3.39 |
4 |
0.9999 |
2.82 |
3.1 |
1 |
0.9999 |
5.07 |
4 |
0.999995264 |
2.9 |
1 |
0.999995264 |
21.72 |
144 |
20 |
10M |
1.6GB |
5.19 |
4.98 |
4 |
0.9999 |
4.49 |
4.52 |
1 |
0.9999 |
5.43 |
4 |
0.999998038 |
6.77 |
1 |
0.999998038 |
13.97 |
276 |
500 |
10M |
40GB |
151 |
148.68 |
4 |
0.5311 |
146.97 |
141.21 |
1 |
0.5311 |
40.74 |
5 |
0.5311704 |
34.39 |
1 |
0.5311704 |
204.32 |
288 |
1598 |
10M |
~128GB |
1750 |
1488.67 |
6 |
0.9999 |
1405.46 |
1404.83 |
1 |
0.9999 |
304.12 |
6 |
0.999999973 |
1295.26 |
1 |
0.999999973 |
708.53 |
656 |
Browse through the tabs to see the time comparison: