Random Forest#
Random Forest is a versatile ensemble learning method that excels in making predictions across various domains, including classification and regression tasks. It operates by constructing a multitude of decision trees during training and outputs the mode or mean prediction of the individual trees for classification or regression, respectively. Renowned for its robustness and resistance to overfitting, Random Forest mitigates the shortcomings of individual decision trees by leveraging the diversity of an ensemble.
Vertica vs Spark ML#
Important
In this benchmark, we aim to evaluate the performance of Vertica’s Random Forest algorithm in comparison to the implementation in Apache Spark. Focusing on the crucial aspects of speed, accuracy, and scalability, our analysis seeks to provide valuable insights into the strengths and limitations of these two implementations. The comparative study will contribute to a nuanced understanding of the suitability of Vertica’s Random Forest algorithm for diverse data science applications, particularly when pitted against the well-established capabilities of Spark.
Dataset#
Size: 25 M
No. of Rows |
No. of Columns |
---|---|
25 M |
106 |
Datatypes of data: Float
Note
In order to get a larger size, we duplicated rows.
Test Environment#
Version |
Instance Type |
Cluster |
vCPU (per node) |
Memory (per node) |
Deploy Mode |
OS |
OS Version |
Processor freq. (per node) |
---|---|---|---|---|---|---|---|---|
11.1.0-0 |
On-Premises VM |
1 node |
8 |
20393864 kB |
Enterprise |
Red Hat Enterprise Linux |
7.6 (Maipo) |
2.3 GHz |
Version |
Instance Type |
Cluster |
vCPU (per node) |
Memory (per node) |
Deploy Mode |
OS |
OS Version |
Processor freq. (per node) |
---|---|---|---|---|---|---|---|---|
11.1.0-0 |
On-Premises VM |
4 nodes |
8 |
20393864 kB |
Enterprise |
Red Hat Enterprise Linux |
7.6 (Maipo) |
2.3 GHz |
Version |
Instance Type |
Cluster |
vCPU (per node) |
Memory (per node) |
Deploy Mode |
OS |
OS Version |
Processor freq. (per node) |
---|---|---|---|---|---|---|---|---|
3.2.1 |
On-Premises VM |
1 node |
8 |
20393864 kB |
NA |
Red Hat Enterprise Linux |
7.6 (Maipo) |
2.3 GHz |
Version |
Instance Type |
Cluster |
vCPU (per node) |
Memory (per node) |
Deploy Mode |
OS |
OS Version |
Processor freq. (per node) |
---|---|---|---|---|---|---|---|---|
3.2.1 |
On-Premises VM |
4 nodes |
8 |
20393864 kB |
NA |
Red Hat Enterprise Linux |
7.6 (Maipo) |
2.3 GHz |
Comparison#
Training |
Prediction - 25 M |
Accuracy |
AUC |
|
---|---|---|---|---|
Spark |
1096 |
1581 |
248.4 |
240.6 |
Vertica |
650.27 |
150.09 |
1.24 |
1.11 |
Metrics |
Vertica |
Spark |
---|---|---|
Accuracy |
0.90 |
0.89 |
AUC |
0.94 |
0.75 |
Browse throught the tabs to see the time and accuracy comparison:
Training |
Prediction- 25 M |
Accuracy |
AUC |
|
---|---|---|---|---|
Spark |
409.5 |
1326.3 |
70.72 |
66.93 |
Vertica |
249.64 |
69.25 |
1.26 |
0.43 |
Metrics |
Vertica |
Spark |
---|---|---|
Accuracy |
0.90 |
0.89 |
AUC |
0.95 |
0.75 |
Browse throught the tabs to see the time and accuracy comparison:
Vertica vs Madlib#
Important
Vertica Version: 23.3.0-5
Comparison with the Madlib Random Forest model.
Dataset#
No. of Columns |
---|
106 |
Datatypes of data: Float
Note
In order to get a larger size, we duplicated rows.
Test Environment#
Cluster |
OS |
OS Version |
RAM |
Processor frequency |
Processor cores |
---|---|---|---|---|---|
3 node cluster |
Red Hat Enterprise Linux |
8.5 (Ootpa) |
32727072 kB |
2.4GHz |
4 |
Comparison#
Important
Since all Madlib runs were failing for this size of dataset so the benchmark was abandoned.