Random Forest#

Random Forest is a versatile ensemble learning method that excels in making predictions across various domains, including classification and regression tasks. It operates by constructing a multitude of decision trees during training and outputs the mode or mean prediction of the individual trees for classification or regression, respectively. Renowned for its robustness and resistance to overfitting, Random Forest mitigates the shortcomings of individual decision trees by leveraging the diversity of an ensemble.

Vertica vs Spark ML#

Important

Version Details
Vertica: 11.1.0-0
Spark: 3.2.1

In this benchmark, we aim to evaluate the performance of Vertica’s Random Forest algorithm in comparison to the implementation in Apache Spark. Focusing on the crucial aspects of speed, accuracy, and scalability, our analysis seeks to provide valuable insights into the strengths and limitations of these two implementations. The comparative study will contribute to a nuanced understanding of the suitability of Vertica’s Random Forest algorithm for diverse data science applications, particularly when pitted against the well-established capabilities of Spark.

Dataset#

Amazon

Size: 25 M

No. of Rows	No. of Columns
25 M	106

Datatypes of data: Float

Note

In order to get a larger size, we duplicated rows.

Test Environment#

Vertica

Single Node

Version	Instance Type	Cluster	vCPU (per node)	Memory (per node)	Deploy Mode	OS	OS Version	Processor freq. (per node)
11.1.0-0	On-Premises VM	1 node	8	20393864 kB	Enterprise	Red Hat Enterprise Linux	7.6 (Maipo)	2.3 GHz

Multi Node

Version	Instance Type	Cluster	vCPU (per node)	Memory (per node)	Deploy Mode	OS	OS Version	Processor freq. (per node)
11.1.0-0	On-Premises VM	4 nodes	8	20393864 kB	Enterprise	Red Hat Enterprise Linux	7.6 (Maipo)	2.3 GHz

Spark

Single Node

Version	Instance Type	Cluster	vCPU (per node)	Memory (per node)	Deploy Mode	OS	OS Version	Processor freq. (per node)
3.2.1	On-Premises VM	1 node	8	20393864 kB	NA	Red Hat Enterprise Linux	7.6 (Maipo)	2.3 GHz

Multi Node

Version	Instance Type	Cluster	vCPU (per node)	Memory (per node)	Deploy Mode	OS	OS Version	Processor freq. (per node)
3.2.1	On-Premises VM	4 nodes	8	20393864 kB	NA	Red Hat Enterprise Linux	7.6 (Maipo)	2.3 GHz

Comparison#

Sinlge Node

Time in secs#
	Training	Prediction - 25 M	Accuracy	AUC
Spark	1096	1581	248.4	240.6
Vertica	650.27	150.09	1.24	1.11

Metrics	Vertica	Spark
Accuracy	0.90	0.89
AUC	0.94	0.75

Browse throught the tabs to see the time and accuracy comparison:

Time

Accuracy

Multi Node

Time in secs#
	Training	Prediction- 25 M	Accuracy	AUC
Spark	409.5	1326.3	70.72	66.93
Vertica	249.64	69.25	1.26	0.43

Metrics	Vertica	Spark
Accuracy	0.90	0.89
AUC	0.95	0.75

Browse throught the tabs to see the time and accuracy comparison:

Time

Accuracy

Vertica vs Madlib#

Important

Vertica Version: 23.3.0-5

Comparison with the Madlib Random Forest model.

Dataset#

Amazon

No. of Columns
106

Datatypes of data: Float

Note

In order to get a larger size, we duplicated rows.

Test Environment#

Cluster	OS	OS Version	RAM	Processor frequency	Processor cores
3 node cluster	Red Hat Enterprise Linux	8.5 (Ootpa)	32727072 kB	2.4GHz	4

Comparison#

Important

Since all Madlib runs were failing for this size of dataset so the benchmark was abandoned.