Loading...

Random Forest#

Random Forest is a versatile ensemble learning method that excels in making predictions across various domains, including classification and regression tasks. It operates by constructing a multitude of decision trees during training and outputs the mode or mean prediction of the individual trees for classification or regression, respectively. Renowned for its robustness and resistance to overfitting, Random Forest mitigates the shortcomings of individual decision trees by leveraging the diversity of an ensemble.

Vertica vs Spark ML#

Important

Version Details
Vertica: 11.1.0-0
Spark: 3.2.1

In this benchmark, we aim to evaluate the performance of Vertica’s Random Forest algorithm in comparison to the implementation in Apache Spark. Focusing on the crucial aspects of speed, accuracy, and scalability, our analysis seeks to provide valuable insights into the strengths and limitations of these two implementations. The comparative study will contribute to a nuanced understanding of the suitability of Vertica’s Random Forest algorithm for diverse data science applications, particularly when pitted against the well-established capabilities of Spark.

Dataset#

Size: 25 M

No. of Rows

No. of Columns

25 M

106

Datatypes of data: Float

Note

In order to get a larger size, we duplicated rows.

Test Environment#

Version

Instance Type

Cluster

vCPU (per node)

Memory (per node)

Deploy Mode

OS

OS Version

Processor freq. (per node)

11.1.0-0

On-Premises VM

1 node

8

20393864 kB

Enterprise

Red Hat Enterprise Linux

7.6 (Maipo)

2.3 GHz

Version

Instance Type

Cluster

vCPU (per node)

Memory (per node)

Deploy Mode

OS

OS Version

Processor freq. (per node)

11.1.0-0

On-Premises VM

4 nodes

8

20393864 kB

Enterprise

Red Hat Enterprise Linux

7.6 (Maipo)

2.3 GHz

Version

Instance Type

Cluster

vCPU (per node)

Memory (per node)

Deploy Mode

OS

OS Version

Processor freq. (per node)

3.2.1

On-Premises VM

1 node

8

20393864 kB

NA

Red Hat Enterprise Linux

7.6 (Maipo)

2.3 GHz

Version

Instance Type

Cluster

vCPU (per node)

Memory (per node)

Deploy Mode

OS

OS Version

Processor freq. (per node)

3.2.1

On-Premises VM

4 nodes

8

20393864 kB

NA

Red Hat Enterprise Linux

7.6 (Maipo)

2.3 GHz

Comparison#

Time in secs#

Training

Prediction - 25 M

Accuracy

AUC

Spark

1096

1581

248.4

240.6

Vertica

650.27

150.09

1.24

1.11

Metrics

Vertica

Spark

Accuracy

0.90

0.89

AUC

0.94

0.75

Browse throught the tabs to see the time and accuracy comparison:

Time in secs#

Training

Prediction- 25 M

Accuracy

AUC

Spark

409.5

1326.3

70.72

66.93

Vertica

249.64

69.25

1.26

0.43

Metrics

Vertica

Spark

Accuracy

0.90

0.89

AUC

0.95

0.75

Browse throught the tabs to see the time and accuracy comparison:

Vertica vs Madlib#

Important

Vertica Version: 23.3.0-5

Comparison with the Madlib Random Forest model.

Dataset#

No. of Columns

106

Datatypes of data: Float

Note

In order to get a larger size, we duplicated rows.

Test Environment#

Cluster

OS

OS Version

RAM

Processor frequency

Processor cores

3 node cluster

Red Hat Enterprise Linux

8.5 (Ootpa)

32727072 kB

2.4GHz

4

Comparison#

Important

Since all Madlib runs were failing for this size of dataset so the benchmark was abandoned.