Naive Bayes#
Naive Bayes is a probabilistic classification algorithm based on Bayes’ theorem, which assumes independence between features. This simplicity, combined with its efficiency and effectiveness, makes Naive Bayes particularly well-suited for various classification tasks. By calculating the probability of each class based on the input features, Naive Bayes provides a straightforward yet powerful approach to predictive modeling.
Vertica vs Spark#
Important
The goal is to assess the performance of Vertica’s Naive Bayes algorithm in direct comparison with the implementation in Apache Spark. This evaluation will focus on critical factors such as speed, accuracy, and scalability, providing valuable insights into the comparative strengths and limitations of these two implementations. Our study aims to enhance the understanding of the applicability of Vertica’s Naive Bayes algorithm in diverse data science scenarios, offering practitioners valuable information for making informed algorithmic choices.
Dataset#
Size: 25 M
No. of Rows |
No. of Columns |
---|---|
25 M |
106 |
Datatypes of data: Float
Test Environment#
Version |
Instance Type |
Cluster |
vCPU (per node) |
Memory (per node) |
Deploy Mode |
OS |
OS Version |
Processor freq. (per node) |
---|---|---|---|---|---|---|---|---|
11.1.0-0 |
On-Premises VM |
1 node |
8 |
20393864 kB |
Enterprise |
Red Hat Enterprise Linux |
7.6 (Maipo) |
2.3 GHz |
Version |
Instance Type |
Cluster |
vCPU (per node) |
Memory (per node) |
Deploy Mode |
OS |
OS Version |
Processor freq. (per node) |
---|---|---|---|---|---|---|---|---|
11.1.0-0 |
On-Premises VM |
4 nodes |
8 |
20393864 kB |
Enterprise |
Red Hat Enterprise Linux |
7.6 (Maipo) |
2.3 GHz |
Version |
Instance Type |
Cluster |
vCPU (per node) |
Memory (per node) |
Deploy Mode |
OS |
OS Version |
Processor freq. (per node) |
---|---|---|---|---|---|---|---|---|
3.2.1 |
On-Premises VM |
1 node |
8 |
20393864 kB |
NA |
Red Hat Enterprise Linux |
7.6 (Maipo) |
2.3 GHz |
Version |
Instance Type |
Cluster |
vCPU (per node) |
Memory (per node) |
Deploy Mode |
OS |
OS Version |
Processor freq. (per node) |
---|---|---|---|---|---|---|---|---|
3.2.1 |
On-Premises VM |
4 nodes |
8 |
20393864 kB |
NA |
Red Hat Enterprise Linux |
7.6 (Maipo) |
2.3 GHz |
Comparison#
Training |
Prediction - 25 M |
Accuracy |
AUC |
|
---|---|---|---|---|
Spark |
145.7 |
1095.79 |
150.55 |
146.58 |
Vertica |
9.08 |
207.56 |
0.99 |
2.19 |
Metrics |
Vertica |
Spark |
---|---|---|
Accuracy |
0.85 |
0.85 |
AUC |
0.85 |
0.77 |
Browse throught the tabs to see the time and accuracy comparison:
Training |
Prediction- 25 M |
Accuracy |
AUC |
|
---|---|---|---|---|
Spark |
69.16 |
1134.03 |
64.46 |
63.70 |
Vertica |
4.83 |
103.9 |
0.74 |
0.78 |
Metrics |
Vertica |
Spark |
---|---|---|
Accuracy |
0.85 |
0.85 |
AUC |
0.85 |
0.77 |
Browse throught the tabs to see the time and accuracy comparison: