XGBoost#
Vertica vs Amazon Redshift | Python | PySpark#
Important
XGBoost is a highly optimized distributed gradient boosting library renowned for its efficiency, flexibility, and portability. Operating within the Gradient Boosting framework, XGBoost implements powerful machine learning algorithms, specifically designed for optimal performance.
This benchmark aims to assess the performance of Vertica’s XGBoost algorithm in comparison to various XGBoost implementations, including those in Spark, Dask, Redshift, and Python.
Implementations to consider:
Amazon Redshift
Python
Dask
PySpark
By conducting this benchmark, we seek to gain insights into the comparative strengths and weaknesses of these implementations. Our evaluation will focus on factors such as speed, accuracy, and scalability. The results of this study will contribute to a better understanding of the suitability of Vertica’s XGBoost algorithm for diverse data science applications.
Below are the machine details on which the tests were carried out:
Cluster |
OS |
OS Version |
RAM (per node) |
Processor freq. (per node) |
Processor cores (per node) |
---|---|---|---|---|---|
4 node |
Red Hat Enterprise Linux |
8.7 (Ootpa) |
755 GB |
2.3 GHz |
36, 2 threads per core |
Datasets#
Higgs Boson
Amazon
No. of Columns |
---|
29 |
Datatypes of data: Float
No. of Columns |
---|
106 |
Datatypes of data: Float
Test Environment details#
Below are the configurations for each algorithm that was tested:
Parameters: - PlannedConcurrency (general pool): 72 - Memory budget for each query (general pool): ~10GB
Version |
Instance Type |
Cluster |
vCPU (per node) |
Memory (per node) |
Deploy Mode |
OS |
OS Version |
Processor freq. (per node) |
---|---|---|---|---|---|---|---|---|
23.4 |
On Premise VM |
4 node |
36, 2 threads per core |
755 GB |
Enterprise |
Red Hat Enterprise Linux |
8.7 (Ootpa) |
2.3 GHz |
Parameters:
Version |
Instance Type |
Cluster |
vCPU (per node) |
Memory (per node) |
Deploy Mode |
---|---|---|---|---|---|
Jan 2023 |
ra3.16xlarge |
4 node |
48 |
384 |
N/A |
Parameters:
Version |
Instance Type |
Cluster |
vCPU (per node) |
Memory (per node) |
Deploy Mode |
---|---|---|---|---|---|
Jan 2023 |
ml.m5.24xlarge |
1 node |
96 |
384 |
N/A |
But for 1 Billion rows we have a different configuraiton:
Version |
Instance Type |
Cluster |
vCPU (per node) |
Memory (per node) |
Deploy Mode |
---|---|---|---|---|---|
Jan 2023 |
ml.m5.24xlarge |
3 nodes |
96 |
384 |
N/A |
Parameters:
Version |
Instance Type |
Cluster |
vCPU (per node) |
Memory (per node) |
Deploy Mode |
---|---|---|---|---|---|
3.9.15 |
N/A |
N/A |
N/A |
N/A |
N/A |
Parameters:
We have used PySpark Xgboost 1.7.0 version.
Version |
Instance Type |
Cluster |
vCPU (per node) |
Memory (per node) |
Deploy mode |
Executor Memory |
Driver Memory |
---|---|---|---|---|---|---|---|
3.3.1 |
N/A |
N/A |
36 ( Per Worker) |
N/A |
client |
70GB |
50GB |
Parameters#
Platform |
Num Trees |
Tree Depth |
Number of Bins |
Feature Importance (Top 5) |
---|---|---|---|---|
Vertica |
10 |
10 |
150 |
col26, col27, col28, col23, col25 |
Amazon Redshift |
100 |
10 |
150 |
col25, col27, col26, col22, col24 |
Python |
10 |
10 |
150 |
col26, col28, col27, col23, col6 |
Dask (Python) |
10 |
10 |
150 |
col26, col28, col27, col23, col6 |
Spark |
100 |
10 |
150 |
col25, col27, col26, col22, col5 |
Platform |
Num Trees |
Tree Depth |
Number of Bins |
Feature Importance (Top 5) |
---|---|---|---|---|
Vertica |
10 |
6 |
32 |
col26, col27, col28, col23, col25 |
Amazon Redshift |
10 |
6 |
256 |
col25, col27, col26, col22, col24 |
Python |
10 |
6 |
256 |
col26, col28, col27, col23, col6 |
Dask (Python) |
10 |
6 |
256 |
col26, col28, col27, col23, col6 |
Spark |
100 |
6 |
256 |
col25, col27, col26, col22, col5 |
Analysis#
The comparison analysis on both datasets follows:
Parameters: - Number of trees: 10, - tree depth=10, - number of bins=150
Below are the results from different dataset sizes. Browse throught the tabs to look at each one.
Run 1 - Time (mins) |
Run 1 - Time (mins) |
Run 1 - Time (mins) |
Average (mins) |
|
---|---|---|---|---|
Vertica v12.0.4 |
219.18 |
219.14 |
219.03 |
219.12 |
Vertica v23.4 |
106.76 |
108.02 |
107.56 |
107.45** |
Amazon Redshift |
1.37 |
1.37 |
1.37 |
1.37* |
Amazon SageMaker |
Training did not get complete in 12 HRs. |
|||
Python |
Memory Error |
|||
PySpark |
1108.08 |
1066.39 |
1083.06 |
1085.84 |
Since the accuracy is similar, we will only show the runtime comparison below:
Important
Amazon Redshift is only considering a sample data of size 33,617 for training. Thus, we have removed it from further analysis.
Run 1 - Time (mins) |
Run 1 - Time (mins) |
Run 1 - Time (mins) |
Average (mins) |
|
---|---|---|---|---|
Vertica v12.0.4 |
32.9 |
32.4 |
32.3 |
32.5 |
Vertica v23.4 |
13.75 |
13.75 |
13.78 |
13.76 |
Amazon Redshift |
1.37 |
1.37 |
1.37 |
1.37* |
Amazon SageMaker |
9.25 |
9.17 |
8.9 |
9.11 |
Python |
5.67 |
5.78 |
5.61 |
5.69 |
PySpark |
94.23 |
97.98 |
98.18 |
96.8 |
Since the accuracy is similar, we will only show the runtime comparison below:
Important
Amazon Redshift is only considering a sample data of size 33,617 for training.
Run1 |
Run1 |
Run1 |
Average |
||||||
---|---|---|---|---|---|---|---|---|---|
Feature importance(top 5) |
Time Taken (minutes) |
Accuracy(%) |
Time Taken (minutes) |
Accuracy(%) |
Time Taken (minutes) |
Accuracy(%) |
Time Taken (minutes) |
Accuracy(%) |
|
Vertica |
col26, col28, col27, col23, col25 |
6.14 |
72.52 |
6.07 |
72.52 |
6.08 |
72.52 |
6.1 |
72.52 |
Amazon Redshift |
col25, col27, col26, col22, col24 |
1.38 |
70.5 |
1.47 |
70.51 |
1.37 |
70.56 |
1.41* |
70.52 |
Amazon SageMaker |
col25, col27, col26, col22, col24 |
2.05 |
73.26 |
2.04 |
73.26 |
2.15 |
73.26 |
2.08 |
73.26 |
Python |
col26, col28, col27, col23, col6 |
0.47 |
73.29 |
0.48 |
73.29 |
0.47 |
73.29 |
0.47 |
73.29 |
PySpark |
col25, col27, col26, col22, col5 |
7.27 |
73.29 |
7.23 |
73.29 |
7.28 |
73.29 |
7.26 |
73.29 |
Since the accuracy is similar, we will only show the runtime comparison below:
Important
Amazon Redshift is only considering a sample data of size 33,617 for training.
Below are the results from different experiments. Browse throught the tabs to look at each one.
Below are the results from different experiments of parameters. Browse through the tabs to look at each one.
Training time Taken
Xgboost Parameters |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Average |
||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Time Taken (minutes) |
Accuracy(%) |
Time Taken (minutes) |
Accuracy(%) |
Time Taken (minutes) |
Accuracy(%) |
Time Taken (minutes) |
Accuracy(%) |
Average Time Taken (minutes) |
Average Accuracy(%) |
||
Vertica |
max_depth=10,nbins=150 |
6.12 |
100 |
6.1 |
100 |
6.1 |
100 |
6.1 |
100 |
6.105 |
100 |
Amazon Redshift |
max_depth=10,max_bin=150 |
7 |
100 |
7 |
100 |
7 |
100 |
7 |
100 |
7 |
100 |
Python |
max_depth=10,max_bin=150 |
9.56 |
100 |
8.91 |
100 |
10.39 |
100 |
10.26 |
100 |
9,78 |
100 |
PySpark |
max_depth=10,max_bin=150 |
119.6 |
100 |
118.28 |
100 |
124.94 |
100 |
125.43 |
100 |
122.08 |
100 |
Since the accuracy is similar, we will only show the runtime comparison below:
Training time Taken
Xgboost Parameters |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Average |
||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Time Taken (minutes) |
Accuracy(%) |
Time Taken (minutes) |
Accuracy(%) |
Time Taken (minutes) |
Accuracy(%) |
Time Taken (minutes) |
Accuracy(%) |
Average Time Taken (minutes) |
Average Accuracy(%) |
||
Vertica |
max_depth=10,nbins=150 |
40.57 |
100 |
40.58 |
100 |
40.54 |
100 |
40.43 |
100 |
40.53 |
100 |
Amazon Redshift |
max_depth=10,max_bin=150 |
7 |
100 |
7 |
100 |
7 |
100 |
7 |
100 |
7 |
100 |
Python |
max_depth=10,max_bin=150 |
9.77 |
100 |
9.05 |
100 |
10.31 |
100 |
10.18 |
100 |
9.8275 |
100 |
PySpark |
max_depth=10,max_bin=150 |
119.5 |
100 |
118.54 |
100 |
119.06 |
100 |
119.25 |
100 |
119.0875 |
100 |
Since the accuracy is similar, we will only show the runtime comparison below:
Vertica EON vs Vertica Enterprise#
Important
Vertica Version: 11.1.0-0
Dataset#
Amazon
No. of Rows |
No. of Columns |
---|---|
25 M |
106 |
Datatypes of data: Float
Note
In order to get a larger size, we duplicated rows.
Test Environment#
Version |
Instance Type |
Cluster |
vCPU (per node) |
Memory (per node) |
Deploy Mode |
OS |
OS Version |
Processor freq. (per node) |
Processor cores (per node) |
Type |
No. of nodes |
Storage type |
---|---|---|---|---|---|---|---|---|---|---|---|---|
11.1.0-0 |
r4.8xlarge |
3 nodes |
N/A |
244 GB |
Eon |
Red Hat Enterprise Linux |
8.5 (Ootpa) |
2.4GHz |
N/A |
32 |
3 |
SSD |
Version |
Instance Type |
Cluster |
vCPU (per node) |
Memory (per node) |
Deploy Mode |
OS |
OS Version |
Processor freq. (per node) |
Processor cores (per node) |
Type |
---|---|---|---|---|---|---|---|---|---|---|
11.1.0-0 |
On Premise VM |
3 node cluster |
N/A |
32727072 kB |
Enterprise |
Red Hat Enterprise Linux |
8.5 (Ootpa) |
2.4GHz |
4 |
32 |
Comparison#
Metrics |
Vertica EON |
Vertica Enterprise |
---|---|---|
Training |
1381.36 |
1260.09 |
Predicting (25M) |
128.86 |
119.83 |