XGBoost#

Vertica vs Amazon Redshift | Python | PySpark#

Important

Version Details
Vertica: 23.4
Amazon Redshift: Jan 2023
Amazon Sagemaker: Jan 2023
Python Native XGBoost: 3.9.15
PySark: 3.3.1

XGBoost is a highly optimized distributed gradient boosting library renowned for its efficiency, flexibility, and portability. Operating within the Gradient Boosting framework, XGBoost implements powerful machine learning algorithms, specifically designed for optimal performance.

This benchmark aims to assess the performance of Vertica’s XGBoost algorithm in comparison to various XGBoost implementations, including those in Spark, Dask, Redshift, and Python.

Implementations to consider:

Amazon Redshift
Python
Dask
PySpark

By conducting this benchmark, we seek to gain insights into the comparative strengths and weaknesses of these implementations. Our evaluation will focus on factors such as speed, accuracy, and scalability. The results of this study will contribute to a better understanding of the suitability of Vertica’s XGBoost algorithm for diverse data science applications.

Below are the machine details on which the tests were carried out:

Cluster	OS	OS Version	RAM (per node)	Processor freq. (per node)	Processor cores (per node)
4 node	Red Hat Enterprise Linux	8.7 (Ootpa)	755 GB	2.3 GHz	36, 2 threads per core

Datasets#

Higgs Boson
Amazon

Higgs Boson

No. of Columns
29

Datatypes of data: Float

Amazon

No. of Columns
106

Datatypes of data: Float

Test Environment details#

Below are the configurations for each algorithm that was tested:

Vertica

Parameters: - PlannedConcurrency (general pool): 72 - Memory budget for each query (general pool): ~10GB

Version	Instance Type	Cluster	vCPU (per node)	Memory (per node)	Deploy Mode	OS	OS Version	Processor freq. (per node)
23.4	On Premise VM	4 node	36, 2 threads per core	755 GB	Enterprise	Red Hat Enterprise Linux	8.7 (Ootpa)	2.3 GHz

Amazon Redshift

Parameters:

Version	Instance Type	Cluster	vCPU (per node)	Memory (per node)	Deploy Mode
Jan 2023	ra3.16xlarge	4 node	48	384	N/A

Amazon Sagemaker

Parameters:

Version	Instance Type	Cluster	vCPU (per node)	Memory (per node)	Deploy Mode
Jan 2023	ml.m5.24xlarge	1 node	96	384	N/A

But for 1 Billion rows we have a different configuraiton:

Version	Instance Type	Cluster	vCPU (per node)	Memory (per node)	Deploy Mode
Jan 2023	ml.m5.24xlarge	3 nodes	96	384	N/A

Python

Parameters:

Version	Instance Type	Cluster	vCPU (per node)	Memory (per node)	Deploy Mode
3.9.15	N/A	N/A	N/A	N/A	N/A

Pyspark

Parameters:

We have used PySpark Xgboost 1.7.0 version.

Version	Instance Type	Cluster	vCPU (per node)	Memory (per node)	Deploy mode	Executor Memory	Driver Memory
3.3.1	N/A	N/A	36 ( Per Worker)	N/A	client	70GB	50GB

Parameters#

Custom Parameters

Platform	Num Trees	Tree Depth	Number of Bins	Feature Importance (Top 5)
Vertica	10	10	150	col26, col27, col28, col23, col25
Amazon Redshift	100	10	150	col25, col27, col26, col22, col24
Python	10	10	150	col26, col28, col27, col23, col6
Dask (Python)	10	10	150	col26, col28, col27, col23, col6
Spark	100	10	150	col25, col27, col26, col22, col5

Default Parameters

Platform	Num Trees	Tree Depth	Number of Bins	Feature Importance (Top 5)
Vertica	10	6	32	col26, col27, col28, col23, col25
Amazon Redshift	10	6	256	col25, col27, col26, col22, col24
Python	10	6	256	col26, col28, col27, col23, col6
Dask (Python)	10	6	256	col26, col28, col27, col23, col6
Spark	100	6	256	col25, col27, col26, col22, col5

Analysis#

The comparison analysis on both datasets follows:

Higgs Boson dataset analysis

Parameters: - Number of trees: 10, - tree depth=10, - number of bins=150

Below are the results from different dataset sizes. Browse throught the tabs to look at each one.

1 Billion

1B Rows#
	Run 1 - Time (mins)	Run 1 - Time (mins)	Run 1 - Time (mins)	Average (mins)
Vertica v12.0.4	219.18	219.14	219.03	219.12
Vertica v23.4	106.76	108.02	107.56	107.45**
Amazon Redshift	1.37	1.37	1.37	1.37*
Amazon SageMaker	Training did not get complete in 12 HRs.
Python	Memory Error
PySpark	1108.08	1066.39	1083.06	1085.84

Since the accuracy is similar, we will only show the runtime comparison below:

Important

Amazon Redshift is only considering a sample data of size 33,617 for training. Thus, we have removed it from further analysis.

100 Million

100 M Rows#
	Run 1 - Time (mins)	Run 1 - Time (mins)	Run 1 - Time (mins)	Average (mins)
Vertica v12.0.4	32.9	32.4	32.3	32.5
Vertica v23.4	13.75	13.75	13.78	13.76
Amazon Redshift	1.37	1.37	1.37	1.37*
Amazon SageMaker	9.25	9.17	8.9	9.11
Python	5.67	5.78	5.61	5.69
PySpark	94.23	97.98	98.18	96.8

Since the accuracy is similar, we will only show the runtime comparison below:

Important

Amazon Redshift is only considering a sample data of size 33,617 for training.

10.5 Million

10.5 M Rows#
		Run1		Run1		Run1		Average
	Feature importance(top 5)	Time Taken (minutes)	Accuracy(%)	Time Taken (minutes)	Accuracy(%)	Time Taken (minutes)	Accuracy(%)	Time Taken (minutes)	Accuracy(%)
Vertica	col26, col28, col27, col23, col25	6.14	72.52	6.07	72.52	6.08	72.52	6.1	72.52
Amazon Redshift	col25, col27, col26, col22, col24	1.38	70.5	1.47	70.51	1.37	70.56	1.41*	70.52
Amazon SageMaker	col25, col27, col26, col22, col24	2.05	73.26	2.04	73.26	2.15	73.26	2.08	73.26
Python	col26, col28, col27, col23, col6	0.47	73.29	0.48	73.29	0.47	73.29	0.47	73.29
PySpark	col25, col27, col26, col22, col5	7.27	73.29	7.23	73.29	7.28	73.29	7.26	73.29

Since the accuracy is similar, we will only show the runtime comparison below:

Important

Amazon Redshift is only considering a sample data of size 33,617 for training.

Below are the results from different experiments. Browse throught the tabs to look at each one.

Default Parameters

Default Parameters#
	Time Taken (mins)	Accuracy (%)
Vertica	1.27	67.23
Amazon Redshift	8	71.44
Python	3.84	74.15
PySpark	51.77	74.15

Custom Parameters

Custom Parameters#
	Time Taken (mins)	Accuracy (%)
Vertica	24.95	72.52
Amazon Redshift	7	70.89
Python	4.33	75.69
PySpark	56.7	75.69

Amazon dataset analysis

Below are the results from different experiments of parameters. Browse through the tabs to look at each one.

Default Parameters

Training time Taken

Default Parameters#
	Xgboost Parameters	Run 1		Run 2		Run 3		Run 4		Average
		Time Taken (minutes)	Accuracy(%)	Time Taken (minutes)	Accuracy(%)	Time Taken (minutes)	Accuracy(%)	Time Taken (minutes)	Accuracy(%)	Average Time Taken (minutes)	Average Accuracy(%)
Vertica	max_depth=10,nbins=150	6.12	100	6.1	100	6.1	100	6.1	100	6.105	100
Amazon Redshift	max_depth=10,max_bin=150	7	100	7	100	7	100	7	100	7	100
Python	max_depth=10,max_bin=150	9.56	100	8.91	100	10.39	100	10.26	100	9,78	100
PySpark	max_depth=10,max_bin=150	119.6	100	118.28	100	124.94	100	125.43	100	122.08	100

Since the accuracy is similar, we will only show the runtime comparison below:

Custom Parameters

Training time Taken

Custom Parameters#
	Xgboost Parameters	Run 1		Run 2		Run 3		Run 4		Average
		Time Taken (minutes)	Accuracy(%)	Time Taken (minutes)	Accuracy(%)	Time Taken (minutes)	Accuracy(%)	Time Taken (minutes)	Accuracy(%)	Average Time Taken (minutes)	Average Accuracy(%)
Vertica	max_depth=10,nbins=150	40.57	100	40.58	100	40.54	100	40.43	100	40.53	100
Amazon Redshift	max_depth=10,max_bin=150	7	100	7	100	7	100	7	100	7	100
Python	max_depth=10,max_bin=150	9.77	100	9.05	100	10.31	100	10.18	100	9.8275	100
PySpark	max_depth=10,max_bin=150	119.5	100	118.54	100	119.06	100	119.25	100	119.0875	100

Since the accuracy is similar, we will only show the runtime comparison below:

Vertica EON vs Vertica Enterprise#

Important

Vertica Version: 11.1.0-0

Dataset#

Amazon

No. of Rows	No. of Columns
25 M	106

Datatypes of data: Float

Note

In order to get a larger size, we duplicated rows.

Test Environment#

Vertica EON

Version	Instance Type	Cluster	vCPU (per node)	Memory (per node)	Deploy Mode	OS	OS Version	Processor freq. (per node)	Processor cores (per node)	Type	No. of nodes	Storage type
11.1.0-0	r4.8xlarge	3 nodes	N/A	244 GB	Eon	Red Hat Enterprise Linux	8.5 (Ootpa)	2.4GHz	N/A	32	3	SSD

Vertica Enterprise

Version	Instance Type	Cluster	vCPU (per node)	Memory (per node)	Deploy Mode	OS	OS Version	Processor freq. (per node)	Processor cores (per node)	Type
11.1.0-0	On Premise VM	3 node cluster	N/A	32727072 kB	Enterprise	Red Hat Enterprise Linux	8.5 (Ootpa)	2.4GHz	4	32

Comparison#

Time Taken (seconds)#
Metrics	Vertica EON	Vertica Enterprise
Training	1381.36	1260.09
Predicting (25M)	128.86	119.83

Training Time

Prediction Time