Loading...

XGBoost#

Vertica vs Amazon Redshift | Python | PySpark#

Important

Version Details
Vertica: 23.4
Amazon Redshift: Jan 2023
Amazon Sagemaker: Jan 2023
Python Native XGBoost: 3.9.15
PySark: 3.3.1

XGBoost is a highly optimized distributed gradient boosting library renowned for its efficiency, flexibility, and portability. Operating within the Gradient Boosting framework, XGBoost implements powerful machine learning algorithms, specifically designed for optimal performance.

This benchmark aims to assess the performance of Vertica’s XGBoost algorithm in comparison to various XGBoost implementations, including those in Spark, Dask, Redshift, and Python.

Implementations to consider:

  • Amazon Redshift

  • Python

  • Dask

  • PySpark

By conducting this benchmark, we seek to gain insights into the comparative strengths and weaknesses of these implementations. Our evaluation will focus on factors such as speed, accuracy, and scalability. The results of this study will contribute to a better understanding of the suitability of Vertica’s XGBoost algorithm for diverse data science applications.

Below are the machine details on which the tests were carried out:

Cluster

OS

OS Version

RAM (per node)

Processor freq. (per node)

Processor cores (per node)

4 node

Red Hat Enterprise Linux

8.7 (Ootpa)

755 GB

2.3 GHz

36, 2 threads per core

Datasets#

  • Higgs Boson

  • Amazon

No. of Columns

29

Datatypes of data: Float

No. of Columns

106

Datatypes of data: Float

Test Environment details#

Below are the configurations for each algorithm that was tested:

Parameters: - PlannedConcurrency (general pool): 72 - Memory budget for each query (general pool): ~10GB

Version

Instance Type

Cluster

vCPU (per node)

Memory (per node)

Deploy Mode

OS

OS Version

Processor freq. (per node)

23.4

On Premise VM

4 node

36, 2 threads per core

755 GB

Enterprise

Red Hat Enterprise Linux

8.7 (Ootpa)

2.3 GHz

Parameters:

Version

Instance Type

Cluster

vCPU (per node)

Memory (per node)

Deploy Mode

Jan 2023

ra3.16xlarge

4 node

48

384

N/A

Parameters:

Version

Instance Type

Cluster

vCPU (per node)

Memory (per node)

Deploy Mode

Jan 2023

ml.m5.24xlarge

1 node

96

384

N/A

But for 1 Billion rows we have a different configuraiton:

Version

Instance Type

Cluster

vCPU (per node)

Memory (per node)

Deploy Mode

Jan 2023

ml.m5.24xlarge

3 nodes

96

384

N/A

Parameters:

Version

Instance Type

Cluster

vCPU (per node)

Memory (per node)

Deploy Mode

3.9.15

N/A

N/A

N/A

N/A

N/A

Parameters:

We have used PySpark Xgboost 1.7.0 version.

Version

Instance Type

Cluster

vCPU (per node)

Memory (per node)

Deploy mode

Executor Memory

Driver Memory

3.3.1

N/A

N/A

36 ( Per Worker)

N/A

client

70GB

50GB

Parameters#

Platform

Num Trees

Tree Depth

Number of Bins

Feature Importance (Top 5)

Vertica

10

10

150

col26, col27, col28, col23, col25

Amazon Redshift

100

10

150

col25, col27, col26, col22, col24

Python

10

10

150

col26, col28, col27, col23, col6

Dask (Python)

10

10

150

col26, col28, col27, col23, col6

Spark

100

10

150

col25, col27, col26, col22, col5

Platform

Num Trees

Tree Depth

Number of Bins

Feature Importance (Top 5)

Vertica

10

6

32

col26, col27, col28, col23, col25

Amazon Redshift

10

6

256

col25, col27, col26, col22, col24

Python

10

6

256

col26, col28, col27, col23, col6

Dask (Python)

10

6

256

col26, col28, col27, col23, col6

Spark

100

6

256

col25, col27, col26, col22, col5

Analysis#

The comparison analysis on both datasets follows:

Parameters: - Number of trees: 10, - tree depth=10, - number of bins=150

Below are the results from different dataset sizes. Browse throught the tabs to look at each one.

1B Rows#

Run 1 - Time (mins)

Run 1 - Time (mins)

Run 1 - Time (mins)

Average (mins)

Vertica v12.0.4

219.18

219.14

219.03

219.12

Vertica v23.4

106.76

108.02

107.56

107.45**

Amazon Redshift

1.37

1.37

1.37

1.37*

Amazon SageMaker

Training did not get complete in 12 HRs.

Python

Memory Error

PySpark

1108.08

1066.39

1083.06

1085.84

Since the accuracy is similar, we will only show the runtime comparison below:

Important

Amazon Redshift is only considering a sample data of size 33,617 for training. Thus, we have removed it from further analysis.

100 M Rows#

Run 1 - Time (mins)

Run 1 - Time (mins)

Run 1 - Time (mins)

Average (mins)

Vertica v12.0.4

32.9

32.4

32.3

32.5

Vertica v23.4

13.75

13.75

13.78

13.76

Amazon Redshift

1.37

1.37

1.37

1.37*

Amazon SageMaker

9.25

9.17

8.9

9.11

Python

5.67

5.78

5.61

5.69

PySpark

94.23

97.98

98.18

96.8

Since the accuracy is similar, we will only show the runtime comparison below:

Important

Amazon Redshift is only considering a sample data of size 33,617 for training.

10.5 M Rows#

Run1

Run1

Run1

Average

Feature importance(top 5)

Time Taken (minutes)

Accuracy(%)

Time Taken (minutes)

Accuracy(%)

Time Taken (minutes)

Accuracy(%)

Time Taken (minutes)

Accuracy(%)

Vertica

col26, col28, col27, col23, col25

6.14

72.52

6.07

72.52

6.08

72.52

6.1

72.52

Amazon Redshift

col25, col27, col26, col22, col24

1.38

70.5

1.47

70.51

1.37

70.56

1.41*

70.52

Amazon SageMaker

col25, col27, col26, col22, col24

2.05

73.26

2.04

73.26

2.15

73.26

2.08

73.26

Python

col26, col28, col27, col23, col6

0.47

73.29

0.48

73.29

0.47

73.29

0.47

73.29

PySpark

col25, col27, col26, col22, col5

7.27

73.29

7.23

73.29

7.28

73.29

7.26

73.29

Since the accuracy is similar, we will only show the runtime comparison below:

Important

Amazon Redshift is only considering a sample data of size 33,617 for training.

Below are the results from different experiments. Browse throught the tabs to look at each one.

Default Parameters#

Time Taken (mins)

Accuracy (%)

Vertica

1.27

67.23

Amazon Redshift

8

71.44

Python

3.84

74.15

PySpark

51.77

74.15

Custom Parameters#

Time Taken (mins)

Accuracy (%)

Vertica

24.95

72.52

Amazon Redshift

7

70.89

Python

4.33

75.69

PySpark

56.7

75.69

Below are the results from different experiments of parameters. Browse through the tabs to look at each one.

Training time Taken

Default Parameters#

Xgboost Parameters

Run 1

Run 2

Run 3

Run 4

Average

Time Taken (minutes)

Accuracy(%)

Time Taken (minutes)

Accuracy(%)

Time Taken (minutes)

Accuracy(%)

Time Taken (minutes)

Accuracy(%)

Average Time Taken (minutes)

Average Accuracy(%)

Vertica

max_depth=10,nbins=150

6.12

100

6.1

100

6.1

100

6.1

100

6.105

100

Amazon Redshift

max_depth=10,max_bin=150

7

100

7

100

7

100

7

100

7

100

Python

max_depth=10,max_bin=150

9.56

100

8.91

100

10.39

100

10.26

100

9,78

100

PySpark

max_depth=10,max_bin=150

119.6

100

118.28

100

124.94

100

125.43

100

122.08

100

Since the accuracy is similar, we will only show the runtime comparison below:

Training time Taken

Custom Parameters#

Xgboost Parameters

Run 1

Run 2

Run 3

Run 4

Average

Time Taken (minutes)

Accuracy(%)

Time Taken (minutes)

Accuracy(%)

Time Taken (minutes)

Accuracy(%)

Time Taken (minutes)

Accuracy(%)

Average Time Taken (minutes)

Average Accuracy(%)

Vertica

max_depth=10,nbins=150

40.57

100

40.58

100

40.54

100

40.43

100

40.53

100

Amazon Redshift

max_depth=10,max_bin=150

7

100

7

100

7

100

7

100

7

100

Python

max_depth=10,max_bin=150

9.77

100

9.05

100

10.31

100

10.18

100

9.8275

100

PySpark

max_depth=10,max_bin=150

119.5

100

118.54

100

119.06

100

119.25

100

119.0875

100

Since the accuracy is similar, we will only show the runtime comparison below:

Vertica EON vs Vertica Enterprise#

Important

Vertica Version: 11.1.0-0

Dataset#

Amazon

No. of Rows

No. of Columns

25 M

106

Datatypes of data: Float

Note

In order to get a larger size, we duplicated rows.

Test Environment#

Version

Instance Type

Cluster

vCPU (per node)

Memory (per node)

Deploy Mode

OS

OS Version

Processor freq. (per node)

Processor cores (per node)

Type

No. of nodes

Storage type

11.1.0-0

r4.8xlarge

3 nodes

N/A

244 GB

Eon

Red Hat Enterprise Linux

8.5 (Ootpa)

2.4GHz

N/A

32

3

SSD

Version

Instance Type

Cluster

vCPU (per node)

Memory (per node)

Deploy Mode

OS

OS Version

Processor freq. (per node)

Processor cores (per node)

Type

11.1.0-0

On Premise VM

3 node cluster

N/A

32727072 kB

Enterprise

Red Hat Enterprise Linux

8.5 (Ootpa)

2.4GHz

4

32

Comparison#

Time Taken (seconds)#

Metrics

Vertica EON

Vertica Enterprise

Training

1381.36

1260.09

Predicting (25M)

128.86

119.83