Loading...

Linear Regression#

Linear Regression is a fundamental algorithm in machine learning and statistics used for predicting a continuous outcome variable based on one or more predictor variables. It models the relationship between the independent variables and the dependent variable by fitting a linear equation to observed data. Linear Regression is widely employed for tasks such as forecasting, risk assessment, and understanding the underlying relationships within datasets.

Vertica vs Spark#

Important

Version Details
Vertica: 8.0.1
Spark: 2.02

This benchmark aims to evaluate the performance of Vertica’s Linear Regression algorithm in comparison to its counterpart in Apache Spark. Through an in-depth analysis focusing on speed, accuracy, and scalability, we seek to uncover the distinctive characteristics of these implementations. By shedding light on the strengths and potential limitations, this study aims to guide practitioners in selecting the most suitable Linear Regression solution for their specific use cases.

Dataset#

For this dataset, we created an artifical dataset from a Linear Regression model with some noise.

Test Environment#

Version

Instance Type

Cluster

vCPU (per node)

Memory (per node)

Deploy Mode

OS

OS Version

Processor freq. (per node)

8.0.1

On Premise VM

3 node cluster

36, 2 threads per core

755 GB

Enterprise

Red Hat Enterprise Linux

8.7 (Ootpa)

2.4GHz

Vertica: max iter = 100, e = 10^-6

Version

Instance Type

Cluster

vCPU (per node)

Memory (per node)

Deploy Mode

OS

OS Version

Processor freq. (per node)

2.02

N/A

N/A

36, 2 threads per core

755 GB

N/A

Red Hat Enterprise Linux

8.7 (Ootpa)

2.4GHz

Spark: max iter = 100, e = 10^-6

Comparison#

Data

Vertica 8.01 (With BFGS Optimizer)

Vertica 8.01 (With Newton Optimizer)

Spark 2.0.1(l-bfgs)

Spark 2.0.1(Newton Optimizer)

Spark loading data

Columns

Row

Size

Total Time

Total Time

Number of Iterations

RSQUARED

Total Time

Total Time

Number of Iterations

RSQUARED

Training TIme

Number of Iterations

RSQUARED

Training Time

Number of Iterations

RSQUARED

Loading Time

Number of Partitions

100

1M

800MB

4.49

5.07

4

0.86

4.81

4.56

1

0.86

1.43

5

0.860057845

0.7

1

0.860057845

33.46

138

100

10M

8GB

26.39

23.63

4

0.758

26.04

23.19

1

0.758

96.98

4

0.758026674

2.09

1

0.758026674

136.9

288

100

100M

80GB

84.7

89.26

4

 0.9044

85.93

96.38

1

0.9044

216

4

0.904440664

68.47

1

0.904440664

370.93

431

100

1B

800GB

1748.51

2038.15

4

0.9999

1808.56

2167.1

1

 0.9999

2568.68(no cache data)

4

0.99999958

1788.75(no cache data)

1

0.99999958

0

4143

10

10M

800MB

3.52

3.39

4

0.9999

2.82

3.1

1

 0.9999

5.07

4

0.999995264

2.9

1

0.999995264

21.72

144

20

10M

1.6GB

5.19

4.98

4

0.9999

4.49

4.52

1

0.9999

5.43

4

0.999998038

6.77

1

0.999998038

13.97

276

500

10M

40GB

151

148.68

4

0.5311

146.97

141.21

1

 0.5311

40.74

5

0.5311704

34.39

1

0.5311704

204.32

288

1598

10M

~128GB

1750

1488.67

6

0.9999

1405.46

1404.83

1

0.9999

304.12

6

0.999999973

1295.26

1

0.999999973

708.53

656

Browse through the tabs to see the time comparison: