Linear Regression#

Linear Regression is a fundamental algorithm in machine learning and statistics used for predicting a continuous outcome variable based on one or more predictor variables. It models the relationship between the independent variables and the dependent variable by fitting a linear equation to observed data. Linear Regression is widely employed for tasks such as forecasting, risk assessment, and understanding the underlying relationships within datasets.

Vertica vs Spark#

Important

Version Details
Vertica: 8.0.1
Spark: 2.02

This benchmark aims to evaluate the performance of Vertica’s Linear Regression algorithm in comparison to its counterpart in Apache Spark. Through an in-depth analysis focusing on speed, accuracy, and scalability, we seek to uncover the distinctive characteristics of these implementations. By shedding light on the strengths and potential limitations, this study aims to guide practitioners in selecting the most suitable Linear Regression solution for their specific use cases.

Dataset#

For this dataset, we created an artifical dataset from a Linear Regression model with some noise.

Test Environment#

Vertica

Version	Instance Type	Cluster	vCPU (per node)	Memory (per node)	Deploy Mode	OS	OS Version	Processor freq. (per node)
8.0.1	On Premise VM	3 node cluster	36, 2 threads per core	755 GB	Enterprise	Red Hat Enterprise Linux	8.7 (Ootpa)	2.4GHz

Vertica: max iter = 100, e = 10^-6

Spark

Version	Instance Type	Cluster	vCPU (per node)	Memory (per node)	Deploy Mode	OS	OS Version	Processor freq. (per node)
2.02	N/A	N/A	36, 2 threads per core	755 GB	N/A	Red Hat Enterprise Linux	8.7 (Ootpa)	2.4GHz

Spark: max iter = 100, e = 10^-6

Comparison#

Data			Vertica 8.01 (With BFGS Optimizer)				Vertica 8.01 (With Newton Optimizer)				Spark 2.0.1(l-bfgs)			Spark 2.0.1(Newton Optimizer)			Spark loading data
Columns	Row	Size	Total Time	Total Time	Number of Iterations	RSQUARED	Total Time	Total Time	Number of Iterations	RSQUARED	Training TIme	Number of Iterations	RSQUARED	Training Time	Number of Iterations	RSQUARED	Loading Time	Number of Partitions
100	1M	800MB	4.49	5.07	4	0.86	4.81	4.56	1	0.86	1.43	5	0.860057845	0.7	1	0.860057845	33.46	138
100	10M	8GB	26.39	23.63	4	0.758	26.04	23.19	1	0.758	96.98	4	0.758026674	2.09	1	0.758026674	136.9	288
100	100M	80GB	84.7	89.26	4	0.9044	85.93	96.38	1	0.9044	216	4	0.904440664	68.47	1	0.904440664	370.93	431
100	1B	800GB	1748.51	2038.15	4	0.9999	1808.56	2167.1	1	0.9999	2568.68(no cache data)	4	0.99999958	1788.75(no cache data)	1	0.99999958	0	4143
10	10M	800MB	3.52	3.39	4	0.9999	2.82	3.1	1	0.9999	5.07	4	0.999995264	2.9	1	0.999995264	21.72	144
20	10M	1.6GB	5.19	4.98	4	0.9999	4.49	4.52	1	0.9999	5.43	4	0.999998038	6.77	1	0.999998038	13.97	276
500	10M	40GB	151	148.68	4	0.5311	146.97	141.21	1	0.5311	40.74	5	0.5311704	34.39	1	0.5311704	204.32	288
1598	10M	~128GB	1750	1488.67	6	0.9999	1405.46	1404.83	1	0.9999	304.12	6	0.999999973	1295.26	1	0.999999973	708.53	656

Browse through the tabs to see the time comparison:

BFGS

100M

10M

Newton

100M

10M