Downloading the Machine Learning Example Data

You need several data sets to run the machine learning examples. You can download these data sets from the Vertica Github repository. These examples introduce the machine learning functionality provide by Vertica.

You can download the example data in either of two ways:

Loading the Example Data

You can load the example data by either:

Example Data Descriptions

The repository contains the following data sets.

Name Description
agar_dish Synthetic data set meant to represent clustering of bacteria on an agar dish. Contains the following columns: id, x-coordinate, and y-coordinate.
agar_dish_2 125 rows sampled randomly from the original 500 rows of the agar_dish data set.
agar_dish_1 375 rows sampled randomly from the original 500 rows of the agar_dish data set.
baseball Contains statistics from a fictional baseball league. The statistics included are: first name, last name, date of birth, team name, homeruns, hits, batting average, and salary.
faithful

Wait times between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.

Reference

Härdle, W. (1991) Smoothing Techniques with Implementation in S. New York: Springer.

Azzalini, A. and Bowman, A. W. (1990). A look at some data on the Old Faithful geyser. Applied Statistics 39, 357–365.

faithful_testing Roughly 60% of the original 272 rows of the faithful data set.

faithful_training

Roughly 40% of the original 272 rows of the faithful data set.
house84

The house84 data set includes votes for each of the U.S. House of Representatives Congress members on 16 votes. Contains the following columns: id, party, vote1, vote2, vote3, vote4, vote5, vote6, vote7, vote8, vote9, vote10, vote11, vote12, vote13, vote14, vote15, vote16.

Reference

Congressional Quarterly Almanac, 98th Congress, 2nd session 1984, Volume XL: Congressional Quarterly Inc. Washington, D.C., 1985.

iris

The iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

Reference

Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.

iris1 90 rows sampled randomly from the original 150 rows in the iris data set.
iris2 60 rows sampled randomly from the original 150 rows in the iris data set.
mtcars

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

Reference

Henderson and Velleman (1981), Building multiple regression models interactively. Biometrics, 37, 391–411.

salary_data Contains fictional employee data. The data included are: employee id, first name, last name, years worked, and current salary.
transaction_data Contains fictional credit card transactions with a BOOLEAN column indicating whether there was fraud associated with the transaction. The data included are: first name, last name, store, cost, and fraud.