Business Team Meeting Discussion Working Concept
When learning new database applications, a good place to start is with some compelling, real-world data. It’s not necessarily so easy to find. When looking around at publicly-available data sets, we found the road-safety data provided by the government of the United Kingdom under their Open Government License for public sector information. We loaded the data into Vertica and were able to come up with some interesting visualizations.
U.K. Traffic Data: Map of Traffic Accidentsby Severity
About This Project
My team tests the integration of Vertica with third-party products. We have used the road-safety data sets and others to enrich our understanding of the products that integrate with Vertica. In this blog, we’ll share our experience with the road-safety data and give you the tools for experimenting with it on your own.
Choosing the Data
When we were looking for data, we initially looked for fairly high-volume data setsa logical choice since Vertica’s specialty is Big Data. The traffic safety data sets were actually relatively small, with a million to two million records in the main fact tables. But we decided it was worth investigating because of the interesting attributes of the records. Also, we thought that the load process would be fairly straightforward since the data was in CSV format.
Loading the Data
Actually, we did hit a few gotchas when loading the data. Some of the CSV files for the dimensional information were actually in an Excel formatnot a big deal but it meant opening the spreadsheets and exporting as CSV. After the export, we discovered some commas we hadn’t anticipated, so we had to edit those records manually. Then we discovered that some of the more recent files contained additional age fields, so we had to edit our load script to allow for that. We also edited the script to find and replace missing dimension values and perform a few other checks we hadn’t thought of previously. We probably could have saved ourselves a bit of time and effort with a little more advance planning…. But we did end up with a clean, complete data set that was interesting to investigate.
Try Out Our Scripts
The load scripts we used (Create Tables.sql) are attached to this blog.
Feel free to try them out. We haven’t gone back to clean up the spelling errors and naming inconsistencies. So please just allow for those. Our focus was on getting the load to work so we could look at the data.
We used Vertica 7.2. If you would like to use the scripts but dont have a Vertica database, you can download the Community Edition free of charge from my.vertica.com. The Community Edition has more than enough capacity for this exercise.
Try the Visualization Software
My team works with many technology partners that offer visualization tools. We could have chosen any of them for this exercise. We chose Tableau because version 10 had just been released, and it was a good opportunity for us to explore its new features and assess how they work with Vertica. We would like to do more data explorations like this one using other tools in the future. Please let us know if you’d like us to use your visualization product.
The Tableau 10 workbook we created (UK Traffic Accidents.zip) is attached to this blog.
If you would like to try out our workbook but do not have Tableau, you can download a trial version of Tableau that will work with this example from the Tableau website.
Technical Note: Since we did not add constraints to the database, we enabled the Assume Referential Integrity options on our data sources. This allowed Tableau to drop unnecessary dimension tables from the query when we were not looking at those values.
What We Learned from the Data
We spent some time exploring the traffic safety data we had loaded into Vertica. Here are some of our observations:
- Although the number of accidents is generally decreasing over time, there was a bump in 2014. At that time, the 30 MPH speed limit was reduced to 20 MPH in some areas. When the speed limit was lowered, the number of accidents in those areas actually increased.
- Most accidents involve two vehicles. But serious accidents are almost as likely to only involve a single vehicle. The likelihood of a fatal accident involving a single vehicle is even higher.
- On weekdays, accidents are most likely to occur during commuting hours, when people are driving to and from work.
- On weekends, more accidents happen in the afternoon.
- Even though cars are getting safer and fewer people in cars are dying in traffic accidents, most traffic fatalities are still people in cars. The next highest category of fatalities is pedestrians. The number of fatalities on smaller motorcycles (less than 125cc) is increasing.
- More males than females die in traffic accidents. Out of all fatalities, the highest number are males aged 2635.
- The newer data sets include more details about driver age. But the data is a bit suspect, because age is tracked in five-year intervals. It would be interesting to find out if the data is really showing a trend or if rounding errors are skewing the results.
- New vehicles less than a year old are the most likely to be involved in an accident.
- Overall, the number of traffic accidents is decreasing. However, there are several areas where the number of traffic accidents is actually increasing. This is especially clear in the Birmingham area, where the number of accidents has been increasing for each
of the last 3 years of data since 2012.
We hope you found this investigation interesting. If you have comments, we would like to hear them.
If you tried out this exercise, using Tableau or any other visualization tool, what information did you discover in the data? We’d be happy to take a look at your ideas. And we could potentially help you leverage your reports or workbooks to generate interest in your visualization product.
We are always looking for interesting data. If you know of any large, public data sets that contain lots of information, please let us know.
UK Traffic Accidents.zip