Parallel data management in databases is a tough thing to learn when you’re a student. It can feel like another one of those esoteric theories, and it never quite sticks. Now, there’s a cool class at Brandeis University that brings those lofty theories down to earth, and gives students a real chance to see how it all works.
For the last 10 years, Dr. Olga Papaemmanouil (she gave us permission to simply call her “Olga”) has been a professor of Computer Science at Brandeis University. Back when she received her PhD from Brown University, she knew and worked with some of the academics whose work led to Vertica’s design and columnar architecture, including Michael Stonebraker.
As a professor in CS, she believes in teaching from a practical standpoint, not just an academic one. On a recent sabbatical, Olga reached out to the team of Vertica engineers she had known for years, aware that many of her research interests overlapped with theirs. Soon she began developing a class that would not simply offer an overview of how parallel systems and databases are designed and built today, but would concretely address the challenges and tradeoffs in parallel architecture, including the practical considerations of moving workloads to the cloud.
Her new course, launched in the Spring 2020 semester, offers hands-on training and serves as an introduction to distributed and parallel databases. She uses Vertica for this class, for a variety of reasons.
Choosing Vertica for the classroom
Her goal is to give students real experience, and to help them build a portfolio they can point to upon graduation. “In the past,” she says, “database courses in computer science had students read a book, write a SQL statement, and see how fast it runs. My course attempts to go a little deeper into the query plan, see how the optimizer chose to optimize, and to understand the data distribution behind it.”
Typically, a first course in databases focuses on the front end experience, explaining the SQL language. But students don’t learn about optimizations in distributed and parallel processing systems, or why a particular partitioning scheme worked well or not.
“Vertica offers a processing platform that integrates many of the essential data distribution and parallel concepts that students read in textbooks, such as column-oriented storage, vertical partitioning (aka projections), horizontal fragmentation (aka segmentation in Vertica), distributed joins and aggregations. Using Vertica as our case study system, helps students dig deeper into what is going on behind the front end. It shows how query plans and storage optimizations interact and the impact of these interactions on query performance. Students are able to see database engineering at work,” she says. “Vertica isn’t just a columnar analytics database. It offers students a chance to directly explore concepts like load balancing, and the difference between an architecture with storage and compute together, and a separation of compute from storage architecture.”
A hands-on data experience
Designed for a classroom of 30-40 students, the course is an elective for both graduate and undergraduate students. “This class is their opportunity to learn how real data management systems work, and that’s how I create the course assignments. I ask students to run and optimize queries, distribute the data, and understand how those decisions impact performance.”
One of the goals is to have students apply what they’ve learned by using the database. “So for example they create projections on a TPC-H benchmark table, collect data, and report back on how it was distributed on the cluster. How was it partitioned? Did we end up with an even distribution? Did it help the performance of your query workloads? Why or why not?”
For her class, Olga uses the Community Edition of Vertica because, “It’s a very hybrid system without being extremely complex to understand. A high percentage of topics I want to cover can be illustrated with it – for instance, query optimizations, shared nothing architectures, separation of computation logic and storage schemas, the impact of data sorting variations, redundant storage for high reliability and availability.”
Even before using Vertica to demonstrate concepts in her class, many of Olga’s students at Brandeis became employees at Vertica after graduation. That’s a great reflection of her commitment to giving her students the practical tools they need.
Check out the Community Edition of Vertica if you want to try things out for yourself. The Vertica Academy has free video instruction in how to get the most out of it.
Or, if you’re a Brandeis student, check out Professor Olga’s class!