One on One with Davin Potts: 6. Advantages of In-Database Machine Learning
At the recent Data Day Texas event, I sat down with Davin Potts and had a long conversation about a wide variety of subjects. I divided the conversation into multiple chunks by subject, and have been posting them one chunk at a time. In the first post, we discussed the wide variety of programming languages and tools in use for data science projects right now, and how he became a core Python committer. In the second post, we discussed the advantages of KNIME for a data science consultant like Potts, and the advantages of using SQL in a database to do data manipulation and analysis. In the third post, we dove into a cool new feature coming in the next version of Python. In the fourth post, Potts gave a few tips on how anyone who uses open source projects like Python can contribute in an important way without being an expert. In the fifth post, we discussed how open source and Vertica interact, with focus on the new open source Python interface for Vertica.
This is the final installment of my discussion with Davin Potts. I have to say it was a lot of fun catching up and talking shop. In this final interview post, we talked about some of the advantages of doing machine learning inside a database that a lot of folks don’t know about.
Paige Roberts: One misconception I’ve had for a long time, probably from hanging out with the Hadoop and Spark crowd, was that you need to do machine learning in something like SPARK or Python. You pull data out of the database and you put them in a dataframe or something, and then you do machine learning. Then, you put your results back in the database. It was kind of an epiphany to realize, the data is already there in a table. Why move it?
Davin Potts: I’ve never seen anyone do a careful survey, all I have are anecdotes, but I get that same impression. Relatively few people are doing their machine learning work inside of the database. And I think that’ll change with time, but it’s not going to happen overnight because whatever machine learning they were doing before, they were already doing it in a particular way.
And when they shift to doing that inside of the database, there’s also a mental shift. Like during the talk, I put up the first slide about running Python code inside of Postgres. There were actually two people who I saw in the audience do a back take. Like, “What?”
Roberts: You can’t do that.
Potts: Yeah. First, there was that. Then, I saw on their faces a “Wow,” and then there was a smirk of, “No, that’s crazy.”
Potts: People need to overcome that. If they do, the reward is the performance. You’re not going to do machine learning in a database for the cool factor. You need to do it because it’s more performant.
Because you don’t have to move your data around. You don’t have that IO and CPU hit from data movement, or from data transformation. You don’t have to have any of that impact. And you don’t have to downsample or anything. You just leave the data where it is, and do your machine learning there.
The other challenge is that since they weren’t already doing machine learning in SQL, you’re asking them to make a transition to a new language as well. Look at any of Vertica’s competitors, any of the data stores, there’s usually one premier language. PL/SQL for Postgres, Java for Oracle, and go down the list. Some of them don’t even have a choice, other than just kind of “Well, you get ANSI SQL, and why would you ever need anything else?” “Yeah. Okay, MySQL.”
But the notion of things like what Microsoft has started doing, like embedding Python and R inside of the database. That’s a serious commitment on the part of any of those companies.
And in order to get more people to have more reasons to not just adopt, but to spend their whole lives inside of the database. That might be a reasonable strategy, but I’m sure watching Microsoft will show us how much that actually makes a difference between…
You don’t have to watch Microsoft. Vertica did that years ago. You can build machine learning algorithms in Vertica, in R, in Python, in Java, in SQL, in whatever it is you want to do it in. And our problem is, no one knows that.
Actually, I didn’t know that either until recently.
See? Nobody knows. And that’s a problem. It needs to change. People think, Vertica is not a baby NoSQL database, so it must not be able to do what I want it to do, because I’m doing something cutting edge, like tracking streaming aircraft data. I just walked out of a talk downstairs. A guy was talking about how they built a special database to do time series analysis because “there just wasn’t a good database for that.” And I’m like, “Hello?” We’re awesome at time series analysis. It wasn’t even a columnar database. It’s not like columnar analytics databases are a brand new idea. This is something that’s been around a while. If you’re going to do analytics, you need to use a columnar format, if you want good performance and scale.
That’s like basement level. That’s below the foundation. [laughs] It’s interesting to me, sometimes, to see this attitude about data management software right now: If it’s not open source, and it’s not brand new, it must not be any good. There is that big chunk of prejudice that established proprietary software has to overcome.
There is. It’s funny how that has shifted, right? If we go back, hmm, maybe 15 years ago, people were still using stupid phrases like “open sores.”
Now, oh man, now the proprietary code is the bad guy. That’s not cool either.
I was glad to hear Gwen Shapiro say something about that. – No, you can’t do it all in open source. Use some of the vendor’s stuff. Just be ready to escape if you need to, so you’re not locked in. Good advice.
If you pick on the Python developers, with the few exceptions that are weirdos like me that are consultants, which are very few, the vast majority of them work for companies with proprietary software products.
And some of them may have open source or significant open source components, so a lot of big companies have both proprietary and open source things. But proprietary is part of what the open source community works on. It’s part of their day jobs. It shouldn’t feel like an either/or thing.
They should know. I mean, you would think that if you work on proprietary software, you know that it’s good software. You built it.
Yep. I think it’s the fan boys and fan girls running around wanting to rally behind a banner, and create the appearance of sides that need to be rooted for, that exacerbate the situation. There will always be people like that, but the hype machine will shift, and will come back to a more reasonable middle ground. It’s when the proprietary tools start interacting with the open source tools and show a willingness, as opposed to, when you create the perception of we’re against that, people reject you.
Yeah. Microsoft figured that out.
Microsoft figured that out. They had to get rid of that guy that knew how to throw chairs in order to figure that out, but they figured it out. And so the notion of Vertica, I don’t know of a great open source story around Vertica. I’ve never heard of Vertica being down on open source either. But the notion of embracing the fact that there are open source things inside of the database that you can already do, like yesterday, I think that’s a cool story.
I really appreciate Davin Potts giving me so much of his time and his wisdom. He’s been doing data science since before anyone called it that. Thanks for reading. And if you want to try out some of the things we chatted about, here’s a link to the free Vertica Community Edition, and the equally free KNIME community edition. Happy analyzing!
And be sure to check out the earlier posts from this interview:
In the first post, we discussed the wide variety of programming languages and tools in use for data science projects right now, and how he became a core Python committer. In the second post, we discussed the advantages of KNIME for a data science consultant like Potts, and the advantages of using SQL to do data manipulation and analysis. In the third post, Potts shared some exciting news about the upcoming Python 3.8 release. In the fourth post, Potts gives a few tips on how anyone who uses open source projects like Python can contribute in an important way without being an expert. In the fifth post, we discussed how open source and Vertica interact, with focus on the new open source Python interface for Vertica.
Learn more about in-database machine learning in Vertica.
Learn more about doing time series analysis in Vertica Analytics Platform.
Learn more about the intersection between Vertica and open source.
Try out Vertica for free.