At the recent Data Day Texas
event, I sat down with Davin Potts
and had a long conversation about a wide variety of subjects. I divided the conversation into multiple chunks by subject, and will post them up one subject at a time. In the first post
, we discussed the wide variety of programming languages and tools in use for data science projects right now, and how Potts became a core Python committer. In this second post, we discuss the advantages of KNIME
for a data science consultant, and the advantages of using SQL in the database to do data manipulation and analysis.
Paige Roberts: So, I just attended your talk on not choosing sides when choosing tools. (Choosing Sides When Choosing Tools Hurts) As a consultant, you can’t choose sides. You have to work with whatever your customer wants. So, I know you’ve been a long-time user of KNIME, one of Vertica’s partners. You used to do talks on KNIME back when I was hosting the Austin KNIME meetups, and you used to even work at KNIME, right?
I was one of the founders of the company.
Paige Roberts: One of the founders? I didn’t know that. So, what are the advantages of KNIME for a consultant who has to go in and use whatever is required?
So, one of the neat payoffs, especially when starting an engagement with a new group, where not everybody in the room knows you: I’m trying to convey that I understand some of what they’re talking about. With KNIME, I’m able to make some initial traction in being able to show that understanding, not just verbally, but in a very visual way.
KNIME gives that visual presentation of “See, here I am reading in your data. Here I am transforming something about your data. Here I am calculating something new from your data. And now, I’m presenting information about the data back to you. All in that graphical interface.” It provides a really nice way to communicate first and foremost.
Whereas if I start out by writing code, no one has ever claimed that that’s an exciting or engaging way to present information. Let’s just put a bunch of code up on the screen. No. Even with tools like Jupyter Notebooks which are fantastic, you’re still struggling to explain to the non-technical people. They’re not interested in the code. They want to get past the code quickly to the graphics, to the visuals.
And with KNIME, they feel like it’s almost all approachable. They can wrap their heads around what they’re seeing at a level that they want to operate at. And if they want to delve deeper, they can. So, in terms of helping new engagements, KNIME is an excellent tool for consultants.
Roberts: Communicating complex concepts has been my job for years. The communication aspect is one of the things that I always thought was pretty impressive about KNIME. But the other aspect, you emphasized in your talk: You don’t have to pick a single stack. You want to use SPARK, you want to use Python, you want to use R, you want to use Java, you want to … whatever it is that you want to use, you can. You can put it all in a KNIME flow. And you demonstrated that.
And, of course, now that I’m working with Vertica, I was particularly interested in the emphasis you put on using SQL. You can do in-database SQL queries, and data manipulation. You don’t have to take data out of the database, then operate on the data. Just pass in SQL and go on.
Right. For that initial part of the conversation with a new client, KNIME is great. But one of the biggest issues within virtually every company is siloed data. Maybe it’s just human nature that we create these silos. For better or for worse, it’s what happens all too often. So, the ability to quickly tap into that silo is essential.
Like you were saying, as a consultant, I try to adjust to whatever it is that the client has chosen as their technology stack. And I’m happy to do that, be flexible, and contribute in a meaningful way in a lot of different tech stacks. But I can’t do them all, and there’s no hope for one person doing that. The ability to quickly tap into the silos with KNIME means I can demonstrate something, but it’s not just visuals. I can take it into a production environment, on any stack. That is something that I have done with a lot of groups, and will continue to do with KNIME.
So, it’s not just about: Give me a nice graphical quick feedback experience that feels rewarding. It’s actually something that they can think about taking to production as well. Not every company is going to want to do that, and that has to be okay. And so when they want certain things to be implemented in Scala because that’s the one true language, or it has to be inside of Fortran because that’s the one true language. There might be a company like that, right? That also has to be okay in the end.
If you go in trying to convince people “Stop using your favorite tool. Use my company’s tool instead.” That is a hard slog. And a number of the other companies here as sponsors of Data Day Texas
are in that game of trying to convince people: Stop using your old database. Use ours instead.
I might know something about that. Yeah.
More power to them. And I’m sure each of those tools brings some cool new features that, for the right people, are an excellent choice, but that is such a hard fight. And as a tool vendor company, they can’t be flexible in the same way as a consultant can. But a consultant can’t do some other things that they’re able to do as larger companies.
To me, now working at a specific database vendor, one of the nice things about KNIME is, even if I go in and I convince a customer that whatever database they had before, “That’s a bad idea, you should use my database. And here, let’s switch you all over to Vertica.” The key workflows that the company counts on are still going to work because KNIME works with whatever database you have, and whatever other tech you have. I think that’s powerful, that flexibility.
I think to a very significant extent, companies, Vertica
included, to pick on them for a moment, the relationship between the database and the application developers is not always a healthy relationship, right? The application developer doesn’t often understand what a database can actually do for them. And to a certain extent, it’s almost like a religious lack of belief or belief structure.
It can be a holy war.
And so trying to beat the application developer over the head and say, “No, Vertica will totally kick butt. It’ll do exactly what you need. You should totally use it.” Their boss may even go to them and say, “Thou shalt use Vertica.” And they may use it under protest or duress. But they may not use it in a way that really benefits them. So, you get that schism.
Some of what helps is the database tools making themselves easier to use by providing different sorts of APIs, providing things other than SQL. There are a lot of different strategies that different groups have pursued. I’m sure all of those have helped different people.
The thing that we’ll still struggle with, the cost of when that schism remains, and they’re misusing the database on the application developer side, the cost that we pay in terms of performance often comes from the application pulling too much data out of the database, and doing things in the application code that should’ve been done inside of Vertica.
Or they’re holding on to data in the application when they should let the database do its stuff…
Let the database do what databases do well.
They’re creating risk as well. They’re not able to write code that operates as fast as what Vertica is capable of because Vertica has years and years of effort in optimizations that have gone into it.
Vertica focuses on just that one thing, crunching a lot of data at optimal performance. That’s what we’re good at.
Exactly. But when we move that data across the wire, we pay a significant penalty.
And when we transition the data from how it’s represented in the data store into the application code, there’s a translation event, so that costs CPU cycles. The transmission also costs IO cycles, and we’re paying double duty on them.
Be sure to read the first part of my discussion with Davin Potts
where we talked about how he became a core Python committer, and the wide variety of tools currently being used to build data science workflows. In our next discussion, Davin shares some exciting news about a huge new feature coming in Python 3.8.
Learn more about KNIME
Learn more about Vertica