At the recent Data Day Texas event, I sat down with Davin Potts and had a long conversation about a wide variety of subjects. I divided the conversation into multiple chunks by subject, and have been posting them one chunk at a time. In the first post, we discussed the wide variety of programming languages and tools in use for data science projects right now, and how he became a core Python committer. In the second post, we discussed the advantages of KNIME for a data science consultant like Potts, and the advantages of using SQL in a database to do data manipulation and analysis. In this post, we dive into a cool new feature coming in the next version of Python, and how everyone can help with open source.
Davin Potts: Something new is planned as a part of the upcoming release of Python. It should be more along the lines of what I talked about earlier.
Paige Roberts: A shared memory.
Davin Potts: A shared memory is not a new idea at all. If anything is new, it’s the idea of shared memory having a modern use. The old-school version that became widespread was System V Shared Memory. To indicate that it was old, they used Roman Numeral V instead of a 5.
Paige Roberts: [laughs]
Potts: Nowadays, we have somewhat more modern incarnations of it, directly derived from it, but they go by different names. POSIX shared memory is on all of the Unix platforms in a consistent way. And on Windows because, Windows sometimes feels it needs to do things differently, they have Named Shared Memory.
But to expose it into a language like Python in a way where it gives us a single consistent API, and it goes cross-platform for all of the modern platforms that everybody is focused on, gives us a single consistent tool to use. It can still stay platform independent.
Roberts: Without moving your data around and translating it constantly and having that slow down.
Potts: You can avoid that cost. And especially in Python where we think about having distinct processes, to be nerdy about it, people tend to think about using threads to get parallel performance out of their code. It’s the go-to solution that we’ve all been taught to do. Writing multi-threaded code is the first thing we think of, but it’s not the only choice. And one of the reasons for its popularity is that all of the threads can see all of the same things in memory at the same time.
So, we avoid the need to translate and communicate and transmit data. That’s a huge win. The gotcha is that you can see everything in memory across all of the threads.
And manipulate it and they can bump into each other and–yeah.
And very bad things happen. So, to protect against that, we have the concept of locks and semaphores, but also people talk about, in modern languages, the concept of thread-local storage. The idea of: Too much, I can hear too many people talking in memory. Too much noise. What I need is a quiet space to be by myself. That’s thread-local storage where the things that I create there, none of the other threads can see or touch or manipulate. I need my quiet space.
Which is great, but then you can lose that advantage that you had before when the memory was being shared.
So, the idea is, with shared memory, you can create processes that don’t trip over one another and do things in parallel, but traditionally, you had to transmit the data, communicate, translate it. Shared memory allows you to, instead of everything being shared by default, and you’re having to create a private little space for the things you really don’t want to share, it’s the flip of that. Everything is exclusive to a process, and you create a shared space where you do want to share things, and not accidentally over share.
So, you only put the things you want shared in the shared space. It makes sense.
That’s the idea. And the technique has been used to great effect for decades now. From System V to the POSIX shared memory stuff in C and C++, especially, but that shared memory construct is something that is accessible in lots of different areas. The focus in the next release for Python, in the Python module that has been created as a prototype, which has been tested and beaten upon–it’s remained unchanged for six months now. It’s actually been around for closer to a year and a half, so it seems to be stable and ready for everyone’s use.
When do you think Python 3.8 is going to come out?
The releases are on a fixed schedule. They’re published long ahead of time. It’s on an 18-month release cycle. And so the release is going to be in December this year.
There’s a nice long alpha cycle and beta cycle before we actually get to that point. So the first alpha is actually going to start in a week.
Oh. So if people want to try it out and bang on it, to make sure it’s good before it goes live?
And in the meantime, they can actually get the source code and build that directly, but people usually have to be a bit more devoted to actually want to go and do that.
It could be somebody like you. [laughs]
Yes, it turns out to be really, really easy to do. Granted, not everyone is a developer, but the task is download the source code, unpack the tarball, and type “configure” and make. That should work without any other special flags on pretty much any of the platforms.
So, yeah, there are a gazillion flags on it, but you probably don’t need to mess with them.
Just do that and it’ll probably work. And amazingly enough, the whole thing will build in, I don’t know, a minute or two, depending upon how old your system is, but, yeah. Not everybody is gonna do that. That’s okay. That’s fine.
You don’t want everybody doing that. You just want the people who are really dedicated to do that.
Honestly, I would be happy to have more people helping, testing things. I can’t imagine reaching the point where I start saying, “Yeah, okay, that’s too many.”
As far as this new feature, if you make it better for Python and you make it shared, that means it’s also better for anybody else who wants to come in and work with the same data.
So that’s the central message around that in particular. When I’ve talked with other core developers about the idea of what other people could do with it, once we’ve exposed it, meaning other people outside of the land of Python–they see the opportunities, but at the same time, they have reservations about that. Man, if we go out promising that this is some magic pill that solves everybody’s ills, that will end badly.
Overpromising can lead to a lot of disappointment and disillusionment.
Yeah, so while I see some other really interesting opportunities for using it, and that’s part of my inspiration, the idea of trying to make it so that it’s simple for everybody to use in Python. I mean, shared memory even just explaining what the concept is.
It’s not something that you learn on the first day of programming class in grade school. But making it so that it’s easy for people to use, maybe as an almost everyday tool, because it just becomes part of their workflow… It doesn’t have to reach that point. But if you try to think along the lines of could it ever be useful in that way?
It helps guide your thinking so that you don’t just resign yourself to: Only the real nerds are going to poke at this, so I’m not going bother with documentation on this one.
Ack. That’s a self-fulfilling prophecy right there. [laughs]
So, what could you accomplish with Python shared memory, if it was a normal thing that everyone used?
I see it first and foremost from a “running big code in parallel capable hardware” point of view of “This makes what I do faster, and it’s not by a little bit, it’s by a lot,” versus what I otherwise would do. It’s a huge performance increase, like orders of magnitude depending upon the situation.
But this is also for people who aren’t performance driven, they’re just trying to explore their data. I’m just trying to use Jupyter Notebooks. They can have one Jupyter Notebook up, and have so much data loaded that they can’t load the data again in another Jupyter Notebook to play with it at the same time. Shared memory would mean they could have the data loaded once in memory and use it across many different notebooks.
In that case, you’re not even thinking about performance. Just, how many copies of my data can I store in memory?
It’s actually a convenience thing, where you’re constrained by the laptop in front of you. I’ve had tests in the last few months where I needed to load on the order of 20 gigs of data into memory. And I only have 16 gigs on my Mac laptop. And I knew I had to switch to another machine. I had a 32-gig machine, but I still couldn’t hold on to two copies at the same time. Shared memory gave me a way where I can actually do lots of things on the hardware that I have.
On a very basic machine.
And you say, “Well, in that case, you should go and buy a real server with a lot of memory.”
That can be damned expensive, and you can’t haul one of those around everywhere.
So, what I was doing at first was I left the data largely on disk and I would cache chunks of it. The memory is swapping back and forth, and my tasks were taking pretty long. And then I had the accidental thought of “Wait a minute, what if I use the shared memory stuff that I’ve been working on for the last year plus?” And, holy cow… I was like, “THIS is the primary feature. The performance stuff is nice, but THIS is it. This is the killer feature.”
So, you can use less memory and accomplish more things. That is huge.
Don’t miss the earlier parts of this discussion with Davin Potts. In the first post, we discussed the wide variety of programming languages and tools in use for data science projects right now, and how he became a core Python committer. In the second post, we discussed the advantages of KNIME for a data science consultant like Potts, and the advantages of using SQL to do data manipulation and analysis. In the next post, we’ll talk about how you can help with your favorite open source project without being an expert.
Learn more about the open source Python-Vertica interface.
Learn more about how you can help with the Python 3.8 test cycle.
Get a copy of Python 3.8 alpha 2 and test.