← Back to context

Comment by ggm

3 years ago

I'm using pypy to analyse 350m DNS events a day, through python cached dicts to avoid dns lookup stalls. I am getting 95% dict cache hit rate, and use threads with queue locks.

Moving to pypy definitely speeded me up a bit. Not as much as I'd hoped, it's probably all about string index into dict and dict management. I may recode into a radix tree. Hard to work out in advance how different it would be: People optimised core datastructs pretty well.

Uplift from normal python was trivial. Most dev time spent fixing pip3 for pypy in debian not knowing what apts to load, with a lot of "stop using pip" messaging.

Debian is its own worst enemy with things like this. It’s why we eventually moved off it at a previous job, because deploying Python server applications on it was dreadful.

I’m sure it’s better if you’re deploying an appliance that you hand off and never touch again, but for evolving modern Python servers it’s not well suited.

  • Yes 1000x What is it with them which makes them feel entitled to have special "dist-packages" vs "site-packages" as is the default? This drives me nuts, when I have a bunch of native packages I want to bundle in our in-house python deployment. CentOS and Ubuntu are vanilla, and only Debian (mind-boggingly) deviates from the well-trodden path.

    I still haven't figured out how to beat this dragon. All suggestions welcome!

    • > What is it with them which makes them feel entitled to have special "dist-packages" vs "site-packages" as is the default? This drives me nuts, when I have a bunch of native packages I want to bundle in our in-house python deployment. CentOS and Ubuntu are vanilla, and only Debian (mind-boggingly) deviates from the well-trodden path.

      Hi, I'm one of the people that look after this bit of Debian (and it's exactly the same in Ubuntu, FWIW).

      It's like that to solve a problem (of course, everything has a reason). The idea is that Debian provides a Python that's deeply integrated into Debian packages. But if you want to build your own Python from source, you can. What you build will use site-packages, so it won't have any overlap with Debian's Python.

      Unfortunately, while this approach was designed to be something all package-managed distributions could do, nobody else has adopted it, and consequently the code to make it work has never been pushed upstream. So, it's left as a Debian/Ubuntu oddity that confuses people. Sorry about that.

      My recommendations are: 1. If you want more control over your Python than you get from Debian's package-managed python, build your own from source (or use a docker image that does that). 2. Deploy your apps with virtualenvs or system-level containers per app.

    • Dist packages is the right way to handle Python libs. You'd prefer to have the distro package manager clashing with Pip? Never knowing who installed what. Breaking things when updates are made.

    • I usually make a venv in ~/.venv and then activate it at the top of any python project. Makes it much easier to deal with dependencies when they're all in one place.

      13 replies →

    • IMO bespoke containers using whatever python package manager makes sense for each project. Or make the leap to Nix(OS) and then still have to force every python project into compliance which can be very easy if the PyPy packages you need are already in the main Nix repo (nixpkgs) or very difficult if depends on a lot of uncommon packages, uses poetry, etc.

      Since PEP 665 was rejected the Python ecosystem continues to lack a reasonable package manager and the lack of hashed based lock files prevents building on top of the current python project/package managers.

  • What distro did you move to? IME debian as a base image for python app containers is also kind of a pain.

    • We moved to stripped down Debian images in containers and made sure to not use any of the Debian packaging ecosystem.

  • It works completely fine in my experience.

    • Lucky you. Having gone through multiple Debian upgrades, a Python 2->3 migration on Debian, and Debian python packaging to Pip/PyPI, it was a whole world of pain that cost us months of development time over years, as well as a substantial amount of downtime.

If you have very large dicts, you might find this hash table I wrote for spaCy helpful: https://github.com/explosion/preshed . You need to key the data with 64-bit keys. We use this wrapper around murmurhash for it: https://github.com/explosion/murmurhash

There's no docs so obviously this might not be for you. But the software does work, and is efficient. It's been executed many many millions of times now.

  • I'm in strings, not 64 bit keys. But thanks, nice to share ideas.

    • The idea is to hash the string into a 64-bit key. You can store the string in a value, or you can have a separate vector and make the value a struct that has the key and the value.

      The chance of colliding on the 64-bit space is low if the hash distributes evenly, so you just yolo it.

> it's probably all about string index into dict and dict management

Cool. Is the performance here something you would like to pursue? If so could you open an issue [0] with some kind of reproducer?

[0] https://foss.heptapod.net/pypy/pypy/-/issues

  • I'm thinking about how to demonstrate the problem. I have a large pickle but pickle load/dump times across gc.disable()/gc.enable() really doesn't say much.

    I need to find out how to instrument the seek/add cost of threads against the shared dict under a lock.

    My gut feel is that probably if I inlined things instead of calling out to functions I'd shave a bit more too. So saying "slower than expected" may be unfair because there's limits to how much you can speed this kind of thing up. Thats why I wondered if alternate datastructures were a better fit.

    its variable length string indexes into lists/dicts of integer counts. The advantage of a radix trie would be finding the record in semi constant time to the length in bits of the strings, and they do form prefix sets.

Uplift from normal python was trivial.

By definition if you lift something it is going to go up, but what does this mean?

  • If you replace your python engine you have to replace your imports.

    Some engines can't build and deploy all imports.

    Some engines demand syntactic sugar to do their work. Pypy doesn't

One should really consider using containers in this situation.

  • Can you describe what in this situation warrants it?

    I'm very curious about where the line is/should be.

    • In my experience leaving the system python interpreter the way it was shipped will save you enormous headaches down the road. Anytime I find myself needing additional python packages installed I will almost always at minimum create a virtual env, or ideally a container.