Comment by tgbugs

3 years ago

We use pypy3 on musl via gentoo in production to run dataset validation pipelines. The easiest place to see that we use pypy3 is probably [1]. The build process and patches we carry are under [2].

We also use pypy3 to accelerate rdflib parsing and serialization of various RDF formats. See for example [3].

Thanks to you and the whole PyPy team!

1. https://github.com/tgbugs/dockerfiles/blob/6f4ad5d873b7ab267...

2. https://github.com/tgbugs/dockerfiles/blob/6f4ad5d873b7ab267...

3. https://github.com/SciCrunch/sparc-curation/blob/0fdf393e26f...

2 comments

tgbugs

mattip 3 years ago

You're welcome, thanks for sharing. Do you have any numbers about speed vs. an alternative?

tgbugs 3 years ago

I don't have anything rigorous, but I can say that I see the usual ~4x speedup when using rdflib to parse large files so that a 20 minute workload in cpython drops to 4 or 5 minutes when run on pypy3.
I just reran one of my usual benchmarks and I see 2mins for pypy3 (pypy 7.3.12 python 3.10.12) peak memory usage about 8gigs, 4.8mins for python3.11 (3.11.4) peak memory usage about 3.6gigs (2.4x speedup). On another computer running the exact same workload I see 6.3mins and 19mins (3x speedup) with the same peak memory usage.
I don't have any numbers on the dataset pipelines because I never ran them in production on cpython and went straight to pypy3. It is easy for me to switch between the two implementations in this context so I could run a side by side comparison (with the usual caveat that it would be completely non-rigorous).
I also have some internal notes related to a project that I didn't list because it isn't public, isn't in production, and the benchmarks were collected quite a while ago, but I see a 4x increase in throughput when pulling large amounts of data from a postgresql database from 20mbps on cpython 3.6 to 80mbps on pypy3.