← Back to context

Comment by paddy_m

1 day ago

I have been working on Buckaroo - my table display library for dataframes in notebook environments. Buckaroo adds table and analytics features like histograms, summary stats, sorting, and search to every dataframe. Recently I have been working to make it work better with large datasets.

This involves making it lazy for polars, allowing it to read arbitrarily large files no longer requiring loading the entire dataframe into memory. When a large dataframe initially displays, no summary stats will be available. Summary stats are computed in the background in groups of columns. Then results are cached per column. To accomplish this I wrote a polars plugin in rust that computes hashes of columns. Dealing with large data like this is tricky, operations sometimes crash, sometimes take all available memory, and sometimes they just run for a very long time. I have also been building an execution framework for Buckaroo. It uses multiprocessing based timeouts, and the caching to execute summary stats in the background.

Being able to control the execution, recover from timeouts, crashes and memory exhaustion opens up some interesting debugging tools. I have written methods that take arbitrary groups of polars expressions and produce a minimal reproduction test case through a git-bisect like process.

All of this assures that if individual columns of a dataframe fits into memory, summary stats will be computed for the entire dataframe in the background. And because it is cached, the next time you open the same dataframe, the stats will be display instantly. When exploring data I do this in an adhoc way manually (splitting up a dataframe by columns and rows), but it is error prone. This should all be automatic.

I will be presenting this at PyData Boston in December.

The Column's the limit: interactive exploration of larger than memory data sets in a notebook with Polars and Buckaroo