← Back to context

Comment by rapatel0

1 day ago

Look into [Ibis](https://ibis-project.org/). It's a dataframe library built on duckdb. It supports lazy execution, greater than memory datastructures, remote s3 data and is insanely fast. Also works with basically any backend (postgres, mysql, parquet/csv files, etc) though there are some implementation gaps in places.

I previously had a pandas+sklearn transformation stack that would take up to 8 hours. Converted it to ibis and it executes in about 4 minutes now and doesn't fill up RAM.

It's not a perfect apples to apples pandas replacement but really a nice layer on top of sql. after learning it, I'm almost as fast as I was on pandas with expressions.

I made the switch to Ibis a few months ago and have been really enjoying it. It works with all the plotting libraries including seaborn and plotnine. And it makes switching from testing on a CSV to running on a SQL/Spark a one-line change. It's just really handy for analysis (similar to the tidyverse).