Comment by bravura

2 years ago

Can you explain your use-case and why DuckDB is better?

Considering switching from pandas and want to understand what is my best bet. I am just processing feature vectors that are too large for memory, and need an initial simple JOIN to aggregate them.

3 comments

bravura

rapatel0 2 years ago

Look into [Ibis](https://ibis-project.org/). It's a dataframe library built on duckdb. It supports lazy execution, greater than memory datastructures, remote s3 data and is insanely fast. Also works with basically any backend (postgres, mysql, parquet/csv files, etc) though there are some implementation gaps in places.

I previously had a pandas+sklearn transformation stack that would take up to 8 hours. Converted it to ibis and it executes in about 4 minutes now and doesn't fill up RAM.

It's not a perfect apples to apples pandas replacement but really a nice layer on top of sql. after learning it, I'm almost as fast as I was on pandas with expressions.

techwizrd 2 years ago

I made the switch to Ibis a few months ago and have been really enjoying it. It works with all the plotting libraries including seaborn and plotnine. And it makes switching from testing on a CSV to running on a SQL/Spark a one-line change. It's just really handy for analysis (similar to the tidyverse).

sega_sai 2 years ago

I am not necessarily saying duckdb is better. I personally just found it easier, clearer to write a sql query for any complicated set of joins/group by processing than to try to do that in pandas.