← Back to context

Comment by satvikpendem

5 hours ago

"If I have seen further, it is by standing on the shoulders of giants" - Isaac Newton

Polars is great, but it is better precisely because it learned from all the mistakes of Pandas. Don't besmirch the latter just because it now has to deal with the backwards compatibility of those mistakes, because when it first started, it was revolutionary.

Can one criticize pandas by comparing to R's native DataFrames that have existed since R's inception in the 90s?

I (and many others) hated Pandas long before Polars was a thing. The main problem is that it's a DSL that doesn't really work well with the rest of Python (that and multi-index is awful outside of the original financial setting). If you're doing pure data science work it doesn't really come up, but as soon as you need to transform that work into a production solution it starts to feel quite gross.

Before Polars my solution was (and still largely remains) to do most of the relational data transformations in the data layer, and the use dicts, lists and numpy for all the additional downstream transformations. This made it much easier to break out of the "DS bubble" and incorporate solutions into main products.

"revolutionary"? It just copied and pasted the decades-old R (previous "S") dataframe into Python, including all the paradigms (with worse ergonomics since it's not baked into the language).

  • No other modern language will compete with R on ergonomics because of how it allows functions to read the context they’re called in, and S expressions are incredibly flexibly. The R manual is great.

    To say pandas just copied it but worse is overly dismissive. The core of pandas has always been indexing/reindexing, split-apply-combine, and slicing views.

    It’s a different approach than R’s data tables or frames.

    • > allows functions to read the context they’re called in

      Can you show an example? Seems interesting considering that code knowing about external context is not generally a good pattern when it comes to maintainability (security, readability).

      I’ve lived through some horrific 10M line coldfusion codebases that embraced this paradigm to death - they were a whole other extreme where you could _write_ variables in the scope of where you were called from!

      1 reply →

  • This is an interesting question.

    Dataframes first appeared in S-PLUS in 1991-1992. Then R copied S, and from 1995-1996-1997 onwards R started to grow in popularity in statistics. As free and open source software, R started to take over the market among statisticians and other people who were using other statistical software, mainly SAS, SPSS and Stata.

    Given that S and R existed, why were they mostly not picked up by data analysts and programmers in 1995-2008, and only Python and Pandas made dataframes popular from 2008 onwards?

  • Exactly. I was programming in R in 2004 and Pandas didnt exist. I remember trying Pandas once and it felt unergonomic for fata analysis and it lacked the vast library of statistical analysis library.

  • It was revolutionary to Python. Without NumPy and Pandas, ML in Python would never have been a thing.

    (Yes, yes - I know some people wish that were the case!)

Indeed, even Rust was created learning with the mistakes of memory management and known patterns like the famous RAII.

With all great observations made, the quote still stands. "If I have seen further, it is by standing on the shoulders of giants" - Isaac Newton When people say I feel the sense of community, this is exactly what it means in software philosophy: we do something, others learn from it, and make better ones. In no way is the inspiration’s origin below what it inspired.