← Back to context

Comment by rich_sasha

2 months ago

The article starts well, on trying to condense pandas' gaziliion of inconsistent and continuously-deprecated functions with tens of keyword arguments into a small, condensed set of composable operations - but it lost me then.

The more interesting nugget for me is about this project they mention: https://modin.readthedocs.io/en/latest/index.html called Modin, which apparently went to the effort of analysing common pandas uses and compressed the API into a mere handful of operations. Which sounds great!

Sadly for me the purpose seems to have been rather to then recreate the full pandas API, only running much faster, backed by things like Ray and Dask. So it's the same API, just much faster.

To me it's a shame. Pandas is clearly quite ergonomic for various exploratory interactive analyses, but the API is, imo, awful. The speed is usually not a concern for me - slow operations often seem to be avoidable, and my data tends to fit in (a lot of) RAM.

I can't see that their more condensed API is public facing and usable.

The pandas API is awful, but it's kind of interesting why. It was started as a financial time series manipulation library ('panels') in a hedge fund and a lot of the quirks come from that. For example the unique obsession with the 'index' - functions seemingly randomly returning dataframes with column data as the index, or having to write index=False every single time you write to disk, or it appending the index to the Series numpy data leading to incredibly confusing bugs. That comes from the assumption that there is almost always a meaningful index (timestamps).

  • > The pandas API is awful

    I hate to be the "you're holding it wrong" guy but 90% of "Pandas bad!" posts I find are either outright misinformed or mischaracterizing one person's particular opinion as some kind of common truth. This one is both!

    > That comes from the assumption that there is almost always a meaningful index (timestamps)

    The index can be literally any unique row label or ID. It's idiosyncratic among "data frames" (SQL has no equivalent concept, and the R community has disowned theirs), but it's really not such a crazy thing to have row labels built into your data table. Excel supports this in several different ways (frozen columns, VLOOKUP) and users expect it in just about any table-oriented GUI tool.

    > having to write index=False every single time you write to disk

    If you're actually using the index as it's meant to be used, you'd see why this isn't the default setting.

    > functions seemingly randomly returning dataframes with column data as the index

    I assume you're talking about the behavior of .groupby() and .rolling()? It's never been random. Under-documented and hard to reason about group_keys= and related options, yes. But not random.

    > appending the index to the Series numpy data leading to incredibly confusing bugs

    I've been using Pandas professionally almost daily since 2015 and I have no idea what this means.

    • I think the commenter you are replying to might well understand these nuances. The point is not that Pandas is inscrutable, but instead that it‘s annoying to use in many common use-cases.

    • > but it's really not such a crazy thing to have row labels built into your data table.

      Sometimes you need data in a certain order. Sometimes there is no primary key. And it is nuts how janky the pandas API is if you just want the index to mean the current order of the dataframe and nothing else. Oh you did a pivot? I'm just going to make those pivot columns a row label now if that's alright with you. I don't do that for all functions though, you're going to have to remember which ones. Oh you want to sort a dataframe? You better make damn sure you reindex if you're planning to use that with data from another dataframe (e.g. x + y on data from separate dataframes), otherwise I'm going to align the data on indices, and you can't stop me. Also - want to call pyplot.plot(df['column'])? Yeah I'm giving it the data in index order obviously I don't care about that sort you just did. Oh you want to port this data to excel? Well if your row labels aren't meaningful and you don't want "Unnamed: 0" you're going to have to tell me not to. You need to manipulate a multi-index? You're so cute. Have fun with that buddy.

      There is a reason no other dataframe library does this - because it's confusing and cognitive overhead that doesn't need to exist. I've used pandas since ~2013, had this chat with colleagues and many recommend just giving in and maintaining an index throughout. Except I've read their pandas and it sucks because now _you_ need to reason about what is currently the index - because it actually needs to change a lot to do normal things with data. I just use .reset_index copiously and try to make it behave like a normal dataframe library because it's just easier to understand later. Pandas has not earned the right to redefine what a dataframe means.

      At the absolute least, index behaviour should be opt-in, not something imposed on the user.

Check out polars- I find it much more intuitive than pandas as it looks closer to SQL (and I learned SQL first). Maybe you'll feel the same way!

  • I've looked at Polars. My sense is that Pandas is an interactive data analysis library poorly suited to production uses, and Polars is the other way around. Seemed quite verbose for example. Sometimes doing `series["2026"]` is exactly the right thing to type.

    • With some of the newest 3.x changes like copy-on-write, I find pandas getting quite verbose now as well.

      In a world where AI is writing the code, I guess I shouldn't complain, but when I am discovering something the ai of choice yet again missed, both pandas and polars still feel verbose and lacking sugar.

  • Agreed — I much prefer polars, too. IIRC the latest major version of pandas even introduced some polars-style syntax.

> Pandas is clearly quite ergonomic for various exploratory interactive analyses, but the API is, imo, awful.

Having previously inherited (and now dispossessed) an un-disentangleable pile of Python, pandas, and SQL hacks reminiscent of a spreadsheet rammed with inscrutable Excel formulae, I have no idea how data scientists collaborate on anything with this technology. It's like when bioinformatics was full of write-only Perl code that was maybe executed successfully once for the purposes of a study or paper, and was kept around for future archaeologists to hopefully one day resuscitate when the need may arise again.

If programmers are expected to just throw garbage like this at the next asshole with the misfortune to have to maintain code that was never designed to be maintained, it's not a surprise that the industry is once again moving towards write-only code, this time produced at scale by LLMs.

It's like we're back to Visual Studio Ultimate slopping out 10k lines of XAML in response to your dragging and dropping in the WYSIWYG. There is a reason nobody does this any more.