Comment by sweezyjeezy
2 months ago
The pandas API is awful, but it's kind of interesting why. It was started as a financial time series manipulation library ('panels') in a hedge fund and a lot of the quirks come from that. For example the unique obsession with the 'index' - functions seemingly randomly returning dataframes with column data as the index, or having to write index=False every single time you write to disk, or it appending the index to the Series numpy data leading to incredibly confusing bugs. That comes from the assumption that there is almost always a meaningful index (timestamps).
> The pandas API is awful
I hate to be the "you're holding it wrong" guy but 90% of "Pandas bad!" posts I find are either outright misinformed or mischaracterizing one person's particular opinion as some kind of common truth. This one is both!
> That comes from the assumption that there is almost always a meaningful index (timestamps)
The index can be literally any unique row label or ID. It's idiosyncratic among "data frames" (SQL has no equivalent concept, and the R community has disowned theirs), but it's really not such a crazy thing to have row labels built into your data table. Excel supports this in several different ways (frozen columns, VLOOKUP) and users expect it in just about any table-oriented GUI tool.
> having to write index=False every single time you write to disk
If you're actually using the index as it's meant to be used, you'd see why this isn't the default setting.
> functions seemingly randomly returning dataframes with column data as the index
I assume you're talking about the behavior of .groupby() and .rolling()? It's never been random. Under-documented and hard to reason about group_keys= and related options, yes. But not random.
> appending the index to the Series numpy data leading to incredibly confusing bugs
I've been using Pandas professionally almost daily since 2015 and I have no idea what this means.
I think the commenter you are replying to might well understand these nuances. The point is not that Pandas is inscrutable, but instead that it‘s annoying to use in many common use-cases.
> but it's really not such a crazy thing to have row labels built into your data table.
Sometimes you need data in a certain order. Sometimes there is no primary key. And it is nuts how janky the pandas API is if you just want the index to mean the current order of the dataframe and nothing else. Oh you did a pivot? I'm just going to make those pivot columns a row label now if that's alright with you. I don't do that for all functions though, you're going to have to remember which ones. Oh you want to sort a dataframe? You better make damn sure you reindex if you're planning to use that with data from another dataframe (e.g. x + y on data from separate dataframes), otherwise I'm going to align the data on indices, and you can't stop me. Also - want to call pyplot.plot(df['column'])? Yeah I'm giving it the data in index order obviously I don't care about that sort you just did. Oh you want to port this data to excel? Well if your row labels aren't meaningful and you don't want "Unnamed: 0" you're going to have to tell me not to. You need to manipulate a multi-index? You're so cute. Have fun with that buddy.
There is a reason no other dataframe library does this - because it's confusing and cognitive overhead that doesn't need to exist. I've used pandas since ~2013, had this chat with colleagues and many recommend just giving in and maintaining an index throughout. Except I've read their pandas and it sucks because now _you_ need to reason about what is currently the index - because it actually needs to change a lot to do normal things with data. I just use .reset_index copiously and try to make it behave like a normal dataframe library because it's just easier to understand later. Pandas has not earned the right to redefine what a dataframe means.
At the absolute least, index behaviour should be opt-in, not something imposed on the user.