Comment by sweezyjeezy

2 months ago

> but it's really not such a crazy thing to have row labels built into your data table.

Sometimes you need data in a certain order. Sometimes there is no primary key. And it is nuts how janky the pandas API is if you just want the index to mean the current order of the dataframe and nothing else. Oh you did a pivot? I'm just going to make those pivot columns a row label now if that's alright with you. I don't do that for all functions though, you're going to have to remember which ones. Oh you want to sort a dataframe? You better make damn sure you reindex if you're planning to use that with data from another dataframe (e.g. x + y on data from separate dataframes), otherwise I'm going to align the data on indices, and you can't stop me. Also - want to call pyplot.plot(df['column'])? Yeah I'm giving it the data in index order obviously I don't care about that sort you just did. Oh you want to port this data to excel? Well if your row labels aren't meaningful and you don't want "Unnamed: 0" you're going to have to tell me not to. You need to manipulate a multi-index? You're so cute. Have fun with that buddy.

There is a reason no other dataframe library does this - because it's confusing and cognitive overhead that doesn't need to exist. I've used pandas since ~2013, had this chat with colleagues and many recommend just giving in and maintaining an index throughout. Except I've read their pandas and it sucks because now _you_ need to reason about what is currently the index - because it actually needs to change a lot to do normal things with data. I just use .reset_index copiously and try to make it behave like a normal dataframe library because it's just easier to understand later. Pandas has not earned the right to redefine what a dataframe means.

At the absolute least, index behaviour should be opt-in, not something imposed on the user.

0 comments

sweezyjeezy

No comments yet

Contribute on Hacker News ↗