Comment by stared
2 days ago
Yes, every time I write df[df.sth = val], a tiny part of me dies.
For a comparison, dplyr offers a lot of elegant functionality, and the functional approach in Pandas often feels like an afterthought. If R is cleaner than Python, it tells a lot (as a side note: the same story for ggplot2 and matplotlib).
Another surprise for friends coming from non-Python backgrounds is the lack of column-level type enforcement. You write df.loc[:, "col1"] and hope it works, with all checks happening at runtime. It would be amazing if Pandas integrated something like Pydantic out of the box.
I still remember when Pandas first came out—it was fantastic to have a tool that replaced hand-rolled data structures using NumPy arrays and column metadata. But that was quite a while ago, and the ecosystem has evolved rapidly since then, including Python’s gradual shift toward type checking.
> Yes, every time I write df[df.sth = val], a tiny part of me dies.
That's because it's a bad way to use Pandas, even though it is the most popular and often times recommended way. But the thing is, you can just write "safe" immutable Pandas code with method chaining and lambda expressions, resulting in very Polars-like code. For example:
Plus nowadays with the latest Pandas versions supporting Arrow datatypes, Polars performance improvements over Pandas are considerably less impressive.
Column-level name checking would be awesome, but unfortunately no python library supports that, and it will likely never be possible unless some big changes are made in the Python type hint system.
I’m not really sure why you think
Is stylistically superior to
I agree it comes in handy quite often, but that still doesn’t make it great to write compared to what sql or dplyr offers in terms of choosing columns to filter on (`where y > 0.5`, for sql and `filter(y > 0.5)`, for dplyr)
It is superior because you don't need to assign your dataframe to a variable ('df'), then update that variable or create a new one everytime you need to do that operation. Which means it is both safer (you're guaranteed to filter on the current version of the dataframe) and more concise.
For the rest of your comment: it's the best you can do in python. Sure you could write SQL, but then you're mixing text queries with python data manipulation and I would dread that. And SQL-only scripting is really out of question.
1 reply →
It's superior because it is safer. Not because the API (or requirement for using Lambda) looks better. The lambda allows the operation to work on the current state of the dataframe in the chained operation rather than the original dataframe. Alternatively, you could use .query("y > 0.5"). This also works on the current state of the dataframe.
(I'm the first to complain about the many warts in Pandas. Have written multiple books about it. This is annoying, but it is much better than [df.y > 0.5].)
Using `lambda` without care is dangerous because it risks being not vectorized at all. It risks being super slow, operating one row at a time. Is `d` a single row or the entire series or the entire dataframe?
In this case `d` is the entire dataframe. It's just a way of "piping" the object without having to rename it.
You are probably thinking about `df.apply(lambda row: ..., axis=1)` which operates on each row at a time and is indeed very slow since it's not vectorized. Here this is different and vectorized.
2 replies →
Agreed 100%. I am using this method-chaining style all the time and it works like a charm.
I mean, yes there’s arrow data types, but it’s got a long way to go before it’s got full parity with the numpy version.
All I want is for the IDE and Python to correctly infer types and column names for all of these array objects. 99% of the pain for me is navigating around SQL return values and CSVs as pieces of text instead of code.
Nonsense, if you understand why df[df.sh ==val] you'll see it's great. If you don't, you can also do df.query("sh == val").