Comment by lmeyerov
1 year ago
Where I really want this is pandas. The community has been smoothing the basic typing story over the last couple of years, which helps with deprecations & basic API misuses. However, I'm excited for shape/dependent typing over dataframe column names, as that would get more into our typical case of data & logic errors.
You might want to check pola.rs then, it's backed by the appache arrow memory models and it's written in rust. All the columns have a defined type and you can easily catch a mistake when loading data
Unless I'm misunderstanding, Arrow solves the data representation on disk/memory, both for pandas and polars, while I'm writing about type inferencing during static analysis, which Arrow doesn't solve.
Having a type checking system respect arrow schemas is indeed our ideal. Will polars during mypy static type checking invocations catch something like `df.this_col_is_missing` as an error? If so, that's what we want, that's great!
FWIW, we donated some of the first versions of what became apache arrow ;-)
I've been hunting down column level typing for a while and did not realise polars had this! That's an absolute game changer, especially if it could cover things like nullability, uniqueness etc.
It's not static, it's basically the same as pandas. Your editor will not know the type of a given column or whether it even exists; all of that happens at runtime.
do you have a reference for how to use static typing for polars columns? I haven't seen this in their docs...
Pandera helps with some of this. Check it out -- https://pandera.readthedocs.io/en/stable/
We've used it to great effect.
This is neat, I like the direction!
As far as I can tell, it's runtime, not static, so it won't help during our mypy static checks period?
As intuited by the poster above, we already do generally stick to Apache Arrow column types for data we want to control. Anything we do there is already checked dynamically, such as at file loads and network IO (essentially contracts), and Arrow IO conversions generally already do checks at those points. I guess this is a lightweight way to add stronger dynamically-checked contracts at intermediate function points?
Column misnaming/typo is indeed a problem in pandas. I think a powerful IDE could do the trick though.
Sort of... the IDE would want the mypy (or otherwise) typings to surface that. Internally, the dataframe library should make it easier for the IDE to see that, vs today's norm of tracking just "Any" / "Index" / "Series" / ... .
I wish timezone-naive and timezone-aware Timestamps would be different types.
See this post for a comparison of Python datetime libraries. The datetype, heiclockter and whenever libraries have different types for them.
https://dev.arie.bovenberg.net/blog/python-datetime-pitfalls...