Comment by lmeyerov

1 year ago

Where I really want this is pandas. The community has been smoothing the basic typing story over the last couple of years, which helps with deprecations & basic API misuses. However, I'm excited for shape/dependent typing over dataframe column names, as that would get more into our typical case of data & logic errors.

11 comments

lmeyerov

d3m0t3p 1 year ago

You might want to check pola.rs then, it's backed by the appache arrow memory models and it's written in rust. All the columns have a defined type and you can easily catch a mistake when loading data

lmeyerov 1 year ago

Unless I'm misunderstanding, Arrow solves the data representation on disk/memory, both for pandas and polars, while I'm writing about type inferencing during static analysis, which Arrow doesn't solve.
Having a type checking system respect arrow schemas is indeed our ideal. Will polars during mypy static type checking invocations catch something like `df.this_col_is_missing` as an error? If so, that's what we want, that's great!
FWIW, we donated some of the first versions of what became apache arrow ;-)
benrutter 1 year ago
I've been hunting down column level typing for a while and did not realise polars had this! That's an absolute game changer, especially if it could cover things like nullability, uniqueness etc.
- bobbylarrybobby 1 year ago
  
  It's not static, it's basically the same as pandas. Your editor will not know the type of a given column or whether it even exists; all of that happens at runtime.
ledauphin 1 year ago

do you have a reference for how to use static typing for polars columns? I haven't seen this in their docs...

thenobsta 1 year ago

Pandera helps with some of this. Check it out -- https://pandera.readthedocs.io/en/stable/

We've used it to great effect.

lmeyerov 1 year ago

This is neat, I like the direction!
As far as I can tell, it's runtime, not static, so it won't help during our mypy static checks period?
As intuited by the poster above, we already do generally stick to Apache Arrow column types for data we want to control. Anything we do there is already checked dynamically, such as at file loads and network IO (essentially contracts), and Arrow IO conversions generally already do checks at those points. I guess this is a lightweight way to add stronger dynamically-checked contracts at intermediate function points?

ninja3925 1 year ago

Column misnaming/typo is indeed a problem in pandas. I think a powerful IDE could do the trick though.

lmeyerov 1 year ago

Sort of... the IDE would want the mypy (or otherwise) typings to surface that. Internally, the dataframe library should make it easier for the IDE to see that, vs today's norm of tracking just "Any" / "Index" / "Series" / ... .

sampo 1 year ago

I wish timezone-naive and timezone-aware Timestamps would be different types.

Hasnep 1 year ago

See this post for a comparison of Python datetime libraries. The datetype, heiclockter and whenever libraries have different types for them.
https://dev.arie.bovenberg.net/blog/python-datetime-pitfalls...