Comment by jamesblonde

2 months ago

In Data for ML, everything has switch from NumPy (Pandas) to Arrow (Polars, DuckDB, Spark, Pandas 2.x, etc). However, Scikit-Learn is still a hold out, so it's Arrow from you data sources all to way to pre-processing pipelines in Scikit-Learn when you have to go back to NumPy. In practice, it now makes more sense to separate feature pipelines in Arrow from training pipelines with Pandas/NumPy and Scikit-Learn.*

*This is ML, not Deep Learning or Transformers.

1 comment

jamesblonde

kccqzy 2 months ago

Most Arrow arrays can be transformed into numpy arrays in a zero-copy manner. And having used both, I personally think Arrow is way more buggy than numpy: PyArrow segfaults for me about once a month when writing pure Python; numpy never segfaulted on me.