Comment by postalcoder

12 days ago

I've migrated off of pandas to polars for my workflows to reap the benefit of, in my experience a 10-20x speedup on average. I can't imagine anything bringing me back short of a performance miracle. LLMs have made syntax almost a non-barrier.

Went from pandas to polars to duckdb. As mentioned elsewhere SQL is the most readable for me and LLM does most of the coding on my end (quant). So I need it at the most readable and rudimentary/step-wise level.

OT, but I can’t imagine data science being a job category for too long. It’s got to be one of the first to go in AI age especially since the market is so saturated with mediocre talents.

  • As a long time DS I sadly feel we filled the field with people who don’t do any actual data science or engineering. A lot of it is glorified BI users who at most pull some averages and run half baked AB tests.

    I don’t think the field will go away with AI, frankly with LLMs I’ve automated that bottom 80% of queries I used to have to do for other users and now I just focus on actual hard problems.

    That “build a self serve dashboard” or number fetching is now an agentic tool I built.

    But the real meat of “my business specializes in X, we need models to do this well” has not yet been replaceable. I think most hard DS work is internal so isn’t in training sets (yet).

  • Even before LLMs, Data Science was being replaced by more specialization, IME.

    Data Engineers took over the plumbing once they moved on from Scala and Spark. ML Engineers took over the modeling (and LLMs are now killing this job too, as it’s rare to need model training outside of big labs). Data analysts have to know SQL and python these days, and most DS are now just this, but with a nicer title and higher pay.

    Once upon a time I thought DS would be much more about deeper statistics and causal inference, but those have proven to be rare, niche needs outside soft science academia.

    • Reading a comment like this makes me realize how broad the title “Data Scientist” is, especially this tidbit:

      > as it’s rare to need model training outside of big labs

      Do you think there are pre-trained models for e.g. process optimization for the primary metallurgy process for steel manufacturing? Industrial engineers don’t know anything about machine learning (by trade), and there are companies that bring specialized Data Science know-how to that industry to improve processes using modern data-driven methods, especially model building.

      It’s almost like 99% of comments on this topic think that DS begins at image classification and ends at LLMs, with maybe a little bit of landing page A/B testing or something. Wild.

      > Once upon a time I thought DS would be much more about deeper statistics and causal inference, but those have proven to be rare, niche needs outside soft science academia.

      This is my entire career lol.

  • > It’s got to be one of the first to go in AI age especially since the market is so saturated with mediocre talents.

    Depends what your definition of “to go” means. Responsibilities swallowed by peers? Sure, and new job titles might pop up like Research & Development Engineer or something.

    The discipline of creating automated systems to extract insights from data to create business value? I can’t really see that going anywhere. I mean, why tf would we be building so many data centers if there’s no value in the data they’re storing.

  • << It’s got to be one of the first to go in AI age especially since the market is so saturated with mediocre talents.

    This is interesting. I wanted to dig into it a little since I am not sure I am following the logic of that statement.

    Do you mean that AI would take over the field, because by default most people there are already not producing anything that a simple 'talk to data' LLM won't deliver?

    • Not GP, but as a data engineer who has worked with data scientists for 20 years, I think the assessment is unfortunately true.

      I used to work on teams where DS would put a ton of time into building quality models, gating production with defensible metrics. Now, my DS counterparts are writing prompts and calling it a day. I'm not at all convinced that the results are better, but I guess if you don't spend time (=money) on the work, it's hard to argue with the ROI?

      3 replies →

also migrated, but to duckdb.

It's funny to look back at the tricks that were needed to get gpt3 and 3.5 to write SQL (e.g. "you are a data analyst looking at a SQL database with table [tables]"). It's almost effortless now.

Same. I don't even use LLM normally as I found polars' syntax to be very intuitive. I just searched my ChatGPT history and the only times I used it are when I'm dealing with list and struct columns that were not in pandas.

  • iirc part of pandas’ popularity was that it modeled some of R’s ergonomics. What a time in history, when such things mattered! (To be clear, I’m not making fun of pandas. It was the bridge I crossed that moved me from living in Excel to living in code.)

    • I learned about pandas with R in my class way back when. At the time, it seemed like magic. In a sense, it still does, but things evolve.

Do you not experience LLM generated code constantly trying to use Pandas' methods/syntax for Polars objects?

  • Yes, ChatGPT 5.2 Pro absolutely still does this. Just ask it for a pivot table using Polars and it will probably spit out code with Pandas arguments that doesn’t work.

  • There were some growing pains in gpt-3.5 to gpt-4 era, but not nowadays (shoutout to the now-defunct Phind, which was a game changer back then).

    • The fact they pivoted away from their very compelling core offering (AI stack overflow) to complete with loveable etc in the "AI generated apps" giant fight continues to baffle me. Though I guess model updates ate their lunch.

      1 reply →

The speedup you claim is going to be contingent on how you use Pandas, with which data types, and which version of Pandas.

Same, also polars works on typescript which I used at some point out move my data from backend to frontend

" 10-20x speedup on average. "

Is this everyone's experience?

  • It depends on the specifics, but I converted a couple of scripts recently that would take minutes to run with Pandas that only took seconds to run with Polars. I was pretty impressed.

  • That was probably about what I got when I migrated some heavy number crunching code from Pandas to Polars a few years ago. Maybe even better than that.

  • It’s a typical experience. Polars is fast, and Pandas is very slow and memory-hungry. It would be one thing if Pandas had a good API, but it doesn’t.