Comment by dmurray
4 days ago
That's not why it's used in data science though. Lots of data scientists use Python all day and have no concept of ever working in a different field.
It's used in data science because it's used in data science.
4 days ago
That's not why it's used in data science though. Lots of data scientists use Python all day and have no concept of ever working in a different field.
It's used in data science because it's used in data science.
It's used in data science because no other language has this level of library support.
And it got this unprecedented level of support because right from the start it made its focus clear syntax and (perceived) simplicity.
There is also a sort of cumulative effect from being nice for algorithmic work.
Guido's long-term strategy won over numerous other strong candidates for this role.
I think the key thing not obvious to most data scientists is they're not using python because it meets their needs, it's because we've failed them. twice.
1. data scientists aren't programmers, so why do they need a programming language? the tools they should be using don't exist. they'd need programmers to make them, and all we have to offer is... more programming languages.
2. the giant problem at the heart of modern software: the most important feature of a modern programming language is being easy to read and write. this feature is conspicuously absent from most important languages.
they're trapped. they can't do what they need without a programming language but there are only a handful they can possibly use. the real reason python ended up with such good library support is they never really had a choice.
When the first scientific libraries were written for python, most alternatives didn't even consider being readable, or convenient. The choice was more like C/Cpp/Fortran vs Python.
And then Python went into a self-reinforcing loop, with scientific community coming up with more and more ways to improve Python support for the kind of interactive work that was required for data analysis. Think ipython -> jupyter -> jupyter forks and other python-centric notebook systems.
So when data analysis evolved into data science and machine learning, gpu-first library vendors already faced a crowd of people knowing python.
It is crazy how right now one can utilize 100s of gpus through these bits of dirty python wrapped in json.
1 reply →
Partially, but it's also because 90% of your work in "data science" isn't direct analysis.
You need to get the data from somewhere. Do you need to scrape that because Python is okay at scraping? Oh, after its scraped, we looked at it and it's in ObtuseBinaryFormat0.0.LOL.Beta and, what do you know, somebody wrote a converter for that for Python. And we need to clean all the broken entries out of that and Python is decent at that. etc.
The trick is that while Python may or may not be anybody's first choice for a particular task, Python is an okay second or third choice for most tasks.
So, you can learn Python. Or you learn <best language> and <something else>. And if <something else> is Python, was <best language> sufficiently better than Python to be worth spending the time learning?
But data science usually isn't an island.
Use whatever you want on your one off personal projects but use something more non-data science friendly if you ever want your model to run directly in a production workflow.
Productionizing R models is quite painful. The normal way is to just rewrite it not in R.
I've soured a lot on directly productionizing data science code. It's normally an unmaintainable mess.
If you write it in R and then rewrite it in C (better: rewrite it in English with the R as helpful annotations, then have someone else rewrite it in C), at least there is some chance you've thought about the abstractions and operations that are actually necessary for your problem.
That's probably true now, but at one point, they were looking for people to start doing data science, and were pulling people from other domains.