Ok so I am always interested in these papers as a chemist. Often, we find that the LLM are terrible at chemistry. This is because the lived experience of a chemist is fundamentally different from the education they receive. Often, a masters student takes 6 months to become productive at research in a new sub field. A PhD, around 3 months.
Most chemists will begin to develop an intuition. This is where the issues develop.
This intuition is a combination of the chemists mental model, and how the sensory environment stimulates that. As a polymer chemist in a certain system maybe brown means I see scattering hence particles. My system is supposed to be homogeneous so I bin the reaction.
It is often known that good grades don’t make good researchers. That’s because researchers aren’t doing rote recall.
So the issue is this: we ask the LLM how many proton environment in this nmr?
We should ask: I’m intercalating Li into a perovskite using BuLi. Why does the solution turn pink?
I think a huge reason why LLMs are so far ahead in programming is because programming exists entirely in a known and totally severed digital environment outside our own. To become a master programmer all you need is a laptop and an internet connection. The nature of it existing entirely in a parallel digital universe just lends itself perfectly to training.
All of that is to say that I don't think the classic engineering fields have some kind of knowledge or intuition that is truly inaccessible to LLMs, I just think that it is in a form that is too difficult right now to train on. However if you could train a model on them, I strongly suspect they would get to the same level they are at today with software.
>the lived experience of a chemist is fundamentally different from the education they receive. Most chemists will begin to develop an intuition.
Is this a documentation problem? The LLMs are only trained on what is written down. Seems to track with another comment further down quoting:
"Models are limited in ability to answer knowledge-intensive questions, probably because the required knowledge cannot easily be accessed via papers but rather by lookup in specialized databases, which the humans used to answer such questions"
I would say odds are because of an impurity. My first guess might be the solvent if there is more in action than reagents or reactants. Maybe could be confirmed or denied by some carefully figured filtration beforehand, which might not even be that difficult. I doubt I would try much further than that unless it was a bad problem.
Although for instance an alternate simple purification like distillation is pretty much routine for pure aniline to get some colorless material, and that's some pretty rough stuff to handle.
Now I once was a young chemist facing AI, I ended up highly focused on going forward in ways that would not be "taken over" by AI, and I knew I couldn't be slow or recession still might catch up with me, plus the 1990's were approaching fast ;)
By the mid 1990's I figured there's no way the stuff they have in this paper had not been well investigated.
I always knew it would take people that had way more megabytes than I could afford.
Sheesh, did I overestimate the progress people were making when I wasn't looking.
Just out of curiosity (not knowing anything about butyllithium other than what I've read on 'Things I Won't Work With'), is this answer from o3-pro even close?
I'm sure an LLM knows more about computer science than a human programmer.
Not to say the LLM is more intelligent or better at coding, but that computer science is an incredibly broad field (like chemistry). There's simply so much to know that the LLM has an inherent advantage. It can be trained with huge amounts of generalized knowledge far faster than a human can learn.
Do you know every common programing language? The LLM does, plus it can code in FRACTRAN, Brainfuck, Binary lambda calculus, and a dozen other obscure languages.
It's very impressive, until you realize the LLM's knowledge is a mile wide and an inch deep. It has vast quantities of knowledge, but lacks depth. A human that specializes in a field is almost always going to outperform an LLM in that field, at least for the moment.
As a foreign English speaker, it's a huge pet peeve is when people use acronyms without having used the full sentence before. Especially when the acronym is already a word or expression and looking it up just returns a bunch of useless examples (oh!). Eventually I'll find out the meaning (other half), and it always turns out they only saved a total of six or seven letters, which can be typed in less than 0.5 seconds, but in exchange they made their sentence more or less incomprehensible for a large group of people.
> Do you know every common programing language? The LLM does, plus it can code in FRACTRAN, Brainfuck, Binary lambda calculus, and a dozen other obscure languages.
Not only this, but they're surprisingly talented at reading compiled binaries in a dozen different machine and bytecodes. I have seen one one-shot an applet rewrite from compiled java bytecode to modern javascript.
And herein lies the fundamental power of the LLM and why it can even solve "impressive" problems: it is able to navigate a space that humans can't trivially - massive amounts of information and ability to parse through walls of simple logic/text.
LLMs are at their best when the context capacity of the human is stretched and the task doesn't really take any reasoning but requires an extraction of some basic, common pattern.
> There's simply so much to know that the LLM has an inherent advantage.
But do they understand it? I mean, A child used swear words, but does it understand the meaning of the swear words. In other comment, somebodies OH also mentioned about artistic abilities and utility of the words spoken.
It doesn't matter to my employment prospects if the AI "understands" or "thinks", whatever is meant by that, but rather if potential employers recon it's good enough to not bother employing me.
To me it looks like the paper was submitted last year but the peer reviewers identified issues with the paper which required revision before the final acceptance in March.
We can see the paper was updated since the 1 April 2024 version as it includes o1-preview (released September 2024, I believe), and GPT‑3.5 Turbo from August. I think a couple of other tested versions also post-date 1 April.
Thus, one possible criticism might have been (and I stress that I am making this up) that the original paper evaluated only 3 systems, and didn't reflect the fully diversity of available tools.
In any case, the main point of the paper was not the specific results of AI models available by the end of last year, but the development of a benchmark which can be used to evaluated models in general.
Yes, this paper and many others will be forgotten as soon as they leave the front page. Afterwards noone refers to articles like these here. People just talk about anecdotes and personal experiences. Not that I think this is bad.
Nice benchmark but the human comparison is a little lacking. They claim to have surveyed 19 experts, though the vast majority of them have only a master's degree. This would be akin to comparing LLM programming expertise to a sample of programmers with less than 5 years of experience.
I'm also not sure it's a fair comparison to average human results like that. If you quiz physicians on a broad variety of topics, you shouldn't expect cardiologists to know that much about neurology and vice-versa. This is what they did here, it seems.
Sure, but all we know is that these "13 have a master’s degree (and are currently enroled in Ph.D. studies)". We only know they have at least "2 years of experience in chemistry after their first university-level course in chemistry."
How does that qualify them as "domain experts"? What domain is their expertise? All of chemistry?
Also, books, books are really good for finding knowledge !
Seriously LLM's as a cultural technology cast them as a super interactive indexing system. I find that's a useful lens to use to understand this kind of study.
I asked several LLM's after jailbreaking with prompts to provide viable synthesis routes for various psychoactive substances and they did a remarkable job.
This was neat to see but also raised some eyebrows from me. A clever kid with some pharmacology knowledge and basic organic chemistry understanding could get up to no good.
Especially since you can ask the model to use commonly available reagents + precursors and for synthesis routes that use the least amount of equipment and glassware.
You need a decent amount of experience to make psychoactive substances. Chemistry is one of those things that looks like you just follow the steps, but in practice requires a ton of intuition and "feeling it". You can see this if you watch NileRed on youtube, he is a pretty experienced chemist, and even then still flops all the time trying to replicate reactions right out of the book.
Besides, the books Pihkl and Tikhl lay out how to make most psychoactive substances, and those books have been online for free for decades now.[1][2] Maybe there are easier routes and easier to acquire precursor recipes, but I doubt those would be hard to find. The hardest part by far is the chemistry intuition.
> You can see this if you watch NileRed on youtube
Or Extractions & Ire, along with his other channel Explosions & Fire[2], which is a PhD student trying to do chemistry in his shed, literally, using stuff you can get from a well-stocked hardware store or such.
Often the steps seem straight forward but there are details in the papers that are not covered, or the contaminants from using some brand household product rather than a pure source screws it up.
Still, his videos are usually quite entertaining regardless of results.
TiHKal and PiHKaL are fulls of synths that require equipment and re-agents far beyond what a hobbyist would be able to source.
There are various "one-pot" techniques for certain compounds if one is sufficiently clever.
For example, a certain cathinone can be produced by combining ephedrine/pseudoephedrine with a household product that reduces secondary alcohols to ketones and letting it sit.
My limited bit of knowledge of both chemistry and LLMs would tell me that subtle incorrect chemistry can have disastrous effects while subtle incorrect is an LLM superpower suggests that this is precisely the inevitable outcome
> [..] models are [...] limited in [...] ability to answer knowledge-intensive questions [...], they did not memorize the relevant facts. [...] This is probably because the required knowledge cannot easily be accessed via papers [...] but rather by lookup in specialized databases [...], which the humans [...] used to answer such questions [...]. This indicates that there is [...] room for improving [...] by training [...] on more specialized data sources or integrating them with specialized databases.
> [...] our analysis shows [...] performance of models is correlated with [...] size [...]. This [...] also indicates that chemical LLMs could, [...], be further improved by scaling them up.
Does that means the world of chemists will be eaten by LLMs? Will LLMs just improve chemists output or productivity? I'd be scared if this happened in my area of work.
It's increasingly looking like if you're young enough most knowledge work will be eaten by LLMs (or the thing that comes next) within your lifetime.
Hopefully we'll see human assisted with AI & induced demand for a good while, but the idea that people work unassisted in knowledge work is gonna go the way of artisan clothing
How much of this is because Scale AI and others have had human “taskers” create huge amounts of domain-specific content for OpenAI and other foundation model providers?
Nothing to see here unless you have some kind of unsatisfied interest in the future of AI :\
This is all highly academic, and I'm highly industrial so take this with a grain of salt. Sodium salt or otherwise, your choice ;)
If you want things to be accomplished at the bench, you want any simulation to be made by those who have not been away from the bench for that many decades :)
Same thing with the industrial environment, some people have just been away from it for too long regardless of how much familiarity they once had. You need to brush up, sometimes the same plant is like a whole different world if you haven't been back in a while.
Ok so I am always interested in these papers as a chemist. Often, we find that the LLM are terrible at chemistry. This is because the lived experience of a chemist is fundamentally different from the education they receive. Often, a masters student takes 6 months to become productive at research in a new sub field. A PhD, around 3 months.
Most chemists will begin to develop an intuition. This is where the issues develop.
This intuition is a combination of the chemists mental model, and how the sensory environment stimulates that. As a polymer chemist in a certain system maybe brown means I see scattering hence particles. My system is supposed to be homogeneous so I bin the reaction.
It is often known that good grades don’t make good researchers. That’s because researchers aren’t doing rote recall.
So the issue is this: we ask the LLM how many proton environment in this nmr?
We should ask: I’m intercalating Li into a perovskite using BuLi. Why does the solution turn pink?
I think a huge reason why LLMs are so far ahead in programming is because programming exists entirely in a known and totally severed digital environment outside our own. To become a master programmer all you need is a laptop and an internet connection. The nature of it existing entirely in a parallel digital universe just lends itself perfectly to training.
All of that is to say that I don't think the classic engineering fields have some kind of knowledge or intuition that is truly inaccessible to LLMs, I just think that it is in a form that is too difficult right now to train on. However if you could train a model on them, I strongly suspect they would get to the same level they are at today with software.
> I think a huge reason why LLMs are so far ahead in programming
Are they? Last time I checked (couple of seconds ago), they still made silly mistakes and hallucinated wildly.
Example: https://imgur.com/a/Cj2y8km (AI teaching me about the Coltrane operator, that obviously does not exist).
17 replies →
>the lived experience of a chemist is fundamentally different from the education they receive. Most chemists will begin to develop an intuition.
Is this a documentation problem? The LLMs are only trained on what is written down. Seems to track with another comment further down quoting:
"Models are limited in ability to answer knowledge-intensive questions, probably because the required knowledge cannot easily be accessed via papers but rather by lookup in specialized databases, which the humans used to answer such questions"
>using BuLi. Why does the solution turn pink?
I would say odds are because of an impurity. My first guess might be the solvent if there is more in action than reagents or reactants. Maybe could be confirmed or denied by some carefully figured filtration beforehand, which might not even be that difficult. I doubt I would try much further than that unless it was a bad problem.
Although for instance an alternate simple purification like distillation is pretty much routine for pure aniline to get some colorless material, and that's some pretty rough stuff to handle.
Now I once was a young chemist facing AI, I ended up highly focused on going forward in ways that would not be "taken over" by AI, and I knew I couldn't be slow or recession still might catch up with me, plus the 1990's were approaching fast ;)
By the mid 1990's I figured there's no way the stuff they have in this paper had not been well investigated.
I always knew it would take people that had way more megabytes than I could afford.
Sheesh, did I overestimate the progress people were making when I wasn't looking.
Just out of curiosity (not knowing anything about butyllithium other than what I've read on 'Things I Won't Work With'), is this answer from o3-pro even close?
https://chatgpt.com/share/685041db-c324-800b-afc6-5cb2c5ef31...
I'm sure an LLM knows more about computer science than a human programmer.
Not to say the LLM is more intelligent or better at coding, but that computer science is an incredibly broad field (like chemistry). There's simply so much to know that the LLM has an inherent advantage. It can be trained with huge amounts of generalized knowledge far faster than a human can learn.
Do you know every common programing language? The LLM does, plus it can code in FRACTRAN, Brainfuck, Binary lambda calculus, and a dozen other obscure languages.
It's very impressive, until you realize the LLM's knowledge is a mile wide and an inch deep. It has vast quantities of knowledge, but lacks depth. A human that specializes in a field is almost always going to outperform an LLM in that field, at least for the moment.
It's impressive until you realize its limitations.
Then it becomes impressive again once you understand how to productively use it as a tool, given its limitations.
Also that limitations keep dropping every six months
> Do you know every common programing language?
A long time ago my OH was introduced to someone who claimed "to speak seven languages fluently".
Her response at the time was was "Do they have anything interesting to say in any of them?"
As a foreign English speaker, it's a huge pet peeve is when people use acronyms without having used the full sentence before. Especially when the acronym is already a word or expression and looking it up just returns a bunch of useless examples (oh!). Eventually I'll find out the meaning (other half), and it always turns out they only saved a total of six or seven letters, which can be typed in less than 0.5 seconds, but in exchange they made their sentence more or less incomprehensible for a large group of people.
14 replies →
> OH
Other half? I've never seen this acronym before.
Is your other half Richard Feynmann?
sounds snarky and defensive, tbh
> Do you know every common programing language? The LLM does, plus it can code in FRACTRAN, Brainfuck, Binary lambda calculus, and a dozen other obscure languages.
Not only this, but they're surprisingly talented at reading compiled binaries in a dozen different machine and bytecodes. I have seen one one-shot an applet rewrite from compiled java bytecode to modern javascript.
And herein lies the fundamental power of the LLM and why it can even solve "impressive" problems: it is able to navigate a space that humans can't trivially - massive amounts of information and ability to parse through walls of simple logic/text.
LLMs are at their best when the context capacity of the human is stretched and the task doesn't really take any reasoning but requires an extraction of some basic, common pattern.
2 replies →
Binwalk, Unicorn... as if it that was advanced wizardry. Unix systems have file(1) since forever and binutils from and to every arch.
1 reply →
> There's simply so much to know that the LLM has an inherent advantage.
But do they understand it? I mean, A child used swear words, but does it understand the meaning of the swear words. In other comment, somebodies OH also mentioned about artistic abilities and utility of the words spoken.
Does a submarine swim?
It doesn't matter to my employment prospects if the AI "understands" or "thinks", whatever is meant by that, but rather if potential employers recon it's good enough to not bother employing me.
So impressive that every complex SUBLEQ code I've tried with an LLM failed really fast.
But the LLM can already connect things that you can not, by virtue of its breadth. Some may disagree, but I think it will soon go deeper too.
Received 01 April 2024
Accepted 26 March 2025
Published 20 May 2025
Probably normal but shows the built in obsolescence of the peer review journal article model in such a fast moving field.
How so?
To me it looks like the paper was submitted last year but the peer reviewers identified issues with the paper which required revision before the final acceptance in March.
We can see the paper was updated since the 1 April 2024 version as it includes o1-preview (released September 2024, I believe), and GPT‑3.5 Turbo from August. I think a couple of other tested versions also post-date 1 April.
Thus, one possible criticism might have been (and I stress that I am making this up) that the original paper evaluated only 3 systems, and didn't reflect the fully diversity of available tools.
In any case, the main point of the paper was not the specific results of AI models available by the end of last year, but the development of a benchmark which can be used to evaluated models in general.
How has that work been made obsolete?
How so? All the models they've tested are obsolete, multiple generations behind high-end versions.
(Though even these obsolete models did better than the best humans and domain experts).
1 reply →
Yes, this paper and many others will be forgotten as soon as they leave the front page. Afterwards noone refers to articles like these here. People just talk about anecdotes and personal experiences. Not that I think this is bad.
Fast-moving field? This is a chemistry paper not an ML paper. ML people have their conferences which are on much abridged timeframes.
shows the value of preprint servers like arxiv.org and chemrxiv.org
Nice benchmark but the human comparison is a little lacking. They claim to have surveyed 19 experts, though the vast majority of them have only a master's degree. This would be akin to comparing LLM programming expertise to a sample of programmers with less than 5 years of experience.
I'm also not sure it's a fair comparison to average human results like that. If you quiz physicians on a broad variety of topics, you shouldn't expect cardiologists to know that much about neurology and vice-versa. This is what they did here, it seems.
I'll get some downvotes for this but PhD vs master's degree difference is mostly work experience, an element of workload hazing and snobbery.
Somebody with a masters degree and 5 years of work experience will likely know more than a freshly graduated PhD
Sure, but all we know is that these "13 have a master’s degree (and are currently enroled in Ph.D. studies)". We only know they have at least "2 years of experience in chemistry after their first university-level course in chemistry."
How does that qualify them as "domain experts"? What domain is their expertise? All of chemistry?
I think the breadth vs depth thing applies here as well, the PhD will know more about the topic they're researching of course.
Also, books, books are really good for finding knowledge !
Seriously LLM's as a cultural technology cast them as a super interactive indexing system. I find that's a useful lens to use to understand this kind of study.
I asked several LLM's after jailbreaking with prompts to provide viable synthesis routes for various psychoactive substances and they did a remarkable job.
This was neat to see but also raised some eyebrows from me. A clever kid with some pharmacology knowledge and basic organic chemistry understanding could get up to no good.
Especially since you can ask the model to use commonly available reagents + precursors and for synthesis routes that use the least amount of equipment and glassware.
You need a decent amount of experience to make psychoactive substances. Chemistry is one of those things that looks like you just follow the steps, but in practice requires a ton of intuition and "feeling it". You can see this if you watch NileRed on youtube, he is a pretty experienced chemist, and even then still flops all the time trying to replicate reactions right out of the book.
Besides, the books Pihkl and Tikhl lay out how to make most psychoactive substances, and those books have been online for free for decades now.[1][2] Maybe there are easier routes and easier to acquire precursor recipes, but I doubt those would be hard to find. The hardest part by far is the chemistry intuition.
[1]https://erowid.org/library/books_online/pihkal/pihkal.shtml [2]https://erowid.org/library/books_online/tihkal/tihkal.shtml
> You can see this if you watch NileRed on youtube
Or Extractions & Ire, along with his other channel Explosions & Fire[2], which is a PhD student trying to do chemistry in his shed, literally, using stuff you can get from a well-stocked hardware store or such.
Often the steps seem straight forward but there are details in the papers that are not covered, or the contaminants from using some brand household product rather than a pure source screws it up.
Still, his videos are usually quite entertaining regardless of results.
[1]: https://www.youtube.com/@ExtractionsAndIre
[2]: https://www.youtube.com/@explosionsandfire
TiHKal and PiHKaL are fulls of synths that require equipment and re-agents far beyond what a hobbyist would be able to source.
There are various "one-pot" techniques for certain compounds if one is sufficiently clever.
For example, a certain cathinone can be produced by combining ephedrine/pseudoephedrine with a household product that reduces secondary alcohols to ketones and letting it sit.
My limited bit of knowledge of both chemistry and LLMs would tell me that subtle incorrect chemistry can have disastrous effects while subtle incorrect is an LLM superpower suggests that this is precisely the inevitable outcome
What LLM’s?
I’m a chemist and I asked it to show me the structure for a common molecule and it kept getting it really wrong
> [..] models are [...] limited in [...] ability to answer knowledge-intensive questions [...], they did not memorize the relevant facts. [...] This is probably because the required knowledge cannot easily be accessed via papers [...] but rather by lookup in specialized databases [...], which the humans [...] used to answer such questions [...]. This indicates that there is [...] room for improving [...] by training [...] on more specialized data sources or integrating them with specialized databases.
> [...] our analysis shows [...] performance of models is correlated with [...] size [...]. This [...] also indicates that chemical LLMs could, [...], be further improved by scaling them up.
Does that means the world of chemists will be eaten by LLMs? Will LLMs just improve chemists output or productivity? I'd be scared if this happened in my area of work.
It's increasingly looking like if you're young enough most knowledge work will be eaten by LLMs (or the thing that comes next) within your lifetime.
Hopefully we'll see human assisted with AI & induced demand for a good while, but the idea that people work unassisted in knowledge work is gonna go the way of artisan clothing
so much for those birth rates
How much of this is because Scale AI and others have had human “taskers” create huge amounts of domain-specific content for OpenAI and other foundation model providers?
Nothing to see here unless you have some kind of unsatisfied interest in the future of AI :\
This is all highly academic, and I'm highly industrial so take this with a grain of salt. Sodium salt or otherwise, your choice ;)
If you want things to be accomplished at the bench, you want any simulation to be made by those who have not been away from the bench for that many decades :)
Same thing with the industrial environment, some people have just been away from it for too long regardless of how much familiarity they once had. You need to brush up, sometimes the same plant is like a whole different world if you haven't been back in a while.
BASF Group - will they speak in public? probably not, given what is at stake IMHO