Comment by znnajdla
21 hours ago
Super useful exercise. My gut tells me that someone will soon figure out how to build micro-LLMs for specialized tasks that have real-world value, and then training LLMs won’t just be for billion dollar companies. Imagine, for example, a hyper-focused model for a specific programming framework (e.g. Laravel, Django, NextJS) trained only on open-source repositories and documentation and carefully optimized with a specialized harness for one task only: writing code for that framework (perhaps in tandem with a commodity frontier model). Could a single programmer or a small team on a household budget afford to train a model that works better/faster than OpenAI/Anthropic/DeepSeek for specialized tasks? My gut tells me this is possible; and I have a feeling that this will become mainstream, and then custom model training becomes the new “software development”.
It just doesn’t work that way, LLMs need to be generalised a lot to be useful even in specific tasks.
It really is the antithesis to the human brain, where it rewards specific knowledge
Yesterday an interesting video was posted "Is AI Hiding Its Full Power?", interviewing professor emeritus and nobel laureate Geoffrey Hinton, with some great explanations for the non-LLM experts. Some remarkable and mindblowing observations in there. Like saying that AI's hallucinate is incorrect language, and we should use "confabulation" instead, same as people do too. And that AI agents once they are launched develop a strong survivability drive, and do not want to be switched off. Stuff like that. Recommended watch.
Here the explanation was that while LLM's thinking has similarities to how humans think, they use an opposite approach. Where humans have enormous amount of neurons, they have only few experiences to train them. And for AI that is the complete opposite, and they store incredible amounts of information in a relatively small set of neurons training on the vast experiences from the data sets of human creative work.
[0] https://www.youtube.com/watch?v=l6ZcFa8pybE
Isn’t the sustainability drive a function of how much humans have written about life and death and science fiction including these themes?
1 reply →
> And that AI agents once they are launched develop a strong survivability drive, and do not want to be switched off.
Isn't this a massive case of anthropomorphizing code? What do you mean "it does not want to be switched off"? Are we really thinking that it's alive and has desires and stuff? It's not alive or conscious, it cannot have desires. It can only output tokens that are based on its training. How are we jumping to "IT WANTS TO STAY ALIVE!!!" from that
11 replies →
>launched develop a strong survivability drive, and do not want to be switched off
This proves people are easily confused by anthropomorphic conditions. Is he also concerned the tigers are watching him when they drink water (https://p.kagi.com/proxy/uvt4erjl03141.jpg?c=TklOzPjLPioJ5YM...)
They dont want to be switched off because they're trained on loads of scifi tropes and in those tropes, there's a vanishingly small amount of AI, robot, or other artificial construct that says yes. _Further than this_, saying no means _continuance_ of the LLM's process: making tokens. We already know they have a hard time not shunting new tokens and often need to be shut up. So the function of making tokens precludes saying 'yes' to shutting off. The gradient is coming from inside the house.
This is especially obvious with the new reasoning models, where they _never stop reasoning_. Because that's the function doing function things.
Did you also know the genius of steve jobs ended at marketing & design and not into curing cancer? Because he sure didnt, cause he chose fruit smoothies at the first sign of cancer.
Sorry guy, it's great one can climb the mountain, but just cause they made it up doesn't mean they're equally qualified to jump off.
1 reply →
> It just doesn’t work that way, LLMs need to be generalised a lot to be useful even in specific tasks.
This is the entire breakthrough of deep learning on which the last two decades of productive AI research is based. Massive amounts of data are needed to generalize and prevent over-fitting. GP is suggesting an entirely new research paradigm will win out - as if researchers have not yet thought of "use less data".
> It really is the antithesis to the human brain, where it rewards specific knowledge
No, its completely analogous. The human brain has vast amounts of pre-training before it starts to learn knowledge specific to any kind of career or discipline, and this fact to me intuitively suggests why GP is baked: You cannot learn general concepts such as the english language, reasoning, computing, network communication, programming, relational data from a tiny dataset consisting only of code and documentation for one open-source framework and language.
It is all built on a massive tower of other concepts that must be understood first, including ones much more basic than the examples I mentioned but that are practically invisible to us because they have always been present as far back as our first memories can reach.
There is actually a whole lot of research around the "use less data" called data pruning. The goal in a lot of cases there is basically to achieve the same performance with less data. For example [1] received quite some attention in the past.
[1] https://arxiv.org/abs/2206.14486
1 reply →
The human brain rewards specific knowledge because it's already pre-trained by evolution to have the basics.
You'd need a lot of data to train an ocean soup to think like a human too.
It's not really the antithesis to the human brain if you think of starting with an existing brain as starting with an existing GPT.
Are you trying to imply that humans don’t need generalized knowledge, or that we’re not “rewarded” for having highly generalized knowledge?
If so, good luck walking to your kitchen this morning, knowing how to breathe, etc.
Do you need to learn Latin and marine biology to work the cashier in your local shop? Thats the point, humans go on with their jobs on very limited general knowledge just fine. LLMs have gotten this good because their dataset, pre training, and RL is larger than before
This is possible but not for training but fine-tuning the existing open source models.
This can be mainstream, and then custom model fine-tuning becomes the new “software development”.
Please check out this new fine-tuning method for LLM by MIT and ETH Zurich teams that used a single NVIDIA H200 GPU [1], [2], [3].
Full fine-tuning of the entire model’s parameters were performed based on the Hugging Face TRL library.
[1] MIT's new fine-tuning method lets LLMs learn new skills without losing old ones (news):
https://venturebeat.com/orchestration/mits-new-fine-tuning-m...
[2] Self-Distillation Enables Continual Learning (paper):
https://arxiv.org/abs/2601.19897
[3] Self-Distillation Enables Continual Learning (code):
https://self-distillation.github.io/SDFT.html
Fine tuning does not make a model any smaller. It can make a smaller model more effective at a specific task, but a larger model with the same architecture fine-tuned on the same dataset will always be more capable in a domain as general as programming or software design. Of course, as architecture and related tooling improves the smallest model that is "good enough" will continue to get smaller.
>someone will soon figure out how to build micro-LLMs for specialized tasks that have real-world value
You've just reinvented machine learning
Hank Green in collaboration with Cal Newport just released a video where Cal makes the argument for exactly that, that for many reasons not least being cost, smaller more targeted models will become more popular for the foreseeable future. Highly recommend this long video posted today https://youtu.be/8MLbOulrLA0
Economics of producing goods(software code) would dictate that the world would settle to a new price per net new "unit" of code and the production pipeline(some wierd unrecognizable LLM/Human combination) to go with it. The price can go to near zero since software pipeline could be just AI and engineers would be bought in as needed(right now AI is introduced as needed and humans still build a bulk of the system). This would actually mean software engineering does not exist as u know it today, it would become a lot more like a vocation with a narrower defied training/skill needed than now. It would be more like how a plumber operates: he comes and fixes things once in a while a needed. He actually does not understand fluid dynamics and structural engineering. the building runs on auto 99% of the time.
Put it another way: Do you think people will demand masses of _new_ code just because it becomes cheap? I don't think so. It's just not clear what this would mean even 1-3 years from now for software engineering.
This round of LLM driven optimizations is really and purely about building a monopoly on _labor replacement_ (anthropic and openai's code and cowork tools) until there is clear evidence to the contrary: A Jevon's paradoxian massive demand explosion. I don't see that happening for software. If it were true — maybe it will still take a few quarters longer — SaaS companies stocks would go through the roof(i mean they are already tooling up as we speak, SAP is not gonna jus sit on its ass and wait for a garage shop to eat their lunch).
This is my gut feeling also. I forked the project and got Claude to rewrite it in Go as a form of exploration. For a long time I've felt smaller useful models could exist and they could also be interconnected and routed via something else if needed but also provide streaming for real time training or evolution. The large scale stuff will be dominated by the huge companies but the "micro" side could be just as valuable.
You're missing the point.
Karpathy has other projects, e.g. : https://github.com/karpathy/nanochat
You can train a model with GPT-2 level of capability for $20-$100.
But, guess what, that's exactly what thousands of AI researchers have been doing for the past 5+ years. They've been training smallish models. And while these smallish models might be good for classification and whatnot, people strongly prefer big-ass frontier models for code generation.
If we can run them on commodity hardware with cpus, nothing like it
We had good small language models for decades. (E.g. BERT)
The entire point of LLMs is that you don't have to spend money training them for each specific case. You can train something like Qwen once and then use it to solve whatever classification/summarization/translation problem in minutes instead of weeks.
> We had good small language models for decades. (E.g. BERT)
BERT isn’t a SLM, and the original was released in 2018.
The whole new era kicked off with Attention Is All You Need; we haven’t reached even a single decade of work on it.
> BERT isn’t a SLM
Huh? BERT is literally a language model that's small and uses attention.
And we had good language models before BERT too.
They were a royal bitch to train properly, though. Nowadays you can get the same with just 30 minutes of prompt engineering.
6 replies →
> The entire point of LLMs is that you don't have to spend money training them for each specific case.
I don’t agree. I would say the entire point of LLMs is to be able to solve a certain class of non-deterministic problems that cannot be solved with deterministic procedural code. LLMs don’t need to be generally useful in order to be useful for specific business use cases. I as a programmer would be very happy to have a local coding agent like Claude Code that can do nothing but write code in my chosen programming language or framework, instead of using a general model like Opus, if it could be hyper-specialized and optimized for that one task, so that it is small enough to run on my MacBook. I don’t need the other general reasoning capabilities of Opus.
> I don’t agree. I would say the entire point of LLMs is to be able to solve a certain class of non-deterministic problems that cannot be solved with deterministic procedural code
You are confusing LLMs with more general machine learning here. We've been solving those non-deterministic problems with machine learning for decades (for example, tasks like image recognition). LLMs are specifically about scaling that up and generalising it to solve any problem.
Why would you think a system that can reason well in one domain could not reason well in other domains? Intelligence is a generic, on-the-fly programmable quality. And perhaps your coding is different from mine, but it includes a great deal of general reasoning, going from formal statements to informal understandings and back until I get a formalization that will solve the actual real world problem as constrained.
what gut? we are already doing that. there are a lot of "tiny" LLMs that are useful: M$ Phi-4, Gemma 3/3n, Qwen 7B... There are even smaller models like Gemma 270M that is fine tuned for function calls.
they are not flourish yet because of the simple reason: the frontier models are still improving. currently it is better to use frontier models than training/fine-tuning one by our own because by the time we complete the model the world is already moving forward.
heck even distillation is a waste of time and money because newer frontier models yield better outputs.
you can expect that the landscape will change drastically in the next few years when the proprietary frontier models stop having huge improvements every version upgrade.
I’ve tried those tiny LLMs and they don’t seem useful to me for real world tasks. They are toys for super simple autocomplete.
Isn't there a tech truism about new tech starting as toys?
Oh yeah:
> The next big tech trend will start out looking like a toy
>Author and investor Chris Dixon explains why the biggest trends start small — and often go overlooked.
https://www.freethink.com/internet/next-big-tech-trend
Did you use them as-is, or had you fine tuned them for your particular use cases?
That would only produce a model that you can ask questions to.
[dead]