Comment by yogthos
21 hours ago
You obviously have to try it out to see how it works for you, but the trick they use is pretty clever. When you ask an AI to write code, it doesn’t always get it right. Sometimes the code has bugs, sometimes it misunderstands the problem entirely. A naive way to address that is to generate a few solutions and test each one. The odds that at least one works go way up. ATLAS generates multiple attempts, running each through a test suite. Each retry also gets told what went wrong with the previous attempt, so it can try to avoid the same mistake.
But this can be pretty slow since you have to run the code in an isolated environment, check the outputs, wait for it to finish. Doing that for every candidate quickly adds up. So ATLAS has another shortcut for avoiding unnecessary testing. Instead of simply generating solutions and testing all of them, it tries to predict which one is most likely correct before running any tests.
ATLAS also asks the model for an embedding of what it just wrote which acts as a fingerprint. Two similar pieces of code will produce similar fingerprints. A well-written, confident solution will produce a different fingerprint than a confused, buggy one.
These fingerprints get fed into a separate, much smaller neural network called the Cost Field. This little network was trained ahead of time on examples where they already knew which solutions were correct and which were wrong. It learned to assign a score to each fingerprint. Correct solutions get a low score and incorrect ones get a high one.
So the process is to generate multiple solutions, get their fingerprints, score each one, and pick the lowest. Only that one gets tested. The Cost Field picks correctly about 88% of the time according to the repo.
Really intriguing set of techniques to improve accuracy by generating multiple solutions. Even with the work to predict the most likely solutions, it's not clear to me based on the description how this could all be done efficiently. Would definitely be really impressive if it pans out on real-world use cases. Will look to kick the tires on this if I can get some time.
> it's not clear to me based on the description how this could all be done efficiently.
Depends how you define efficiency. The power use of this rig is a lot less than the large data centers that serve trillion parameter models. The page suggests that the final dollar cost per request is an order of magnitude lower than the frontier models charge.
Seems like the key insight is to train a small model that acts as a heuristic for embeddings that resemble quality code. I imagine a lot depends on how well this model is trained. And you could probably create specialized versions for different languages and domains.
Another interesting approach could be to use this set up with a language like Clojure or Common Lisp which facilitates interactive development. If you could hook up the agent directly to a REPL in a running program, then it could run tests with a lot less overhead.
I'm super confused. The small model "cost field" `rag-api/geometric_lens/cost_field.py` was trained on PASS_TASKS like "Write a function that counts vowels in a string." and FAIL_TASKS like "Write a function that converts a regular expression string to an NFA using Thompson's construction, then converts the NFA to a DFA.".
So it seems like it's a difficulty classifier for task descriptions written in English.
This is then used to score embeddings of Python code, which is a completely different distribution.
Presumably it's going to look at a simple solution, figure out it lands kinda close to simple problems in embedding space and pass it.
But none of this helps you solve harder problems, or distinguish between a simple solution which is wrong, and a more complex solution which is correct.
2 replies →
I tried to read the project documentation, but I got overwhelmed by the aimless AI generated documentation that has a nebulous goal of documenting absolutely everything, but never explaining anything.
If the author actually wanted to explain his project he should have started with something along the lines of "Inference-time learning is the act of updating model parameters while you are generating tokens. Inference time learning is cost prohibitive for LLMs due to the need to update billions of parameters. However, what if updating billions of parameters wasn't necessary to begin with? What if you could instead have a much smaller model that merely scores a bunch of candidate output tokens? That model could be small enough for inference time learning to become viable and that's exactly what ATLAS does to achieve a 74.6% pass rate in LiveCodeBench and thereby outperforms Claude Sonnet with a small 14B open weight model that can be run locally on your $500 GPU."
This would have primed the reader to know what to look for. Instead you got this insurmountable wall of distractions.
Example: "combining constraint-driven generation, energy-based verification, self-verified iterative refinement, and adaptive routing"
That's a very long sequence of unexplained buzzwords that could mean absolutely anything.