Comment by dash2
1 year ago
Something major missing from the LLM toolkit at the moment is that it can't actually run (and e.g. test or benchmark) its own code. Without that, the LLM is flying blind. I guess there are big security risks involved in making this happen. I wonder if anyone has figured out what kind of sandbox could safely be handed to a LLM.
I have experimented with using LLM for improving unit test coverage of a project. If you provide the model with test execution results and updated test coverage information, which can be automated, the LLM can indeed fix bugs and add improvements to tests that it created. I found it has high success rate at creating working unit tests with good coverage. I just used Docker for isolating the LLM-generated code from the rest of my system.
You can find more details about this experiment in a blog post: https://mixedbit.org/blog/2024/12/16/improving_unit_test_cov...
It depends a lot on the language. I recently tried this with Aider, Claude, and Rust, and after writing one function and its tests the model couldn't even get the code compiling, much less the tests passing. After 6-8 rounds with no progress I gave up.
Obviously, that's Rust, which is famously difficult to get compiling. It makes sense that it would have an easier time with a dynamic language like Python where it only has to handle the edge cases it wrote tests for and not all the ones the compiler finds for you.
I've found something similar, when you keep telling the LLM what the compiler says, it keeps adding more and more complexity to try to fix the error, and it either works by chance (leaving you with way overengineered code) or it just never works.
I've very rarely seen it simplify things to get the code to work.
5 replies →
Suggestion: Now take the code away, and have the chatbot generate code that passes the tests it wrote.
(In theory, you get a clean-room implementation of the original code. If you do this please ping me because I'd love to see the results.)
That’s sort of interesting. If code -> tests -> code is enough to get a clean room implementation, really, I wonder if this sort of tool would test that.
1 reply →
OpenAI is moving in that direction. The Canvas mode of ChatGPT can now runs its own python in a WASM interpreter, client side, and interpret results. They also have a server-side VM sandboxed code interpreter mode.
There are a lot of things that people ask LLMs to do, often in a "gotcha" type context, that would be best served by it actually generating code to solve the problem rather than just endlessly making more parameter/more layer models. Math questions, data analysis questions, etc. We're getting there.
The new Cursor agent is able to check the linter output for warnings and errors, and will continue to iterate (for a reasonable number of steps) until it has cleared them up. It's not quite executing, but it does improve output quality. It can even back itself out of a corner by restoring a previous checkpoint.
It works remarkably well with typed Python, but struggles miserably with Rust despite having better error reporting.
It seems like with Rust it's not quite aware of which patterns to use, especially when the actual changes required may span multiple files due to the way memory management is structured.
> It seems like with Rust it's not quite aware of which patterns to use, especially when the actual changes required may span multiple files due to the way memory management is structured.
What do you mean? Memory management is not related to files in Rust (or most languages).
When did they say that?
3 replies →
I believe that Claude has been running JavaScript code for itself for a bit now[1]. I could have sworn it also runs Python code, but I cannot find any post concretely describing it. I've seen it "iterate" on code by itself a few times now, where it will run a script, maybe run into an error, and instantly re-write it to fix that error.
[1]: https://www.anthropic.com/news/analysis-tool
Gemini can run Python using the Code Execution or Function Calling APIs.
https://ai.google.dev/gemini-api/docs/code-execution
I've been closely following this area - LLMs with the ability to execute code in a sandbox - for a while.
ChatGPT was the first to introduce this capability with Code Interpeter mode back in around March 2023: https://simonwillison.net/tags/code-interpreter/
This lets ChatGPT write and then execute Python code in a Kubernetes sandbox. It can run other languages too, but that's not documented or supported. I've even had it compile and execute C before: https://simonwillison.net/2024/Mar/23/building-c-extensions-...
Gemini can run Python (including via the Gemini LLM API if you turn on that feature) but it's a lot more restricted than ChatGPT - I don't believe it can install extra wheels, for example.
Claude added the ability to write and execute JavaScript recently (October), which happens in a sandbox in the user's browser, not on their servers: https://simonwillison.net/2024/Oct/24/claude-analysis-tool/
Claude also has Artifacts, which can write a UI in HTML and JavaScript and show that to the user... but can't actually execute code in a way that's visible to the LLM itself so doesn't serve the same feedback look purposes as those other tools. https://simonwillison.net/tags/claude-artifacts/
In December ChatGPT added Canvas which can execute Python in the user's browser, super confusing because they already have a separate Python system in Code Interpreter: https://simonwillison.net/2024/Dec/10/chatgpt-canvas/#canvas...
Running code would be a downstream (client) concern. There's the ability to get structured data from LLMs (usually called 'tool use' or 'function calling') which is the first port of call. Then running it is usually an iterative agent<>agent task where fixes need to be made. FWIW Langchain seems to be what people use to link things together but I find it overkill.* In terms of actually running the code, there are a bunch of tools popping up at different areas in the pipeline (replit, agentrun, riza.io, etc)
What we really need (from end-user POV) is that kinda 'resting assumption' that LLMs we talk to via chat clients are verifying any math they do. For actually programming, I like Replit, Cursor, ClaudeEngineer, Aider, Devin. There are bunch of others. All of them seem to now include ongoing 'agentic' steps where they keep trying until they get the response they want, with you as human in the chain, approving each step (usually).
* I (messing locally with my own tooling and chat client) just ask the LLM for what I want, delimited in some way by a boundary I can easily check for, and then I'll grab whatever is in it and run it in a worker or semi-sandboxed area. I'll halt the stream then do another call to the LLM with the latest output so it can continue with a more-informed response.
This is a major issue when it comes to things like GitHub Copilot Workspace, which is a project that promises a development environment purely composed of instructing an AI to do your bidding like fix this issue, add this feature. Currently it often writes code using packages that don't exist, or it uses an old version of a package that it saw most during training. It'll write code that just doesn't even run (like putting comments in JSON files).
The best way I can describe working with GitHub Copilot Workspace is like working with an intern who's been stuck on an isolated island for years, has no access to technology, and communicates with you by mailing letters with code handwritten on them that he thinks will work. And also if you mail too many letters back and forth he gets mad and goes to sleep for the day saying you reached a "rate limit". It's just not how software development works
The only proper way to code with an LLM is to run its code, give it feedback on what's working and what isn't, and reiterate how it should. Then repeat.
The problem with automating it is that the number of environments you'd need to support to actually run arbitrary code with is practically infinite, and with local dependencies genuinely impossible unless there's direct integration, which means running it on your machine. And that means giving an opaque service full access to your environment. Or at best, a local model that's still a binary blob capable of outputting virtually anything, but at least it won't spy on you.
Any LLM-coding agent that doesn't work inside the same environment as the developer will be a dead end or a toy.
I use ChatGPT to ask for code examples or sketching out pieces of code, but it's just not going to be nearly as good as anything in an IDE. And once it runs in the IDE then it has access to what it needs to be in a feedback loop with itself. The user doesn't need to see any intermediate steps that you would do with a chatbot where you say "The code compiles but fails two tests what should I do?"
Don't they? It highly depends on the errors. Could range from anything like a simple syntax error to a library version mismatch or functionality deprecation that requires some genuine work to resolve and would require at least some opinion input from the user.
Furthermore LLMs make those kinds of "simple" errors less and less, especially if the environment is well defined. "Write a python script" can go horribly wrong, but "Write a python 3.10 script" is most likely gonna run fine but have semantic issues where it made assumptions about the problem because the instructions were vague. Performance should increase with more user input, not less.
2 replies →
It can't be done in the LLM itself of course, but the wrapper you're taking about already exists in multiple projects fighting in SWEbench. The simplest one is aider with --auto-test https://aider.chat/docs/usage/lint-test.html
There are also large applications like https://devin.ai/ or https://github.com/AI-App/OpenDevin.OpenDevin
We have it run code and the biggest thing we find is that it gets into a loop quite fast if it doesn't recognise the error; fixing it by causing other errors and then fixing it again by causing the initial error.
Somewhat related - I wonder if LLMs are trained with a compiler in the loop to ensure they understand the constraints of each language.
This is a good idea. You could take a set of problems, have the LLM solve it, then continuously rewrite the LLM's context window to introduce subtle bugs or coding errors in previous code submissions (use another LLM to be fully hands off), and have it try to amend the issues through debugging the compiler or test errors. I don't know to what extent this is already done.
I don't think that's always true. Gemini seemed to run at least some programs, which I believe because if you asked it to write a python program that would take forever, it does. For example the prompt "Write a python script that prints 'Hello, World', then prints a billion random characters" used to just timeout on Gemini.
I think that there should be a guard to check the code before running it. It can be human or another LLM checking code based on its safety. I'm working on an AI assistant for data science tasks. It works in a Jupyter-like environment, and humans execute the final code by running a cell.
It'd be great if it could describe the performance of code in detail, but for now just adding a skill to detect if a bit of code has any infinite loops would be a quick and easy hack to be going on with.
Is reliably detecting if code has any infinite loops feasible? Sounds like the halting problem.
It is exactly the halting problem. Finding some infinite loops is possible, there are even some obvious cases, but finding "any" infinite loops is not. In fact, even the obvious cases are not if you take interrupts into account.
I think that's the joke. In a sci-fi story, that would make the computer explode.
It depends how you define reliably.
The halting problem isn't so relevant in most development, and nothing stops you having a classifier that says "yes", "no" or "maybe". You can identify code that definitely finishes, and you can identify code that definitely doesn't. You can also identify some risky code that probably might. Under condition X, it would go into an infinite loop - even if you're not sure if condition X can be met.
3 replies →
Not in the general case, but you could detect specific common patterns.
I believe some platforms like bolt.new do run generated code and even automatically detect and attempt to fix runtime errors.
Ideally you could this one step further and feed production logs, user session replays and feedback into the LLM. If the UX is what I'm optimizing for, I want it to have that context, not for it to speculate about performance issues that might not exist.
I think the GPT models have been able to run Python (albeit limited) for quite a while now. Expanding that to support a variety of programming languages that exist though? That seems like a monumental task with relatively little reward.
Pretty sure this is done client-side by one of the big LLM companies. So it's virtually no risk for them
I known at least one mainstream LLM that can write unit tests and run them right in the chat environment.
godbolt exists and can run code, so surely similar principles could be used here.
ChatGPT runs code. o1 even checks for runtime problems and fixes them "internally".
Chatgpt has a Code Interpreter tool that can run Python in a sandbox, but it's not yet enabled for o1. o1 will pretend to use it though, you have to watch very carefully to check if that happened or not.
Example transcript here (also showing that o1 can't search but will pretend it can): https://chatgpt.com/share/677420e4-8854-8006-8940-9bc30b7088...
[dead]
That's a bit like saying the drawback of a database is that it doesn't render UIs for end-users, they are two different layers of your stack, just like evaluation of code and generation of text should be.