Comment by simonw
4 days ago
Opus 4.5 really is something else. I've been having a ton of fun throwing absurdly difficult problems at it recently and it keeps on surprising me.
A JavaScript interpreter written in Python? How about a WebAssembly runtime in Python? How about porting BurntSushi's absurdly great Rust optimized string search routines to C and making them faster?
And these are mostly just casual experiments, often run from my phone!
>A JavaScript interpreter written in Python?
I'm assuming this refers to the python port of Bellard's MQJS [1]? It's impressive and very useful, but leaving out the "based on mqjs" part is misleading.
[1] https://github.com/simonw/micro-javascript?
That's why I built the WebAssembly one - the JavaScript one started with MQJS, but for the WebAssembly one I started with just a copy of the https://github.com/webassembly/spec repo.
I haven't quite got the WASM one into a share-able shape yet though - the performance is pretty bad which makes the demos not very interesting.
Isn’t that telling though?
1 reply →
> How about porting BurntSushi's absurdly great Rust optimized string search routines to C and making them faster?
How did it do? :-)
Alarmingly well! https://gisthost.github.io/?1bf98596a83ff29b15a2f4790d71c41d...
It couldn't quite beat the Rust implementation on everything, but it managed to edge it out on at least some of the benchmarks it wrote for itself.
(Honestly it feels like a bit of an afront to the natural order of things.)
That said... I'm most definitely not a Rust or C programmer. For all I know it cheated at the benchmarks and I didn't spot it!
Nice. Yeah I'd have to actually look at what it did. For the task of substring search, it's extremely easy to fall into a local optima. The `memchr` crate has oodles of benchmarks, and some of them are very much in tension with others. It's easy to do well on one to the expense of others.
But still, very neat.
1 reply →
What are you using to easily share the conversation as its own webpage? Very nice and tidy.
2 replies →
I have tried to give it extreme problems like creating slime mold pathing algorithm and creating completely new shoe-lacing patterns and it starts struggling with problems which use visual reasoning and have very little consensus on how to solve them.
I'm not super surprised that these examples worked well. They are complex and a ton of work, but the problems are relatively well defined with tons of documentation online. Sounds ideal for an LLM no?
Yes, that's a point I've been trying to emphasize: if a problem is well specified a coding agent can crunch for hours on it to get to a solution.
Even better if there's an existing conformance suite to point at - like html5lib-tests or the WenAssembly spec tests.
One of my first tests with it was "Write a Python 3 interpreter in JavaScript."
It produced tests, then wrote the interpreter, then ran the tests and worked until all of them passed. I was genuinely surprised that it worked.
There are multiple Python 3 interpreters written in JavaScript that were very likely included in the training data. For example [1] [2] [3]
I once gave Claude (Opus 3.5) a problem that I thought was for sure too difficult for an LLM, and much to my surprise it spat out a very convincing solution. The surprising part was I was already familiar with the solution - because it was almost a direct copy/paste (uncredited) from a blog post that I read only a few hours earlier. If I hadn't read that blog post, I would have been none the wiser that copy/pasting Claude's output would be potential IP theft. I would have to imagine that LLMs solve a lot of in-training-set problems this way and people never realize they are dealing with a copyright/licensing minefield.
A more interesting and convincing task would be to write a Python 3 interpeter in JavaScript that uses register based bytecode instead of stack based, supports optimizing the bytecode by inlining procedures and constant folding, and never allocates memory (all work is done in a single user provided preallocated buffer). This would require integrating multiple disparate coding concepts and not regurgitating prior art from the training data
[1] https://github.com/skulpt/skulpt
[2] https://github.com/brython-dev/brython
[3] https://github.com/yzyzsun/PyJS
It's ability to test/iterate and debug issues is pretty impressive.
Though it seems to work best when context is minimized. Once the code passes a certain complexity/size it starts making very silly errors quite often - the same exact code it wrote in a smaller context will come out with random obvious typos like missing spaces between tokens. At one point it started writing the code backwards (first line at the bottom of the file, last line at the top) :O.
On the other hand when I tried it just yesterday, I couldn't really see a difference. As I wrote elsewhere: same crippled context window, same "I'll read 10 irrelevant lines from a file", same random changes etc.
Meanwhile half a year to a year ago I could already point whatever model was du jour at the time at pychromecast and tell it repeatedly "just convert the rest of functionality to Swift" and it did it. No idea about the quality of code, but it worked alongside with implementations for mDNS, and SwiftUI, see gif/video here: https://mastodon.nu/@dmitriid/114753811880082271 (doesn't include chromecast info in the video).
I think agents have become better, but models likely almost entirely plateaued.
Insanely difficult to you maybe because you stopped learning. What you cannot create you don't understand.
Are you honestly saying that building a new spec-compliant WebAssembly runtime from scratch isn't an absurdly difficult project?