Comment by simonw
3 days ago
That's why I built the WebAssembly one - the JavaScript one started with MQJS, but for the WebAssembly one I started with just a copy of the https://github.com/webassembly/spec repo.
I haven't quite got the WASM one into a share-able shape yet though - the performance is pretty bad which makes the demos not very interesting.
Isn’t that telling though?
A good test might be to provide it only about a third of the tests, then when it says it's done, run it on the holdout 2/3 of tests and see how well it did. Of course it may have already seen the other tests during training, but that's not relevant here since the goal is to find whether or not it's just "brute force bumbling" its way through the task relying heavily on the test suite as bumper rails for feedback, or if it's actually writing generalizable bug-free code with active awareness of pitfalls and corner cases. (Then again it might be invalidated if this specific project was part of the RL training process. Which it may well have been, it's low hanging fruit to convert any repo with comprehensive test suite into training data).
Either way, most tasks don't have the luxury of a thorough test suite, as the test suite itself is the product of arduous effort in debugging and identifying corner case.