Comment by curioussquirrel

11 hours ago

Why test for something? I find it fascinating if something starts being good at task it is "explicitly not designed for" (which I don't necessarily agree with - it's more of a side effect of their architecture).

I also don't agree that nobody is using this for - there are real life use cases today, such as people trying to find meaning of misspelled words.

On a side note, I remember testing Claude 3.7 with the classic "R's in the word strawberry" question through their chat interface, and given that it's really good at tool calls, it actually created a website to a) count it with JavaScript, b) visualize it on a page. Other models I tested for the blog post were also giving me python code for solving the issue. This is definitely already a thing and it works well for some isolated problems.

> such as people trying to find meaning of misspelled words.

That worked just fine for quite a while. There's apparently enough misspelling in the training data, we don't need precise spelling for it. You can literally write drunken gibberish and it will work.