Comment by _heimdall

9 hours ago

That's a huge resource cost though, and simply unnecessary. We should be building semantically valid HTML from the beginning rather than leaning on a GPU cluster to parse the function based on the entire HTML, CSS, and JS on the page (or a screenshot requiring image parsing by a word predictor).

That's the point of solving problems with LLMs. We pay a large resource cost, but in return we get general intelligence to understand things.