← Back to context

Comment by jchw

5 months ago

I've had a pretty similar outlook and still kind of do, but I think I do understand the hype a little bit: I've found that Claude and Gemini 2 Pro (experimental) sometimes are able to do things that I genuinely don't expect them to be able to do. Of course, that was the case before to a lesser extent already, and I already know that that alone doesn't translate to useful necessarily.

So, I have been trying Gemini 2 Pro, mainly because I have free access to it for now, and I think it strikes a bit above being interesting and into the territory of being useful. It has the same failure mode issues that LLMs have always had, but honestly it has managed to generate code and answer questions that Google definitely was not helping with. When not dealing with hallucinations/knowledge gaps, the resulting code was shockingly decent, and it could generate hundreds of lines of code without an obvious error or bug at times, depending on what you asked. The main issues were occasionally missing an important detail or overly complicating some aspect. I found the quality of unit tests generated to be sub par, as it often made unit tests that strongly overlapped with each other and didn't necessarily add value (and rarely worked out-of-the-box anyways, come to think of it.)

When trying to use it for real-world tasks where I actually don't know the answers, I've had mixed results. On a couple occasions it helped me get to the right place when Google searches were going absolutely nowhere, so the value proposition is clearly somewhere. It was good at generating decent mundane code, bash scripts, CMake code, Bazel, etc. which to me looked decently written, though I am not confident enough to actually use its output yet. Once it suggested a non-existent linker flag to solve an issue, but surprisingly it actually did inadvertently suggest a solution to my problem that actually did work at the same time (it's a weird rabbit hole, but compiling with -D_GNU_SOURCE fixed an obscure linker error with a very old and non-standard build environment, helping me get my DeaDBeeF plugin building with their upstream apbuild-based system.)

But unfortunately, hallucination remains an issue, and the current workflow (even with Cursor) leaves a lot to be desired. I'd like to see systems that can dynamically grab context and use web searches, try compiling or running tests, and maybe even have other LLMs "review" the work and try to get to a better state. I'm sure all of that exists, but I'm not really a huge LLM person so I haven't kept up with it. Personally, with the state frontier models are in, though, I'd like to try this sort of system if it does exist. I'd just like to see what the state of the art is capable of.

Even that aside, though, I can see this being useful especially since Google Search is increasingly unusable.

I do worry, though. If these technologies get better, it's probably going to make a lot of engineers struggle to develop deep problem-solving skills, since you will need them a lot less to get started. Learning to RTFM, dig into code and generally do research is valuable stuff. Having a bot you can use as an infinite lazyweb may not be the greatest thing.