← Back to context

Comment by jhancock

5 days ago

I'm not a skeptic, but I keep LLMs on a short leash.

This is a thoughtful article. Thanks `tptacek

My LLM use is: 1 - tedious stuff; web pages interacting with domain back end. 2 - domain discovery.

In a recent adventure, I used Claude 4 to tease out parameters in a large graph schema. This is a combination of tedium and domain discovery (it's not my graph and I'm not a domain expert). In the first day, Claude uncovered attributes and relations no other LLM or Google search uncovered. And it worked!! The next day, I allowed it to continue. After a bit, results didn't pass the sniff test.

I checked into details of Claude's thinking: it decided to start making up schema attributes and inventing fallback queries on error with more made up attributes. It was "conscious" of its decision to do so. By the time I caught this, Claude had polluted quite a bit of code. Sure, plenty of well placed git commits helped in rolling back code...but it's not quite that simple..over the many git commits were sprinkled plenty of learnings I don't want to toss. It took another two days of carefully going through the code to pull out the good stuff and then roll things back. So now I'm at day five of this adventure with cleaned up code and notes on what we learned.

I suspect continual improvements on tooling will help. Until then, it's a short leash.

Domain discovery is so underrated. LLMs remove so much friction that makes everything so incredibly accessible.

If LLMs couldn't do anything else then that alone would still warrant an invention of a century sticker.

  • Couldn't agree more. The first time I used Claude Code was for something very much like this. We had a PDF rendering issue with Unicode characters in one of our libraries. We ultimately needed to implement a sort of bespoke font fallback system.

    With the help of the agent, I was able to iterate through several potential approaches and find the gaps and limitations within the space of an afternoon. By the time we got to the end of that process the LLM wrote up a nice doc of notes on the experiments, and *I* knew what I wanted to do next. Knowing that, I was able to give a more detailed and specific prompt to Claude which then scaffolded out a solution. I spent probably another day tweaking, testing, and cleaning up.

    Overall I think it's completely fair to say that Claude saved me a week of dev time on this particular task. The amount of reading and learning and iterating I'd have had to do to get the same result would have just taken 3-4 days of work. (not to mention the number of hours I might have wasted when I got stuck and scrolled HN for an hour or whatever).

    So it still needed my discernment and guidance - but there's no question that I moved through the process much quicker than I would have unassisted.

    That's worth the $8 in API credit ten times over and no amount of parroting the "stochastic parrot" phrase (see what I did there?) would change my mind.

    • Yes, that’s true and very cool but you’re an expert. Where do the next generation you’s come from? The ones that did not do weeks of dead-end research which built resilience, skill and the experience to tell Claude now saves them time? You cannot skip that admittedly tedious part of life for free.

      I think pro-AI people sometimes forget/ignore the second order effects on society. I worry about that.

      12 replies →

  • > Domain discovery is so underrated. LLMs remove so much friction that makes everything so incredibly accessible.

    And, unfortunately, a lot of friction from not having access to information in the first place. I've read a bunch of docs from people talking to Glean in order to explore a new topic; if it's a topic I'm actually very familiar with then four out of five times it is somewhere from misleading to catastrophically wrong. Any internal terminology that doesn't match the common usage outside of our organization poisons the whole session and it will make up things to join the meanings together, and the prompter is none the wiser.

    I trust AI only as a gap filler in domains that I'm already an expert in or where there's little internal context, anything else is intellectual suicide.

    • I feel like if you're using it in the right way, asking the AI to write code, or give domain context in a specific way then the answers it gives are easily verifiable enough- and it's domain knowledge you wouldn't have gotten very easily through a series of google searches- LLMs as a kind of search can work great.

  • No doubt. The tedious stuff is just a time saver. The domain discovery is often the difference between making progress and giving up.

One question is whether, even after all that backpedaling, you feel you could've achieved the same or a similar result in those five days. My findings have been that it's a net plus for productivity, but I'm a bit less sure whether I prefer the way work feels when a lot of it is just going back and cleaning up after the growth. (Of course, that sounds like a familiar statement for a lot of engineers before LLMs, too.)

  • This is why agents suck.

    Backpedling is a massive inefficiency.

    A better way is the single clean step approach.

    Use the largest LLM you can. Have it generate a single output for one update.

    If that update has logical errors or dropped anything you asked for restart, refine, narrow until it does.

    It's quite hard to plan each step right but the level and conplexity you can get to is far higher than an agent.

    Agents are much better at the shallow/broad problems.

    Large LLMs are exponentially better deep/narrow problems.

Yeah I'm impressed with its ability to do stuff, but not quite with its results. We have been working on more AI assistance adoption so I asked it to do some decently complex things with json/yml schema definitions and validations (outside the typical json schema we wanted things like conditional validation, etc)... It wrote a LOT of code and took a long time, and kept telling me it would work, and it didn't. I finally stepped in and eliminated roughly 75% of the code in about 10 minutes and got it working. It's great at tedious stuff, but outside of that, I'm skeptical.

  • IMO, you just noted it’s great at creating tedious (but pointless) stuff?