Thoughts on a month with Devin

21 hours ago (answer.ai)

I'm one of the creators of OpenHands (fka OpenDevin). I agree with most of what's been said here, wrt to software agents in general.

We are not even close to the point where AI can "replace" a software engineer. Their code still needs to be reviewed and tested, at least as much as you'd scrutinize the code of a brand new engineer just out of boot camp. I've talked to companies who went all in on AI engineers, only to realize two months later that their codebase was rotting because no one was reviewing the changes.

But once you develop some intuition for how to use them, software agents can be a _massive_ boost to productivity. ~20% of the commits to the OpenHands codebase are now authored or co-authored by OpenHands itself. I especially love asking it to do simple, tedious things like fixing merge conflicts or failing linters. It's great at getting an existing PR over the line.

It's also important to keep in mind that these agents are literally improving on a _weekly_ basis. A few weeks ago we were at the top of the SWE-bench leaderboard; now there are half a dozen agents that have pulled ahead of us. And we're one launch away from leapfrogging back to the top. Exciting times!

https://github.com/All-Hands-AI/OpenHands

  • > code still needs to be reviewed and tested, at least as much as you'd scrutinize the code of a brand new engineer just out of boot camp

    > ..._massive_ boost to productivity. ~20% of the commits to the OpenHands codebase are now authored or co-authored by OpenHands itself.

    I'm having trouble reconciling these statements. Where does the productivity boost come from since that reviewing burden seems much greater than you'd have if you knew commits were coming from a competent human?

    • There's often a lot of small fixes that not time efficient to do, but a solution is not much code and is quick to verify.

      If the cost is small to setting a coding agent (e.g. aider) on a task, seeing if it reaches a quick solution, and just aborting if it spins out, you can solve a subset of these types of issues very quickly, instead of leaving them in issue tracking to grow stale. That lets you up the polish on your work.

      That's still quite a different story to having it do the core, most important part of your work. That feels a little further away. One of the challenges is the scout rule, the refactoring alongside change that makes the codebase nicer. I feel like today it's easier to get a correct change that slightly degrades codebase quality, than one that maintains it.

      3 replies →

    • I haven't started doing this with agents, but with autocomplete models I know exactly what OP is talking about: you stop trying to use models for things that models are bad at. A lot of people complain that Copilot is more harm than good, but after a couple of months of using it I figured out when to bother and when not to bother and it's been a huge help since then.

      I imagine the same thing applies to agents. You can waste a lot of time by giving them tasks that are beyond them and then having to review complicated work that is more likely to be wrong than right. But once you develop an intuition for what they can and cannot do you can act appropriately.

    • I suspect that many engineers do not expend significant energy on reviewing code; especially if the change is lengthy.

  • We've seen exponential improvements in LLM's coding abilities. Went from almost useless to somewhat useful in like two years.

    Claude 3.5 is not bad really. I wanted to do a side project that has been on my mind for a few years, and Claude coded it in like 30 seconds.

    So to say "we are not even close" seems strange. If LLMs continue to improve, they will be comparable to mid level developers in 2-3 years, senior developers in 4-5 years.

    • > So to say "we are not even close" seems strange. If LLMs continue to improve, they will be comparable to mid level developers in 2-3 years, senior developers in 4-5 years.

      These sorts of things can’t be extrapolated. It could be 6-months, it could be a local maxima / dead end that’ll take another breakthrough in 10 years like transformers were. See self-driving cars.

  • What does the cost look like for running OpenHands yourself? From your docs, it looks like you recommend Sonnet @ $3 / million tokens. But I could imagine this can add up quickly if you are sending large portions of the repository at a time as context.

As someone who uses AI coding tools daily and has done a fair amount of experimentation with different approaches (though not Devin), I feel like this tracks pretty well. The problem is that Devin and other "agentic" approaches take on more than they can handle. The best AI coders are positioned as tools for developers, rather than replacements for them.

Github Copilot is "a better tab complete". Sure, it's a neat demo that it can produce a fast inverse square root, but the real utility is that it completes repetitive code. It's like having a dynamic snippet library always available that I never have to configure.

Aider is the next step up the abstraction ladder. It can edit in more locations than just the current cursor position, so it can perform some more high-level edit operations. And although it also uses a smarter model than Copilot, it still isn't very "smart" at the end of the day, and will hallucinate functions and make pointless changes when you give it a problem to solve.

  • When I tried Copilot the "better tab complete" felt quite annoying, in that the constantly changing suggested completion kept dragging my focus away from what I was writing. That clearly doesn't happen for you. Was that something you got used to over time, or did that just not happen for you? There were elements of it I found useful, but I just couldn't get over the flickering of my attention from what I was doing to the suggested completions.

    Edit: I also really want something that takes the existing codebase in the form of a VSCode project / GitHub repo and uses that as a basis for suggestions. Does Copilot do that now?

    • I have the automatic suggestions turned off. I use a keybind to activate it when I want it.

      > existing codebase in the form of a VSCode project / GitHub repo and uses that as a basis for suggestions

      What are you actually looking for? Copilot uses "all of github" via training, and your current project in the context.

      1 reply →

    • I tried to get used to the tab completion tools a few times but always found it distracting like you describe. often I’d have a complete thought, start writing the code, get a suggested completion, start reading it, realize it was wrong, but then I’d have lost my initial thought, or at least have to pause and bring myself back to it.

      I have, however, fully adopted chat-to-patch style workflows like Aider, I find it much less intrusive and distracting than the tab completions, since I can give it my entire thought rather than some code to try to complete.

      I do think there’s promise in more autonomous tools, but they still very much fall into the compounding-error traps that agents often do at the present.

    • For cursor you can chat and ask @codebase and it will do rag (or equivalent) to answer your question

    • I would try cursor. It’s pretty good at copy pasting the relevant parts of the codebase in and out of the chat window. I have the tab autocomplete disabled.

    • Cursor tab does that. Or at least, it takes other open tabs into account when making suggestions.

    • i’ve been very impressed with the gemini autocomplete suggestions in google colab, and it doesn’t feel more/less distracting than any IDEs built in tab suggestions

      1 reply →

  • > The best AI coders are positioned as tools for developers, rather than replacements for them.

    I agree with this. However, we must not delude ourselves and understand that corporate is pushing for replacement. So there will be a big push to improve on tools like Devin. This is not a conspiracy theory, in many companies (my wife's, for example) they are openly stating this: we are going to reduce (aka "lay off") the engineering staff and use as much AI solutions as possible.

    I wonder how many of us, here, understand that many jobs are going away if/when this works out for the companies. And the usual coping mechanism, "it will only be for low hanging fruit", "it will never happen to me because my $SKILL is not replaceable", will eventually not save you. Sure, if you are a unique expert on a unique field, but many of us don't have that luxury. Not everyone can be a top of the cream specialist. And it'll be used to drive down salaries, too.

    • I remember when I was first getting started in the industry the big fear of the time was that offshoring was going to take all of our jobs and drive down the salaries of those that remained. In fact the opposite happened: it was in the next 10 years that salaries ballooned and tech had a hiring bubble.

      Companies always want to reduce staff and bad companies always try to do so before the solution has really proven itself. That's what we're seeing now. But having deep experience with these tools over many years, I'm very confident that this will backfire on companies in the medium term and create even more work for human developers who will need to come in and clean up what was left behind.

      (Incidentally, this also happened with offshoring— many companies ended up with large convoluted code bases that they didn't understand and that almost did what they wanted but were wrong in important ways. These companies needed local engineers to untangle the mess and get things back on track.)

      6 replies →

    • > I wonder how many of us, here, understand that many jobs are going away if/when this works out for the companies. And the usual coping mechanism, "it will only be for low hanging fruit", "it will never happen to me because my $SKILL is not replaceable", will eventually not save you. Sure, if you are a unique expert on a unique field, but many of us don't have that luxury. And it'll be used to drive down salaries, too.

      Yeah it's maddening.

      The cope is bizarre too: "writing code is the least important part of the job"

      Ok then why does nearly every company make people write code for interviews or do take home programming projects?

      Why do people list programming languages on their resumes if it's "least important"?

      Also bizarre to see people cheering on their replacements as they use all this stuff.

      5 replies →

  • It's weird to talk about aider hallucinating.

    That's whatever model you chose to use with it. Aider can use any.l model you like.

I think one of the big problems with Devin (and AI agents in general) is that they're only ever as good as they are. Sometimes their intelligence feels magical and they accomplish things within minutes that even mid level or senior software engineers would take a few hours to do. Other times, they make simple mistakes and no matter how much help you give, they run around in circles.

A big quality that I value in junior engineers is coachability. If an AI agent can't be coached (and it doesn't look like it right now), then there's no way I'll ever enjoy using one.

  • My first job I spent so much time reading Python docs, and the ancient art of Stack Overflow spelunking. But I could intuitively explain a solution in seconds because of my CS background. I used to encounter a certain kind of programmer often, who did not understand algorithms well but had many years of experience with a language like Ruby, and thus was faster in completing tasks because they didn't need to do the reference work that I had to do. Now I think these kinds of programmers will slowly disappear and only the ones with the fast CS intuition will remain.

    • I disagree. If anything, CS degrees have proven time and time again they aren't translatable into software development (which is why there's an entire degree field called Software Engineering emerging).

      If anything, my gut says that the CS concepts are very easy for LLMs to recall and will be the first things replaced (if ever) by AI. Software engineer{ing,s} (project construction, integrations, scaling, organizational/external factors, etc) will stick around for along time.

      There's also the meme in the industry that self-taught, non-CS degree engineers are potentially of the most capable group. Though this is anecdotal.

      5 replies →

  • I completely agree with you. More precisely, I feel they are useful when you have specific tasks with limited scope.

    For instance, just yesterday I was battling with a complex SQL query and I got halfway there. I gave our bot the query and an half assed description of what I wanted/what was missing and it got it right on the first try.

  • And when working with people it's fairly easy to intervene and improve when needed. I think the current working model with LLMs is definitely suboptimal when we cannot confine their solution space AND where they should apply a solution precisely, and timely.

    • It’s also often possible to know what a human will be bad at before they start. This allows you to delegate tasks better or vary the level of pre-work you do before getting started. This is pretty unpredictable with LLMs still.

      1 reply →

One thing that surprised me a little is that there doesn't seem to be an "ask for help" escape hatch in it - it would work away for literally days on a task where any human would admit they were stuck?

One of the more important features of agents is supposedly that they can stop and ask for human input when necessary. It seems it does do this for "hard stops" - like when it needed a human to setup API keys in their cloud console - but for "soft stops" it wouldn't.

By contrast, a human dev would probably throw in the towel after a couple of hours and ask a senior dev for guidance. The chat interface definitely supports that with this system but apparently the agent will churn away in a sort of "infinite thinking loop". (This matches my limited experience with other agentic systems too.)

  • LLMs can create infinite worlds in the error message it’s receiving. It probably needs some outside signal to stop and re-assess. I don’t think LLMs have any ability to reason if they’re lost in their own world on their own. They’ll just keep creating new less and less coherent context for themselves

    • If you correct an LLM based agent coder, you are always right. Often, if you give it advice, it pretends like it understands you, then goes on to do something different from what it said it was going to do. Likewise, it will outright lie to you telling you it did things it didn't do. (In my experience)

      2 replies →

    • For sure - but if I'm paying for a tool like Devin then I'd expect the infrastructure around it to do things like stop it if it looks like that has happened.

      What you often see with agentic systems is that there's an agent whose role is to "orchestrate", and that's the kind of thing the orchestrator would do: every 10 minutes or so, check the output and elapsed time and decide if the "developer" agent needs a reality check.

      2 replies →

  • I think training it to do that would be the hard part.

    - stopping is probably the easy part

    - I assume this happens during RLFH phase

    - Does the model simply stop or does it ask a question?

    - You need a good response or interaction, depending on the query? So probably sets or decision trees of them, or agentic even? (chicken-egg problem?)

    - This happens 10s of thousands of times, having humans do it, especially with coding, is probably not realistic

    - Incumbents like M$ with Copilot may have an advantage in crafting a dataset

  • > One thing that surprised me a little is that there doesn't seem to be an "ask for help" escape hatch in it - it would work away for literally days on a task where any human would admit they were stuck?

    You are over-estimating the sophistication of their platform and infrastructure. Everyone was talking about Cursor (or maybe was it astroturfing?) but once I checked it out, it was not far from avante on neovim.

    • Cursor isn't designed to do long running tasks. As someone mentioned in another comment it's closer to a function call than a process like Devin.

      It will only do one task at a time that it's asked to do.

  • Devin does ask for help when it can't do something. I think I have it asked me how to use a testing suite it had trouble running.

    The problem is it really really hate asking for help if it had a skill issue, it would prefer running in circles than admitting it just can't do something.

  • If they had built that from the beginning people would have said "every other tasks it asks me for help, how is it a developer then if I have to assist it all the time?"

    But now since you are okay with that, I think it's the right time to add that feature.

  • You can set a "max work time" before it pauses so it wont go for days endlessly spending your credits. By default its set to 10 credits.

    So I'm not sure how the author got it to go for days.

  • There should be an energy coefficient to problems. You only get a set amount of energy to solve per issue. When the energy runs out. A human must help.

I'm sure a lot of folks in these comments predicted these sorts of results with surprising accuracy.

Stuff like this is why I scoff when I hear about CEOs freezing engineering hiring or saying they just don't need mid-level engineers anymore because they have AI.

I'll start believing that when I see it happening, and see actual engineers saying that AI can replace a human.

I am long AI, but I think the winning formula is small, repetitive tasks with a little too much variation to make it worth it (or possible) to automate procedurally. Pulling data from Notion into Google sheets, like these folks did initially, is probably fine. Having it manage your infrastructure and app deployments, likely not.

This feels a bit like AI image generation in 2022. The fact that it works at all is pretty mindblowing, and sometimes it produces something really good, but most of the time there are obvious mistakes, errors, etc. Of course, it only took a couple more years to get photorealistic image outputs.

A lot of commenters here seem very quick to write off Devin / similar ideas permanently. But I'd guess in a few years the progress will be remarkable.

One stubborn problem – when I prompt Midjourney, what I get back is often very high-quality, but somehow different than what I expected. In other words, I wouldn't have been able to describe what I wanted, but once I see the output I know it's not quite right. I suspect tools like this will run into similar issues. Maybe there will be features that can help users 'iterate' quickly.

  • > Of course, it only took a couple more years to get photorealistic image outputs.

    "Photorealistic" is a pretty subjective judgement, whereas "does this code produce the correct outputs" is an objective judgement. A blurry background character with three arms might not impact one's view of a "photorealistic" image, but a minor utility function returning the wrong thing will break a whole program.

Those “how I feel about Devin after using it” comments at the bottom are damning, when you compare them to the user testimonials of people using cursor.

Seems to me that agents just aren’t the answer people want them to be, just a hype wave obscuring real progress in other areas (eg. MCST) because they’re easy to implement.

…but really, if things are easy to implement, at this point, you have to ask why they haven’t been done yet.

Probably, it seems, because it’s harder to implement in a way that’s useful than it superficially appears…

Ie. If the smart folk working on Devin can only do something of this level, anyone working on agentic systems should be worried, because it’s unlikely you can do better, without better underlying models.

  • How is Devin different from cursor?

    I recently used cursor and it has felt very capable in implementing tasks across files. I get that cursor is an IDE but it's ai functionality feels very agentic.. where do you draw the line?

    • Cursor Composer (both "normal" and "agent" mode) fit the colloquial definition of agent, for sure.

    • I had to look up MCST: it means Model-Centric Software Tools, as opposed to autonomous agents.

      Devin is closer to a long-running process that you can interact with as it is processing tasks, whereas Cursor is closer to a function call: once you've made the call, the only think you can do is wait for the result.

      2 replies →

  • Agents are really new and would solve plenty of annoying things.

    When I code with Claude, I have to copy paste files around.

    But everything we do in AI is new and outdated a few weeks ago.

    Claude is really good but blocks you in 1-3h for a bit due to context length.

    That type of issues will be solved.

    And local coding models are super fast on a 4090 already. Imagine a small project digits on your desktop were you allow these models also more thinking. But the thinking style models again are super new.

    Things probably are not done yet because we humans are the bottleneck right now. Getting enough chips, energy, standards, training time, doing experiments with tech a while tech b starts to emerge from another corner of ai.

    5090 just was announced and depending on benchmarks it might be 1.x-3 times faster. if it's faster above 1.5 that would again be huge.

Disclosure: Working on a company in the space and have recently been compared to Devin in at least one public talk.

Devin has tried to do too much. There is value in producing a solid code artifact that can be handed off for review to other developers in limited capacities like P2s and minor bugs which pile up in business backlogs.

Focusing on specific elements of the development loop such as fix bugs, add small feature, run tests, produce pull request is enough.

Businesses like Factory AI or my own are taking that approach and we're seeing real interest in our products.

  • Not to take away from your opinion, but I guess time will tell? As models get better, it's possible that wide tools like Devin will work better and swallow tools that do one thing. I think companies much rather have a AI solution that works like what they already know (developers), than one that works in the IDE, another that watches to Github issues, another that reviews PRs, and one that hangs on Slack and makes small fixes.

    > Businesses like Factory AI or my own are taking that approach and we're seeing real interest in our products.

    Interest isn't what tools like Devin are lacking, (un)fortunately.

    To be clear, I do share a lot of scepticism regarding all the businesses working around AI code generation. However, that isn't because I think they'll never be able to figure it out, but because I think they are all likely to figure it out at the end, at the same time, when better models come out. And none of them will have a real advantage over the other.

    • I've recently had several enterprise level conversations with different companies and what we're being asked for is specifically the simpler approach. I think that is the level of risk they're willing to tolerate and it will still ameliorate a real issue for them.

      The key here is my product is no worse positioned to do more things if and when the time comes, but building a solid foundation and trust, and not having the quiet part be (which I heard as early as several months ago) that your product doesn't work means we'll hopefully still have the customer base to roll that out to.

      I've talked to Devin's CEO once at Swyx's conference last June, they're very thoughtful and very kind so this must be very rough but between when they showed their demo then and what I'm hearing now the product has not evolved in a way where they are providing value commensurate with their marketing or hype.

      I'm a fan of Guillermo Rauch's (Vercel CEO) take on these things. You earn the right to take on bigger challenges and no one in this space has earned the right yet including us.

      Devin's investment was fueled by hyperspeculation early on when no one knew what the shape of the game was. In many ways we still don't, but if you burn your reputation before we get there you may not be able to capitalize on it.

      To be completely fair to them, taking the long view and the bank account to go with it they may still be entirely fine.

      2 replies →

  • You can get a much higher hit rate with more constrained agents, but unfortunately if it's too constrained it just doesn't excite people as much.

    Ex. the Grit agent (my company) is designed to handle larger maintenance tasks. It has a much higher success rate, with <5% rejected tasks and 96% merged PRs (including some pretty huge repos).

    It's also way less exciting. People want the flashy tool that can solve "everything."

Also trialed Devin, it's quite impressive when it understands the code formatting and local test setup, producing well formatted and test case passing code, but it seems to always add extraneous changes beyond the task that can break other things. And it can't seem to undo those changes if you ask. So everything requires more cleanup. Devin opened my eyes to the power of agentic workflows with closed loop feedback, and the coolness of a slack interface, but I am gonna recommend cancelling it because it's not actually saving time and it's quite expensive.

I’ve used Cursor a lot and the conclusion doesn’t surprise me. I feel like I’m the one *forcing* the system in a certain direction and sometimes an LLM gives a small snippet of useful code. Sometimes it goes in the wrong direction and I have to abort the suggestion and force it into another direction. For me, the main benefit is having a typing assistant which can save me from typing one line here and there. Especially refactorings is where Cursor shines. Things like moving argument order around or adding/removing a parameter at function callsites is great. Saved me a ton of typing and time already. I’m way more comfortable just quickly doing a refactoring when I see one.

  • Weird. I have such a different experience with Cursor.

    Most changes occur with a quick back and forth about top level choices in chat.

    Followed with me grabbing appropriate interfaces and files for context so Sonnet doesn't hallucinate API, and then code that I'll glance over and around half the time suggest one or more further changes.

    It's been successful enough I'm currently thinking of how to adjust best practices to make things even smoother for that workflow, like better aggregating package interfaces into a single file for context, as well as some notes around encouraging more verbose commenting in a file I can provide as context as well on each generation.

    Human-centric best practices aren't always the best fit, and it's finally good enough to start rethinking those for myself.

    • This! I've been using Cursor regularly since late 2023. It's all about building up effective resources to tactfully inject into prompts as needed. I'll even give it sample API responses in addition to API docs. Sometimes I'll have it first distill API docs down into a more tangible implementation guide and then save that as a file in the codebase.

      I think I'm just a naturally verbose person by default, and I'm starting to think that has been very helpful in me getting a lot out of my use of LLMs and various LLM tools over the past 2+ years.

      I treat them like the improv actors they are and always do the up front work to create (with their assistance) the necessary broader context and grounding necessary for them to do their "improv" as accurately as possible.

      I honestly don't use them with the immediate assumption I'll save time (although that happens almost all the time), I use them because they help me tame my thoughts and focus my efforts. And that in and of itself saves me time.

    • This is what’s needed to get the most out of these tools. You understand deeply how the tool works and so you’re able to optimize its inputs in order to get good results.

      This puts you in the top echelon of developers using AI assisted coding. Most developers don’t have this deep of an understanding and so they don’t get results as good as yours.

      So there’s a big question here for AI tool vendors. Is AI assisted coding a power tool for experts, or is it a tool for the “Everyman” developer that’s easy to use?

      Usage data shows that the most adopted AI coding tool is still ChatGPT, followed by Copilot (even if you’d think it’s Cursor from reading HN :-))

  • I'll add few things at which Cursor with Claude is better than us (at least in time/effort):

    - explaining code. Enter some legacy part of your code nobody understands, LLMs aren't limited to keeping few things in memory like us. Even if the code is very obfuscated and poorly written it can understand what it does and the purpose and suggest refactors to make it understandable

    - explaining and fixing bugs. Just the other day Antirez posted a bug of him debugging a Redis segfault on some C code providing context and stack trace. This might be a hit or miss at times, but more often than not it saves you hours

    - writing tests. It often comes up with many more examples and edge cases than I thought of. If it doesn't, you can always ask it to.

    In any case I want to stress that LLMs are only as good as your data and prompts. They lack the nuance of understanding lots of context, yet I see people talking to them like humans that understand the business, best practices and others.

    • That first one has always felt super crazy to me, I've figured out what lots of "arcane magic, don't touch" type of functions genuinely do since LLMs have become a thing.

      Even if it's slightly wrong it's usually at least in the right ballpark so it gives you a very good starting point to work from. Almost everything is explainable now.

      3 replies →

  • I think the .cursorrules and .cursorignore files might be useful here.

    Especially the .cursorrules file, as you can include a brief overview of the project and ground rules for suggestions, which are applied to your chat sessions and Cmd/Ctrl K inline edits.

So for anyone who doubted SWE-BENCH's relevance's to typical tasks, it seems that its stated 13.86% almost exactly matches this 3 successes out of 20 pilot outcome.

We're not quite there yet, but all of these issues seem to be solvable with current tech by applying additional training and/or "agentic" direction. I would now expect pretty much the textbook disruptive innovation process over the next decade or so, until the typical human dev role is pushed to something more akin to the responsibilities of current day architects and product managers. QA engineering though will likely see a big resurgence.

  • >> but all of these issues seem to be solvable with current tech by applying additional training and/or "agentic" direction.

    Can you explain why you think this. From what I gather from other comments it seems like if we continue on current trajectory at best you'd still need a dev who understands the projects context to work in tandem w/ the agent so the code doesn't devolve into slop.

    • > so the code doesn't devolve into slop

      As I see it, this is pretty much a given across all codebases, with a natural tendency of all projects to become balls of mud if the developer(s) don't actively strive to address technical debt and continuously refactor to address the new needs. But having said that, my experience is that for a given task in an unfamiliar codebase, an AI agent is already better at maintaining consistency than a typical junior developer, or even a mid-level developer who recently joined the team. And when explicitly given the task of refactoring the codebase while keeping the tests passing, the AI agents are already very capable.

      The biggest issue, which is what you may be alluding to, is that AI agents are currently very bad at recognizing the limits of their capabilities and continue trying an approach when a human dev would have long since given up and went to their lead to ask for help or for the task specification to be redefined. That's definitely an issue, but I don't see any fundamental technological limitation here, but rather something addressable via an engineering effort.

      In general, I've seen so many benchmarks fall to AI in the recent decade (including SWE-BENCH), that now I'm quite confident that if a task being performed by humans can be defined with clear numerical goals, then it's achievable by AI.

      And another way I'm looking at it is that for any specific knowledge work competency, it seems to already be much easier and time effective to train an AI to do well on it than to create a curriculum for humans to learn it and then to have every single human to go through it.

This only reinforces my bias against AI agents. At this point, they are mostly just hype. I believe that for AI to replace a junior, we would need to achieve at least near-AGI, and we are far from that.

  • If by hype you mean that there isn't extreme real world value right here and right now, then I very much disagree.

    Closing in on 20 years since I left school and for me AI is absolutely useful, right here and right now. It is really a bicycle for the mind:

    It allows me to get much faster to where I want. (And like bicycles you will get a few crashes early on and possibly later as well, depending on how fast you move and how careful you are.)

    I might be in some sweet spot where I am both old enough to know what is going on without using an AI but also young enough to pick up the use of AI relatively effortlessly.

    If however by hype you mean that people still have overhyped expactations about the near future, then yes, I agree more and more.

    • I feel AI can also do simple monotonous coding tasks, but I don't think programming is something it's currently very good at. Samples, yes, trivial programs, sure, but anything non-trivial and it's rarely useful.

      Where it really shines today is getting humans up to speed with new technologies, things that are well understood in general but maybe not well understood by you.

      Want to say build a window manager in X11, despite never having worked with X11 before? Sure, Claude will point you in the right direction and give you a simple template to work with in 30 seconds. Enormous time saver compared to figuring out how to do that from scratch.

      Never touched node in your life but want to build a simple electron app? Sure, here's how you get started. Few hours and several follow up questions later, you're comfortable and productive in the environment.

      Getting off the ground with new technologies is so much easier with AI it's kind of ridiculous. The revolutionary part of AI coding is how it makes it much easier to be a true generalist, capable of working in any environment with any technology, whatever is appropriate.

  • Exactly. LLMs are gullible. They will believe anything you tell them, including incorrect things they have told themselves. This amplifies errors greatly, because they don't have the capacity to step back and try a different approach, or introspect why they failed. They need actual guidance from somebody with much common sense; if let loose in the world, they mostly just spin around in circles because they don't have this executive intelligence.

    • A regular single-pass LLM indeed cannot step back, but newer ones like o1/o3/Marco-o1/QwQ can, and a larger agentic system composed of multiple LLMs definitely can. There is no "fundamental" limitation here. And once we start training these larger systems from the ground up via full reinforcement learning (rather than composing existing models), the sky's the limit. I'd be very bullish about Deepmind, once they fully enter this race.

      1 reply →

What model does Devin use? How would it change if it used o1 or even o3 for times when it gets stuck?

IE. Generate the initial code using GPT4o/Claude 3.5, then start testing the code, when it gets stuck, use o1/o3 to help.

  • Yea this is what I was wondering as well. I have o1 not o1 pro but I am gathering from reddit/youtube o1 pro if used correctly is superior for coding tasks.

The assumption with low-code tooling was that AI is so good at writing actual code in a way that it will make low-code tools redundant. Spending time with Windsurf, Cursor, and a bunch of VSCode extensions, while it was so impressive to see new projects being created autonomously, asking for new requirements or fixing bugs after >10 iterations was more complex.

I had to audit the code and give specific directions on how to restructure the code to avoid getting stuck when the project gets more complex. That makes me think autonomous agents will do much better on low-code tools, as their restrictions ensure the agent is on track. The problem with low-code tools is that they also get more complicated to scale after maybe like >200 iterations. (for a medium-sized project, on average 6 months)

The whole idea of Devin is pointless and doomed to fail in my humble opinion, big tech will be quite capable on delivering A.I agents / assistants - very soon. I don't think wrappers over other people's LLMs like Devin make a lot of sense. Can someone help me understand what's the value proposition / moat of this company?

  • I'm confused here, aren't agents/assistants basically wrappers over LLMs or tools that interact with them as well? Devin seems to be in this category.

    • I recommend you look at tools like Aider or Codebuff... sure they need to call some LLM at some point (could be your own, could be external), but the key thing that they are doing complex modifications of source code using things like treesitter -> i.e. you don't rely directly on the LLM modifying code, but the LLM using trees to modify the code. See in Aider's sourcecode: https://github.com/Aider-AI/aider/tree/main/aider/queries

      Simple copy-pasting of "here's my prompt, give me code" was always doomed from the start to be perfect every time, and DEFINITELY won't work for an agent. We need to start thinking about how to use these LLMs in smarter ways (like the above mentioned tools)

      2 replies →

"Even more telling was that we couldn’t discern any pattern to predict which tasks would work."

I think this cuts to the core of the problem for having a human in the loop. If we cannot learn how to best use the tool from repeated use and discern some kind of patterns of best and worst practices then it isn't really a tool.

Sounds exactly like my experience with the “agents” about a year ago. Autogpt or whatever it was called. Works great 1% of the time and the rest it gets stuck in the wrong places completely unable to back out.

I’m now using o1 or Claude Sonnet 3.5 and usually one of them gets it right.

  • The current frontier models are all neocortex. They have no midbrain or crocodile brain to reconcile any physical, legal or moral feedback. The current state of the art is to preprocess all LLM responses with a physical/legal/moral classifier and respond with a generic "I'm sorry Dave, I'm afraid I can't do that."

    We are fooled into thinking these golems have a shred of humanity, but their method of processing information is completely backward. Humans begin with a fight/flight classifier, then a social consensus regression, and only after this do we start generating tokens ... and we do this every moment of every day of our lives, uncountably often, the only prerequisite being the calories in an occasional slice of bread and butter.

The thing with AI agents I tend to find is they reveal how much heavy lifting the dev is actually doing.

A personal example, my best use out of AI so far has been cases where documentation was poor to nonexistent, and Claude was able to give me a solution. But the thing is, it wasn't a working solution, nowhere close, but it was enough for me to extrapolate and do my own research based on the structure, classes and functions it used. Basically, it gave me somewhere to start from. Whether that's worth the social, economic and environmental problems is another story.

I'm working on AI assistant in Python notebook. It aims to help with data science tasks. I'm not using it to do a full analysis. It will fail. What I ask is to create a code snippet for my next step in the analysis. Many times I need to manually change the code, but it is fine because LLM speed-up my coding a lot. And it is really fantastic in writing matplotlib code for visualization. I don't remember all matplotlib syntax to change axis labels, add annotations or change style, and LLM really can handle it good, in impressive speed.

I remain sceptical about the "Planet Tracker"-task. The task was to debunk claims about historical positions of Jupiter and Saturn. If the task was to find those planets were NOT in a certain (claimed) position an erroneous program would still appear to "debunk" the claims. Did they check if Devin's code's calculated positions were actually correct? Did they check in some NASA-database? If Devin gave arbitrary positions for the planets it's much more likely that they're different than any claim and appear to debunk it.

  • I was able to read the code it wrote, and check that (as hoped) it was using a good existing library to do the heavy lifting. And I had it make plots that I could visually use to check that the values were 'reasonable'. The value in that case was simply that I didn't have to leave the couch and write the code myself (although if the result was actually needed for anything more important than a smug 'i thought so' confirmation I would still have taken over and validated it kore carefully).

At some point people are going to realize that using these LLM AIs is a communications problem, and by that I mean the reason various attempts to use them fail is because they are not being effectively told what to do, vague and implied requests are not enough for a inhuman statistical construct to grasp what you're asking without clearer more details and more specific instructions.

I also wrote my first impressions on Devin, more focused on the user experience and analysis of its capabilities (with lots of screenshots):

https://thegroundtruth.substack.com/p/devin-first-impression...

  • Your take seems much more positive than theirs. What do you think the key differences are between your experience and the one here?

    • One possible reason is that I'm using popular tech stacks (Next.js, HTML/JS for demo website and SDK). No niche frameworks or tools like nbdev (I've never heard of that).

      Also I've been prompting ChatGPT and Claude for over a year, that might help with communicating with Devin.

> Even more concerning was Devin’s tendency to press forward with tasks that weren’t actually possible. (...)

> Devin spent over a day attempting various approaches and hallucinating features that didn’t exist.

One of the big problems of GenAI is its inability to know what they don't know.

Because of that, they don't ask clarifying questions.

Humans, in the same situation, would spend a lot of time learning before they could be truly productive.

  • Your statement is factually wrong, Claude 3.5v2 asks clarifying questions when needed "natively", and you can add similar instructions in your prompt for any model.

    • The default system prompts are tuned for the naive case. LLMs being all purpose text handling tools, can be reprogrammed for any behavior you wish. This is the crux of skilled use of LLMs.

      The better the LLMs get, the worse the average prompt quality.

      1 reply →

I've been experimenting with code gen on and off for the last 18 months, and find this exactly in line with my experience.

Do you have good references about using AI coding assistants?

Techniques of prompt engineering help a lot, but I really think there will be created a body of knowledge about how to use, what's the good contexts of use, and good heuristics. They are a valuable tool, but I feel it's possible to extract more value.

Most the problems you mentioned will likely be solved with the next iterations of Devin or similar product.

I can say that because I work daily with Claude as an agent over mcp, and the problems you mentioned feel very familiar.

Based on the type of the issues you mentioned, Devin isn't likely using o1 yet. A workflow like o1 for planning, Claude for Coding, o1 for review, etc., would work better.

The problems you mentioned: ssh-key issue unrelated to script, code not following existing patterns or themes, instructions not being followed, extra abstractions, etc., fall into that category.

Some of the issues are likely due to context length problem. For example, LLM doesn't work well with jupyter notebook because of extra junk in ipynb, which will likely remain a problem.

No matter what happens with Devin specifically, I think this is a really important topic and I enjoy reading updates on this kind of review every time.

Please keep them coming.

An engineer that thinks it knows everything (but doesn't) and can't self-correct is about the worst combo I can think of.

  • Well, having read too much sci-fi, I am more afraid of an AI engineer that really does know everything.

I saw few people around testing and it is quite disappointing. Sometimes a task might take forever and deliver a bad result or fail completely.

It seems it is targeting few specific problems and whatever else is just too hard. I also think that, thought it is expensive, it is cheap for the technology behind it and it won't be able to keep that price for long

Honestly, i have been bitten so many times by LLM hallucinations when I work in parallel with the LLM, I wouldn't trust it autonomously running anything at all. If you have tried to use imaginary APIs, imaginary configuration and imaginary cli arguments, you know what I mean

  • > If you have tried to use imaginary APIs, imaginary configuration and imaginary cli arguments, you know what I mean

    I see this comment a lot but I can't help but feel it's 4 weeks out of date. The version of o1 released on 2024-12-17 so rarely hallucinates when asked code questions of basic to medium difficulty and provided with good context and a well written prompt, in my experience. If the context window is sub-10k tokens, I have very high confidence that the output will be correct. GPT-4o and o1-mini, on the other hand, hallucinates a lot and I have learned to put low trust in the output.

  • I have been feeling LLM burnout and favoring code it all my self after a year of LLM assistance. When it gets things wrong it is too annoying. Like, I would get mad and start to curse it, shouting loud and in the chat.

    • Exactly this. At first started verbally abusing it untill it conformed, but i quickly realised that after the context gets very long it simply discards former instructions and abusing. So i get frustrated, toxic AND don't get my job done

Now is the time for us to hold seemingly contradictory propositions: A child born today will live to see 99% of all computer code written by artificial intelligence, but the current AI boom is massively overcapitalized.

  • I'd argue that software is being written (either by humans or AI) in an order that it progressively adds less marginal value (if we define value in the capitalistic sense).

    Most of the value that software will ever create has already been created.

    The only truly valuable missing things are stuff whose value is not easy to translate to capitalists, or need some visionary work.

  • That's already the case if you call compilers/interpreters "AI". Just a new higher level abstraction for code.