The Deep Research problem

1 year ago (ben-evans.com)

I did a trial run with Deep Research this weekend to do a comparative analysis of the comp packages for Village Managers in suburbs around Chicagoland (it's election season, our VM's comp had become an issue).

I have a decent idea of where to look to find comp information for a given municipality. But there are a lot of Chicagoland suburbs and tracking documents down for all of them would have been a chore.

Deep Research was valuable. But it only did about 60% of the work (which, of course, it presented as if it was 100%). It found interesting sources I was unaware of, and assembled lots of easy-to-get public data that would have been annoying for me to collect that made spot-checking easier (for instance, basic stuff like the name of every suburban Village Manager). But I still had to spot check everything myself.

The premise of this post seems to be that material errors in Deep Research results negate the value of the product. I can't speak to how OpenAI is selling this; if the claim is "subscribe to Deep Research and it will generate reliable research reports for you", well, obviously, no. But as with most AI things, if you get paste the hype, it's plain to see the value it's actually generating.

  • >>The premise of this post seems to be that material errors in Deep Research results negate the value of the product

    No it’s not. It’s that it’s oversold from a marketing perspective and comes with some big caveats.

    But it does talk about big time savings for the right contexts.

    Emphasis from the article:

    “these things are useful”

  • I'm just realizing this might finally be something that helps me get past analysis paralysis I have before committing to so many decisions online. I always feel like without doing my research, I'll get scammed. Maybe this will help give me a bit more confidence

    • On the flipside, you might end up getting scammed even worse because of incorrect analysis. For example if ChatGPT hallucinates some data/features through faulty research then you might be surprised when you actually make the decision.

      4 replies →

    • I have found it to be exactly this in a lot of cases. It helps answer or synthesize the data that answers questions I had that are good to know but not critical for me to understand.

  • It's what one imagines the first cars were like - if you were mechanically inclined, awesome. If not, screwed. If you know LLMs and how a basic RAG pipeline works, deep research is wonderful. If not, screwed.

    • I can't help but feel that it's different if a car runs 90% of the time but breaks down 10% of the time, and if it turns the direction you tell it 90% of the time, but the opposite direction 10% of the time.

      2 replies →

Similar to what Derek Lowe found with a pharma example: https://www.science.org/content/blog-post/evaluation-deep-re...

> As with all LLM output, all of these things are presented in the same fluid, confident-sounding style: you have to know the material already to realize when your foot has gone through what was earlier solid flooring.

  • Of which, people will surely die (when it is used to publish medical research by those wishing not to fail the publish-vs-perish situation)

    • I think the "delve" curve shows were already well into "AI papers" stage of civilization. It has probably been tamped down, now; the last thing j heard using it like a tell was notebookllm.

      Deep dive.

      "Yes, that's exactly right"

I think of it as the return to 10 blue links. It searches the web, finds stuff and summarizes it so I can decide which links to click. I ignore the narrative it constructs because it’s probably wrong. Which I forgive because it’s the hazmat suit for the internet I’ve always dreamed of.

It gets in the trenches, braves the cookie popups, email signups, inline ads and overly clever web design so I don’t have to. That’s enough to forgive its attempts to create research narratives. But, I hope we can figure out a way to train a heaping spoonful of humility into future models.

  • > Which I forgive because it’s the hazmat suit for the internet I’ve always dreamed of.

    You dreamed of this? Why not dream of a web where you don’t have to brave a veritable ocean of crap to get what you want? It may surprise you to learn such a web existed in the not too distant past.

  • Agreed. Maybe we're moving toward a world where LLMs do all the searching, and "websites" just turn into data-only endpoints made for AI to consume. That'll have other big implications... Interesting times ahead.

    • Interesting times, for sure.

      > and "websites" just turn into data-only endpoints made for AI to consume.

      As is already the case with humans, that only serves users to the extent that the websites' veracity is within the intelligence's ability to verify — all the problems we've had with blogspam etc. have been due to the subset of SEO where people abuse* the mechanisms to promote what they happen to be selling at the expense of genuine content.

      AI generated content is very good at seeming human, at seeming helpful. A "review website" which is all that, but with fake reviews that promote one brand over the others… a chain of such websites that link to each other to boost PageRank scores… which are then cross-linked with a huge number of social media bots…

      Will lead to a lot of people who think they're making an informed choice, but who were lied to about everything from their cornflakes to their president.

      * Tautologically, when it's not "abuse", SEO is only helping the search engine find the real content. I've seen places fail to perform any SEO including the legitimate kind.

      2 replies →

    • Interesting idea. AI can't look at ads, so in the long run ads on informational material might die any you're going back to paying outright. I like it.

      5 replies →

  • If you ignore the narrative and only look at the links then you're just describing a search engine with an AI summarization feature. You could just use Kagi and click "summarize" on the interesting results and then you don't have to worry that the sources themselves are hallucinations.

    The summaries are probably still wrong but you do you, at least this would save you the step of reading bullshit and boiling a pond to generate a couple links

When ChatGPT came out, one of the things we learned is that human society generally assumes a strong correlation between intelligence and the ability to string together grammatically correct sentences. As a result, many people assumed that even GPT-3.5 was wildly more "intelligent" than it actually was.

I think Deep Research (and tools like it) offer an even stronger illustration of that same effect. Anything that can produce a well-formatted multiple page report with headings and citations surely must be of PhD-level intelligence, right?

(Clearly not.)

Research skills involve not just combining multiple pieces of data, but also being able to apply very subtle skills to determine whether a source is trustworthy, to cross-check numbers where their accuracy is important (and to determine when it's "important"), and to engage in some back and forth to determine which data actually applies to the research question being asked. In this sense, "deep research" is a misleading term, since the output is really more akin to a probabilistic "search" over the training data where the result may or may not be accurate and requires you to spot-check every fact. It is probably useful for surfacing new sources or making syntactic conjectures about how two pieces of data may fit together, but checking all of those sources for existence, let alone validity, still needs to be done by a person, and the output, as it stands in its polished form today, doesn't compel users to take sufficient responsibility for its factuality.

Deep Research is in its „ChatGPT 2.0“ phase. It will improve, dramatically. And to the naysayers: When OpenAI released its first models, many doubted that it will be good at coding. Now after two years look at Cursor, aider, and all the llms powering them, what you can do with a few prompts and iterations.

Deep research will dramatically improve as it’s a process that can be replicated and automated.

  • This is like saying: y=e^-x+1 will soon be 0, because look at how fast it went through y=2!

    • Many past technologies have defied “it’s flattening out” predictions. Look at Personal computing, the internet, and smartphone technology.

      By conflating technology’s evolving development path with a basic exponential decay function, the analogy overlooks the crucial differences in how innovation actually happns.

      2 replies →

    • Tony Tromba (my math advisor at UCSC) used to tell a low key infuriating, sexist and inappropriate story about a physicist, a mathematician, and a naked woman. It ended with the mathematician giving up in despair and a happy physicist yelling "close enough."

      5 replies →

  • disagree - i actually think all the problems the author lays out about Deep Research apply just as well to GPT4o / o3-mini-whatever. These things just are absolutely terrible at precision & recall of information

    • I think Deep Research shows that these things can be very good at precision and recall of information if you give them access to the right tools... but that's not enough, because of source quality. A model that has great precision and recall but uses flawed reports from Statista and Statcounter is still going to give you bad information.

      6 replies →

  • Unfortunately that's not how trust works. If someone comes into your life and steals $1,000, and then the next time they steal $500, you don't trust them more, do you?

    Code is one thing, but if I have to spend hours checking the output, then I'd be better off doing it myself in the first place, perhaps with the help of some tooling created by AI, and then feeding that into ChatGPT to assemble into a report. By showing off a report about smartphones that is total crap, I can't remotely trust the output of deep research.

  • > Now after two years look at Cursor, aider, and all the llms powering them, what you can do with a few prompts and iterations.

    I don't share this enthusiasm, things are better now because of better integrations and better UX, but the LLM improvements themselves have been incremental lately, with most of the gains from layers around them (e.g. you can easily improve code generation if you add an LSP in the loop / ensure the code actually compiles instead of trusting the output of the LLM blindly).

  • I agree, they are only starting the data flywheel there. And at the same time making users pay $200/month for it, while the competition is only charging $20/month.

    And note, the system is now directly competing with "interns". Once the accuracy is competitive (is it already?) with an average "intern", there'd be fewer reasons to hire paid "interns" (more expensive than $200/month). Which is maybe a good thing? Fewer kids wasting their time/eyes looking at the computer screens?

> Are you telling me that today’s model gets this table 85% right and the next version will get it 85.5 or 91% correct? That doesn’t help me. If there are mistakes in the table, it doesn’t matter how many there are - I can’t trust it. If, on the other hand, you think that these models will go to being 100% right, that would change everything, but that would also be a binary change in the nature of these systems, not a percentage change, and we don’t know if that’s even possible.

Of course, humans also make mistakes. There is a percentage, usually depending on the task but always below 100%, where the work is good enough to use, because that's how human labor works.

  • If I'm paying a human, even a student working part-time or something, I expect "concrete facts extracted from the literature" to be correct at least 99.99% of the time.

    There is a huge gap from 85% to 99.99%.

    • You can expect that from a human, but if you don't know their reputation, you'd be lucky with the 85 percent. How do you even know if they understood the question correctly, used trusted sources, correct statistical models etc?

    • This does not at all resonate with my experience with human researchers, even highly paid ones. You still have to do a lot of work to verify their claims.

  • humans often don't trust (certain) other humans too.

    but replacing that with a random number(/token) generator is more reliable to someone, then more power to them.

    there is value to be had in the output of this tool. but personally i would not trust it without going through the sources and verifying the result.

  • A human WILL NOT make up non-existent facts, URLs, libraries and all other things. Unless they deliberately want to deceive.

    They can make mistake in understanding something and will be able to explain those mistakes in most cases.

    LLM and Human mistakes ARE NOT same.

    • > A human WILL NOT make up non-existent facts

      Categorically not true and there’s so many examples of this in every day practice that I can’t help but feel you’re saying this to disprove your own statement.

      5 replies →

I urge anyone to do the following: take a subject you know really really well and then feed it into one of the deep research tools and check the results.

You might be amazed but most probably very shocked.

  • In my experience, Perplexity and OpenAI's deep research tools are so misleading that they are almost worthless in any area worth researching. This becomes evident if one searches for something they know or tries to verify the facts the models produce. In my area of expertise, video game software engineering, about 80% of the insights are factually wrong cocktail-party-level thoughts.

    The "deep research" features were much more effective at getting me to pay for both subscriptions than in any valuable data collection. The former, I suspect, was the goal anyway.

    It is very concerning that people will use these tools. They will be harmed as a result.

    • > “They will be harmed as a result.”

      Compared to what exactly? The ad-fueled, SEO-optimized nightmare that is modern web search? Or perhaps the rampant propaganda and blatant falsehoods on social media?

      Whoever is blindly trusting what ChatGPT is spitting out is also falling for whatever garbage they’re finding online. ChatGPT is not very smart, but at least it isn’t intentionally deceptive.

      I think it’s an incredible improvement for the low information user over any current alternatives.

      6 replies →

  • Yup none of these tools are actually any close to AGI or "research". They are still a much better search engine and of course spam generator.

  • I tried to get it to research the design of models for account potential in B2B sales. It went to the shittiest blogspam sites and returned something utterly unimpressive. Instacancelled the $200 sub. Will try it a few more times this month but my expectations are very low.

  • In my case very "not useful". Background, I am writing a Substack where I write "deep research" articles on autonomous agent tech and explored several of these tools to understand the risks to my workflow, but none of them can replace my as of now.

Everyone who has been working on RAG is aware of how important source control is. Simply directing your agent to fetch keyword matching documents will lead to inaccurate claims.

The reality is that for now it is not possible to leave the human out of research, so I think the best LLM can only help curate sources and synthesize them, but cannot reliably write sound conclusions.

Edit: this is something elicit.com recognized quite early. But even when I was using it, I was wishing I had more control over the space over which the tool was conducting search.

Not to take away from the main point of the article, which is true but:

It seems to be at intern level according to the author - not bad if you ask me, no?

Did he try to proceed as with an intern? ie. was it a dialogue? did he try to drop in this article into prompt and see what comes out?

For skeptics my best advise is – do your usual work and at the end drop in your whole work with prompt to find issues etc. – it will be net positive result, I promise.

And yes they do get better and it shouldn't get dismissed – the most fascinating part is precisely just that – not even their current state, but how fast they keep improving.

One part which always bothers me a bit with this type of arguments – why on earth are we assuming that human does it 100% correctly? Aren't humans also making similar mistakes?

IMHO there is some similarity with young geniuses – they get tons of stuff right and it's impressive however total, unexpected failures occur which feel weird – in my opinion it's a matter of focused training similar to how you'd teach a genius.

It's worth taking step back and recognizing in how many diverse contexts we're using (like now, today, not in 5 years) models like grok3 or claude3.7 – the goalpost seem to have moved to "beyond any human expert on all subjects".

I always wondered, if deep research has an X% chance of producing errors in it's report and you have to double check everything + visit every source or potentially correct it yourself. Then does it really save time in helping you get research done (outside of coding and marketing)? .

  • It might depend on how much you struggle with writers block. An LLM essay with sources is probably a better starting point than a blank page. But it will vary between people.

This article covers something early on that makes the question of “will models get to zero mistakes” pretty easy to answer: No.

Even if they do the math right and find the data you ask for and never make any “facts” up, the sources of the data themselves carry a lot of context and connotation about how the data is gathered and what weight you can put on it.

If anything, as LLMs become a more common way of ingesting the Internet, the sources of data themselves will start being SEOed to get chosen more often by the LLM purveyors. Add in paid sponsorship, and if anything, trust in the data from these sorts of Deep Research models will only get worse over time.

"Deep research" is super impressive, but so far is more "search the web and surf pages autonomously to aggregate relevant data"

It is in many ways a workaround to Google's SEO poisoning.

Doing very deep research requires a lot of context, cross-checking data, resourcefulness in sourcing and taste. Much of that context is industry specific and intuition plays a role in selecting different avenues to pursue and prioritisation. The error rates will go down but for the most difficult research it will be one tool among many rather than a replacement for the stack.

  • > It is in many ways a workaround to Google's SEO poisoning.

    But the article goes into exactly how Deep Research fell exactly for the same SEO traps.

Watched recent Viva la dirt league videos on how trailers lie and do false promises. Now I see LLM as that marketing guy. Even if he knows everything, he can't help with lying. You can't trust anything he says no matter how authoritative he sounds, even if he is telling the truth you have know way of knowing.

These deep research things are a waste of time if you can't trust the output. Code you can run and verify. How do you verify this.

These days I'm feeling like GenAi is basically an accuracy rate of 95% maybe 96%. Great at boilerplate, great at stuff you want an intern to do or maybe to outsource... but it really struggles with the valuable stuff. The errors are almost always in the most inconvenient places and they are hard to see... So I agree with Ben Evans on this one, what is one to do? the further you lean on it the worse your skills and specializations get. It is invaluable for some kinds of work greatly speeding you up, but then some of the things you would have caught take you down a rabbit hole that waste so much time. The tradeoffs here aren't great.

  • I think it's not the valuable stuff though. The valuable stuff is all the boilerplate, because, I don't want to do it. The rest, I actually have a stake in not only that it's done, but how it's done. And I'd rather be hands-on doing that and thinking about it as I do it. Having an AI do that isn't that valuable, and in fact robs me of the learning I acquire by doing it myself.

I'll share my recipe for using these products on the off chance it helps someone.

1. Only do searches that result in easily verifiable results from non-AI sources.

2. Always perform the search in multiple products (Gemini 1.5 Deep Research, Gemini 2.0 Pro, ChatGPT o3-mini-high, Claude 3.7 w/ extended thinking, Perplexity)

With these two rules I have found the current round of LLMs useful for "researchy" queries. Collecting the results across tools and then throwing out the 65-75% slop results in genuinely useful information that would have taken me much longer to find.

Now the above could be seen as a harsh critique of these tools, as in the kiddie pool is great as long as you're wearing full hazmat gear, but I still derive regular and increasing value from them.

  • Good advice.

    My current research workflow is:

    * Add sources to NotebookLM

    * Create a report outline with NotebookLM

    * Get Perplexity and/or Chatgpt to give feedback on report outline, amend as required.

    * Get NotebookLM and Perplexity to each write their own versions of the report one section at a time.

    * Get Perplexity to critique each version and merge the best bits from each.

    * Get Chatgpt to periodically provide feedback on the growing document.

    * All the while acting myself as the chief critic and editor.

    This is not a very efficient workflow but I'm getting good results. The trick to use different LLMs together works well. I find Perplexity to be the best at writing engaging text with nice formatting, although I haven't tried Claude yet.

    By choosing the NotebookLM sources carefully you start off with a good focus, it kind of anchors the project.

    • I should also mention that this more 'hands on' technique is good for learning a subject because you have to make editorial assessments as you go.

      Maybe good for wider subject areas, longer reports, or where some editorial nuance helps.

  • > ... perform the search in multiple products

    I do that a lot, too, not only for research but for other tasks as well: brainstorming, translation, editing, coding, writing, summarizing, discussion, voice chat, etc.

    I pay for the basic monthly tiers from OpenAI, Anthropic, Google, Perplexity, Mistral, and Hugging Face, and I occasionally pay per-token for API calls as well.

    It seems excessive, I know, but that's the only way I can keep up with what the latest AI is and is not capable of and how I can or cannot use the tools for various purposes.

  • This makes sense. How many of those products do you have to pay for?

    • I'm not OP but I do similar stuff. I pay for Claude's basic tier, OpenAI's $200 tier, and Gemini ultra-super-advanced I get for free because I work there.

      I combine all the 'slop' from the three of them in to Gemini (1 or 2 M context window) and have it distill the valuable stuff in there to a good final-enough product.

      Doing so has got me a lot of kudos and applause from those I work with.

      7 replies →

Indeed, the main drawback of the various Deep Research implementation is the quality of sources is determined by SEO, which is often sketchy. Often the user has insight on what the right sources are and they may even be off-line on your computer.

We built an alternative to do Deep Research (https://radpod.ai) on data you provide, instead of relying on Web results. We found this works a lot better in terms of quality of answers as the user can control the source quality.

Deep Research, as it currently stands, is a jack of all trades but a master of none. Could this problem be mitigated by building domain-specific profiles curated by experts? For example, specifying which sources to prioritize, what to avoid, and so on. You would select a profile, and Deep Research would operate within its specific constraints, supplemented by expert knowledge.

The problem with tools like deep research is that they imply good reasoning skills of the underlying technology. Artificial reasoning clearly exists, but is not refined enough to build these kind of technology on top of it. The reasoning is the fundamental of this system and all on top of it gets very unstable.

"I can say that these systems are amazing, but get things wrong all the time in ways that matter, and so the best uses cases so far are things where the error rate doesn’t matter or where it’s easy to see."

That's probably how we should all be using LLMs.

This is such embarrassing marketing from an organization (OpenAI) which presents itself to the world as a "research" entity.. They could have at least photoshopped the screenshot to make it look like it emitted correct information.

> they don’t really have products either, just text boxes - and APIs for other people to build products.

Isn't this a very valuable product in itself? Whatever happened to the phrase "When there is a gold rush, sell shovels"?

Two factors to consider: human performance and cost.

Plenty of humans regularly make similar mistakes to the one in the Deep Research marketing, with more overhead than an LLM.

The thig is, if you look at all the "Deep Research" benchmark scores, they never claim to be perfect. The problem was plain to see.

Yes the confidence tbh is getting a bit out of hand. I see the same thing with coding with our SAAS, once the problems get bigger I find myself more often than not starting to code the old way, even over "fixing ai’s code", because the issues are often too much.

Ithink more certainty communication could help. Especially when they talk about docs or 3rd party packages etc. Regularly even Sonnet 3-7 just invents stuff...

I, for one, have it in my prompt that GPT should end every message with a message about how sure it is about the answer, and a rating of "Extremely sure", "moderately sure", etc.

It works surprisingly well. It also provides its reasoning about the quality of the sources, etc. (This is using GPT-4o of course, as it's the only mature GPT with web access)

I highly recommend adding this to your default prompt.

  • > It works surprisingly well

    What do you mean by this exactly? That it makes you feel better about what its said, or that its assessment of its answer is actually accurate?

    • That's its assessment gives me a good picture of which ways to push with my next question, and which things to ask it to use its web search tool to find more information on.

      It's a conversation with AI, it's good to know its thought process on how certain it is of its conclusions, as it isn't infallible not is any human.

One other existential question is Simpson's paradox, which I believe is exploited by politicians to support different policies from the same underlying data. I see this as a problem for government especially if we have liberal or conservative trained LLMs. We expect the computer to give us the correct answer, but when the underlying model is trained one way by RLHF or by systemic/weighted bias in its source documents -- Imagine training a libertarian AI on Cato papers -- you have could have highly confident pseudo-intellectual junk. Economists already deal with this problem daily since their field was heavily politicized. Law as well is another one.

  • I've never thought of Simpson's Paradox as a political problem before, thanks for sharing this!

    Arguably this applies just as well to Bayesian vs Frequentist statisticians or Molecular vs Biochemical Biologists.

I am currently in India in a big city doing yoga for the first time as a westerner.

I dont Google anything. Google maps, yeah. Google, no.

Everything I want to know is much better answered by ChatGpt Deep Research.

Ask a Question, Drink a Chai, Get a Great, Prioritised, structured Answer without spam or sifting through ad ridden SEO pages.

It is a game changer, and at one point the will get rid of the "drink a chai" wait and it will kill the Google we know now.

I used deep research with o1-pro to try to fact/sanity check a current events thing a friend was talking about, read the results and followed the links it provided to get further info, and ended up on the verge of going down a rabbit hole that now looks more like a leftist(!) conspiracy theory.

  • I didn't want to bring in specifics because I didn't feel like debating the specific thing, so I guess that made this post pretty hard to parse and I should have added more info.

    I was trying to convey that it had found some sources that, if I came across them naturally, I probably would have immediately recognized as fringe. The sources were threading together a number of true facts into a fringe narrative. The AI was able to get other sources on the true facts, but has no common sense, and I think ended up producing a MORE convincing presentation of the fringe theory than the source of the narrative. It sounded confident and used a number of extra sources to check facts even though the fringe narrative that threaded them all together was only from one site that you'd be somewhat apt to dismiss just by domain name if it was the only source you found.