Comment by imiric

7 months ago

I applaud the effort. We need human-friendly CAPTCHAs, as much as they're generally disliked. They're the only solution to the growing spam and abuse problem on the web.

Proof-of-work CAPTCHAs work well for making bots expensive to run at scale, but they still rely on accurate bot detection. Avoiding both false positives and negatives is crucial, yet all existing approaches are not reliable enough.

One comment re:

> While AI agents can theoretically simulate these patterns, the effort likely outweighs other alternatives.

For now. Behavioral and cognitive signals seem to work against the current generation of bots, but will likely also be defeated as AI tools become cheaper and more accessible. It's only a matter of time until attackers can train a model on real human input, and inference to be cheap enough. Or just for the benefit of using a bot on a specific target to outweigh the costs.

So I think we will need a different detection mechanism. Maybe something from the real world, some type of ID, or even micropayments. I'm not sure, but it's clear that bot detection is at the opposite, and currently losing, side of the AI race.

> So I think we will need a different detection mechanism. Maybe something from the real world, some type of ID, or even micropayments. I'm not sure, but it's clear that bot detection is at the opposite, and currently losing, side of the AI race.

I think the most likely long-term solution is something like DIDs.

https://en.wikipedia.org/wiki/Decentralized_identifier

A small number of trusted authorities (e.g. governments) issue IDs. Users can identify themselves to third-parties without disclosing their real-world identity to the third-party and without disclosing their interaction with the third-party to the issuing body.

The key part of this is that the identity is persistent. A website might not know who you are, but they know when it’s you returning. So if you get banned, you can’t just register a new account to evade the ban. You’d need to do the equivalent of getting a new passport from your government.

  • But this mean that now a saas baning you from your account for spurious reason can be a serious problem.

    • You could roll a new id to replace the previous one. Each user would still have only one at a time. If this isn't acceptable a service may ask to have the feature disabled for clear mission critical reasons and/or a fee.

  • https://www.wired.com/story/worldcoin-sam-altman-orb/

  • It also allows automated software to act on behalf of a person, which is excellent for assistive technologies and something most current bot detection leaves behind.

    • I think this will be a positive effect of the rise of AI agents. We’re going to have a much different distribution of automated vs human traffic and authentication/methods will have to be more robust than they are now

  • On the one hand, yes, this might work, but I'm concerned that it will inevitably require loss of anonymity and be abused by companies for user tracking. I suppose any type of user identification or fingerprinting is at the expense of user privacy, but I hope we can come up with solutions that don't have these drawbacks.

    • The benefit of majorly reducing fraud can create an ecosystem where the trade off is worth it for users to take. For example generous free plans or trials can exist without companies needing to invest so much in antifraud for them.

  • If this gets implemented, the next thing the govt will do is require all websites to store DIDs of visitors for at least 10 years and not accept visitors without them.

    • This makes no sense at all. If a government wanted to pass a law to force logins and track people, they could do that today without using an identifier that is worthless for that purpose.

  • I have not heard about DIDs at all before. How does this really work? They are Government-issued? I am not sure I would trust that though.

> They're the only solution to the growing spam and abuse problem on the web

They're the only solution that doesn't require a pre-existing trust relationship, but the web is more of a dark forest every day and captchas cannot save us from that. Eventually we're going to have to buckle down and maintain a web of trust.

If you notice abuse, you see which common node caused you to trust the abusers, and you revoke trust in that node (and, transitively, everything that it previously caused you to trust).

  • That might be the way to go. Someone else in the thread mentioned a similar reputation system.

    The problem is that such a system could be easily abused or misused. A bad actor could intentionally or mistakenly penalize users, which would have global consequences for those users. So we need a web of trust for the judges as well, and some way of disputing and correcting the mistake.

    It would be interesting to prototype it, though, and see how it could work at scale.

    • > we need a web of trust for the judges as well

      I don't think there should be any judges (or to put it differently, I think every user should be a judge), nor any centralized database, no roots of trust at all. That way it doesn't present any high value targets for corruption to focus on.

      The trustworthiness of a user in some domain (won't-DOS-your-page could be a trust domain, writes-honest-product-reviews could be a domain, not-a-scammer, etc) as evaluated by some other individual would have to do with some aggregation of the shortest paths (and their associated trust scores) between those to users on the trust graph.

      There is no trust score for user foo, only a trust score for user foo according to user bar. User baz might see foo differently.

      If you get scammed, you don't revoke trust in the scammer. Well, you do, but you also go one-hop-back and revoke trust in whoever caused you to trust the scammer. This creates incentives towards trust hygiene. If you don't want people to stop trusting you, then you have to be careful about who you trust. It's a protocol-level proxy for a skill we've been honing for millenia: looking out for each other.

      But it doesn't work if there's just a single company that tracks your FICO score or something like that. Either that company ends up being too juicy of a target and ends up itself becoming corrupt, or people attack the weak association between user and company such that the company can't actually tell the difference between a scammer and a legit user (the latter is the case for the credit score companies, hence: identity fraud).

      Attacks like that are much harder to pull off if the source of truth isn't some remote database somewhere and is instead based on the set of people you see every day in meatspace.

    • Hyphanet (formerly Freenet) uses a similar Web of Trust, if you want to see a real-life example in action. Maybe Freenet still uses a WoT as well, I'm not sure.

Everything on the web is a robot, every client is an agent for someone somewhere, some are just more automated.

Distinguishing en mass seems like a waste to me. Deal with the actual problems like resource abuse.

I think part of the issue is that a lot of people are lying to themselves that they "love the public" when in reality they really don't and want nothing to do with them. They lack the introspection to untangle that though and express themselves with different technical solutions.

  • I do think the answer is two-pronged: roll out the red carpet for "good bots", add friction for "bad bots".

    I work for Stytch and for us, that looks like:

    1) make it easy to provide Connected Apps experiences, like OAuth-style consent screens "Do you want to grant MyAgent access to your Google Drive files?"

    2) make it easy to detect all bots and shift them towards the happy path. For example, "Looks like you're scraping my website for AI training. If you want to see the content easily, just grab it all at /LLMs.txt instead."

    As other comments mention, bot traffic is overwhelmingly malicious. Being able to cheaply distinguish bots and add friction makes your life as a defending team much easier.

    • IMO if it looks like a bot and doesn't follow robots.txt you should just start feeding it noise. Ignoring robots.txt makes you a bad netizen.

1. Create a website with a series of tasks to capture this data.

2. Send link to coworkers via Slack so they can spend five minutes doing the tasks.

3. Capture that data and create thousands of slight variations saved to db as profiles

4. Bypass bot protections.

There is nothing anyone can do to prevent bots.

  • > There is nothing anyone can do to prevent bots.

    Are you sure about this?

    • I was part of the team managing tens of millions of dollars’ worth of NFL event-ticket inventory, which meant I had to automate the Ticketmaster UI to delist any ticket that was put into checkout or sold on a secondary market like StubHub. For legal reasons, Ticketmaster wouldn’t grant us direct access to their private API while they were still building out the developer API (which our backend team actually helped design), so I spent about half my time reverse-engineering and circumnavigating their bot protections on Ticketmaster, SeatGeek, StubHub, etc. I made it very clear that anyone caught using my code to automate ticket purchases would face serious consequences.

      At the time, Ticketmaster’s anti-bot measures were the gold standard. They gave us fair warning that they planned to implement Mastercard’s SaaS-based solution (same as described in OP’s article), so I had everyone on the team capture keyboard-typing cadence, mouse movements, and other behavioral metrics. I used that as the excuse to build a Chrome extension that handled all of those tasks, and I leaned on the backend team to stop procrastinating and integrate the new API endpoints that Ticketmaster was rolling out. For about a week, that extension managed millions of dollars in inventory—until I got our headless browsers back up and running.

      In the end, any lock can be picked given enough time; its only real purpose is to add friction until attackers move on to an easier target. But frankly, nobody can stop me from scraping data or automating site interactions if it’s more profitable than whatever else I could be working on. I have some ideas how to prevent me from using automated bots but all of the companies I've applied to over the years never respond -- that's on them.

I run a company that relies on bots getting past captchas. It's not hard to get past captchas like this. Anyone with even a medium-sized economic incentive will figure it out. There'll probably be free open-source solutions soon.

> We need human-friendly CAPTCHAs, as much as they're generally disliked. They're the only solution to the growing spam and abuse problem on the web.

This is wrong, badly wrong.

CAPTCHA stood for “Completely Automated Public Turing test to tell Computers and Humans Apart”. And that’s how people are using such things: to tell computers and humans apart. But that’s not the right problem.

Spam and abuse can come from computers, or from humans.

Productive use can come from humans, or from computers.

Abuse prevention should not be about distinguishing computers and humans: it should be about the actual usage behaviour.

CAPTCHAs are fundamentally solving the wrong problem. Twenty years ago, they were a tolerable proxy for the right problem: imperfect, but generally good enough. But they have become a worse proxy over time.

Also, “human-friendly CAPTCHAs” are just flat-out impossible in the long term. As you identify, it’s only a “for now” thing. Once it’s a target, it ceases to be effective. And the range in humans is so broad that it’s generally distressingly easy to make a bot exceed the lower reaches of human performance.

> Proof-of-work CAPTCHAs work well for making bots expensive to run at scale, but they still rely on accurate bot detection. Avoiding both false positives and negatives is crucial, yet all existing approaches are not reliable enough.

Proof-of-work is even more obviously a temporary solution, security by obscurity: it relies upon symmetry in computation power, which is just wildly incorrect. And all of the implementations I know of have made the bone-headed decision to start with SHA-256 hashing, which amplifies this asymmetry to ludicrous degree (factors of tens of thousands with common hardware, to tens of millions with Bitcoin mining hardware). At that point, forget choosing different iteration counts based on bot detection, it doesn’t even matter.

—⁂—

The inconvenient truth is: there is no Final Ultimate Solution to the Spam Problem (FUSSP).

  • > Proof-of-work is even more obviously a temporary solution, security by obscurity: it relies upon symmetry in computation power, which is just wildly incorrect. And all of the implementations I know of have made the bone-headed decision to start with SHA-256 hashing, which amplifies this asymmetry to ludicrous degree (factors of tens of thousands with common hardware, to tens of millions with Bitcoin mining hardware). At that point, forget choosing different iteration counts based on bot detection, it doesn’t even matter.

    It takes a long time and enormous amounts of money to make new chips for a specific proof of work. And sites can change their algorithm on a dime. I don't think this is a big issue.

    • Even disregarding the SHA-256 thing, there is unavoidable significant asymmetry and range that renders proof of work unviable. One legitimate user may use a low-end phone, another may have a high-end desktop that can work a hundred or more times as fast whatever technique you use, and an attacker may have a bot net.

      It’s important to assume, in security and security-adjacent things, that the attacker has more compute power than the defender. You cannot win in this way.

      Proof-of-work is bad rate limiting that relies upon the server having a good estimate of the capabilities of the client. No more, no less.

      I bring up the SHA-256 thing as an argument that none of the players in the space are competent. None of them. If you exclude hand-rolled cryptography or known-bad techniques like MD5, SHA-256 is very literally the worst choice remaining: its use in Bitcoin and the rewards available have utterly broken it for this application. If you intend proof of work to actually be the line of defence, you start with something like Argon2d instead. I honestly think that, at this stage, these scripts could replace their proof of work with a “sleep for one second” (maybe adding “or two if I think you’re probably a bot”) routine and have the server trust that they had done so, without compromising their effectiveness.

  • > Spam and abuse can come from computers, or from humans.

    > Productive use can come from humans, or from computers.

    I agree in principle, but the reality is that 37% of all internet traffic originates from bots[1]. The overwhelming majority of that traffic (89% according to Fastly) can be described as abusive. In turn, the abusive traffic from humans likely pales in comparison. It's vastly cheaper to setup bot farms than mechanical turk farms, and it's only getting cheaper.

    Identifying the source of the traffic, while difficult, is a generalizable problem. Whereas tracking specific behavior will depend on each site, and will likely require custom implementation for each type of service. Or it requires invasive tracking of users throughout the duration of their session, as many fraud prevention systems do.

    Both approaches can be deployed at the same time. A CAPTCHA is not meant to be the only security solution anyway, but as a first layer of defense that is generally simple to deploy and maintain.

    That said, I concede that the sentence "[CAPTCHAs] are the only solution" is wrong. :)

    > Proof-of-work is even more obviously a temporary solution, security by obscurity

    I disagree, and don't see how it's security by obscurity. It's simply a method of increasing the access cost for abusive traffic. The more signals are gathered that identify the user as abusive, the higher the "price" they're required to pay to access the service. Whether the user is a suspected bot or not could just be one type of signal. Behavioral and cognitive signals as mentioned in TFA can be others. Yes, these methods aren't perfect, and can mistakenly penalize human users and be spoofed by bots, but it's the best we currently have. This is what I'd like to see improved.

    Still, even with all their faults, I think PoW CAPTCHAs offer a much better UX than traditional CAPTCHAs ever did. Yes, telling humans apart from computers is getting more difficult, but it doesn't mean that the task is pointless.

    [1]: https://learn.fastly.com/rs/025-XKO-469/images/Fastly-Threat...

> Proof-of-work CAPTCHAs work well for making bots expensive to run at scale

“Expensive” depends on the value of what you do behind the captcha

There are human-solving captcha services that charge USD 1 for 1k captchas solved (0.1 cents per captcha)

So as long as you can charge more than what solving the captchas cost, you are good to go

Unfortunately, for a lot of tasks, humans are currently cheaper than AI

Exactly. If the financial incentive is there, they'll add sufficient jitter to trick the detector, and eventually train an ML model to make it even more realistic.

  • Yes and no. Traditional CAPTCHAs didn't cause bot farms to advance computer vision

    • I don't see how that contradicts the parent post. Computer vision wasn't as good when reCAPTCHA was still typing out books, but machine learning has (per my expectation, having worked with it since ~2015, but the proof would be in the pudding) likely been good enough for mimicking e.g. keystroke timings for decades. It hasn't been needed until now. That doesn't mean they won't use it now that it is needed. Different situation from where tech did not yet exist

      1 reply →

    • > Traditional CAPTCHAs didn't cause bot farms to advance computer vision

      Are you sure? And how do you know?

      There are a lot of CAPTCHA cracking services. Given the price, they are hardly sustainable even under developing country wage level. I believe they actually solve the easy ones automatically and humans are only involved for the harder ones.

I think we'll have to go with id connected to a real human eventually tbh.

Y'all will balk at that but in a decade or so I think we'll have no other choice.

But even that will fail since certain countries will likely be less precious about their system for this and spammers will still get fake ids. Same problem as now with phone numbers/rcs spam.

> but [PoWs] still rely on accurate bot detection.

No they don't, that's the point: you can serve everyone a PoW and don't have to discriminate and ban real people. This system you're enthusiastic about is what tries to do this "accurate bot detection" (scratch the first word)