Cloudflare does not notice (until a customer complains) that they are sending broken responses at scale? I would have thought they would notice this from sampling and linting a few replies.. just in case they did something like Cloudbleed again.
> We spent six weeks chasing a nearly invisible bug — a race condition that occurred only under specific conditions — in the hyper library that impacted how the Images binding returned processed image data back to the client. In the end, it took four lines of code to fix it.
It is a long time and it gets frustrating when there is significant time where there is flailing with no visible progress.
I have had long bug hunts (~a month each) and witnessed ones that took much, much longer. But the longest one I witnessed was drawn out because reproduction was initially unreliable and could take weeks to months. Thankfully, reproduction was by letting a box sit in a corner while tje people involved moved on to other tasks. This kept everybody sane.
I get that it’s fun to dunk on Rust when a Rust bug surfaces. But is it a bit petty to bring this out when there’s any type of bug of any severity in any Rust software?
In this case a small minority of requests were getting truncated responses.
No one said Rust software is bug free. If someone thinks that they’ve been seriously misled.
Actually I suspect that Rust is a Silver Bullet in that sense. That essay seems to be a case where people know of the essay but haven't read it. Normally in English a "Silver Bullet" is something much bigger, a panacea or cure all which entirely solves a problem but in his essay Brooks is talking about order-of-magnitude improvements, and that looks a lot like Rust.
Brooks was expecting such "Silver Bullet" improvements as often as every few decades, we're arguably overdue significantly. He cites Ada as an example of where such an improvement might come from, well, Rust isn't Ada but a lot of the same ideas about correctness are present.
Google reports order of magnitude changes from their Rust work for example.
Of course it's a concurrency bug. It races sending data to the kernel against the kernel sending data to the network. If the wrong one wins the bug occurs.
Nice writeup, but I don't understand how `curl` didn't trigger bug for them (or any other hyper HTTP server out there), given the explanation in the article.
`curl --http1.1` sends `Connection: Close` so sender (hyper) must attempt to shutdown connection after sending whole body. Surely any network is slower than memory copy into socket kernel buffers, so it must reliably trigger condition "buffer flush can't be done in one go" and thus trigger early TCP shutdown.
> The failure was caused by a timing-dependent race condition in hyper’s HTTP/1 connection handling. When the reader was slower and the socket buffer filled, poll_flush returned Poll::Pending, but the dispatch loop discarded that result. Hyper then treated the response as complete and shut down the socket while data remained buffered internally, causing the client to receive an EOF before the full body arrived.
This is the Rust idiom for “I am intentionally ignoring this return value”. The linter would have caught
self.poll_read()?;
and in fact one of the options the linter itself suggests in this case is exactly this “let underscore equals” idiom. (Arguably, this code exists because of the linter, not due to its absence!)
In any case, the return value is being “handled” - the question mark examines the result and breaks the loop if the result is not `Ok(…)`, ie if the call is not successful.
Intentionally ignoring the successful return value isn’t necessarily terrible, either - you could be calling the function for its side effect, and you don’t care what the specific result of that effect is, just as long as there is some effect. E.g. maybe you have a state machine, and this is the code that repeatedly drives it.
(Not coincidentally, polling is what you do to Futures, and Futures are state machines that you need to repeatedly drive…)
In conclusion, I do not think this is prima facie terrible code, nor is it an obvious bug. Async rust is subtle and complicated, and not always fully understood by those who nevertheless have to use it.
>This is the Rust idiom for “I am intentionally ignoring this return value”.
That doesn't make the code any less awful, it just makes idiomatic Rust sound awful. Discarding a return value without even a comment to explain why shouldn't be allowed in any critical project, and the linter should be perfectly capable of ensuring that a comment accompanies the discard and complaining loudly when it doesn't.
It is an explicit way to discard return values; `self.poll_read(cx)?` etc. alone would warn. Or in this case, `Poll<Result<(), Error>>` is unwrapped once and `Result<(), Error>` is being discarded. The decision to discard `Result<(), Error>` should have been intentional, albeit turned out to be not always the case.
If they're not going to handle the return values, they should change the function signature to reflect this aspirational contract, that that function "never fails".
I see in the article they did change the poll_flush to run just-in-time at poll_shutdown. So they definitely can make a "best effort" poll_flush version that just does not return any errors for use in that loop.
Assigning to _ in Rust specifically means that you intentionally want to discard the value, and the clippy linter and the Rust compiler both know that.
There's a hidden equivocation there. "Handling" errors, as far as the language is concerned, mean you do something with them, but explicitly discarding them is most definitely a "something".
From a human perspective we can consider that not handling the error.
But the language has no mechanism for "knowing" that discarding the error is wrong. Discarding errors is a fully valid mechanism that we must be able to do in a program because it is sometimes correct. There really isn't even a sensible way to define a way to "force" a user to "handle" errors. The language can only be designed to make it hard to forget to "handle" them somehow in the way the language sees, but it is always possible for the user to incorrectly handle them, of which discarding them when they shouldn't have is only one particularly cognitively-available option but is hardly the full scope of possibilities. Probably isn't even the most common mistake to make, I would imagine there are far more errors that are not handled "correctly" than ones that are spuriously discarded.
Note I keep saying "language" rather than Rust. All a language can do is surface the issue, and Rust does that. It can't force good code. No language can.
It's the same argument anti-vaxers love to make. "Well you can still get covid after getting the shot", which is something I read and heard quite a lot. That doesn't make the thing useless.
This would have been flagged by Clippy lints `let_underscore_untyped` or `let_underscore_must_use`, which sadly are not enabled by default.
Or just by not writing let _ =
All recurrent people problems are system problems.
1 reply →
Ehh, easy fix
I said ‘flagged’, not ‘fixed’ :)
You can always write the wrong code if you want it enough. But hopefully a warning would have prompted someone to think harder about this flow.
3 replies →
And this is why you should warn on `clippy::allow_attributes_without_reason` in your projects.
You can set the lints to `forbid` instead of `deny`, which means they can't be `allowed` like that.
Yeah, but you must know about them and the possible bug first in order to allow them...
4 replies →
[flagged]
Cloudflare does not notice (until a customer complains) that they are sending broken responses at scale? I would have thought they would notice this from sampling and linting a few replies.. just in case they did something like Cloudbleed again.
Can you get reasonable results without exposing sensitive info? I'm asking because I genuinely have no idea what it's like at their scale
> We spent six weeks chasing a nearly invisible bug — a race condition that occurred only under specific conditions — in the hyper library that impacted how the Images binding returned processed image data back to the client. In the end, it took four lines of code to fix it.
That's a long time, must be frustrating.
It is a long time and it gets frustrating when there is significant time where there is flailing with no visible progress.
I have had long bug hunts (~a month each) and witnessed ones that took much, much longer. But the longest one I witnessed was drawn out because reproduction was initially unreliable and could take weeks to months. Thankfully, reproduction was by letting a box sit in a corner while tje people involved moved on to other tasks. This kept everybody sane.
Would using Rust have prevented this?
I get that it’s fun to dunk on Rust when a Rust bug surfaces. But is it a bit petty to bring this out when there’s any type of bug of any severity in any Rust software?
In this case a small minority of requests were getting truncated responses.
No one said Rust software is bug free. If someone thinks that they’ve been seriously misled.
Agree. This is warning to people who thought Rust is optional at cloud scale.
Isn't this already Rust?
That was obviously a joke question, pointing that Rust isn't the solution for everything.
Woosh :-)
No. Anyone expecting that hasn't read No Silver Bullet essay.
Actually I suspect that Rust is a Silver Bullet in that sense. That essay seems to be a case where people know of the essay but haven't read it. Normally in English a "Silver Bullet" is something much bigger, a panacea or cure all which entirely solves a problem but in his essay Brooks is talking about order-of-magnitude improvements, and that looks a lot like Rust.
Brooks was expecting such "Silver Bullet" improvements as often as every few decades, we're arguably overdue significantly. He cites Ada as an example of where such an improvement might come from, well, Rust isn't Ada but a lot of the same ideas about correctness are present.
Google reports order of magnitude changes from their Rust work for example.
1 reply →
The Hyper library in question is a Rust library.
Did you read the article, or are you a "use rust" parrot / bot based on titles?
Sarcasm. (I guess)
1 reply →
So “fearless concurrency” still only happens when one just decides to not be afraid… :)
This does not appear to be a concurrency bug though?
Of course it's a concurrency bug. It races sending data to the kernel against the kernel sending data to the network. If the wrong one wins the bug occurs.
1 reply →
“ a race condition that occurred only under specific conditions — in the hyper library”
Nice writeup, but I don't understand how `curl` didn't trigger bug for them (or any other hyper HTTP server out there), given the explanation in the article.
`curl --http1.1` sends `Connection: Close` so sender (hyper) must attempt to shutdown connection after sending whole body. Surely any network is slower than memory copy into socket kernel buffers, so it must reliably trigger condition "buffer flush can't be done in one go" and thus trigger early TCP shutdown.
> The failure was caused by a timing-dependent race condition in hyper’s HTTP/1 connection handling. When the reader was slower and the socket buffer filled, poll_flush returned Poll::Pending, but the dispatch loop discarded that result. Hyper then treated the response as complete and shut down the socket while data remained buffered internally, causing the client to receive an EOF before the full body arrived.
https://github.com/hyperium/hyper/issues/4022
Saved you 3000 words
Reminds me of another “slow client”-related bug in gunicorn: https://github.com/benoitc/gunicorn/issues/3334
That's not even a bug. That's how TCP works. If you keep sending data to a socket the other side has closed, you get RST.
3 replies →
Hey, you have to justify three engineers full time's worth of salary.
[dead]
[flagged]
This is the Rust idiom for “I am intentionally ignoring this return value”. The linter would have caught
and in fact one of the options the linter itself suggests in this case is exactly this “let underscore equals” idiom. (Arguably, this code exists because of the linter, not due to its absence!)
In any case, the return value is being “handled” - the question mark examines the result and breaks the loop if the result is not `Ok(…)`, ie if the call is not successful.
Intentionally ignoring the successful return value isn’t necessarily terrible, either - you could be calling the function for its side effect, and you don’t care what the specific result of that effect is, just as long as there is some effect. E.g. maybe you have a state machine, and this is the code that repeatedly drives it.
(Not coincidentally, polling is what you do to Futures, and Futures are state machines that you need to repeatedly drive…)
In conclusion, I do not think this is prima facie terrible code, nor is it an obvious bug. Async rust is subtle and complicated, and not always fully understood by those who nevertheless have to use it.
>This is the Rust idiom for “I am intentionally ignoring this return value”.
That doesn't make the code any less awful, it just makes idiomatic Rust sound awful. Discarding a return value without even a comment to explain why shouldn't be allowed in any critical project, and the linter should be perfectly capable of ensuring that a comment accompanies the discard and complaining loudly when it doesn't.
2 replies →
It is an explicit way to discard return values; `self.poll_read(cx)?` etc. alone would warn. Or in this case, `Poll<Result<(), Error>>` is unwrapped once and `Result<(), Error>` is being discarded. The decision to discard `Result<(), Error>` should have been intentional, albeit turned out to be not always the case.
If they're not going to handle the return values, they should change the function signature to reflect this aspirational contract, that that function "never fails".
I see in the article they did change the poll_flush to run just-in-time at poll_shutdown. So they definitely can make a "best effort" poll_flush version that just does not return any errors for use in that loop.
But all in all? Amateur hour.
3 replies →
Assigning to _ in Rust specifically means that you intentionally want to discard the value, and the clippy linter and the Rust compiler both know that.
[flagged]
LLM?
why?
So much for Rust forcing you to handle errors.
Go does force you too, but it also supports _ as a bypass - because sometimes you do know better. Just not in this case.
Rust never promised it'll let programmers turn off their brain, that's what LLMs are for.
You could argue the bug happened exactly because hyper's poll_flush treats flushing some but not all data as a successful return, not an error case.
There's a hidden equivocation there. "Handling" errors, as far as the language is concerned, mean you do something with them, but explicitly discarding them is most definitely a "something".
From a human perspective we can consider that not handling the error.
But the language has no mechanism for "knowing" that discarding the error is wrong. Discarding errors is a fully valid mechanism that we must be able to do in a program because it is sometimes correct. There really isn't even a sensible way to define a way to "force" a user to "handle" errors. The language can only be designed to make it hard to forget to "handle" them somehow in the way the language sees, but it is always possible for the user to incorrectly handle them, of which discarding them when they shouldn't have is only one particularly cognitively-available option but is hardly the full scope of possibilities. Probably isn't even the most common mistake to make, I would imagine there are far more errors that are not handled "correctly" than ones that are spuriously discarded.
Note I keep saying "language" rather than Rust. All a language can do is surface the issue, and Rust does that. It can't force good code. No language can.
You could say the exact same thing about safety belts and airbags in cars after someone has died in a crash.
Why even bother with measures that prevent many problems if they won't prevent all of them, right?
This is the argument I like too.
It's the same argument anti-vaxers love to make. "Well you can still get covid after getting the shot", which is something I read and heard quite a lot. That doesn't make the thing useless.
Humans are really dumb.
I wonder if this bug was found via project glasswing
> I wonder if this bug was found via project glasswing
Did you read how they said it took weeks? Would run out of tokens at that rate...
Yet Cloudflare relies on bugs in browsers to "verify" you.