Comment by kentonv

1 day ago

> it feels like it's essentially Cloudflare employees improving Cap'n Proto for Cloudflare.

That's correct. At present, it is not anyone's objective to make Cap'n Proto appeal to a mass market. Instead, we maintain it for our specific use cases in Cloudflare. Hopefully it's useful to others too, but if you choose to use it, you should expect that if any changes are needed for your use case, you will have to make those changes yourself. I certainly understand why most people would shy away from that.

With that said, gRPC is arguably weird in its own way. I think most people assume that gRPC is what Google is built on, therefore it must be good. But it actually isn't -- internally, Google uses Stubby. gRPC is inspired by Stubby, but very different in implementation. So, who exactly is gRPC's target audience? What makes Google feel it's worthwhile to have 40ish(?) people working on an open source project that they don't actually use much themselves? Honest questions -- I don't know the answer, but I'd like to.

(FWIW, the story is a bit different with Protobuf. The Protobuf code is the same code Google uses internally.)

(I am the author of Cap'n Proto and also was the one who open sourced Protobuf originally at Google.)

While you are correct the majority of Google services internally are Stubby-based, it isn't correct to say gRPC is not utilized inside Google. Cloud and external APIs are an obvious use case but also internal stuff that are also used in open source world use gRPC even when deployed internally. TensorFlow etc comes to mind...

So "not used much" at Google scale probably justifies 40 people.

My most vivid gRPC experience is from 10 years or so ago, so things have probably changed. We were heavily Go and micro-services. Switched from, IIRC, protobuf over HTTP, to gRPC "as it was meant to be used." Ran into a weird, flaky bug--after a while we'd start getting transaction timeouts. Most stuff would get through, but errors would build and eventually choke everything.

I finally figured out it was a problem with specific pairs of servers. Server A could talk to C, and D, but would timeout talking to B. The gRPC call just... wouldn't.

One good thing is you do have the source to everything. After much digging through amazingly opaque code, it became clear there was a problem with a feature we didn't even need. If there are multiple sub-channels between servers A and B. gRPC will bundle them into one connection. It also provides protocol-level in-flight flow limits, both for individual sub-channels and the combined A-B bundle. It does it by using "credits". Every time a message is sent from A to B it decrements the available credit limit for the sub-channel, and decrements another limit for the bundle as a whole. When the message is processed by the recipient process then the credit is added back to the sub-channel and bundle limits. Out of credits? Then you'll have to wait.

The problem was that failed transactions were not credited back. Failures included processing time-outs. With time-outs the sub-channel would be terminated, so that wasn't a problem. The issue was with the bundle. The protocol spec was (is?) silent as to who owned the credits for the bundle, and who was responsible for crediting them back in failure cases. The gRPC code for Go, at the time, didn't seem to have been written or maintained by Google's most-experienced team (an intern, maybe?), and this was simply dropped. The result was the bundle got clogged, and A and B couldn't talk. Comm-level backpressure wasn't doing us any good (we needed full app-level), so for several years we'd just patch new Go libraries and disable it.

> What makes Google feel it's worthwhile to have 40ish(?) people working on an open source project that they don't actually use much themselves? Honest questions -- I don't know the answer, but I'd like to.

It's at least used for the public Google Cloud APIs. That by itself guarantees a rather large scale, whether they use gRPC in prod or not.