Comment by socketcluster

1 month ago

I've also become something of a text maximalist. It is the natural meeting point in human-machine communication. The optimal balance of efficiency, flexibility and transparency.

You can store everything as a string; base64 for binary, JSON for data, HTML for layout, CSS for styling, SQL for queries... Nothing gets closer to the mythical silver-bullet that developers have been chasing since the birth of the industry.

The holy grail of programming has been staring us in the face for decades and yet we still keep inventing new data structures and complex tools to transfer data... All to save like 30% bandwidth; an advantage which is almost fully cancelled out anyway after you GZIP the base64 string which most HTTP servers do automatically anyway.

Same story with ProtoBuf. All this complexity is added to make everything binary. For what goal? Did anyone ever ask this question? To save 20% bandwidth, which, again is an advantage lost after GZIP... For the negligible added CPU cost of deserialization, you completely lose human readability.

In this industry, there are tools and abstractions which are not given the respect they deserve and the humble string is definitely one of them.

27 comments

socketcluster

astrobe_ 1 month ago

> The optimal balance of efficiency, flexibility and transparency.

You know the rule, "pick 2 out of 3". For a CPU, converting "123" would be a pain in the arse if it had one. Oh, and hexadecimal is even worse BTW; octal is the most favorable case (among "common" bases).

Flexibility is a bit of a problem too - I think people generally walked back from Postel's law [1], and text-only protocols are big "customers" of it because of its extreme variability. When you end-up using regexps to filter inputs, your solution became a problem [2] [3]

30% more bandwidth is absolutely huge. I think it is representative of certain developers who have been spoiled with grotesquely overpowered machines and have no idea any idea of the value of bytes, bauds and CPU cycles. HTTP3 switched to binary for even less than that.

The argument that you can make up for text's increased size by compressing base64 is erroneous; one saves bandwidth and processing power on both sides if you can do away without compression. Also, with compressed base64 you've already lost the readability on the wire (or out of the wire since comms are usually encrypted anyway).

[1] https://en.wikipedia.org/wiki/Robustness_principle

[2] https://blog.codinghorror.com/regular-expressions-now-you-ha...

[3] https://en.wikipedia.org/wiki/ReDoS

yegle 1 month ago

As someone who's daily job is to move protobuf messages around, I don't think protobuf is a good example to support your point :-)

AFAIKT, binary format of a protobuf message is strictly to provide a strong forward/backward compatibility guarantee. If it's not for that, the text proto format and even the jaon format are both versatile, and commonly used as configuration language (i.e. when humans need to interact with the file).

socketcluster 1 month ago
You can also provide this with JSON and API versioning. Also with JSON, you can add new fields to requests and responses, it's only deleting fields which breaks compatibility.
- bccdee 1 month ago
  
  There's no simple replacement for what Protobuf does. Forwards and backwards compatibility is well-specified across clients in all languages. I can write a v2 message with new fields and pass it to a service written in a different language based on the v1 message schema. That service can modify the message using only its v1 schema, but when it re-emits the modified message, its original v2 fields will be intact.
  You may think, "I don't need that," but once you've got more than a couple microservices, you'd be surprised how many headaches this type of compatibility issue can cause. You may think, "I can do that with json," but can you do exactly the same version of it across 4 or 5 different languages while maintaining a single source of truth for each message type's schema? At that point, you're just rebuilding Protobuf.
  Afaik the only other tool that does what Protobuf does is Avro, though I haven't used it. I have tried to use json-schema for this, but that's not what it was made for. The schema evolution story is worse, and the codegen isn't as good.

beej71 1 month ago

I've moved away from DOCish or PDF for storage to text (usually markdown) with Makefiles to build with Typst or whatever. Grep works, git likes it, and I can easily extract it to other formats.

My old 1995 MS thesis was written in Lotus Word Pro and the last I looked, there was nothing to read it. (I could try Wine, perhaps. Or I could quickly OCR it from paper.) Anyway, I wish it were plain text!

cricalix 1 month ago

I poked this - the 96 installer from Archive didn't play nice with wine. However, dosbox plus win3.11 and some ingmount commands worked just fine. So yes, you could export to plain text or similar.

bccdee 1 month ago

> For the negligible added CPU cost of deserialization, you completely lose human readability.

You could turn that around & say that, for the negligible human cost of using a tool to read the messages, your entire system becomes slower.

After all, as soon as you gzip your JSON, it ceases to be human-readable. Now you have to un-gzip it first. Piping a message through a command to read it is not actually such a big deal.

naniwaduni 1 month ago
The human cost becomes negligible once the tooling is already integrated. You don't get to call it negligible until after the integration has been done.
- bccdee 1 month ago
  
  Sure I do. The integration looks like this:
  jmsg, _ := protojson.Marshal(msg) fmt.Println(jmsg)
  That's negligible.

whatevermom5 1 month ago

Base64 and JSON takes a lot of CPU to decode; this is where Protobuf shines (for example). Bandwidth is one thing, but the most expensive resources are RAM and CPU, and it makes sense to optimize for them by using "binary" protocols.

For example, when you gzip a Base64-encoded picture, you end up 1. encoding it in base64 (takes a *lot* of CPU) and then, compressing it (again! jpeg is already compressed).

I think what it boils down to is scale; if you are running a small shop and performance is not critical, sure, do everything in HTTP/1.1 if that makes you more productive. But when numbers start mattering, designing binary protocols from scratch can save a lot of $ in my experience.

socketcluster 1 month ago
Maybe for some kind of multiplayer game which has massive bandwidth and CPU usage requirements and has to be supported by paper-thin advertising profit margins... When tiny performance improvements can mean the difference between profitable and unprofitable, then it might make sense to optimize but this... But for the vast majority of software, the cost of serializing JSON is negligible and not worth thinking about.
For example, I've seen a lot of companies obsess over minor stuff like shaving a few bucks off their JSON serialization or using a C binding of some library to squeeze every drop of efficiency out of those technologies... While at the same time letting their software maintenance costs blow out of control... Or paying astronomical cloud compute bills when they could have self-hosted for 1/20th of the price...
Also, the word scale is overused. What is discussed here is performance optimization, not scalability. Scalability doesn't care for fixed overhead costs. Scalability is about growth in costs as usage increases and there is no difference in scalability if you use ProtoBuf or JSON.
The expression that comes to mind is "Penny-wise, pound-foolish." This effect is absolutely out of control in this industry.
- whatevermom5 1 month ago
  
  Replying late, but yes I agree. What matters is the bottom line, and the vast majority of apps should be using JSON because this is the most economical when it comes to engineering time.
- panstromek 1 month ago
  
  If you deploy on phones, CPU and memory is a major problem. Pick a median Android and lots of websites consisently fail to deliver good experience on it and it's very common to see them bottlenecked on CPU. JSON is massively innefficient, it's foolish think it won't have any effect.

8n4vidtmkvmk 1 month ago

The value of protobuf is not to save a few bytes on the wire. First, it requires a schema which is immensely valuable for large teams, and second, it helps prevent issues with binary skew when your services aren't all deployed at the same millisecond.

handfuloflight 1 month ago

I marvel at the constraint and freedom of the string.

smj-edison 1 month ago

Just go full Tcl, where instead of shunning stringly typed data structures, the only data structure available is a string :)

makeitdouble 1 month ago

The text based side of protobuf is not base64 or json. We'd be looking at either CSV or length delimited fields.

Many large scale systems are on the same camp as you as their text files flow around their batch processors like crazy, but there's absolutely no flexibility or transparency.

Json and or base64 are more targeted as either low volume or high latency systems. Once you hit a scale where optimizing a few bits straight saves a significant amount of money, self labeled fields are just out of question.

TimByte 1 month ago

I think some of the binary tooling exists less because engineers hate strings and more because humans aren't the primary consumers anymore

the8472 1 month ago

shipping base64 in json instead of a multipart POST is very bad for stream-processing. In theory one could stream-process JSON and base64... but only the json keys prior would be available at the point where you need to make decisions about what to do with the data.

socketcluster 1 month ago
Still, at least it's an option to put base64 inline inside the JSON. With binary, this is not an option and must send it separately in all cases, even small binary...
You can still stream the base64 separately and reference it inside the JSON somehow like an attachment. The base64 string is much more versatile.
- zzo38computer 1 month ago
  
  Even with binary, you can store a binary inline inside of another one if it is a structured format with a "raw binary data" type, such as DER. (In my opinion, DER is better in other ways too, and (with my nonstandard key/value list type added) it is a superset of the data model of JSON.)
  Using base64 means that you must encode and decode it, but binary data directly means that is unnecessary. (This is true whether or not it is compressed (and/or encrypted); if it is compressed then you must decompress it, but that is independent of whether or not you must decode base64.)
- dwattttt 1 month ago
  
  > Still, at least it's an option to put base64 inline inside the JSON. With binary, this is not an option and must send it separately in all cases, even small binary...
  There's nothing special about "text" or binary here. You can absolutely put binary inside other binary; you use a symbol that doesn't appear inside the binary, much like you do for text.
  You use a divider, like " is for json, and a prearranged way to avoid that symbol from appearing inside the inner binary (the same approach that works for text works here).
  What do you think a zip file is? They're not storing compressed binary data as text, I can tell you that.
  
  3 replies →
- makeitdouble 1 month ago
  
  I don't get why using a binary protocol doesn't allow handling strings. What's the limitation ?

ozim 1 month ago

I think you want ZSTD instead of GZIP nowadays.