Comment by Ferret7446

2 months ago

Text is just bytes, and bytes are just text. I assume this is talking about human readable ASCII specifically.

I think the obsession with text comes down to two factors: conflating binary data with closed standards and poor tooling support. Text implies a baseline level of acceptable mediocrity for both. Consider a CSV file will millions of base64 encoded columns and no column labels. That would really not be any friendlier than a binary file with a openly documented format and suitable editing tool, e.g. sqlite.

Maybe a lack of fundamental technical skills is another culprit, but binary files really aren't that scary.

7 comments

Ferret7446

bigstrat2003 2 months ago

> Text is just bytes, and bytes are just text. I assume this is talking about human readable ASCII specifically.

Text is human readable writing (not necessarily ASCII). It is most certainly not just any old bytes the way you are saying.

dwattttt 2 months ago
I agree, but binary is exactly the same. You use a different tool to view it, and maybe you don't have that tool, and that's the problem. But it's a matter of having a way to interpret the data; trivially base64 encoding readable text gives you text, and if you can't decode it, it's as meaningless as binary you can't decode.
It makes more sense to consider readability or comprehensibility of data in an output format; text makes sense for many kinds of data, but given a graph, I'd rather view it as a graph than as a readable text version.
And if you have a way to losslessly transform data between an efficient binary form, readable text, or some kind of image (or other format), that's the best of all.
- smj-edison 2 months ago
  
  And it's funny to think about how many different incompatible text standards there were for the first 30ish years of computers. Each vendor had their own encoding, and it took until UTF-8 to even agree on text (let alone the legacy of UTF-16). If it took that long to agree on text, I have a bad feeling it'll take even longer to agree on anything else.
  I suppose open standards have slowly been winning with opus and AV1, but there's still so many forms of interactions that have proprietary or custom interfaces. It seems like anything that has a stable standard has to be at least 20 years old, lol.
ffuxlpff 2 months ago

And machine readable. You can parse csv file more or less easily but try the same with some forgotten software specific binary.

energy123 2 months ago

Text is bytes that's accompanied with a major constraint on which sequences of bytes are permitted (a useful compression into principal axes that emerged over thousands of years of language evolution), along with a natural connection to human semantics that is due to universal adoption of the standard (allowing correlations to be modelled).

Text is like a complexity funnel (analogous to a tokenizer) that everyone shares. Its utility is derived from its compression and its standardization.

If everyone used binary data with their own custom interpretation schema, it might work better for that narrow vertical, but it would not have the same utility for LLMs.

xpe 2 months ago

> Maybe a lack of fundamental technical skills is another culprit, but binary files really aren't that scary.

Indeed, there is a galactic civilization centered around binary communication: https://memory-alpha.fandom.com/wiki/Bynar

TimByte 2 months ago

Yet you don't need special tools, schemas, or viewers to get some understanding out of it