← Back to context

Comment by Ferret7446

13 hours ago

Text is just bytes, and bytes are just text. I assume this is talking about human readable ASCII specifically.

I think the obsession with text comes down to two factors: conflating binary data with closed standards and poor tooling support. Text implies a baseline level of acceptable mediocrity for both. Consider a CSV file will millions of base64 encoded columns and no column labels. That would really not be any friendlier than a binary file with a openly documented format and suitable editing tool, e.g. sqlite.

Maybe a lack of fundamental technical skills is another culprit, but binary files really aren't that scary.

> Text is just bytes, and bytes are just text. I assume this is talking about human readable ASCII specifically.

Text is human readable writing (not necessarily ASCII). It is most certainly not just any old bytes the way you are saying.

  • I agree, but binary is exactly the same. You use a different tool to view it, and maybe you don't have that tool, and that's the problem. But it's a matter of having a way to interpret the data; trivially base64 encoding readable text gives you text, and if you can't decode it, it's as meaningless as binary you can't decode.

    It makes more sense to consider readability or comprehensibility of data in an output format; text makes sense for many kinds of data, but given a graph, I'd rather view it as a graph than as a readable text version.

    And if you have a way to losslessly transform data between an efficient binary form, readable text, or some kind of image (or other format), that's the best of all.

    • And it's funny to think about how many different incompatible text standards there were for the first 30ish years of computers. Each vendor had their own encoding, and it took until UTF-8 to even agree on text (let alone the legacy of UTF-16). If it took that long to agree on text, I have a bad feeling it'll take even longer to agree on anything else.

      I suppose open standards have slowly been winning with opus and AV1, but there's still so many forms of interactions that have proprietary or custom interfaces. It seems like anything that has a stable standard has to be at least 20 years old, lol.

  • And machine readable. You can parse csv file more or less easily but try the same with some forgotten software specific binary.

Text is bytes that's accompanied with a major constraint on which sequences of bytes are permitted (a useful compression into principal axes that emerged over thousands of years of language evolution), along with a natural connection to human semantics that is due to universal adoption of the standard (allowing correlations to be modelled).

Text is like a complexity funnel (analogous to a tokenizer) that everyone shares. Its utility is derived from its compression and its standardization.

If everyone used binary data with their own custom interpretation schema, it might work better for that narrow vertical, but it would not have the same utility for LLMs.