← Back to context

Comment by DenisM

5 months ago

tldr

The proliferation of opensource file formats (i.e., Parquet, ORC) allows seamless data sharing across disparate platforms. However, these formats were created over a decade ago for hardware and workload environments that are much different from today

Each self-describing F3 file includes both the data and meta-data, as well as WebAssembly (Wasm) binaries to decode the data. Embedding the decoders in each file requires minimal storage (kilobytes) and ensures compatibility on any platform in case native decoders are unavailable.

I know sandboxes for wasm are very advanced, but decades of file formats with built-in scripting are flashing the danger lights at me.

Is that really necessary, though? Data files are useless without a program that knows how to utilize the data. Said program should already know how to decode data on the platform it's running.

And if you're really working on an obscure platform, implementing a decoder for a file format is probably easier than implementing a full-blown wasm runtime for that platform.

  • > Data files are useless without a program that knows how to utilize the data.

    As I see it, the point is that the exact details of how the bits are encoded is not really interesting from the perspective of the program reading the data.

    Consider a program that reads CSV files and processes the data in them. First column contains a timestamp, second column contains a filename, third column contains a size.

    As long as there's a well-defined interface that the program can use to extract rows from a file, where each row contains one or more columns of data values and those data values have the correct data type, then the program doesn't really care about this coming from a CSV file. It could just as easily be a 7zip-compressed JSON file, or something else entirely.

    Now, granted, this file format isn't well-suited as a generic file format. After all, the decoding API they specify is returning data as Apache Arrow arrays. Probably not well-suited for all uses.

    • I think the counter argument here is that you’re now including a CSV decoder in every CSV data file now. At the data sizes we’re talking, this is negligible overhead, but it seems overly complicated to me. Almost like it’s trying too hard to be clever.

      How many different storage format implementations will there realistically be?

      3 replies →