Comment by umairnadeem123
6 days ago
i think the key difference is making that metadata first-class and queryable across the whole system (lineage, stats, access patterns), not just information_schema / catalog tables. most rdbms expose schema metadata, but not things like which queries produced which rows, freshness, or cost/latency signals unless you bolt it on with tracing. curious if floe is treating metadata as data (versioned, joinable) or as observability sidecars?
> which queries produced which rows
I doubt many real-world applications could tolerate the amount of data/performance degradation this implies. If you need this (and I can't think why you would), then I think writing your own logging code is the answer, rather than lumbering everyone else with it.
To clarify a point: We store the row count of each operator in the query - not the actual row (that would indeed be madness!). Though we DO have tracing you can control that allow you to enable rows for very specific diagnostics - but the stream is massive and you need to opt in for that.
With that clarified, the logging not as large as you might think (see my other response).
Think of web servers - they routinely store much larger log streams than this with metadata about each hit. You rely on that stream to do various forms of web analytics - yet you would not dream of rolling your own - nor worry about the small overhead you are already paying.
Example:
SELECT x, y FROM foo JOIN bar USING (k)
In a query like this, you will have these operators:
SCAN of foo
SCAN of bar
JOIN (on k between Foo and Bar)
Hope that clarifies the point... That's 3 operators, and we log the row counts of each. That in turn allows you answer questions about your data model and how it is being used.
If they did it efficiently, then they might have a game changer.
Well, you can do it efficiently as you like, but if the data is to be historically accurate on a writeable table then if we are recording
it is likely to require a copy of foobar at the time of the query, or doing amazing things with the transaction log. I agree this would be a programming tour de force, but I question its utility.
1 reply →
That's quite expensive. Most systems that need this sort of data will instead implement some form of audit log or audit table. Which is still quite expensive.
At the record level, I've seldom seen more than an add timestamp, and add user id, a last change timestamp, and a last change user id. Even then, it covers any change to the whole row, not every field. It's still relatively expensive.
> Which is still quite expensive.
OTOH, if they managed to do that in an efficient way, they have something really interesting.
Writing to disk is never free.
1 reply →