Comment by richhickey
10 years ago
It contravenes the common and historical use of the word 'data' to imply undifferentiated bits/scribbles. It means facts/observations/measurements/information and you must at least grant it sufficient formatting and metadata to satisfy that definition. The fact that most data requires some human involvement for interpretation (e.g. pointing the right program at the right data) in no way negates its utility (we've learned a lot about the universe by recording data and analyzing it over the centuries), even though it may be insufficient for some bootstrapping system you envision.
I think what Alan was getting at is that what you see as "data" is in fact, at its basis, just signal, and only signal; a wave pattern, for example, but even calling it a "wave pattern" suggests interpretation. What I think he's trying to get across is there is a phenomenon being generated by something, but it requires something else--an interpreter--to even consider it "data" in the first place. As you said, there are multiple ways to interpret that phenomenon, but considering "data" as irreducible misses that point, because the concept of data requires an interpreter to even call it that. Its very existence as a concept from a signal presupposes an interpretation. And I think what he might have been getting at is, "Let's make that relationship explicit." Don't impose a single interpretation on signal by making "data" irreducible. Expose the interpretation by making it explicit, along with the signal, in how one might design a system that persists, processes, and transmits data.
If we can't agree on what words mean we can't communicate. This discussion is undermined by differing meanings for "data", to no purpose. You can of course instead send me a program that (better?) explains yourself, but I don't trust you enough to run it :)
The defining aspect of data is that it reflects a recording of some facts/observations of the universe at some point in time (this is what 'data' means, and meant long before programmers existed and started applying it to any random updatable bits they put on disk). A second critical aspect of data is that it doesn't and can't do anything, i.e. have effects. A third aspect is that it does not change. That static nature is essential, and what makes data a "good idea", where a "good idea" is an abstraction that correlates with reality - people record observations and those recordings (of the past) are data. Other than in this conversation apparently, if you say you have some data, I know what you mean (some recorded observations). Interpretation of those observations is completely orthogonal.
Nothing about the idea of 'data' implies a lack of formatting/labeling/use of common language to convey the facts/observations, in fact it requires it. Data is not merely a signal and that is why we have two different ideas/words. '42' is not, itself, a fact (datum). What constitutes minimal sufficiency of 'data' is a useful and interesting question. E.g. should data always incorporate time, what are the tradeoffs of labeling being in- or out-of-band, per datom or dataset, how to handle provenance etc. That has nothing to do with data as an idea and everything to do with representing data well.
But equating any such labeling with more general interpretation is a mistake. For instance, putting facts behind a dynamic interpreter (one that could answer the same question differently at different times, mix facts with opinions/derivations or have effects) certainly exceeds (and breaks) the idea of data. Which is precisely why we need the idea of data, so we can differentiate and talk about when that is and is not happening - am I dealing with facts, an immutable observation of the past ("the king is dead") or just a temporary (derived) opinions ("there may be a revolt"). Consider the difference between a calculation involving (several times) a fact (date-of-birth) vs a live-updated derivation (age). The latter can produce results that don't add up. 'date-of-birth' is data and 'age' (unless temporally-qualified, 'as-of') is not.
When interacting with an ambassador one may or may not get the facts, and may get different answers at different times. And one must always fear that some question you ask will start a war. Science couldn't have happened if consuming and reasoning about data had that irreproducibility and risk.
'Data' is not a universal idea, i.e. a single primordial idea that encompasses all things. But the idea that dynamic objects/ambassadors (whatever their other utility) can substitute for facts (data) is a bad idea (does not correspond to reality). Facts are things that have happened, and things that have happened have happened (are not opinions), cannot change and cannot introduce new effects. Data/facts are not in any way dynamic (they are accreting, that's all). Sometimes we want the facts, and other times we want someone to discuss them with. That's why there is more than one good idea.
Data is as bad an idea as numbers, facts and record keeping. These are all great ideas that can be realized more or less well. I would certainly agree that data (the maintenance of facts) has been bungled badly in programming thus far, and lay no small part of the blame on object- and place-oriented programming.
Why do you limit the meaning of 'data' to facts and/or observations?
1 reply →
I think in the Science of Process that is being related as a desirable goal, everything would necessarily be a dynamic object (or perhaps something similar to this but fuzzier or more relational or different in some other way, but definitely dynamic) because data by itself is static while the world itself is not.
Your selection of data is arbitrary.
Not only is your perception based on an interpreter, but how can you be sure that you were even given all of the relevant bits? Or, even what the bits really meant/are?
Of course the selection of data is arbitrary -- but Rich gives us a definition, which he makes abundantly clear and uses consistently. All definitions can be considered arbitrary. He's not making any claim that we have all the relevant bits of data or that we can be sure what the data really means or represents.
But we can expound on this problem in general. In any experiment where we gather data, how can we be sure we have collected a sufficient quantity to justify conclusions (and even if we are using statistical methods that our underlying assumptions are indeed consistent with reality) and that we have accrued all the necessary components? What you're really getting at is an __epistemological__ problem.
My school of thought is that the only way to proceed is to do our best with the data we have. We'll make mistakes, but that's better than the alternative (not extrapolating on data at all.)
I hope we can do our best, I'm just not sure there is really a satisfactory way to define/measure/judge that we have actually done so....