← Back to context

Comment by Chris_Newton

10 years ago

The tricky part is that smartness of data structures is context-sensitive.

One of the most common design errors in OO systems seems to be building systems that beautifully encapsulate a single object’s state… and then finding that the access patterns you actually need involve multiple objects but it’s impossible to write efficient implementations of those algorithms because the underlying data points are all isolated within separate objects and often not stored in a cache-friendly way either.

Another common design problem seems to be sticking with a single representation of important data even though it’s not a good structure for all of the required access patterns. I’m surprised by how often it does make sense to invest a bit of run time converting even moderately large volumes of data into some alternative or augmented structure, if doing so then sets up a more efficient algorithm for the expensive part of whatever you need to do. However, again it can be difficult to employ such techniques if all your data is hidden away within generic containers of single objects and if the primary tools you have to build your algorithms are generic algorithms operating over those containers and methods on each object that operate on their own data in isolation.

The more programming experience I gain, the less frequently I seem to find single objects the appropriate granularity for data hiding.

Well said.

The exercise of comparmentalising and creating atomic islands of objects that dutifully encapsulate data becomes difficult during reassembly simply because we recreate the need for declarative style of accessing data in an imperative (OO) world. It's ye old object-relational impedance mismatch.

A (relational) data model is a single unit. It has to be seen this way. Creating imperative sub-structures (like encapsulating data into objects) breaks this paradigm with serious consequences when attempting to rejig the object-covered data into an on-demand style architecture. The whole model (database?) must be seen as a single design construct and all operations against the entire model must be sensitive to this notion - even if we access one table at a time. Yes, at specific times we may be interested in the contents of a single table or a few tables joined together declaratively for a particular use case, but the entire data model is a single atomic structure "at rest".

When paradigmatic lines like this are drawn, I side with the world-view that getting the data model "right" first is the way to go.

Fred Brooks and Linus Torvalds speak from experience in the trenches.

This also comes up in relational databases. There might be a nice, canonical way to represent what the data really is (i.e. what it represents). but then access patterns for how it is used mean that a different representation is better (usually, somewhat de-normalized). Fortunately, relational algebra enables this (one of Codd's main motivations).

A programming language is even more about data processing that a database is. But it still seems that data structures/objects represent something. I recently came up with a good way to resolve what that is:

  In a data processing (i.e. programming) language what you are
  modelling/representing is not entities in the world, but computation.
  Therefore, choose data structures that model your data processing. 

This definition allows for the messy denormalized-like data structures you get when you optimize for performance. It also accounts for elegant algebra-like operators, that can be easily composed to model different computation (like the +, . and * of reg exp).