← Back to context

Comment by immanuwell

9 hours ago

The normalization analogy is genuinely clever as a teaching tool, but it quietly papers over the fact that normalization is a logical design concept while columnar storage is a physical one - treating them as the same thing can mislead more than it clarifies, I think

I've always preferred to think of normalization as more about "removing redundancy" than in the frame it is normally presented. Or, to put it another way, rather than "normalizing" which has as a benefit "removing redundancy", raise the removing of redundancy up to the primary goal which has as a side benefit "normalization".

A nice thing about that point of view is that it fits with your point; redundancy is redundancy whether you look at it with a column-based view or a row-based view.

Oh man, thank you for saying this. The difference between logical and physical goes over so many people's heads. It's a little unnerving at times how much people resist it.

Definitely agree with what you said - if we treat them as the same thing that's going to mislead some folks.

Yeah I feel like papering over the physical aspect actually misses the main motivation for columnar storage in the first place, which is to more efficiently store some types of data and perform OLAP queries on it.

Theoretically I would agree, but practically I still wonder why we need different database engines for row and columnar storage if supporting different types of indices is trivial(TM) for Postgres?

  • Microsoft SQL Server can use both row and column store for tables, in various combinations such as row store for the table with a columnar index, or vice versa.

  • In theory, you don't. In practice, it's because the major SQL DBMS were architected around row-oriented storage and the technical effort to implement hybrid storage is large.

    There are columnar storage engine extensions for many of the popular databases, though.

    • Interesting: I transferred the idea of a matrix being stored row or column wise to the database world and assumed this was a more physical than theoretical feature (not a native speaker here)?

      Looking forward to check out `pg_duckdb`, yes.

    • there are hybrid engines too, so-called HTAP (as opposed to OLAP or OLTP) notable efforts are: SingleStore (commercial) and OceanBase (foss)

Fair, but one of the big benefits of normalization was the benefit on storage and memory back in the day which was tiny comparatively.

There's always a reason for a dev to ship something shitty but when you show you can use 80% less storage for the same operation you can make the accountants your lever.

  • The purpose of normalization is not to save storage. In fact it might often require more storage, since it involves introducing a foreign-key column. It really depends on the data in question whether it saves storage or require more.

    • Fair, I said one of the big benefits, not the purpose - in some cases it can require more storage (but that storage is often more amenable to compression) -but generally deduplicating your data doesn't increase your storage needs.

    • That forign key column is saving duplicating multiple columns

      But I don’t think that’s the top 5 reasons of normalization

  • Nonsense. See Codd’s first paper.

    1NF removes repeating groups, putting for example data for each month in its own row, not an array of 12 months in 1 row.

    Storage efficiency was never the point. IMS had that locked down. Succinctness of expression and accuracy of results was the point. And is: normalization prevents anomalous results.

    • Normalizing repeating groups doesn't offer significant savings when they are completely populated (e.g. each entity has the full 12 monthly values per year), but other types of normalization do. For example dependent data are actually redundant.

    • I think parent was saying it’s a benefit, not the original purpose. If I store a FK to a table containing my company’s corporate address, that is a tremendous savings in storage (and memory pressure), and it also eliminates update anomalies.

      1 reply →