Comment by mcv
11 hours ago
Is everything becoming columnar? Parquet stores data per column instead of per row because it improves compression. I get that. Arrow apparently is columnar, and now DuckDB also gets its efficiency by treating data as columns instead of rows?
I still need to wrap my head around how that works, but it's a fascinating development.
It depends on your task. In analytics where you need to scan lots of data points within few columns, then columnar storage is very much the best. But for transactional workloads where you have to deal with specific entities, row based would be more advantageous. There are hybrid systems that try to be both at the same time but in my experience they end not doing either very well.
Some day we'll get CREATE TABLE ... ( ... STORAGE ORDER COLUMN MAJOR) to have our transactional cake on the tables that need it and eat our analytics cake on the tables that need that.
But until then, separate tools for separate purposes isn't a bad place to be when those tools are both fantastic.
Often used to be referred to as HTAP, and yeah in most data engineering its moving things from OLTP to OLAP forms, and OLAP pretty much always benefit from columnar compression for aggregations and rollups.
compression is a side effect but not really the goal. To simplify, analytical queries often filter on a specific column value, and if these are laid out contiguously it makes disk-level reads much faster than rows that would involve read-skip-read-etc. In transactional systems data is typically written as rows though, so that's likely slower in a columnar system. As a general rule, heavy read workflows with known access patterns is going to benefit from a columnar layout.
BTW, columnar is very similar to struct of arrays (SOA) and some of the reasons it works well overlap with SOA.
Those three things you mentioned kind of live in the same niche - offline data storage and querying. In that world yes everything has become columnar since it’s just better. Row-oriented is still the solution for online streaming use cases.