← Back to context

Comment by zjaffee

17 hours ago

It's not about how much data you have, but also the sorts of things you are running on your data. Joins and group by's scale much faster than any aggregation. Additionally, you have a unified platform where large teams can share code in a structured way for all data processing jobs. It's similar in how companies use k8s as a way to manage the human side of software development in that sense.

I can however say that when I had a job at a major cloud provider optimizing spark core for our customers, one of the key areas where we saw rapid improvement was simply through fewer machines with vertically scaled hardware almost always outperformed any sort of distributed system (abet not always from a price performance perspective).

The real value often comes from the ability to do retries, and leverage left over underutilized hardware (i.e. spot instances, or in your own data center at times when scale is lower), handle hardware failures, ect, all with the ability for the full above suite of tools to work.

Other way around. Aggregation is usually faster than a join.

  • Disagree, though in practice it depends on the query, cardinality of the various columns across table, indices, and RDBMS implementation (so, everything).

    A simple equijoin with high cardinality and indexed columns will usually be extremely fast. The same join in a 1:M might be fast, or it might result in a massive fanout. In the case of the latter, if your RDBMS uses a clustering index, and if you’ve designed your schemata to exploit this fact (e.g. a table called UserPurchase that has a PK of (user_id, purchase_id)) can still be quite fast.

    Aggregations often imply large amounts of data being retrieved, though this is not necessarily true.

    • That level of database optimization is rare in practice. As soon as a non-database person gets decision making authority there goes your data model and disk layout.

      And many important datasets never make it into any kind of database like that. Very few people provide "index columns" in their CSV files. Or they use long variable length strings as their primary key.

      OP pertains to that kind of data. Some stuff in text files.

      1 reply →

    • unconvinced. any join needs some kind of seek on the secondary relation index, or a bunch of state if ur stream joining to build temporary index sizes O(n) until end of batch. on the other hand summing N numbers needs O(1) memory and if your data is column shaped it’s like one CPU instruction to process 8 rows. in “big data” context usually there’s no traditional b-tree index to join either. For jobs that process every row in the input set Mr Join is horrible for perf to the point people end up with a dedicated join job/materialized view so downstream jobs don’t have to re do the work

    • An aggregation is less work than a join. You are segmenting the data in basically the same way in ideal conditions for a join as you are in an aggregation. Think of an aggregation as an inner join against a table of buckets (plus updating a single value instead of keeping a number of copies around). In practice this holds with aggregation being a linear amount faster than a join over the same data. That delta is the extra work the join needs to do to keep around a list of rows rather than a single value being updated (and in cache) repeatedly. Depending on the data this delta might be quite small. But without a very obtuse aggregation function (maybe ketosis perhaps), the aggregation will be faster. Its updating a single value vs appending to a list with the extra memory overhead this introduces.