Comment by fifilura

1 month ago

No joins in that article?

The comments here smell of "real engineers use command line". But I am not sure they ever actually worked with analysing data more than using it as a log parser.

Yes Hadoop is 2014.

These days you obviously don't set up a Hadoop cluster. You use the cloud provider service provided (BigQuery or AWS Athena for example).

Or map your data into DuckDB or use polars if it is small.

6 comments

fifilura

christophilus 1 month ago

It depends. I’ve done plenty of data processing, including at large fortune 10s. Most of the big data could be shrunk to small data if you understood the use case— pre-aggregating, filtering to smaller datasets based on known analysis patterns, etc.

Now, you could argue that that’s cheating a bit and introduces preprocessing that is as complex as running Hadoop in the first place, but I think it depends.

In my experience, though, most companies really don’t have big data, and many that do don’t really need to.

Most companies aren’t fortune 500s.

I used to work at Elastic, and I noticed that most (not all!) of the customers who walked up to me at the conferences were there to ask about datasets that easily fit into memory on a cheap VPS.

fifilura 1 month ago

Let your analysts use DuckDB or pandas/polars then instead of quirky command line tools.

ziml77 1 month ago

> But I am not sure they ever actually worked with analysing data more than using it as a log parser.

It really feels that way. Real data analysis involves a lot more than just grepping logs. And the reason to be wary of starting out unprepared for that kind of analysis is that migrating to a better solution later is a nightmare.

noo_u 1 month ago
In many ways HN is Reddit in denial at this point :) Comments and upvotes that are based mostly on vibes, with depth and discussion usually happening somewhere towards the middle of the comment tree.
- dapperdrake 1 month ago
  
  Where else would you JOIN in?
  
  1 reply →