Comment by mbb70

1 month ago

The bigness of your data has always depended on the what you are doing with it.

Consider the following table of medical surgeries: date,physician_name, surgery_name,success.

"What are the top 10 most common surgeries?" - easy in bash

"Who are the top physicians (% success) in the last year for those surgeries?" - still easy in bash

"Which surgeries are most affected by physician experience?" - very hard in bash, requires calculating for every surgery how many times that physician had performed that surgery on that day, then compare low and high experience outcomes.

A researcher might see a smooth continuum of increasingly complex questions, but there are huge jumps in computational complexity. At 50gb dataset might be 'bigger' than a 2tb one if you are asking tough questions.

It's easier for a business to say "we use Spark for data processing", than "we build bespoke processing engines on a case by case basis".

2 comments

mbb70

christophilus 1 month ago

50GB and 2TB are both sizes that SQLite supports and could handle. You could probably solve all of the problems you mentioned with simple tools on a single server, in the language of your choice.

esafak 1 month ago

Sounds like a good fit for DuckDB.