Comment by eggy
9 years ago
Date on article: Sat 25 January 2014
I am not a Big Data expert, but does that change any of the comments below with reference to large datasets and memory available?
I use J and Jd for fun with great speed on my meager datasets, but others have used it on billion row queries [1]. Along with q/kdb+, it was faster than Spark/Shark last I checked, however, I see Spark has made some advances recently I have not checked into.
J is interpreted and can be run from the console, from a Qt interface/IDE, or in a browser with JHS.
There isn't exactly a direct relationship between the size of the data set and the amount of memory required to process it. It depends on the specific reporting you are doing.
In the case of this article, the output is 4 numbers:
Processing 10 items takes the same amount of memory as processing 10 billion items.
If the data set in this case was 50TB instead of a few GB, it would benefit from running the processing pipeline across many machines to increase the IO performance. You could still process everything on a single machine, it would just take longer.
Some other examples of large data sets+reports that don't require a large amount of memory to process:
Reports that require no grouping (like this chess example) or group things into buckets with a defined size (ports that are in a range of 1-65535) are easy to process on a single machine with simple data structures.
Now, as soon as you start reporting over more dimensions things become harder to process on a single machine, or at least, harder to process using simple data structures.
I kinda forget what point I was trying to make.. I guess.. Big data != Big report.
I generated a report the other day from a few TB of log data, but the report was basically