Comment by eggy

9 years ago

Date on article: Sat 25 January 2014

I am not a Big Data expert, but does that change any of the comments below with reference to large datasets and memory available?

I use J and Jd for fun with great speed on my meager datasets, but others have used it on billion row queries [1]. Along with q/kdb+, it was faster than Spark/Shark last I checked, however, I see Spark has made some advances recently I have not checked into.

J is interpreted and can be run from the console, from a Qt interface/IDE, or in a browser with JHS.

[1] http://www.jsoftware.com/jdhelp/overview.html

1 comment

eggy

justinsaccount 9 years ago

There isn't exactly a direct relationship between the size of the data set and the amount of memory required to process it. It depends on the specific reporting you are doing.

In the case of this article, the output is 4 numbers:

  games, white, black, draw

Processing 10 items takes the same amount of memory as processing 10 billion items.

If the data set in this case was 50TB instead of a few GB, it would benefit from running the processing pipeline across many machines to increase the IO performance. You could still process everything on a single machine, it would just take longer.

Some other examples of large data sets+reports that don't require a large amount of memory to process:

  * input: petabytes of web log data. output: count by day by country code
  * input: petabytes of web crawl data. output: html tags and their frequencies
  * input: petabytes of network flow data. output: inbound connection attempts by port

Reports that require no grouping (like this chess example) or group things into buckets with a defined size (ports that are in a range of 1-65535) are easy to process on a single machine with simple data structures.

Now, as soon as you start reporting over more dimensions things become harder to process on a single machine, or at least, harder to process using simple data structures.

  * input: petabytes of web crawl data. output: page rank
  * input: petabytes of network flow data. output: count of connection attempts by source address and destination port

I kinda forget what point I was trying to make.. I guess.. Big data != Big report.

I generated a report the other day from a few TB of log data, but the report was basically

  for day in /data/* ; do #YYYY-MM-DD
    echo $day $(zcat $day/*.gz | fgrep -c interestingValue)
  done