Comment by llimllib
9 years ago
Here's a google bigquery that lists the most common PDFs referenced in the github sample dataset, and the top 100 results: https://gist.github.com/llimllib/3f1877eab06208958060f491cf3...
It's possible to run this query against the full github dataset but I couldn't figure out how to pay for it, so if somebody wants to do that it would be excellent.
just a note: it's bizarre that I absolutely cannot find a way to determine a) how much it would cost to run or b) how I would pay for it if I wanted to run it
I changed it to query from [bigquery-public-data:github_repos.contents] instead, and before I execute the query it says "Valid: This query will process 1.68 TB when run.".
Queries are $5/TB [0].
So a bit less than 10 bucks. :)
Edit: brb, that's totally worth it.
[0]: https://cloud.google.com/bigquery/pricing
https://docs.google.com/spreadsheets/d/1zjJLsCS5d3Mv22D5k0Ap...
Free trial money is great. :D
3 replies →
Weird! Mine just says "Quota exceeded..." without ever saying how big the query will be. Where do I find that info?
(http://i.imgur.com/3EkPYIY.png is what I see)