Comment by random3
8 hours ago
I built a flash crawler to index all Flash while at Adobe. It started with Alexa top 1M I think then crawled. This was 2008-2010 I think so we had to do a lot of custom stuff, but we basically crawled then ran a headless Firefox with a custom headless Flash player that dumped a ton of data so also analyzed every flash at runtime and indexed all of that.
We built a dedicated cluster in a colocation center in Bucharest to handle all of this. Had issues with max floor weights and what not. Then had to upgrade the RAM on on the cluster. No remote hands. Every operation was a trip to a really cold place.
Used a lot of early stage stuff like Nutch, Hadoop, HBase etc. Everything was then processed and dumped to an SQL database with a nice UI on top. It took a few weeks to set it up, then we passed it to a team of interns that built the SQL database and UI on top. They learned a ton of stuff. Some are now in the Bay Area.
The tool uncovered a ton of security issues.
It was fun building it. I wonder if Adobe kept the data. It could be useful and/or good donation for the Computer History Museum.
Thanks for sharing. It's stories like these I've read since childhood that got me into this. Those little adventures into remote places to work on some computers. This was my version of Indiana jones.
But everyone's in an AWS world right now.
It looks like there's a a bit of reversal in some areas (e.g. ML) and it may make sense to have more geographically distributed (edge) compute so maybe we'll get more diversity in the currently cloud-dominated space.
This said, it was always cool when we could control the entire stack, but the reality was that once we scaled things up, we had to throw things over the fence to IT, DevOps, SRE and whatever name evolutions there were and the reality is AWS/GCE/Azure made things easier than dealing with these teams internally.
Very interesting. What was the objective?
This was around when we were trying to get Flash to work on the first iPhone, so we had a hackathon for a week. Since I was a distributed systems "hacker", I ended up doing what was needed :) and there were lots of questions related to the sizing of flash on web pages and what not. That's what started it - I simple python script that I refined during the hackathon to get the embed parameters etc.
But once I started processing the data, it became a thing and we made a small cross-team team to get this going. We eventually expanded the effort in a few different directions and wanted to do a Flash analytics, but ended up with the internal tool only due to privacy concerns.
I remember using that tool internally! Personally I think I only used it to get stats of which features/APIs were popular. But I think other teams used it for QA/conformance, like finding content that occurred in the wild but wasn't covered by test cases.
1 reply →