Comment by semi-extrinsic

9 years ago

Is there anything in your Hadoop that's not "business evidence", financials, user acquisition etc?

My point is that there are many many business decisions driven by analysing non-financial big data sets that physically cannot be done with data crunched out in five hours. These may even require physical testing or new data collection to validate your data analysis.

Like I mentioned, anyone doing proper Engineering (as in, professional liability) will have the same level of confidence in a number coming out of your Hadoop system as they would in a number their colleague Charlie calculated on a napkin at the bar after two beers. Same goes for people in the pharma/biomolecular/chemical industries, oil and gas, mining etc etc.

Like I mentioned, anyone doing proper Engineering (as in, professional liability) will have the same level of confidence in a number coming out of your Hadoop system as they would in a number their colleague Charlie calculated on a napkin at the bar after two beers. Same goes for people in the pharma/biomolecular/chemical industries, oil and gas, mining etc etc.

What are you talking about?

I personally know people working in mining, oil/gas as well as automotive engineering (which you mentioned previously). All rely on Hadoop. I'm sure I could find you some in the other fields too.

Are you seriously thinking Hadoop isn't used outside web companies or something?

Terradata sells Hadoop now, because people are migrating datawarehouses off their older systems. This isn't web stuff, it is everything the business owns.

One of the developments that we're after is radical improvements in data quality and standards of belief (provenance, verification, completeness).

A huge malady that has sometimes effected business is decisions made on the basis of spreadsheets of data that are from unknown sources, contradicted left, right and centre and full of holes.

A single infrastructure helps us do this because we can establish KPI's on the data and control those (as it's coming to the centre rather than a unit providing summaries or updates with delays) we know when data has gone missing and have often been able to do something about it. In the past it was gone, and by the time that was known there was no chance of recovery.

Additionally we are able to cross reference data sources and do our own sanity checks. We have found several huge issues by doing this, systems reporting garbage, systems introducing systematic errors.

I totally agree, if you need to take new readings then you have to wait for the readings to come in before making a decision. This is the same no matter what data infrastructure you are using.

On the other hand there is no reason to view data coming out of Hadoop as any less good than data coming from any other system, apart from the assertion that Hadoop system X is not being well run, which is more of a diagnosis of something that needs fixing than anything else I think.

There are several reasons (outlined above) to believe that a well run data lake can produce high quality data. If an Engineer ignored (for example) a calculation that showed that a bridge was going to fail because the data that hydrated it came out of my system and instead waited for a couple of days for data to arrive from the stress analysis group, metallurgy group and traffic analysis group would they be acting professionally?

Having said all that I do believe that there are issues with running Hadoop data lakes that are not well addressed and stand in the way of delivering value in many domains. Data audit, the ethical challenges of recombination and inference and security challenges generated by super empowered analysts all need to be sorted. Additionally we are only scratching the surface of processes and approaches to managing data quality and detecting data issues.