“90% of all data was generated in the last two years.”
–SINTEF, 2013
What could that even possibly mean?
There's a lot of hype going around about “big data”, “analytics”, and “data science”.
A lot of it “isn't even wrong”.
Here's what I do see:
We have a convergence of computing power, data availability and digitization, and algorithmic understanding...
...which lets us address problems statistically and empirically which were formerly “physics problems”, or “economics problems”, and so on...
...which is good for our industry because...
Unconventional resource exploitation and a challenging price environment demand we solve ever more and more challenging problems with fewer resources.
So, what is “data science”?
For that matter, what is “analytics”? What is “AI”? What is “machine learning”? What is, well, plain old “statistics”?
Sorry to disappoint...
These are all vaguely-defined, overlapping disciplines.
But here's some flavor for the connotations of each:
The use of empirical, data-driven techniques to understand patterns, model relationships, and make better decisions.
Data science is multidisciplinary: it subsumes elements from all of the above.
Data science is complementary to domain knowledge: it requires context for responsible use.
Data science is computational: it demands proficient application of techniques from computer science for organizing and manipulating data.
Data science is driven by, guided with, and assessed in context of our engineering and geologic domain knowledge.
The fastest way to lose internal (or external) customers' trust is to miss the obvious here.
Names have been omitted to protect the guilty:
A reservoir engineer friend was approached by their employer's internal data science team...
... with a proposal to forecast well production ...
... using exponential smoothing.
Data science necessarily involves processing data...
...and repeated procedures...
...and interfacing to other digital systems
And all of that means: you don't all have to be great programmers, but you have to have a least a couple pretty good programmers.
Vendors promise “visual tools” for “drag and drop” data science.
These work well for workflows that fit the pattern, but fail in opaque ways when pushed...
...and usually require falling back to a programming/scripting language to implement complex tasks anyway:
See a theme?
R, Python, SQL, Javascript: there is no escape!
(and proprietary toolchains are learning dead-ends)
Hot topics sell SPE talks and whitepapers:
But little wins make your team heroes for your customers...
“Does our competitor's completion strategy make a difference?”
“How do geologic properties impact performance?”
Bluntly: they're better, they're cheaper, and they're everywhere
True debugging story: working with “BI”/visualization platform which will remain anonymous.
We'd built a visualization tool for an underlying proprietary Bayesian modeling technique.
Users were reporting crashes in the (built-in functionality) data filtering step.
After much head-scratching, and judicious application of a decompiler (two cheers for bytecode VMs!), here's what we found:
The vendor had gotten too clever by half: they stored “filter sets” internally as SQL where
clauses, and had written their own SQL parser.
But the (filter (expressions (were (deeply (nested))))) for no good reason, and they'd written a recursive descent parser.
Didn't get a stack overflow though: they'd hard-coded a recursion limit of 800.
So our users blew up after filtering 800 records.
We called up and yelled at the vendor.
At first they didn't believe us...
...a month later the newest hotfix changed the hard-coded limit to 2^31 - 1.
Open-source tool workflow: read the code, find the problem, patch, submit pull request if you're feeling generous.
Tool breakage: not if, but when. Choose wisely...
“Big data” (the phrase) is everywhere. Big data (the thing) is not.
Some people define “big data” in terms of the “3 (or more) Vs” (like string theory, someone adds a new dimension every month):
We have some actual big data!
You'll know it when you see it.
We also have a lot of “small, weird data”.
Many of our biggest wins come from feature extraction from small, weird data and from statistical integration across the “resolution gap”.
I implement, train, and coach this stuff for a living. I'd love to work with you and your team!
Write me a note at info@terminusdatascience.com or connect on LinkedIn or GitHub.