First presented at the February 2013 Research Triangle Analysts meeting, this presentation discusses the technical side of making data science a field that's here to last. This presentation focuses on the "science" aspect of data science and how it drives value to an organization.
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Data Science Isn't a Fad: Let's Keep it That Way
1. Data Science Isn’t a Fad
Let’s Keep It That Way
Presentation to Research Triangle Analysts
February 21, 2013
www.rtpanalysts.org
2. Data Science: Buyer Beware
Forbes article: Data Science:
Buyer Beware “This is a
management fad.”
Me: I’ve been doing this for 16 years. It
isn’t a fad. You keep renaming it.
Result: Great conversation, and another Forbes article.
3. Obligatory Definition
Wikipedia: Data science is a novel term that is often
used interchangeably with competitive intelligence or
business analytics, although it is becoming more
common. Data science seeks to use all available and
relevant data to effectively tell a story that can be easily
understood by non-practitioners.
Sexiest job of the 21st century. --Thomas H. Davenport
and DJ Patil
Pseudo science performed by rock-star unicorns. --
The Internet
4. Data SCIENCE
Data: emphasizes the transformation of raw
information into actionable results.
Science: emphasizes the commitment to verifiable and
repeatable process.
Data Science: The discipline of transforming raw
information into actionable results in a manner that is
verifiable and repeatable.
“Information is cheap. Meaning is expensive.”
--George Dyson, 2011
5. Data Science Is....
Google’s
Search Engine Fraud Framework
Spotfire Operations Analytics in Production
Analytics
6. Once upon a time...
Information was VERY expensive.
7. Data Science and Statistics
The statistical methods you learn as an undergraduate
were optimized to make efficient use of small data
samples.
Data is a unique resource: The more you have, the
more valuable each individual piece becomes.
Provided you can extract meaning from the
information.
8. “Big Data” = New Problems
Dynamic environment: relationships change.
Constant sampling means you will have false positives.
Large numbers of variables and data points means you
have to rely on automated tools.
Not all automated tools are created equal.
9. Cue Shameless Plug....
John Sall
Co-Founder & EVP of SAS Institute
Director of JMP
“From Big Data to Big Statistics”
March 21, 6:30pm
Louie and Charlies
www.louieandcharlies.com
10. Raw Information to Actionable
Results
The results of the analysis must answer the business
question(s).
The results of the analysis must provide a course of
action.
11. Actionable
Click on this link. Check this person’s file.
Stop/encourage this
Look at this pattern.
activity.
12. Verifiable
The assumptions from the underlying methods must be
stated and shown to be true.
Outlier cases must be documented and handled
effectively.
Different analysis, error table, excluded point.
13. Y = 3.0017 + 0.499X
Corr = 0.8199
Anscombe’s Quartet
Linear regression assumes a straight line
relationship and normally distributed errors.
14. Y = 3.0017 + 0.499X
Corr = 0.8199
Anscombe’s Quartet
This line has the same statistics as the one
before. But the relationship is not a straight line.
15. Y = 3.0017 + 0.499X
Corr = 0.8199
Anscombe’s Quartet
An outlier is affecting the equation.
16. Y = 3.0017 + 0.499X
Corr = 0.8199
Anscombe’s Quartet
One outlier drives the entire relationship.
17. Repeatable
When I do this again with data that meets the stated
assumptions, I should get the same answers.
Small changes in the data should NOT break the
algorithm.
Easier said than done.
18. Making Results Repeatable
Automated verification of assumptions.
Good coding practices (no matter the language).
Out of sample testing.
Do the same analysis with similar data.
Failure conditions
Document what should happen when bad data goes
into the algorithm.
Run the algorithm with bad data.
19. This is the endpoint of the analysis.
Companies who hire data scientists use the results
to make decisions.
20. Repeatable: Closing the
Loop With Users
It is the data scientist’s responsibility to make sure the
results are used effectively.
Involve users at the beginning of the process.
Use iterative feedback to make sure results are:
Actionable
Verifiable
Repeatable.
21. Why Bother?
“Beware the Big Errors of Big Data”
“Big Data is Falling into the
Trough of Disillusionment”
“If you asked me to describe the rising
philosophy of the day, I would say it’s
data-sim...”
22. Really,Then, Why Bother?
“...the Oakland A's' front
office ...fielded a team that could
compete successfully against
richer competitors in Major
League Baseball (MLB).”
23. Because What We Do Matters
“Refugees United...uses mobile and
web technologies to help refugees find
their missing loved ones.”
--datakind.org
“Predictive analytics is saving lives and
taxpayer dollars in New York City.”
--Alex Howard, Michael Flowers interview
24. That’s Enough From Me
What do you think about me?
mthielbar@gmail.com
melindathielbar.wordpress.com
info@rtpanalysts.org
THANK YOU!
All photos the property of their respective owners.