4. The Two Kinds of Data Scientists
• The Lab
• Statisticians who got
really good at
programming
• Neuroscientists,
geneticists, etc.
• The Factory
• Software engineers who
were in the wrong place
at the wrong time
4
5. The Lab and The Factory
Analytics in the Lab
• Question-driven
• Interactive
• Ad-hoc, post-hoc
• Fixed data
• Focus on speed and
flexibility
• Output is embedded into a
report or in-database scoring
engine
Analytics in the Factory
• Metric-driven
• Automated
• Systematic
• Fluid data
• Focus on transparency and
reliability
• Output is a production
system that makes customer-
facing decisions
5
20. Python UDFs for Impala
• github.com/cloudera/impyla
• Already There
• Numeric and boolean types (as native python objects)
• In Progress
• String support
• C/C++ function integration
• Planned
• Struct/tuple and array types
• UDAFs
• Include support for PyData stack (scikit-learn, NLTK)
20
My major contribution to western civilization.See also: http://www.quora.com/Data-Science/What-is-the-difference-between-a-data-scientist-and-a-statistician/answer/Josh-Wills
Curt Monashmakes a distinction between investigative analytics (which he defines here: http://www.dbms2.com/2011/03/03/investigative-analytics/ ) and operational analytics that I like, and I expanded it into my own set of differences that I want to walk through here.Investigative analytics is what we think of when we think of traditional BI: there’s an analyst or an executive that is searching for previously unknown patterns in a data set, either by looking at a series of visualizations mediated by database queries, or by applying some statistical models to a prepared data set to tease out some deeper explanations. This is where the vast majority of the BI market is focused right now.Operational analytics, on the other hand, is a nascent market, and I don’t believe the existing BI tools have done a good job of supporting companies that want to start leveraging their modeling and analytical prowess in order to make better decisions in real-time. I’d like to shift some of the conversation and the focus in the market from the lab to the factory.
The tip of the iceberg metaphor. This has been a useful metaphor for me throughout my career, I feel like I am constantly exploring the tip of the iceberg,from the theory of model building to the practice of model building to operational model building.There is a ton of stuff I don’t know, but I hope that I can provide a useful sort of commentary on the culture of credit scoring from the perspective of an outsider, kind of like Alexis de Tocqueville or Borat