This document discusses the three key skills of data scientists and data-driven startups: data munging, statistical modeling, and data storytelling. It notes that data munging involves cleaning and preparing raw data. Statistical modeling involves using the cleaned data to build predictive models. And data storytelling involves visually exploring and communicating insights from the data through narratives and visualizations. Examples of startups in each area are also provided.
1. The Three Sexy Skills of Data Scientists (& Data-Driven Startups) Michael Driscoll | Metamarkets IA Ventures Big Data Conference | Oct 2010 = + For print version: http://www.dataspora.com/blog
I’ve added an addendum to this talk – These skills aren’t just sexy for individualsStart-ups with these skills in-house are also sexy investments – we wouldn’t be meeting here today if that weren’t the case.The motivation for this talk was Hal Varian’s quip that Statisticians are the Sexy Profession of the Next Decade. I thought how I could mash up data with sexy, and this is what I got.
Let’s set the stage. Joe Hellerstein has said that we’re living in the Industrial Revolution of Data.Big Data means.
An important note: big data is not just about volume, it’s about velocity.Systems must be dramatically re-architected when they shift from monolithic to modular: unicellular to multicellular.Most of the additional complexity goes into interfaces between the pieces.Regardless, I define Big Data as data that is distributed.Transition: how did we get here, to a world chock full of exabytes?
Attack of the exponentials.
This is what’s happened in the last four decades.These four factors also happen to be inputs for data generation processes. So what happen
Kurzweil reference, I call this the data singularity.CPU cost and storage costs have fallen faster than network and disk IO have risen – meaning more data can be stored & processed locally than can be shipped around. This has strategic implications: data is heavy, and hard to move once it lands somewhere. This puts Amazon, for instance, at enormous competitive advantage over its cloud computing peers.Data is heavy. Strategic implications.Things can be explode.
Kurzweil reference, I call this the data singularity.CPU cost and storage costs have fallen faster than network and disk IO have risen – meaning more data can be stored & processed locally than can be shipped around. This has strategic implications: data is heavy, and hard to move once it lands somewhere. This puts Amazon, for instance, at enormous competitive advantage over its cloud computing peers.Data is heavy. Strategic implications.Things can be explode.
Athabasca Sands of Canada. There are parallels; mining value from these tar sands illustrates the point that these efforts were only worthwhile once value of oil extracted exceeded cost of extraction. The same holds true for data.Where are the Athabasca Tar sands of data?(Graphic showing value > cost threshold with example data)The economics of data aggregation and analysis have shifted dramatically: compelling (i) new categories of data to be stored & collected, (ii) re-examination of already collected but frequently disposed dataIn either case, the criteria is the same: economic value > cost of analysisBut the process of capitalizing on these emerging opportunities, of converting data volumes into value, requires a unique skill set.When concentrated in a single individual or within a start-up, they are a powerful cocktail – sexy to employers and investors alikeThese are the three sexy skills I discuss nextNot all data is worth keeping / aggregating / analyzing.Formerly rehabilitate data that wasn’t meritorious.Amazon stock chart as punchline.So few people had access to these tools. The scientist moniker is almost counter to what we traditionally as scientist. Call out that hacker ethos of the data scientist.
Few individuals have all these skills concentrated in one. That is, after all, the advantage of a start-up – where talents can compliment one another.
It is a painful process.Transition: Most of us are used to confronting files that look like:
Grab a screendump from the Oracle database scrape from 10 years of advertising data from a London publishing partner of ours.
Datamunging is a labor intensive and painful process; often 80% of time in an analysis project can be spent on this pieceThe tools used are typically high-level scripting languages like Python, Ruby, Perl If you want to know more about munging, we have two world-class data mungers are here with us today, Pete Skomoroch & Flip Kromer. Pete built a site that mines Wikipedia’s edit logs for trending news topics, and Flip is the force behind InfoChimps, and has written more parsers than almost the rest of us combined.
Abstraction, symbology, ontology…
Statistics is the grammar of data science. For those who feel that stats is dominated by old white dead men…
That because it is. But these old dead white men have some powerful ideas.
Statistics allows us to provided reduced descriptions of the world, in the form of models.In this way, they are reductive: models capture only the essential features of the data.
Statistical or machine-learning based data product are a staple of nearly every data-driven start-up in town. Here are just a few.Both in the process of developing a data product, data visualization plays an important role.
Our eyes are the highest capacity bandwidth channel we have.Visualization is a means of surfacing otherwise intangibly large data sets.Two broad classes: exploratory, audience of 1 or 2, characterized by rapid iterations, local development, not in printNarrative: a point of view has been established and viz is supposed to help drive the story forward.
Tukey
Wattenberg stream graphs
Storytelling. Human-size for human decision makers – telling stories with the data, through visualization, to communicate massive scales to people that execute and make decisions.
Good luck. Tableau is desktop.
This is an open source stack, and this vibrant big data hacker community actively building these tools.Specifically how its manifesting that we’re using in our country; he’s where we’re paying and here’s where we not. Here’s the solution interim. The stack is loosely coupled: right tool for the right job. The need for a dedicated analytics RDBMSYou know who sits on the top of that stack? We do. That’s why storytelling is such an important skill.Commoditization moves from the bottom up.
I’ve added an addendum to this talk – These skills aren’t just sexy for individualsStart-ups with these skills in-house are also sexy investments – we wouldn’t be meeting here today if that weren’t the case.The motivation for this talk was Hal Varian’s quip that Statisticians are the Sexy Profession of the Next Decade. I thought how I could mash up data with sexy, and this is what I got.
I’m defining data Science is: applying tools to data to answer questions. It is at the intersection of these tools. And it is a growing field, because data is getting bigger, and our tools are getting better. (Suffice to say, the questions we ask have been around since time immemorial: whoAnother word for questions is hypotheses.
There’s been a lot of talk about Big Data in the past year. Articles and conferences.