Talk at Hackbright Academy on September 9, 2014
hackbright-data-scripting.eventbrite.com
VIDEO: http://vimeo.com/105781767
NOTES: https://github.com/clarecorthell/data-prototyping-talk
About the Talk: Data is deeply integrated into decision making in modern companies. But businesses have difficulty understanding the glut of data they now create and consume. What is all this data, and how do we use it?
This talk will cover how to get started with data scripting, a powerful tool for prototyping data work. Data skills are crucial as engineering teams are responsible for discovering, housing, and extracting business-critical data. We’ll discuss how to find meaningful data, develop heuristics for normalizing and slicing data, and define useful data structures. We’ll utilize the pythonic Data Scientists’ favorite tools, numpy, pandas, and iPython. Finally, we’ll talk about data work in industry, and why your scripting skills will be a superpower in a data-rich world.
About the Speaker: Clare Corthell is a Data Scientist and Designer at Mattermark, a data-driven deal intelligence platform, where she builds technologies that quantify the growth of private companies. She is the originator of The Open Source Data Science Masters, a curriculum for learning Data Science. A Stanford-trained product designer and engineer, she's founded and worked with early-stage companies in the US, Europe, and East Africa. She’s up early pondering discovery algorithms, information design, diglossia, and education systems. Follow her at @clarecorthell.
4. Open Source Data Science
Masters
datasciencemasters.org
!
Mattermark
Data Scientist
Machine Learning Engineer
about me
5. Mattermark
Private Company Deal Intelligence Platform
!
or, a huge spreadsheet full of live data about private companies
of which you can ask questions
my company
13. Data Scientists turn data into knowledge
by answering ambiguous questions
such as
How do we bucket companies by industry?
What are those industries?
Can we predict whether someone will start a company?
Are there patterns that computers can see that humans can’t?
What do Data Scientist do?
14. Turning Data into Knowledge
How Data Scientists spend their time:
• 80% on Cleaning > Munging > Exploration
• 20% Experiments / Analysis / Machine Learning
Exploration is important because it lets us determine
what questions we might be able to answer with the data.
Only then can we run experiments, analyze, and finally
begin to fundamentally understand and model the world.
What do Data Scientist do?
15. Exploration results in Prototypes
definition
When you explore and ask questions,
you create knowledge prototype.
(a first, probably incomplete version that leads to knowledge)
!
Knowledge prototypes answer questions.
(they might not perfectly model the world, but they’re a useful start)
!
Questions lead to more questions,
and subsequently more knowledge.
16. “All prototypes are wrong,
but some prototypes are useful.”
— blatant misquoting of George E. P. Box
17. by exploring data
we start to answer questions
by building knowledge prototypes
lemme show you what I mean
18. What do we need to explore data?
• Tools for working with that data (python!)
• A data structure to make the data usable
• Data
• Questions we want to answer
(we’ll make them up as we go today)
19. toolkit
numpy
pandas
iPython
multi-dimensional container of data
data structures analysis tools
browser-based code notebook / IDE
(run blocks of code, not the whole program)
python
20. the data structure: DataFrame
and you thought you hated excel,
but you actually don’t
21. dataframe
• records are rows
!
• columns are values across those rows
!
• basic actions: filtering, sorting, slicing
!
(paradigmatically not a far cry from excel)
basic data structure
22. The Data (from Mattermark)
• Categorical (industry)
• Continuous (uniques)
• Binary (mobile app)
• Dates (date of funding)
Company funding events
in New York City
from the last 5 years
data types (examples)
23. Initial Questions of Exploration
• What’s in here?
• Are there patterns?
• What might we find out if we investigate further?
Exploration
24. From questions
come more questions
And eventually, you find something very, very
interesting (and probably valuable!)
28. What’s in here?
(summary across columns)
columns cont —>
df.describe()
29. What’s in here?
(sort by round size)
…
df.sort(‘colname’, ascending=False)
30. What’s not in here?
(null or missing values)
In the column, is the value at a given index null? (true or false)
…
Count the number of null values in the column
df[‘colname'].isnull()
len(np.where(df[‘colname’].isnull())[0])
31. Question:
What is the most common stage for funding?
to get a quick idea of scale…
df['colname'].value_counts()
df['colname'].value_counts().plot(kind='bar')
32. Leads to Question:
What is the typical funding amount by round?
Further questions:
• What kind of companies
raised at each stage?
• How much variability is
there in the amount raised
at each stage?
• Is this different from other
geographies?
groupby_var = df.groupby(‘colname')
print groupby_var[‘colname’].mean().astype(int)
33. Question:
How many of these are mobile companies?
df.shape
Further questions:
• Do mobile companies have lots of employees?
• Do mobile companies typically have revenue?
• Do mobile companies raise less or more than other
companies?
34. Question:
How many of these are mobile companies?
Further questions:
!
• Do mobile companies have lots of employees?
• Do mobile companies typically have revenue?
• Do mobile companies raise less or more than other
companies?
35. Our prototypes of knowledge:
With regard to private companies in NYC
that raised capital in the last 5 years:
!
• ~10% have mobile applications
• Most funding events were at the seed stage
• The average seed round was $839k
!
In total:
!
• There were 3209 reported funding events
what we discovered
36. Why it’s a prototype
(eg, why we’re not done yet)
• The data isn’t completely clean
• We haven’t accounted for null, missing, zero
values
• We haven’t connected directly to a business
question
• We aren’t working in production (just locally)
37. by exploring data
we start to answer questions
with knowledge prototypes
38. Why does this matter?
• Exploration lets us build prototypes of knowledge
that start to answer real questions.
• One question paves the road to another.
• Answering questions leads to knowledge.
• People who have knowledge understand more
about the world.
39. Why does this matter?
There aren’t enough people that do this with code.
40. Why does this matter?
People who can code in the world of technology
companies are a dime a dozen and get no respect.
People who can code in biology, medicine,
government, sociology, physics, history, and
mathematics are respected and can do amazing
things to advance those disciplines.
- Zed Shaw (Python the Hard Way)