Prototyping in the Data World - Data Scripting with Python

Prototyping in the Data World
Clare Corthell
!
Data @ Mattermark
@clarecorthell

weirdest food
you’ve ever eaten?

whale blubber ice cream,
with blueberries.
(mine)

Open Source Data Science
Masters
datasciencemasters.org
!
Mattermark
Data Scientist
Machine Learning Engineer
about me

Mattermark
Private Company Deal Intelligence Platform
!
or, a huge spreadsheet full of live data about private companies
of which you can ask questions
my company

ask questions of data
(what we think about all the time at Mattermark)

Why do we ask questions
of our data?

Because we want to gain
knowledge from data

creating knowledge,
understanding,
& more data
(after exploration)

Data Scientists turn data into knowledge
by answering ambiguous questions
such as
How do we bucket companies by industry?
What are those industries?
Can we predict whether someone will start a company?
Are there patterns that computers can see that humans can’t?
What do Data Scientist do?

Turning Data into Knowledge
How Data Scientists spend their time:
• 80% on Cleaning > Munging > Exploration
• 20% Experiments / Analysis / Machine Learning
Exploration is important because it lets us determine
what questions we might be able to answer with the data.
Only then can we run experiments, analyze, and finally
begin to fundamentally understand and model the world.
What do Data Scientist do?

Exploration results in Prototypes
definition
When you explore and ask questions,
you create knowledge prototype.
(a first, probably incomplete version that leads to knowledge)
!
Knowledge prototypes answer questions.
(they might not perfectly model the world, but they’re a useful start)
!
Questions lead to more questions,
and subsequently more knowledge.

“All prototypes are wrong,
but some prototypes are useful.”
— blatant misquoting of George E. P. Box

by exploring data
we start to answer questions
by building knowledge prototypes
lemme show you what I mean

What do we need to explore data?
• Tools for working with that data (python!)
• A data structure to make the data usable
• Data
• Questions we want to answer
(we’ll make them up as we go today)

toolkit
numpy
pandas
iPython
multi-dimensional container of data
data structures analysis tools
browser-based code notebook / IDE
(run blocks of code, not the whole program)
python

the data structure: DataFrame
and you thought you hated excel,
but you actually don’t

dataframe
• records are rows
!
• columns are values across those rows
!
• basic actions: filtering, sorting, slicing
!
(paradigmatically not a far cry from excel)
basic data structure

The Data (from Mattermark)
• Categorical (industry)
• Continuous (uniques)
• Binary (mobile app)
• Dates (date of funding)
Company funding events
in New York City
from the last 5 years
data types (examples)

Initial Questions of Exploration
• What’s in here?
• Are there patterns?
• What might we find out if we investigate further?
Exploration

From questions
come more questions
And eventually, you find something very, very
interesting (and probably valuable!)

What’s in here?
(sample 10 rows)
iPython code block
pd.read_csv(csvfilename)

What’s in here?
(sample 1 row)
.iloc[index_int]

What’s in here?
(sample & describe 1 column)
…
df['colname']
df['colname'].describe()

What’s in here?
(summary across columns)
columns cont —>
df.describe()

What’s in here?
(sort by round size)
…
df.sort(‘colname’, ascending=False)

What’s not in here?
(null or missing values)
In the column, is the value at a given index null? (true or false)
…
Count the number of null values in the column
df[‘colname'].isnull()
len(np.where(df[‘colname’].isnull())[0])

Question:
What is the most common stage for funding?
to get a quick idea of scale…
df['colname'].value_counts()
df['colname'].value_counts().plot(kind='bar')

Leads to Question:
What is the typical funding amount by round?
Further questions:
• What kind of companies
raised at each stage?
• How much variability is
there in the amount raised
at each stage?
• Is this different from other
geographies?
groupby_var = df.groupby(‘colname')
print groupby_var[‘colname’].mean().astype(int)

Question:
How many of these are mobile companies?
df.shape
Further questions:
• Do mobile companies have lots of employees?
• Do mobile companies typically have revenue?
• Do mobile companies raise less or more than other
companies?

Question:
How many of these are mobile companies?
Further questions:
!
• Do mobile companies have lots of employees?
• Do mobile companies typically have revenue?
• Do mobile companies raise less or more than other
companies?

Our prototypes of knowledge:
With regard to private companies in NYC
that raised capital in the last 5 years:
!
• ~10% have mobile applications
• Most funding events were at the seed stage
• The average seed round was $839k
!
In total:
!
• There were 3209 reported funding events
what we discovered

Why it’s a prototype
(eg, why we’re not done yet)
• The data isn’t completely clean
• We haven’t accounted for null, missing, zero
values
• We haven’t connected directly to a business
question
• We aren’t working in production (just locally)

by exploring data
we start to answer questions
with knowledge prototypes

Why does this matter?
• Exploration lets us build prototypes of knowledge
that start to answer real questions.
• One question paves the road to another.
• Answering questions leads to knowledge.
• People who have knowledge understand more
about the world.

There aren’t enough people that do this with code.

People who can code in the world of technology
companies are a dime a dozen and get no respect.
People who can code in biology, medicine,
government, sociology, physics, history, and
mathematics are respected and can do amazing
things to advance those disciplines.
- Zed Shaw (Python the Hard Way)

Thank You!
Best way to reach me?
Twitter @clarecorthell
psst — Mattermark is hiring!
Come talk to me!

Prototyping in the Data World - Data Scripting with Python

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (16)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Prototyping in the Data World - Data Scripting with Python