Talk Abstract
Many companies have "big data", but not every company has the resources (or need) for a big data team. In this talk we will discuss lessons I've learned from working as part of a small team within a fast-moving mobile start-up and techniques for getting the most out of your data on a budget.
Speaker Bio:
Nicholas is Senior Data Scientist at FitnessKeeper, makers of the RunKeeper and Breeze mobile and web apps for health and fitness tracking and guidance. Before joining FitnessKeeper he spent 10 years as a research staff member at MIT Lincoln Laboratory, working in the Cyber Security and Information Sciences and the Ballistic Missile Defense Divisions. While at Lincoln Laboratory he also received his Ph.D. in Applied Mathematics from Harvard University, where his research focused on numerical methods for the analysis of massive datasets. His areas of expertise and interest include statistical modeling, data mining, machine learning, data visualization, network science, and statistical signal processing.
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Data Science on a Budget: Maximizing Insight and Impact - Nicholas Arcolano PhD
1. Data Science on a Budget:
Maximizing Insight and Impact
Nicholas Arcolano, Ph.D.
Senior Data Scientist
@arcolano
Photo by giuseppemilo / CC BY
2. A little background…
• Spent 10 years at MIT Lincoln Laboratory
working in ballistic missile defense and cyber
security research
• Areas of interest: statistics, machine learning,
parallel computing, “big data”
• Realized these things had been collectively
re-branded as “data science”
• Started calling myself a “data scientist” and
joined a start-up
Nicholas Arcolano – Data Science on 2 a Budget – November 2014
3. What does a data scientist do?
Nicholas Arcolano – Data Science on 3 a Budget – November 2014
4. What does a data scientist do?
• Something that happens at the intersection of
statistics, machine learning, and computer science
• Usually involves data (typically lots of it)
• Actually, this isn’t the most critical question to be
worrying about
Nicholas Arcolano – Data Science on 4 a Budget – November 2014
5. A better question…
• What does a data team do?
• Basically, two things:
1. Use data to help the rest of the company understand what our
users are doing
2. Help the rest of the company use this information to improve our
product and our business
Nicholas Arcolano – Data Science on 5 a Budget – November 2014
6. The Company
• Started in 2008
• Based in Boston
• About 50 people
• 4-person data team
• 37 million users
• 450 million fitness activities
• 200 billion GPS points
• 17 billion interactions and events
Our Product
The Data
• RunKeeper app for GPS and manual tracking of running, walking,
cycling, other activities
• Long-term fitness goals, training plans, and performance insights
• iOS, Android, web, 3rd party devices
7. PRODUCT SYSTEMS
DATA
MARKETING
EXECUTIVE
BUSINESS
DEVELOPMENT
USER
EXPERIENCE
QUALITY
ASSURANCE
• analytics and business
intelligence
• modeling and forecasting
• data systems and archiving
• user research and testing
• data-driven features
• data stories and
visualizations
7
SUPPORT
“DATA SCIENCE”
8. How can we accomplish all this, quickly and
with a small team?
It’s hard… but here are some steps to
making it easier
Nicholas Arcolano – Data Science on 8 a Budget – November 2014
9. Step 1: Communicate. A lot.
Nicholas Arcolano – Data Science on 9 a Budget – November 2014
10. Step 1: Communicate. A lot.
Nicholas Arcolano – Data Science on 10 a Budget – November 2014
11. Step 1: Communicate. A lot.
• You have a lot to learn about the rest of the company
– Every part of the company has its own blend of tools, systems, processes,
environments
– Every part has data it understands and cares about
– Every part knows things that affect the data that you won’t see—
user interviews, support feedback, product bugs, system failures
• You also have a lot to teach people
– What data we have
– What it can—and can’t—do
– Empower people to “think with data”
Nicholas Arcolano – Data Science on 11 a Budget – November 2014
12. Step 1: Communicate. A lot.
• Be patient—sometimes you
have to say the same things
many times
• You may be the only one
looking at certain data—if you
see something, say something!
Nicholas Arcolano – Data Science on 12 a Budget – November 2014
13. Setting expectations
Things our data team will discover
exciting new things things we already knew
Anticipated impact
of data exploration:
Things our data team will discover
bugs, missing data,
and bad data
things we already knew
exciting new things
Actual impact of
data exploration:
Nicholas Arcolano – Data Science on 13 a Budget – November 2014
14. Step 2: Move quickly but carefully.
“Wisely and slow. They stumble that
run fast.”
– Friar Laurence, from
Shakespeare’s Romeo and Juliet
Nicholas Arcolano – Data Science on 14 a Budget – November 2014
15. Step 2: Move quickly but carefully.
• On moving fast…
– Data science can work well in an agile framework
– Make assumptions, but understand them
– Don’t be afraid to provide caveats
• On being cautious…
– Bad analysis is worse than no analysis
– Make time for data QA
– Use common sense—if it seems to good (or bad) to be true, it usually is
Nicholas Arcolano – Data Science on 15 a Budget – November 2014
16. Step 3: Keep it simple.
• Go for lots of small, quick wins
• Learn and iterate
• Resist the urge to show everyone
how smart you are by doing
something super complicated
Nicholas Arcolano – Data Science on 16 a Budget – November 2014
17. Step 3: Keep it simple.
• Do the “stupid thing” first
– It helps build understanding
– It helps uncover issues with the data
– It may turn out that you’re not even solving the right problem
– It may actually work pretty well
• When in doubt, favor a simpler method that you understand better
over a more complex one
– Easier to implement
– Easier to debug
– Easier to explain to others
Nicholas Arcolano – Data Science on 17 a Budget – November 2014
18. You don’t have to use all the data
• Sometimes, using all the data is the right thing to do:
SELECT COUNT(userid) FROM rk_user;
• Sometimes, though, you can solve your problem entirely with a
small data set
• Benefits
– Easier computation and data wrangling means faster results
– “Curse of dimensionality” is a real thing
– Mitigate bad assumptions (lack of stationarity, different product versions,
changing environments, regional and seasonal effects, etc.)
Nicholas Arcolano – Data Science on 18 a Budget – November 2014
19. Step 4: Use the right tools.
• In any given scenario, the “right
tool” is one of the following:
– The tool you already know and are
comfortable with
– Something you don’t know but
suspect would work really well
– Something that doesn’t exist yet
• It’s up to you to figure out which
one it is
Nicholas Arcolano – Data Science on 19 a Budget – November 2014
20. Languages and technologies I used
during 10 years at my last job
Languages and technologies I’ve used
during 1 year at my current job
Step 4: Use the right tools.
• Be comfortable using a variety of tools
• Make time to learn new ones
• Build your own tools for repeatable
analysis—once you know it’s worth it
• Open source: take advantage of the hard
work of others, but make sure you
understand what you’re using
• Give back
Nicholas Arcolano – Data Science on 20 a Budget – November 2014
21. Step 4: Use the right tools.
• Many of the same principles apply to your “analytical toolkit”
• Try to learn when to stick with a well-worn approach and when
to try something new
• Be skeptical of the conventional wisdom
– Just because a metric or analytical approach is common doesn’t mean it’s
the right thing to do for your situation
– Typical example: A/B testing
Nicholas Arcolano – Data Science on 21 a Budget – November 2014
22. Hypothesis testing (“A/B testing”)
GROUP A
“Control”
GROUP B
“Treatment”
USERS
90%
10%
Standard
flow
Experimental
flow
Test
statistic
Nicholas Arcolano – Data Science on 22 a Budget – November 2014
DECISION
“reject/accept
null hypothesis”
# of successes,
failures
# of successes,
failures
“Null hypothesis”: treatment has no effect
“Alternate hypothesis”: treatment has some effect
23. Thoughts about A/B testing
• A/B testing is hard to do well
– Need lots of data and good estimates of baseline rates to have a chance at significance
– Need lots of data infrastructure to do it quickly on a large scale
– Need to manage variables such multiple testing, changes in product and environment,
interactions between tests, subjects
– Need to make sure tests align with high-level vision and learning goals
• An A/B test can help with one very specific decision, but typically will not...
– Help you understand how multiple different factors interact
– Predict long-term reactions (the “taste test” phenomenon)—need longitudinal study
– Always give you the answer you want—results may be null or inconclusive
– Tell you anything of any value whatsoever if you did it wrong
Nicholas Arcolano – Data Science on 23 a Budget – November 2014
24. Thoughts about A/B testing
Even when performed “correctly”, an A/B
test may not tell you what you think it does
25. Step 5: Have faith and have fun
• Don’t try to understand everything all at once—keep looking from multiple
angles and trust that more understanding will come in time
Nicholas Arcolano – Data Science on 25 a Budget – November 2014
26. Step 5: Have faith and have fun
• Working data from millions of engaged users is awesome
• Helping your company have a real impact on their lives is even
more awesome
• All the tools are available to do truly amazing things
• Make sure everyone knows how much you love the data, and
they will grow to love it too
Nicholas Arcolano – Data Science on 26 a Budget – November 2014
27. Things we’re still working on
• Synthesizing knowledge and communicating results
• Data-driven products and features
• Analytics and instrumentation
• Giving back (open source, blogging, tutorials, talks)
Nicholas Arcolano – Data Science on 27 a Budget – November 2014
28. Thanks for listening! Questions?
nicholas.arcolano@runkeeper.com
http://arcolano.com
@arcolano
http://www.runkeeper.com