Generally speaking, big data and data science originated in the west and are coming to Europe with a bit of a delay. There is at least one exception though: The London-based music discovery website Last.fm is a data company at heart and has been doing large-scale data processing and analysis for years. It started using Hadoop in early 2006, for instance, making it one of the earliest adopters worldwide. When I left Last.fm to join Massive Media, the social media company behind Netlog.com and Twoo.com, I basically moved from a data science forerunner to a newcomer. Massive Media had at least as much data to play with and tremendous potential, but they were not doing much with it yet. The data science team had to be build from the ground up and every step had to be argued for and justified along the way. Having done this exercise of evaluating everything I learned at Last.fm and starting over completely with a clean slate at Massive Media, I developed a pretty clear perspective on how to find good data scientists, what they should be doing, what tools they should be using, and how to organize them to work together efficiently as team, which is precisely what I would like to share in this talk.
2. MY CAREER PATH SO FAR
2007: Began working with big data as PhD student
2009: Embarked on a data science career at Last.fm
2011: Joined Massive Media as Lead Data Scientist
Data company at heart; one of the earliest Hadoop adopters world-
wide; inventors of Ketama; organised first “NoSQL” meetup in SF.
Huge audience and tremendous potential,
but data science newcomer at the time.
3. MY TEAM AT MASSIVE MEDIA
+ interns!
Currently 4 permanent people, so not huge just yet
Relatively big and growing faster than anticipated though
4. OUR MISSION IS HELPING THE COMPANY...
MEASURE metrics dashboards
EVALUATE data-driven testing
DECIDE ad hoc data insights
IMPROVE e.g. abuse detection
EXTEND new product features
PROMOTE PR via data porn
5. OUR MISSION IS HELPING THE COMPANY...
MEASURE metrics dashboards
higher risk but bigger returns
EVALUATE data-driven testing
DECIDE ad hoc data insights
IMPROVE e.g. abuse detection
EXTEND new product features
PROMOTE PR via data porn
6. OUR MISSION IS HELPING THE COMPANY...
MEASURE metrics dashboards
higher risk but bigger returns
very wide range of tasks
EVALUATE data-driven testing
DECIDE ad hoc data insights
IMPROVE e.g. abuse detection
EXTEND new product features
PROMOTE PR via data porn
8. BOOTSTRAP BY SAVING OR GAINING MONEY
You need to get some capital to get started
Saving money tends to be easier in practice
Real-world example:
• Analyzing CDN logs unveiled abuse
• Stopping the abuse greatly reduced the bills
10. HADOOP
Not the holy grail, but deserves a central role
It has a vibrant community and is proven to be:
ECONOMICAL runs on commodity hardware
SCALABLE smart distributed processing
MAINTAINABLE very robust and fault-tolerant
FLEXIBLE predefined schemas not required
12. STATS PIPELINE BASED ON HADOOP
Log collector
HDFS
MapReduce
Dashboards HBase
in batches
continuous
13. STATS PIPELINE BASED ON HADOOP
Cfr. “lambda
architecture”
Log collector
coined by
@nathanmarz HDFS
Realtime
processing
MapReduce
Dashboards HBase
in batches
continuous
14. STATS PIPELINE BASED ON HADOOP
Cfr. “lambda
architecture”
Log collector
coined by
@nathanmarz HDFS
Realtime
Ad-hoc processing
results MapReduce
Dashboards HBase
in batches
continuous
15. PYTHON IS AN AWESOME JACK OF ALL TRADES
It is great for building dashboards:
• Hadoop support: Dumbo, Python UDFs for Pig, ...
• Several amazing web frameworks, e.g. Flask
• Likewise for drawing graphs, e.g. PyCairo
And it covers many other data science needs as well:
• Scripting, prototyping and full-blown programming
• NumPy, SciPy, PyLab, Scikit-learn, Pandas, ...
17. THE SECRET IS IN THE MIX
Hadoop’s tricks also apply to data science teams
• Avoid specialisation to allow easy distribution and scaling
• Exploit data locality by hiring people with wide skill set
Great Data Scientists have the right mix of skills
• Hackers with solid technical background
• Analytical mind that knows statistics and machine learning
• Clever and creative in everything they do
19. SOME TIPS AND TRICKS
Dare to fail and/or start from estimates
Introduce data exploration/innovation days
• Basically 20% time devoted to playing with data
• Incorporate brainstorming
• Encourage collaboration
Communicate findings to the rest of the company
• Fun and silliness are allowed
• Prototype early and often
20. FIVE SIMPLE STEPS IS ALL IT TAKES
1 FOLLOW THE MONEY
2 EMBRACE HADOOP
3 BUILD DASHBOARDS
4 ASSEMBLE A TEAM
5 EXPLORE & INNOVATE
21. FIVE SIMPLE STEPS IS ALL IT TAKES
1 FOLLOW THE MONEY
2 EMBRACE HADOOP
Thanks!
3 BUILD DASHBOARDS
Questions?
4 ASSEMBLE A TEAM
5 EXPLORE & INNOVATE