2014.07.01 - New Technologies, New Roles, New Architectures - Singapore Management University - BigData SG

© 2014 MapR Technologies 3
Contact info
• Slides online later, relax, enjoy, ask questions, participate
• allenday@mapr.com
• @allenday
• slideshare.net/allenday
• github.com/allenday
• …etc

Allen’s Scorecard – Data Science Roles
• Domain Expertise – Genetics, geospatial, advertising
• Data Science – Biostatistics, recommendation systems,
persuasion
• App Development – R (13yr), Hadoop (7yr),
Before: web apps
• Operations – Horizontal scaling (e.g. web apps),
automation

Message
How do emerging technologies…
…change our roles and
…change the way we design systems?

Example: Sensor Data from Drilling Rigs
Real-time + long-time data use case

Powerful Combination: RT Sensor Data + Histories
• Internet of Things is resulting in huge quantities of sensor data
• New opportunities for fine-grained view: save years instead of
months of data
• Also analyze in real-time for short term reporting, dashboards,
anomaly detection and predictive modeling

Which part?
When was maintenance performed?
Why repaired?
If malfunction – what details?
Maintenance Data Base

What is current status of part?
What are the current conditions?
Where is it located?
How much stress is it under?
Real-Time Sensor Data

Real-Time Sensor DataMaintenance Data Base +
Machine Learning => Data Models
Analyze maintenance records
Predict maintenance needs
Schedule repairs to reduce costs
Reduce damage from unexpected failures

How can an application be built to do this?

Application: Data Access
Real Time Processing
Long Term Persistence
New Data
Query
Hadoop
Spark Streaming
Storm

t
now
The Challenge: Hadoop is Not Very Real-time
UnprocessedData
Fully processed Latest full
period
Hadoop job
takes this long
for this data

t
now
Hadoop works great back
here
Spark
Streaming or
Storm work
here
Real-time and Long-time together
Blended viewBlended viewBlended View

t
now
Hadoop works great back
here
Spark
Streaming or
Storm work
here
Blended viewBlended viewBlended View

Lambda Architecture
New Data
SPEED LAYER
BATCH LAYER
Query
SERVING LAYER

Query Process
Real Time Processing
Long Term Persistence &
Batch Processing
New Data
Merge Query Results
SPEED LAYER
SERVING LAYER
BATCH LAYER
Query
Results
Hadoop
Spark Streaming
Storm
Drill
Impala
Hive
Partial Query Results
Partial Query Results

New designs benefit from overlapping roles:
Dev + Ops

Production involves real time & long time processing

Ongoing Development

DevOps View

t
now
Data snapshot for devops
and QA
Live data for
production
systems
Step
forward

Recommendation Systems

Recommendations
– Data used to train model: interactions between people taking action
(users) and items
– Goal is to suggest additional interactions
– Example applications: movie, music or map-based restaurant choices;
suggesting sale items for e-stores or via cash-register receipts

Recommendation
Behavior of a crowd
helps us understand
what individuals will do

User
History
Log Files
Mahout
Analysis
Search
Technology
Item
Meta-Data
Ingest easily via NFS
MapR Cluster
via NFS Python
Use Python
directly via NFS
Pig
Web
TierRecommendations
New User History
Example:
Real-time recommender using MapR data platform
Offline analysis
Real-time
recommendations
Real-time Layer
Batch Layer
Serving Layer

Result: System delivers real-time custom recommendations
based on music listening activity

Practical Machine Learning: Free e-books
• Practical Machine Learning series authored by Ted Dunning and Ellen
Friedman, published by O’Reilly (2014)
• Provide innovations and advice that make machine learning more
accessible and more successful in real world settings
• Two titles available now as free e-book download from MapR website:
Innovations in Recommendation and A New
Look at Anomaly Detection
http://bit.ly/1nI2dyS

Building data science teams

Q:
Can I simply hire one rock star data
scientist to cover all this kind of work?

A: No, interdisciplinary work requires
teams
A: Hire leads who can speak the lingo of
each required discipline
A: Hire individual contributors who
cover 2+ roles, when possible

Good news: you don’t have to do it all
at once
Build in steps and repurpose existing
expertise

Team Process = Needs
apps
discovery
modeling
systems
help people ask the right questions
allow automation to place informed
bets
deliver products at scale to
customers
build smarts into product features
keep infrastructure running, cost-
effective
integration

Team Process = Needs
apps
discovery
modeling
systems
integration
These are the primary phases of leveraging BigData
Analysts drive from discovery.
Engineers drive from systems.
Both meet at integration.
Effective management of Data Science lives at
integration and doesn’t delegate it

business process,
stakeholder
data prep,
discovery,
modeling, etc.
software
engineering,
automation
systems
engineering,
availability
Team Composition = Roles
Each role brings different
disciplines, opportunities, and
risks. It’s a powerful technique to
pair people with complementary
skills.
Blurring roles is very effective with
great people, e.g. DevOps.
There is danger in blurring
boundaries: Don’t try to create
rockstars (pushing down /
overloading stresses teams)

Team Matrix = Needs x Roles
business process,
stakeholder
data prep,
discovery,
modeling, etc.
software
engineering,
automation
systems
engineering,
access

Team Matrix
business process,
stakeholder
data prep,
discovery,
modeling, etc.
software
engineering,
automation
systems
engineering,
access
Conceptual tool for building and managing
Data Science teams
Overlay your project requirements (needs)
with your team’s strengths (roles)
That will show very quickly where to focus
Bring in individuals who cover 2-3 needs,
particularly for Team Leads

Team Matrix = Needs x Roles
business process,
stakeholder
data prep,
discovery,
modeling, etc.
software
engineering,
automation
systems
engineering,
access

Allen’s Overlay
business process,
stakeholder
data prep,
discovery,
modeling, etc.
software
engineering,
automation
systems
engineering,
access

Aggressively Proactive Learning
• Disrupts old learning and
management models
– one size fits all
– Specialists
Hire people who
learn and re-learn
efficiently
Throw Your Life a
Curve
Whitney Johnson
blogs.hbr.org/johnson/2012/09/throw-your-life-a-curve.html

Recap
• Scalable storage allows for huge amounts of data
• Huge data calls for new system designs
Lambda Architecture: conceptual framework to design
systems for combining real-time and long-time data
• New system designs call for new definitions of roles and teams
Building Data Science Teams: conceptual framework for
building teams teams that can effectively work with huge
amounts of data

Bonus round:
What’s MapR?
Why care?

MapR Data Platform
Supports Complete Data Science Lifecycle
Filesystem
POSIX NFS
HBase
HDFS
MapReduce
SAN Storage

FILESYSTEM
POSIX NFS
HBASE
NOSQL TABLES API
HADOOP
HDFS API
APACHE™HADOOP® HDFS
APACHE HBASE
IMPLEMENTS IMPLEMENTS
IMPLEMENTS
DEPENDS
DEPENDS
MapR Data Platform
Architecture in a Nutshell

HADOOP
HDFS API
HBASE
NOSQL TABLES API
FILESYSTEM
APACHE™HADOOP® HDFS
APACHE HBASE
IMPLEMENTS
DEPENDS
DEPENDS
Vertical Integration = High Performance
POSIX NFS
MapR Data Platform
Architecture in a Nutshell

2014.07.01 - New Technologies, New Roles, New Architectures - Singapore Management University - BigData SG

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Mehr von Allen Day, PhD

Mehr von Allen Day, PhD (18)

Kürzlich hochgeladen

Kürzlich hochgeladen (16)

2014.07.01 - New Technologies, New Roles, New Architectures - Singapore Management University - BigData SG

Hinweis der Redaktion