How to Make Hadoop Easy, Dependable and Fast

How to Make Hadoop
Easy, Dependable, and Fast

©MapR Technologies - Confidential 1

Agenda

 Quick MapR overview
 Typical and atypical use cases
– restaurant recommendation
– network security
– mega-scale fraud modeling
– log analysis through creative abuse of text retrieval

 Lessons Learned
– data import and export techniques
– zen integration
– accessing data from a variety of applications
– how to protect data from the most common cause of data loss


MapR’s Complete Distribution for Apache Hadoop

 Apache Applications are MapR Control System

MapR
Integrated, tested, Heatmap™
LDAP, NIS
Integration
Quotas,
Alerts, Alarms
CLI,
REST APT

hardened and
Supported. Hive Pig Oozle Sqoop HBase Whirr

 100% Hadoop, HBase, Mahout Cascading Naglos Ganglia Flume Zoo-
Integration Integration keeper
HDFS API compatible
 Easy portability/
migration between Direct Real-Time Snap- Data
Access Streaming Volumes Mirrors shots Placement
distributions NFS

No NameNode High Performance Stateful Failover
 No changes required Architecture Direct Shuffle and Self Healing
to Hadoop applications
2.7
MapR’s Storage Services™


MapR: Lights Out Data Center Ready

Reliable Compute Dependable Storage

 Automated stateful failover  Business continuity with snapshots
 Automated re-replication and mirrors
 Self-healing from HW and SW failures  Recover to a point in time
 Load balancing  End-to-end check summing
 Rolling upgrades  Strong consistency
 No lost jobs or data  Data safe
 99999’s of uptime  Mirror across sites to meet
Recovery Time Objectives

Restaurant Recommendation

 Use transaction data to characterize users

 Determine restaurant affinities for transactors

 On demand, produce geo-local restaurant recommendation

 Web or mobile interface


Restaurant Recommendation

 Training goes-ins
– transaction data from purchases
– user feedback on recommendations
 Training goes-outs
– large recommendation data files

 Online goes-ins
– user id, current location, recent transaction history
– filters
 Online goes-outs
– restaurant recommendations


What is the Delivery Mechanism

 Database?
– export takes forever
– limited scalability

 Key value store?
– export takes forever
– YAWTM (yet another widget to maintain)

 Do we really need a mechanism at all?


Deploying Recommendations

Final recommendations
computed in browser/app


Summary

 With mirrors and NFS, no special “deployment” mechanism is
necessary

 User’s browser can do final assembly on the recommendations

 Recommendation components served as static files by web-server


Mega-scale Fraud Modeling

 Why not use the simplest modeling technology around?
– similar folk do similar things!
– just find tens of thousands of similar folk and see what they did

 Can we make it a million times faster than the prototype?
– well, yes … we can

 And can you deploy that into a live system?

 And can sequential and parallel versions co-exist?


Modeling with k-nearest Neighbors

a

b c


Speeds and Feeds

 Single machine version can cluster at 20μs per point
– 1 million points in ~20s
– 100 million points in ~2000s = 40 minutes

 Parallel version can cluster at 20μs / nodes per point + 30 seconds
– 1 million points in 31 s on 20 nodes (ish)
– 100 million points in 150 s = 2.5 minutes (on 20 nodes)

 Really would like interchangeable versions


What About Deployment?

 Final matrix size is several GB

 Can’t have copy per thread
– can’t even wait to load many copies

 What about mmap?
– needs real files, can’t use HDFS
– NFS works great

 Need to deploy in map-reduce and real-time environments
– can’t depend on Hadoop features like distributed cache


Summary

 With mirrors and NFS, no special “deployment” mechanism is
necessary

 The modeling client can use NFS + mmap share memory between
threads or processes

 Mirrors can stage as many replicas as desired on whichever
machines are specified


Network Security

 Take an existing network security appliance

 Add magical parallel machine learning to find new attacks

 But don’t spend time copying data back and forth

 And don’t change the legacy code


Summary

 Legacy code “just works” with MapR’s NFS

 Map-reduce programs don’t care where the input comes from

 Exposing new control data requires no special mechanism


Log Analysis

 Receive 200K log lines per second or more

 Want to do multi-field search

 Want to search on log lines with < 30 second delay before search


Solr Based Flexible Analytics

 Solr/Lucene can index at 500K small documents per second

 Faceting provides simple aggregation

 Multiple index search is a given, not a special future enhancement

 Solr/Lucene has awesome record of stability


Data Ingestion and Indexing

SolR
SolR Solr
Incoming Indexer
Text
Kafka Indexer indexer
Data analysis

Real-time

Raw Live index
Older index shard
documents shards

Time sharded Solr indexes


Some Special Points

 Textual analysis is done in parallel outside of the indexer

 Raw documents are stored outside of Solr to minimize index size

 Index hot-spotting is a feature here because it gives time-based
sharding

 Indexing into NFS allows legacy code reuse


Basic Search

Solr
search
Query Web tier

SolR
SolR
Indexer
Solr
Indexer
search

Raw Live index
Older index shard
documents shards


Additional Points

 The number of shards per core can be adjusted easily to match
load

 Near real-time indexing not really required

 No transaction logs need be kept by Solr for failure tolerance
– core failure requires other cores take on lost shards
– indexer failure requires indexer restart … Kafka retains unprocessed input
– indexing is idempotent


Secure Search
Auth
data

Solr
Security search
Query Web tier
filter
SolR
SolR
Indexer
Solr
Indexer
search

Raw Live index
Older index shard
documents shards


Conclusions


Lessons Learned

 Import/export is often a non-issue
– NFS allows processing in place

 Legacy access via NFS provides high performance, minimal effort

 Interchangeable map-reduce and conventional programs are key

 Do simple tasks in simple ways. Save the effort for the big tasks


Zen Integration

 The student went to the master and asked how to integrate
multiple programs using different models
– The master said, “to do more, do less”
 The student went away and came back pointing out that HDFS
allows copying data in and out. He quoted Turing.
– The master said, “to do more, do less”
 The student thought about this for many days. In the
meantime, the master installed MapR and deleted all the
integration code.
 When the student returned and saw this, he asked where the
integration was.
 The master answered “ ” and the student was enlightened.


The Cause of Almost All Data Loss



And snapshots are the cure (partially)

Time for Questions

 Download MapR to learn more
– http://mapr.com/download

 Send email with questions later
– tdunning@maprtech.com

 Tweet as the spirit moves
– @ted_dunning

 These slides and other resources
– http://www.mapr.com/company/events/speaking/tableau-11-8-12


How to Make Hadoop Easy, Dependable and Fast

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (7)

Ähnlich wie How to Make Hadoop Easy, Dependable and Fast

Ähnlich wie How to Make Hadoop Easy, Dependable and Fast (20)

Mehr von MapR Technologies

Mehr von MapR Technologies (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

How to Make Hadoop Easy, Dependable and Fast

Hinweis der Redaktion