Weitere ähnliche Inhalte Ähnlich wie How to Make Hadoop Easy, Dependable and Fast Ähnlich wie How to Make Hadoop Easy, Dependable and Fast (20) Mehr von MapR Technologies Mehr von MapR Technologies (20) Kürzlich hochgeladen (20) How to Make Hadoop Easy, Dependable and Fast1. How to Make Hadoop
Easy, Dependable, and Fast
©MapR Technologies - Confidential 1
2. Agenda
Quick MapR overview
Typical and atypical use cases
– restaurant recommendation
– network security
– mega-scale fraud modeling
– log analysis through creative abuse of text retrieval
Lessons Learned
– data import and export techniques
– zen integration
– accessing data from a variety of applications
– how to protect data from the most common cause of data loss
©MapR Technologies - Confidential 2
3. MapR’s Complete Distribution for Apache Hadoop
Apache Applications are MapR Control System
MapR
Integrated, tested, Heatmap™
LDAP, NIS
Integration
Quotas,
Alerts, Alarms
CLI,
REST APT
hardened and
Supported. Hive Pig Oozle Sqoop HBase Whirr
100% Hadoop, HBase, Mahout Cascading Naglos Ganglia Flume Zoo-
Integration Integration keeper
HDFS API compatible
Easy portability/
migration between Direct Real-Time Snap- Data
Access Streaming Volumes Mirrors shots Placement
distributions NFS
No NameNode High Performance Stateful Failover
No changes required Architecture Direct Shuffle and Self Healing
to Hadoop applications
2.7
MapR’s Storage Services™
©MapR Technologies - Confidential 3
4. MapR: Lights Out Data Center Ready
Reliable Compute Dependable Storage
Automated stateful failover Business continuity with snapshots
Automated re-replication and mirrors
Self-healing from HW and SW failures Recover to a point in time
Load balancing End-to-end check summing
Rolling upgrades Strong consistency
No lost jobs or data Data safe
99999’s of uptime Mirror across sites to meet
Recovery Time Objectives
©MapR Technologies - Confidential 4
5. Restaurant Recommendation
Use transaction data to characterize users
Determine restaurant affinities for transactors
On demand, produce geo-local restaurant recommendation
Web or mobile interface
©MapR Technologies - Confidential 5
6. Restaurant Recommendation
Training goes-ins
– transaction data from purchases
– user feedback on recommendations
Training goes-outs
– large recommendation data files
Online goes-ins
– user id, current location, recent transaction history
– filters
Online goes-outs
– restaurant recommendations
©MapR Technologies - Confidential 6
7. What is the Delivery Mechanism
Database?
– export takes forever
– limited scalability
Key value store?
– export takes forever
– YAWTM (yet another widget to maintain)
Do we really need a mechanism at all?
©MapR Technologies - Confidential 7
9. Summary
With mirrors and NFS, no special “deployment” mechanism is
necessary
User’s browser can do final assembly on the recommendations
Recommendation components served as static files by web-server
©MapR Technologies - Confidential 9
10. Mega-scale Fraud Modeling
Why not use the simplest modeling technology around?
– similar folk do similar things!
– just find tens of thousands of similar folk and see what they did
Can we make it a million times faster than the prototype?
– well, yes … we can
And can you deploy that into a live system?
And can sequential and parallel versions co-exist?
©MapR Technologies - Confidential 10
12. Speeds and Feeds
Single machine version can cluster at 20μs per point
– 1 million points in ~20s
– 100 million points in ~2000s = 40 minutes
Parallel version can cluster at 20μs / nodes per point + 30 seconds
– 1 million points in 31 s on 20 nodes (ish)
– 100 million points in 150 s = 2.5 minutes (on 20 nodes)
Really would like interchangeable versions
©MapR Technologies - Confidential 12
13. What About Deployment?
Final matrix size is several GB
Can’t have copy per thread
– can’t even wait to load many copies
What about mmap?
– needs real files, can’t use HDFS
– NFS works great
Need to deploy in map-reduce and real-time environments
– can’t depend on Hadoop features like distributed cache
©MapR Technologies - Confidential 13
15. Summary
With mirrors and NFS, no special “deployment” mechanism is
necessary
The modeling client can use NFS + mmap share memory between
threads or processes
Mirrors can stage as many replicas as desired on whichever
machines are specified
©MapR Technologies - Confidential 15
16. Network Security
Take an existing network security appliance
Add magical parallel machine learning to find new attacks
But don’t spend time copying data back and forth
And don’t change the legacy code
©MapR Technologies - Confidential 16
18. Summary
Legacy code “just works” with MapR’s NFS
Map-reduce programs don’t care where the input comes from
Exposing new control data requires no special mechanism
©MapR Technologies - Confidential 18
19. Log Analysis
Receive 200K log lines per second or more
Want to do multi-field search
Want to search on log lines with < 30 second delay before search
©MapR Technologies - Confidential 19
20. Solr Based Flexible Analytics
Solr/Lucene can index at 500K small documents per second
Faceting provides simple aggregation
Multiple index search is a given, not a special future enhancement
Solr/Lucene has awesome record of stability
©MapR Technologies - Confidential 20
21. Data Ingestion and Indexing
SolR
SolR Solr
Incoming Indexer
Text
Kafka Indexer indexer
Data analysis
Real-time
Raw Live index
Older index shard
documents shards
Time sharded Solr indexes
©MapR Technologies - Confidential 21
22. Some Special Points
Textual analysis is done in parallel outside of the indexer
Raw documents are stored outside of Solr to minimize index size
Index hot-spotting is a feature here because it gives time-based
sharding
Indexing into NFS allows legacy code reuse
©MapR Technologies - Confidential 22
23. Basic Search
Solr
search
Query Web tier
SolR
SolR
Indexer
Solr
Indexer
search
Raw Live index
Older index shard
documents shards
©MapR Technologies - Confidential 23
24. Additional Points
The number of shards per core can be adjusted easily to match
load
Near real-time indexing not really required
No transaction logs need be kept by Solr for failure tolerance
– core failure requires other cores take on lost shards
– indexer failure requires indexer restart … Kafka retains unprocessed input
– indexing is idempotent
©MapR Technologies - Confidential 24
25. Secure Search
Auth
data
Solr
Security search
Query Web tier
filter
SolR
SolR
Indexer
Solr
Indexer
search
Raw Live index
Older index shard
documents shards
©MapR Technologies - Confidential 25
27. Lessons Learned
Import/export is often a non-issue
– NFS allows processing in place
Legacy access via NFS provides high performance, minimal effort
Interchangeable map-reduce and conventional programs are key
Do simple tasks in simple ways. Save the effort for the big tasks
©MapR Technologies - Confidential 27
28. Zen Integration
The student went to the master and asked how to integrate
multiple programs using different models
– The master said, “to do more, do less”
The student went away and came back pointing out that HDFS
allows copying data in and out. He quoted Turing.
– The master said, “to do more, do less”
The student thought about this for many days. In the
meantime, the master installed MapR and deleted all the
integration code.
When the student returned and saw this, he asked where the
integration was.
The master answered “ ” and the student was enlightened.
©MapR Technologies - Confidential 28
29. The Cause of Almost All Data Loss
©MapR Technologies - Confidential 29
30. The Cause of Almost All Data Loss
©MapR Technologies - Confidential 30
31. The Cause of Almost All Data Loss
And snapshots are the cure (partially)
©MapR Technologies - Confidential 31
32. Time for Questions
Download MapR to learn more
– http://mapr.com/download
Send email with questions later
– tdunning@maprtech.com
Tweet as the spirit moves
– @ted_dunning
These slides and other resources
– http://www.mapr.com/company/events/speaking/tableau-11-8-12
©MapR Technologies - Confidential 32
Hinweis der Redaktion MapR provides a complete distribution for Apache Hadoop. MapR has integrated, tested and hardened a broad array of packages as part of this distribution Hive, Pig, Oozie, Sqoop, plus additional packages such as Cascading. We have spent over a two year well funded effort to provide deep architectural improvements to create the next generation distribution for Hadoop. MapR has made significant updates while providing a 100% compatible Hadoop for Apache distribution.This is in stark contrast with the alternative distributions from Cloudera, HortonWorks, Apache which are all equivalent. With MapR Hadoop is Lights out Data Center ReadyMapR provides 5 99999’s of availability including support for rolling upgrades, self –healing and automated stateful failover. MapR is the only distribution that provides these capabilities, MapR also provides dependable data storage with full data protection and business continuity features. MapR provides point in time recovery to protect against application and user errors. There is end to end check summing so data corruption is automatically detected and corrected with MapR’s self healing capabilities. Mirroring across sites is fully supported.All these features support lights out data center operations. Every two weeks an administrator can take a MapR report and a shopping cart full of drives and replace failed drives.