Integrating Hadoop into your enterprise IT environment

© 2014 MapR Technologies 2
MapR Distribution for Hadoop Overview
Top Ranked
Exponential
Growth
500+
Customers Cloud Leaders
3X bookings Q1 ‘13 – Q1 ‘14
80% of accounts expand 3X
90% software licenses
<1% lifetime churn
>$1B in incremental revenue
generated by 1 customer

Topics for Today
• Hadoop Trends and Realities
• Hadoop Deployment Model
• Integrating Hadoop into Your IT Environment

3 Trends
Forcing a revolution in enterprise architecture

Industry Leaders Compete and Win with Data1TREND
More Data Beats Better Algorithms
Collecting interaction data from ecommerce, social media, offline, and call centers
enables a “customer 360 view” and consumer intimacy
Competitive Advantage is Decided by 0.5%
Consumer financial services: 1% improvement in fraud detection means hundreds of millions of dollars
Advertising and retail: 0.5% improvement in lift means millions of dollars increase in profitability

Big Data is Overwhelming Traditional Systems
• Mission-critical reliability
• Transaction guarantees
• Deep security
• Real-time performance
• Backup and recovery
• Interactive SQL
• Rich analytics
• Workload management
• Data governance
• Backup and recovery
Enterprise
Data
Architecture
2TREND
ENTERPRISE
USERS
OPERATIONAL
SYSTEMS
ANALYTICAL
SYSTEMS
PRODUCTION
REQUIREMENTS
PRODUCTION
REQUIREMENTS
OUTSIDE SOURCES

Hadoop: The Disruptive Technology at the Core of Big Data3TREND
JOB TRENDS FROM INDEED.COM
Jan ‘06 Jan ‘12 Jan ‘14Jan ‘07 Jan ‘08 Jan ‘09 Jan ‘10 Jan ‘11 Jan ‘13

ENTERPRISE
DATA HUB
MARKETING
OPTIMIZATION
RISK & SECURITY
OPTIMIZATION
OPERATIONS
INTELLIGENCE
• Multi-structured
data staging & archive
• ETL / DW optimization
• Mainframe
optimization
• Data exploration
• Recommendation
engines & targeting
• Customer 360
• Click-stream analysis
• Social media analysis
• Ad optimization
• Network security
monitoring
• Security information &
event management
• Fraudulent behavioral
analysis
• Supply chain & logistics
• System log analysis
• Manufacturing quality
assurance
• Preventative
maintenance
• Smart meter analysis
Common Use Cases: Taking Advantage of Hadoop

And 2 Realities

OPERATIONAL
SYSTEMS
ANALYTICAL
SYSTEMS
ENTERPRISE
USERS
1REALITY
• Data staging
• Archive
• Data transformation
• Data exploration
• Streaming,
interactions
Hadoop now on the critical path
2 Interoperability
1 Reliability and DR
4
Supports operations
and analytics
3 High performance
Keys for Production Success

Moving towards operational applications
2003
GFS
2004
Web index is batch
(GFS/MapReduce)
2010
Web index is real-time
(BigTable)
The transition from
batch to real-time
2004
MapReduce
2006
BigTable
The explosion in
operational applications
Google’s operational data store (BigTable) has enabled multiple revolutions
within the company:
(1)
(2)
2REALITY

Hadoop Deployment Model

Modern Data Architecture for Hadoop
Sources
RELATIONAL,
SAAS,
MAINFRAME
DOCUMENTS,
EMAILS
LOG FILES,
CLICKSTREAMS
SENSORS
BLOGS,
TWEETS,
LINK DATA
DATA WAREHOUSE
Data Movement
Data Access
Analytics
Search
Schema-less
data exploration
BI, reporting
Ad-hoc integrated
analytics
Data Transformation, Enrichment
and Integration
MAPR DISTRIBUTION FOR HADOOP
Streaming
(Spark Streaming, Storm)
NoSQL ODBMS
(HBase, Accumulo, …)
Data Storage Platform
DISTRIBUTION FOR HADOOP
Batch / Search
(MR, Spark, Hive, Pig, …)
Operational Apps
Recommendations
Fraud Detection
Logistics
Optimized Data Architecture Machine Learning

Data Warehouse Optimization
Improve data services to customers while reducing enterprise architecture costs
• Provide cloud, security, managed services, data center, & comms
• Report on customer usage, profiles, billing, and sales metrics
• Improve service: Measure service quality and repair metrics
• Reduce customer churn – identify and address IP network hotspots
• Cost of ETL & DW storage for growing IP and clickstream data; >3 months
• Reliability & cost of Hadoop alternatives limited ETL & storage offload
• MapR Data Platform for data staging, ETL, and storage at 1/10th the cost
• MapR provided smallest datacenter footprint with best DR solution
• Enterprise-grade: NFS file management, consistent snapshots & mirroring
OBJECTIVES
CHALLENGES
SOLUTION
• Increased scale to handle network IP and clickstream data
• Reduced workload on DW to maintain reporting SLA’s to business
• Unlocked new insights into network usage and customer preferences
Business
Impact
FORTUNE 100
TELCO

Operational Apps: Push Messaging Platform
MapR: Enabling the “smartest, most aware, precise, easy-to-use, scalable,
secure and powerful push messaging platform on the planet"
• Enable organizations to build one-on-one brand relationships
• Push messaging and geo-location targeting that
• Support large numbers of customers in a multi-tenant platform
• Target specific consumers in real time with relevant offers
• Increase reliability of push messaging while lowering data center costs
OBJECTIVES
CHALLENGES
SOLUTION
• Increasing engagement and customer loyalty for 100’s of leading brands
• Reduced hardware footprint by 50%
• Consolidated 8 Hadoop clusters into 1 MapR cluster
Business
Impact
• MapR Distribution for Hadoop with Apache HBase for operational workloads
• Data placement control enables efficient cluster resource management

Integrating Hadoop into Enterprise Environments

Hadoop Success Depends on
Enterprise
Grade
Functionality
Scaling for the
Future

Integrating Hadoop into the Enterprise
Enterprise-Grade Functionality + Future Proofing
1. Low TCO
Enterprise Requirements

Data
IT Budgets
TCO : Core to Hadoop evolution
• Hadoop TAM comes from disrupting enterprise data warehouse and storage spending
• Gartner, "Forecast Analysis: Enterprise IT Spending by Vertical Industry Market, Worldwide, 2010-2016, 3Q12 Update.“
• Wall Street Journal, “Financial Services Companies Firms See Results from Big Data Push”, Jan. 27, 2014
$9,000
$40,000
<$1,000
DATA GROWING
AT 40%
2013
ENTERPRISE
STORAGE
IT BUDGETS
GROWING AT 2.5%
2014 2015 2016 2017
DATABASE
WAREHOUSE
$ PER TERABYTE
19
HADOOP

Better Performance with Less Hardware
PREVIOUS
RECORD: 1.6 TB
with 2200 nodes
1.65 TBIN 1 MINUTE
298 NODES
NEW MINUTESORT WORLD RECORD
MapR: With a Fraction of the Hardware
Previous Record

1. Low TCO
2. Trusted Data

Data Protection: Replication and Snapshots
Replication
• Protect from hardware failures
• File chunks, table regions and metadata are automatically
replicated (3x by default)
• At least one replica on a different rack
Snapshots
• Protect from user and application errors
• Point-in-time recovery
• Redirect on write
• No performance or scale impact
• Read files and tables directly from snapshot
C1 C2
C3
C1 C2
C4
C1 C4 C4 C2
C5
C5 C6
C3
C5 C6
C3C6 C7
C7 C7
₁

Hadoop Security
Authorization to
ensure the right
access to files
and databases
Authentication
for users and
user-created job
requests
Encryption to
ensure user
credentials and
data are always
secure
Integration with
existing security
infrastructure

1. Low TCO
2. Trusted Data
3. Application SLAs

Metadata HA
MapReduce/YARN HA
Instant recovery
Rolling upgrades
HA is built in
• Distributed metadata can self-heal
• No practical limit on # of files
• Jobs are not impacted by failures
• Meet your data processing SLAs
• Files and tables are accessible within seconds of a node
failure or cluster restart
• Upgrade the software with no downtime
• No special configuration to enable HA
High Availability (HA) Everywhere

Disaster Recovery: Mirroring
• Flexible
– Choose the volumes/directories to mirror
– You don’t need to mirror the entire cluster
– Active/active
• Fast
– No performance impact
– Automatic compression
• Safe
– Point-in-time consistency
– End-to-end checksums
• Easy
– Graceful handling of network issues
– No third-party software
– Takes less than two minutes to configure!
Production
WAN
Production Research
Datacenter 1 Datacenter 2
WAN EC2

1. Low TCO
2. Trusted Data
3. Application SLAs
4. Open Standards

Seamless Integration with NFS
• POSIX compliance
– Random reads/writes
– Simultaneous reading and writing to a file
– Compression is automatic and transparent
• Industry-standard NFS interface (in
addition to HDFS API)
– Stream data into the cluster
– Leverage thousands of tools and
applications
– Easier to use non-Java programming
languages
– No need for most proprietary Hadoop
connectors
Hadoop

When Hadoop Looks Like a NAS…
• Data ingestion is easy
– Popular online gaming company changed data
ingestion from a complex Flume cluster to a 17-
line Python script
• Database bulk import/export with standard
vendor tools
– Large telco saved $30M on EDW costs (5 years)
by leveraging MapR to pre-process and store
raw data prior to loading into EDW
• 1000s of applications/tools
– Existing Linux commands, browsers work out of
the box
Application
servers
$ find . | grep log
$ cp
$ vi results.csv
$ scp
$ tail -f part-00000
Logs

1. Low TCO
2. Trusted Data
3. Application SLAs
4. Open Standards
1. Freedom of Choice
Future Proofing

Pick the
Right Tool
for the Job

Freedom of ChoiceManagement
MapR Data Platform
APACHE HADOOP AND OSS ECOSYSTEM
Security
YARN
Pig
Cascading
Spark
Batch
Spark
Streaming
Storm*
Streaming
HBase
Solr
NoSQL &
Search
Juju
Provisiong.
&
Coordn.
Savannah*
Mahout
MLLib
ML,
Graph
GraphX
MR v1 & v2
EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS
Workflow
& Data
Govnce.Tez*
Accumulo*
Hive
Impala
Shark
Drill*
SQL
Sentry* Oozie ZooKeeperSqoop
Knox* WhirrFalcon*Flume
Data
Integrtn.
& Access
HttpFS
Hue
* 2014 TIMELINE

1. Low TCO
2. Trusted Data
3. Application SLAs
4. Open Standards
2. Multiple Users
Future Proofing

Volumes
100K volumes are OK,
create as many as needed
Volumes dramatically simplify
management of multiple
users:
• Replication factor
• Scheduled mirroring
• Scheduled snapshots
• Data placement control
• User access and tracking
• Administrative permissions
/projects
/tahoe
/yosemite
/user
/msmith
/bjohnson

Multi-tenancy Isolation
• Tasks sandboxed so they don’t impact other tasks or system daemons
• System resources protected from runaway jobs
• Volume-based data placement
• Label-based job scheduling
Quotas
• Storage quotas by volume/user/group
• CPU and memory quotas by queue/user/group
Security and delegation
• Wire-level authentication and encryption (Kerberos not required)
• Fine-grained administration permissions including volume-level delegation
• Authenticate users to AD, LDAP and Kerberos via Linux PAM
Reporting
• Detailed reporting on resource usage (75+ different metrics)
• All reports are available via UI, CLI and REST API

1. Low TCO
2. Trusted Data
3. Application SLAs
4. Open Standards
2. Multiple Users
3. Operational
Applications
Future Proofing

Operations + Analytics on One Platform
Fraud model
Recommendations
table
HADOOP
Fraud
investigator
Interactive
marketer
Online
transactions
Fraud
detection
Personalized
offers
Clickstream
analysis
Fraud
investigation tool
Real-time Operational Applications
Analytics

Recap

1. Low TCO
2. Trusted Data
3. Application SLAs
4. Open Standards
2. Multiple Users
3. Operational
Applications
Future Proofing

From Redundant Processing Silos and Data Science Experiments…
Opportunity to Revolutionize Enterprise Data Architecture

… to Consolidated Operational and Analytical Workloads
The Production Enterprise Data Hub
Hadoop

Q&A
@mapr maprtech
nitin@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies

Integrating Hadoop into your enterprise IT environment

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Integrating Hadoop into your enterprise IT environment

Ähnlich wie Integrating Hadoop into your enterprise IT environment (20)

Mehr von MapR Technologies

Mehr von MapR Technologies (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Integrating Hadoop into your enterprise IT environment

Hinweis der Redaktion