OOP 2014

Hortonworks: We Do Hadoop.
Our mission is to enable your Modern Data Architecture
by Delivering Enterprise Apache Hadoop

Emil A. Siemes
esiemes@hortonworks.com
Solution Engineer
January 2014

Enable your Modern Data Architecture by
Our Mission: Delivering Enterprise Apache Hadoop

Our Commitment
Headquarters: Palo Alto, CA
Employees: 300+ and growing

Open Leadership
Drive innovation in the open exclusively via the
Apache community-driven open source process

Trusted Partners

Enterprise Rigor
Engineer, test and certify Apache Hadoop with
the enterprise in mind

Ecosystem Endorsement
Focus on deep integration with existing data
center technologies and skills

Page 2

APPLICATIONS

A Traditional Approach Under Pressure
Custom
Applications

Business
Analytics

Packaged
Applications

DATA SYSTEM

2.8 ZB in 2012
85% from New Data Types
RDBMS

EDW

MPP

REPOSITORIES

15x Machine Data by 2020
40 ZB by 2020

SOURCES

Source: IDC

Existing Sources

Emerging Sources

(CRM, ERP, Clickstream, Logs)

(Sensor, Sentiment, Geo, Unstructured)

Page 3

APPLICATIONS

Emerging Modern Data Architecture
Custom
Applications

Business
Analytics

Packaged
Applications
DEV & DATA
TOOLS

SOURCES

DATA SYSTEM

BUILD &
TEST

OPERATIONAL
TOOLS
RDBMS

EDW

MANAGE &
MONITOR

MPP

REPOSITORIES

Existing Sources

Emerging Sources

(CRM, ERP, Clickstream, Logs)

(Sensor, Sentiment, Geo, Unstructured)

Page 4

Drivers of Hadoop Adoption

New Business
Applications
From NEW types of
Data (or existing
types for longer)

Page 5

Most Common NEW TYPES OF DATA
1. Sentiment
Understand how your customers feel about your brand and
products – right now

2. Clickstream
Capture and analyze website visitors’ data trails and
optimize your website

3. Sensor/Machine
Discover patterns in data streaming automatically from
remote sensors and machines

4. Geographic

Value

Analyze location-based data to manage operations where
they occur

5. Server Logs
Research logs to diagnose process failures and prevent
security breaches

6. Unstructured (txt, video, pictures, etc..)
Understand patterns in files across millions of web pages,
emails, and documents

+ Keep existing
data longer!

Drivers of Hadoop Adoption
Architectural
A Modern Data
Architecture

New Business
Applications

Complement your existing data
systems: the right workload in the
right place

Page 7

Let’s build a Data Lake…
Instructions on:
hadoopwrangler.com

Page 8

HDP Data Lake Solution Architecture
Manage Steps 1-4: Data Lifecycle with Falcon
FALCON (data pipeline & flow management)

Downstream
Data Sources

Oozie (Batch scheduler)
Step 4: Schedule and Orchestrate

HIVE

SOURCE
DATA

PIG

Step 3: Transform, Aggregate & Materialize

EDW

ClickStream
Data

HCATALOG
File

Sales
Transaction/
Data
JMS

Ingestion

Step 1:Extract & Load
REST

HTTP

Social Data

Sqoop/Hiv
e

EDW
(Teradata)

Step 2: Model/Apply Metadata
INTERACTIVE

SQOOP

compute
&
storage
.
Storm

Web
HDFS

Query/
Analytics/Repor
ting Tools

Hive Server
(Tez/Stinger)

Tableau/Excel

YARN
.
MR2

.

NFS

Marketing/I
nventory

HBase
Client

OLTP
HBase

Use Case Type 1:
Materialize & Exchange

(table & user-defined metadata)

FLUME

Product
Data

Mahout

(data processing)

Exchange

.

.

.

Elastic
Search

.

TEZ

.

.

.
SAS
compute
&
storage

Use Case Type 2:
Explore/Visualize

Datameer/Platfo
ra/SAP

Stream Processing,
Real-time Search,
MPI

AMBARI
Streaming

YARN
Apps

Data Lake HDP Grid
Knox – Perimeter Level Security

Opens up Hadoop to
many new use cases
Page 9

Hadoop 2: The Introduction of YARN
Store all date in a single place, interact in multiple ways
Single Use System

Multi Use Data Platform

Batch Apps

Batch, Interactive, Online, Streaming, …

1st Gen of
Hadoop

HADOOP 2
Standard Query
Processing

(cluster resource management
& data processing)

HDFS
(redundant, reliable storage)

Real Time Stream
Processing

Hive, Pig

MapReduce

Online Data
Processing
HBase, Accumulo

Storm

Batch

…

Interactive

MapReduce

others

Tez

Efficient Cluster Resource
Management & Shared Services
(YARN)

Redundant, Reliable Storage
(HDFS)

Page 10

Let’s start simple…
• A solution unifying all data sources of a mobile App
– Allowing analytics over all data in one place
– In real time and long term

• Mobile Apps have multiple channels for data:
– Data created on the handset (e.g. geo location)
– Data created on servers accessed by the mobile app (e.g. app
data, logs)
– Data from backend services (e.g. RDBMS)
– Store data (e.g. iTunes Connect, Google Play)
– Social data (Twitter, App Reviews, etc.)

Page 11

Why Should We Care?
• How much revenue did I made?
(Not that easy to answer as one could think)
• Where are my customers now?
• Can you fulfill requirements from the business like: ”Tell me when our
customers are in a coffee shop so we can offer them e.g. Wifi”
• What are my customers thinking about my app/brand?
• Are the ones complaining really using it (correct)?
• How can I support marketing activities?
• How can I evaluate local marketing activities?
• Does positive/negative sentiment effect my downloads?
• Will my servers be able to deal with the load in 3 months
• …

Page 12

Design Goals
• Use as much as we have in our stack as possible
• Minimize dependencies on stacks beyond Hadoop
– Still make it useful and complete

• Make it fit into a 8GB MacBook/Laptop
• Release early & release often

Page 13

Types Of Data For iiCaptain
• Geo location data
• Store Data
• iTunes Connect, Google Play, Amazon via AppAnnie

• Twitter
• RDBMS (Sqoop)
• Logs

Page 15

iiCaptain’s Data Ocean / Data Lake

Page 16

SQL Interactive Query & Apache Hive
Key Services

Apache Hive

Platform, operational and
data services essential for
the enterprise

• The defacto standard for Hadoop SQL access
• Used by your current data center partners
• Built for batch AND interactive query

Skills

SQL

Leverage your existing
skills: development,
analytics, operations

Stinger Initiative

Integration
Interoperable with existing
data center investments

Broad, community based effort to deliver the
next generation of Apache Hive
Speed

Scale

SQL

Improve Hive query
performance by 100X to
allow for interactive
query times (seconds)

The only SQL interface
to Hadoop designed for
queries that scale from
TB to PB

Support broadest range
of SQL semantics for
analytic applications
against Hadoop

Page 19

Build Process, Shining With Savanna

Page 20

Roadmap
-

Servlet Engine in YARN
Project Savanna: Continuous Delivery end-2-end
Sentiment Analysis with Flume/Hive and App Reviews
Knox
Falcon
Phoenix

Page 21

HDP 2.0: Enterprise Hadoop Platform
OPERATIONAL
OPERATIONAL
SERVICES
SERVICES
AMBARI
Cluster
AMBARI Dataset
Mgmnt FALCON
FALCON*
Mgmnt
Schedule
OOZIE
OOZIE

Hortonworks
Data Platform (HDP)

DATA
DATA
SERVICES
SERVICES
FLUME
FLUME
Data
Movement

SQOOP
SQOOP
LOAD &
LOAD &
EXTRACT
EXTRACT

NFS
NFS

CORE
CORE SERVICES

WebHDFS

CORE
CORE SERVICES
SERVICES

KNOX*
KNOX*

WebHDFS

HIVE
HBASEData Access HIVE&
PIG
HCATALOG
HBASE

MAP

Process
REDUCE

TEZ
TEZ

ResourceYARN
Management

Cloud

• Integrates full range of
enterprise-ready services

HDFS
Storage
HDFS
Enterprise Readiness
High Availability, Disaster
Recovery, Rolling Upgrades,
Security and Snapshots

HORTONWORKS
DATA PLATFORM (HDP)
OS/VM

• The ONLY 100% open source
and most current platform

• Certified and tested at scale
• Engineered for deep
ecosystem interoperability

Appliance

Page 22

Hortonworks: The Value of “Open” for You
Validate & Try
1. Download the
Hortonworks Sandbox
2. Learn Hadoop using the
technical tutorials

3. Investigate a business
case using the step-bystep business cases
scenarios
4. Validate YOUR business
case using your data in
the sandbox

Engage
1. Execute a Business Case
Discovery Workshop with
our architects
2. Build a business case for
Hadoop today

Connect With the Hadoop Community
We employ a large number of Apache project committers & innovators so
that you are represented in the open source community

Avoid Vendor Lock-In
Hortonworks Data Platform remain as close to the open source trunk as
possible and is developed 100% in the open so you are never locked in

The Partners you Rely On, Rely On Hortonworks
We work with partners to deeply integrate Hadoop with data center
technologies so you can leverage existing skills and investments

Certified for the Enterprise
We engineer, test and certify the Hortonworks Data Platform at scale to
ensure reliability and stability you require for enterprise use

Support from the Experts
We provide the highest quality of support for deploying at scale. You are
supported by hundreds of years of Hadoop experience

Page 23

OOP 2014

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to OOP 2014

Similar to OOP 2014 (20)

Recently uploaded

Recently uploaded (20)

OOP 2014

Editor's Notes