More Related Content Similar to Realtime Analytics in Hadoop (18) Realtime Analytics in Hadoop1. Realtime Analytics in Hadoop
Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Rommel Garcia – Solution Engineer
October 10, 2014
2. Hadoop
Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
3. Hadoop provides
• Terabytes to Petabytes of storage on commodity hardware (HDFS)
• Massive parallel computation on enormous amount of data (YARN)
Hadoop is essentially a supercomputer for the masses!
Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
4. HDFS: Scalable, Reliable, Secure Storage Platform
The Storage Platform for the Modern Data Architecture
Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
YARN: Data Operating System
B A B A C A
C A B C B B A C
HDFS
(Hadoop Distributed File System)
Reliable
Highly Available &Fault Tolerant
Protects against data loss &
corruption
Cost Effective
Horizontally scales on
Commodity Hardware
Secure
Strong access controls, integrated
with authentication mechanisms
Granular data access controls to
datasets across users and groups
NFS
Source/Dest
ination
REST
RPC
Source/Dest
ination
Source/Dest
ination
Standards
Based Data
Interfaces
Ingest and store any data in any format
Flexible read access enables a variety
of work loads
5. Hadoop 1
Single Use Data Platform
Hive Pig
Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Batch
HADOOP 1
Mapreduce
Redundant, Reliable Storage
(HDFS)
Java
6. 2006 2009
MR-279: YARN
Hadoop w/ MapReduce
MapReduce
Largely Batch Processing
1 ° ° ° ° °
HDFS
(Hadoop Distributed File System)
° ° ° ° ° N
Hadoop2 & YARN based Architecture
Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
N °
HDFS
(Hadoop Distributed File System)
Silo’d clusters
Largely batch system
Difficult to integrate
Hadoop 2 & YARN
Batch Interactive Real-Time
Enabled the
Modern Data
Architecture
October 23, 2013
7. Hadoop
Multi Use Data Platform
Batch, Interactive, Realtime, Online, Streaming, …
Management & Shared Services
Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HADOOP 2
Efficient Cluster Resource
(YARN)
Redundant, Reliable Storage
(HDFS)
Standard Query
Processing
Hive
Batch
MapReduce
Online Data
Processing
Interactive
Tez
Real Time Stream
Processing Others
8. Why Are Enterprises Using Hadoop?
Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
9. Traditional systems under pressure
DATA SYSTEM APPLICATIONS
Business
Analytics
Custom
Applications
RDBMS EDW MPP
Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Packaged
Applications
• Silos of Data
• Costly to Scale
• Constrained Schemas
Clickstream
Geolocation
Sentiment, Web Data
Sensor, Machine Data (IoT)
Unstructured docs, emails
Server logs
SOURCES
Existing Sources
(CRM, ERP,…)
New Data Types
…and difficult to
manage new data
10. Hadoop 2 and YARN enable the Modern Data Architecture
Batch Interactive Real-Time
HDFS
(Hadoop Distributed File System)
Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Common data set, multiple applications
• Optionally land all data in a single cluster
• Batch, interactive & real-time use cases
• Support multi-tenant access, processing
& segmentation of data
YARN: Architectural center of Hadoop
• Consistent security, governance & operations
• Ecosystem applications run natively in Hadoop
SOURCES
EXISTING
Systems
Clickstream Web
&Social
Geolocation Sensor
& Machine
Server
Logs
Unstructured
DATA SYSTEM APPLICATIONS
Business
Analytics
Custom
Applications
Packaged
Applications
RDBMS EDW MPP YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° N
12. Realtime Analytics in…
$
• Fraud Detection/Prevention • Cell tower diagnostics • Proactive Maintenance
Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
• Bandwidth Allocation
• Brand Sentiment Analysis
• Localized, Personalized
Promotions
Financial
Services
Retail Telecom Manufacturing
Healthcare
Utilities,
Oil & Gas
Public
Sector
• Monitor patient vitals
• Patient care and safety
• Reduce re-admittance rates
• Smart meter stream
analysis
• Proactive equipment repair
• Power and consumption
matching
• Network intrusion detection
and prevention
• Disease outbreak detection
Transportation
• Unsafe driving detection and
monitoring
13. Truck Demo: Real-Time Analytics
Problem:
• The only way to measure “safe driving” is through accident
occurences.
• There’s no realtime accident prevention mechanism in place
Solution:
• Use Hadoop to analyze driving violations in real-time
• Provide a UI to view to real-time violation alerts
• Provide a dashboard to review violation reports
Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
14. Demo Time !
Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
15. Truck Demo Real-Time Hadoop Architecture
Truck Events
High Speed Ingestion
Message Queue
Distributed Processing
Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Kafka
Storm
Show Driving Report
HDFS/Hive HBase
(ActiveMQ)
Solr
(Reporting
Dashboard)
Real-Time
Monitoring App
Truck Event Data Alerts Violations
Show
16. Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Q&A
17. Hadoop 2.0
Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Rommel Garcia – Solution Engineer
October 10, 2014
18. Hadoop 2 Becoming A Critical Platform
Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
19. Hadoop 2 delivers a comprehensive data management platform
Hadoop 2 Platform
Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow,
Lifecycle &
Governance
Falcon
Sqoop
Flume
NFS
WebHDFS
In-Memory
Spark
YARN: Data Operating System
DATA MANAGEMENT
SECURITY
BATCH, INTERACTIVE & REAL-TIME
DATA ACCESS
GOVERNANCE
& INTEGRATION
Authentication
Authorization
Accounting
Data Protection
Storage: HDFS
Resources: YARN
Access: Hive, …
Pipeline: Falcon
Cluster: Knox
OPERATIONS
Script
Pig
Search
Solr
SQL
Hive
HCatalog
NoSQL
HBase
Accumulo
Stream
Storm
Others
ISV
Engines
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
°
N
HDFS
(Hadoop Distributed File System)
Deployment Choice
Linux Windows On-
Premise
Cloud
YARN is the architectural
center of Hadoop 2
• Enables batch, interactive
and real-time workloads
• Single SQL engine for both batch
and interactive
• Enable existing ISV apps to plug
directly into Hadoop via YARN
Provides comprehensive
enterprise capabilities
• Governance
• Security
• Operations
The widest range of
deployment options
• Linux & Windows
• On premise & cloud
Tez Tez
20. YARN – Roadmap
Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
21. YARN Development Framework
API
Engine
System
YARN : Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
°
° °
° ° ° ° ° ° °
° ° ° ° ° ° N
HDFS
(Hadoop Distributed File System)
Batch
MapReduce
Real-Time
Slider
Direct
Java
.NET
Scripting
Pig
SQL
Hive
Cascading
Java
Scala
NoSQL
HBase
Accumulo
Stream
Storm
Other
ISV
Other
ISV
Applications
Others
Spark
Other ISV
New New
New New
Tez Tez Tez Tez New
22. YARN General Store – The Future
• A Data Lake that has a General Store to continually serve you….
– App Store – YARN Ready Applications
– Data Store – Where do I get the interesting data…Weather, Geo, ..etc.
– View Store – How do I get UI’s to the cluster
– Processing Store – Falcon, Pig...etc. for “standard” data sets or common “processing
patterns”
Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
24. Argus: Security needs are changing
Administration
Centrally management &
consistent security
Authentication
Authenticate users and systems
Authorization
Provision access to data
Audit
Maintain a record of data access
Data Protection
Protect data at rest and in motion
Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Security needs are changing
• YARN unlocks the data lake
• Multi-tenant: Multiple applications for data access
• Changing and complex compliance environment
• ETL of non-sensitive data can yield sensitive data
Summer 2014
65% of clusters host
multiple workloads
Fall 2013
Largely silo’d deployments
with single workload clusters
5 areas of security focus
25. Security in Hadoop with HDP + Argus (XA Secure)
Page 25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Authorization
Restrict access to
explicit data
Audit
Understand who
did what
Data Protection
Encrypt data at
rest & in motion
• Kerberos in native
Apache Hadoop
• HTTP/REST API
Secured with
Apache Knox
Gateway
• HDFS Permissions, HDFS ACL,
• Audit logs in with HDFS & MR
• Hive ATZ-NG
Authentication
Who am I/prove it?
• Wire encryption
in Hadoop
• Open Source
Initiatives
• Partner
Solutions
• HDFS, Hive and
Hbase
• Fine grain
access control
• RBAC
• Centralized
audit reporting
• Policy and
access history
• Future
Integration
Argus Hadoop 2
Centralized Security Administration
• As-Is, works with
current
authentication
methods
26. Hive– SQL In Hadoop & Roadmap
Page 26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
27. Hive: The De-Facto SQL Interface for Hadoop
Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Page 27
28. Data Abstractions in Hive
Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Partitions, buckets and skews facilitate
faster, more direct data access. Cube, windowing, aggregation
functions supported as well
Page 28
Database
Table Table
Partition Partition Partition
Bucket
Bucket
Bucket
Optional Per Table
Unskewed Keys Skewed Keys
31. Hive Demo Using DBVisualizer or Excel?
Page 31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
33. Data Pipeline Tracing
Data pipeline
dependencies
Customer
feed
Purchase
feed
Product
feed
Store
feed
View dependencies
between clusters, datasets
and processes
Data pipeline
tagging
Sensitive encrypted
Add arbitrary tags to
feeds & processes
Page 33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Credit
feed
Data pipeline
audits
Know who modified a
dataset when and into
what
Data pipeline
lineage
File-
1
File-
2
File-
3
Analyze how a dataset
reached a particular
state
34. Example: Multi-Cluster Replication
Primary Hadoop Cluster
Raw Data
Page 34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Presented
Data
Cleansed
Data
Conformed
Data
Staged Data
Presented
Data
Replication
Failover Hadoop Cluster
Replication
Bi and Analytic Applications
• Falcon manages workflow and replication
• Enables business continuity without requiring full data reprocessing
• Failover clusters can be smaller than primary clusters
..and many more
35. Example: Retention
Staged Data
Retention
Policy
Page 35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Presented
Data
Cleansed
Data
Conformed
Data
Retain 5
Years
Retain Last
Copy Only
Retain 3
Years
Retain 3
Years
• Sophisticated retention policies expressed in one place
• Simplify data retention for audit, compliance, or for data re-processing
36. Ambari – Hadoop Cluster Monitoring
Page 36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
38. Ambari 2H 2014
1.7.0 (September) 1.8.0 (October) 2.0.0 (December)
Features
• Config versioning + history
• Config <final> Properties
• Flume Support
• Ubuntu Support
• ResourceManager HA
• HDFS Rebalance
• Ambari Views Framework
• Slider Support
Tech Preview
• Windows Support
• Ambari Shell
Page 38 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Features
• ServiceX on YARN via Slider
• Log Access + Search
• Rack Awareness
• Simplified Kerberos Setup
• NameNode SafeMode
• Ambari Shell GA
Features
• Automated Rolling Upgrades
• Oozie HA
• Ambari Alerts
• Ambari Metrics
• Windows Support GA
40. Efficient Data Lakes can Span to the Cloud
On-Premises Cloud
Page 40 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HDP on Windows
HDP on Linux
Your deployment of Hadoop
hosted as a VM in Azure
HDP on Windows
HDP on Linux
Full control of HW and
software configs
1 2
Analytics Platform System
Turnkey Hadoop and
relational warehouse appliance
HDInsight
Managed Hadoop Service
Built on Azure storage
3 4
Enjoy cross-platform interoperability based on 100% open source HDP
41. Page 41 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Q&A
42. Thank You!
Rommel Garcia – Solution Engineer
Twitter: @rommelgarcia
LinkedIn: /rommelgarcia
Page 42 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Editor's Notes So, where does Hadoop fit in the data center? This picture here is a very simple depiction of the typical data architecture in any organization.
- There are sources of data: ERP, CRM, other digital sources
- That data is then stored in a data system: a data warehouse, MPP system, etc
- Then an application of some kind accesses that data system: a packaged application such as Excel or Tableau, a custom application written by a developer, or even another business application
This has been the foundation of the data center for years. We have had some challenges with this architecture all along, however, we are seeing increased pressure to modify and improve this basic blueprint because
A) this approach created silos of data and it was difficult to either share the data or get a single view of it
B) these systems are costly to scale
C) and they are also coupled to a very static schema. Changes to a data model are difficult if not imnpossible. This limits flexibility and iniight.
Finally, the emergence of NEW types of data as we digitize the world around us such as clickstream, machine sensor, etc, are growing at exponential rates. We are all becoming data driven organizations.
In fact that sheer volume of data is to grow 20X between 2013 and 2020 – and which puts tremendous pressure on this architecture. The old architecture is neither technologically nor commercially practical.
YARN is relatively the element that enables the modern data architecture as it turns hadoop into a truly multi-purpose data platform with batch, interactive and real time workloads all running in a single cluster..
It enables users to:
- Create a central cluster into which data can be stored and then accessed it using a range of processing engines: batch, interactive, real-time.
- It is akin to the journey with virtualization: from a single virtual server to a pool of virtual infrastructure.
It is the architectural center of Hadoop
- it provides the data operating system around which the core enterprise capabilities of security, governance and operations can be integrated
- It is the integration point into which all data processing engines integrate – from the open source community but also from the commercial vendor ecosystem