Weitere ähnliche Inhalte Ähnlich wie Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (HDFS) (20) Mehr von Hortonworks (20) Kürzlich hochgeladen (20) Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (HDFS)1. Discover HDP 2.2
Data Storage Innovations in Hadoop Distributed File System (HDFS)
Page 1 © Hortonworks Inc. 2014
Hortonworks. We do Hadoop.
2. Speakers
Page 2 © Hortonworks Inc. 2014
Rohit Bakhshi
Hortonworks Senior Product Manager & PM for Apache
Hadoop & Apache Solr in Hortonworks Data Platform
Jitendra Pandey
Hortonworks Senior Architect for HDFS
3. Agenda
• Overview of HDFS
• New HDFS Innovation in HDP 2.2
– Heterogeneous storage
– Encryption
– Operational security enhancements
• Q & A
We’ll move quickly:
• Attendee phone lines are muted
• Text any questions to Jitendra using Webex chat
• Questions will be answered at the end of the call
• Unanswered questions and answers in upcoming FAQ/blog post
Page 3 © Hortonworks Inc. 2014
4. Big Data, Hadoop & Data Center Re-platforming
Business Drivers
• From reactive analytics
to proactive interactions
• Insights that drive
competitive advantage
& optimal returns
Page 4 © Hortonworks Inc. 2014
$
Financial Drivers
• Cost of data systems, as
% of IT spend,
continues to grow
• Cost advantages of
commodity hardware
& open source software
Technical Drivers
• Data is growing
exponentially & existing
systems overwhelmed
• Predominantly driven by
NEW types of data that
can inform analytics
There is an inequitable balance between vendor and customer in the market
5. Clickstream
Capture and analyze
website visitors’ data
trails and optimize
your website
Page 5 © Hortonworks Inc. 2014
Sensors
Discover patterns in
data streaming
automatically from
remote sensors and
machines
Server Logs
Research logs to
diagnose process
failures and prevent
security breaches
Hadoop Value: New Types of Data
Sentiment
Understand how
your customers feel
about your brand
and products –
right now
Geographic
Analyze location-based
data to
manage operations
where they occur
Unstructured
Understand patterns
in files across millions
of web pages, emails,
and documents
6. A Shift from Reactive to Proactive Interactions
A shift in Advertising
From mass branding …to 1x1 Targeting
A shift in Financial Services
From Educated Investing …to Automated Algorithms
A shift in Healthcare
From mass treatment …to Designer Medicine
A shift in Retail
A shift in Telco
Page 6 © Hortonworks Inc. 2014
HDP and Hadoop allow
organizations to use
data to shift interactions
from…
Reactive
Post Transaction
Proactive
Pre Decision
…to Real-t From static branding ime Personalization
From break then fix …to repair before break
7. Enterprise Goals for the Modern Data Architecture
Batch Interactive Real-Time
Page 7 © Hortonworks Inc. 2014
• Consolidate siloed data sets structured
and unstructured
• Central data set on a single cluster
• Multiple workloads across batch
interactive and real time
• Central services for security, governance
and operation
• Preserve existing investment in current
tools and platforms
• Single view of the customer, product,
supply chain
DATA SYSTEM APPLICATIONS
Business
Analytics
Custom
Applications
Packaged
Applications
RDBMS
EDW
MPP
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
°
° ° ° ° ° ° ° ° N
CRM
ERP
Other
1 ° ° °
° ° ° HDFS
(Hadoop Distributed File System)
SOURCES
EXISTING
Systems
Clickstream
Web
&Social
Geoloca9on
Sensor
&
Machine
Server
Logs
Unstructured
8. YARN Transformed Hadoop & Opened a New Era
Script
Pig
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
SQL
Hive
TezTez
Page 8 © Hortonworks Inc. 2014
YARN
The Architectural
Center of Hadoop
• Common data platform, many applications
• Support multi-tenant access & processing
• Batch, interactive & real-time use cases
Java
Scala
Cascading
Tez
Stream
Storm
YARN: Data Operating System
(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
° °
° °
Others
ISV
Engines
° ° ° ° °
° ° ° ° °
HDFS
(Hadoop Distributed File System)
Search
Solr
NoSQL
HBase
Accumulo
Sli der
Slider
In-Memory
Spark
9. YARN Extends Hadoop to Other Data Center Leaders
Script
Pig
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
SQL
Hive
TezTez
Java
Scala
Cascading
Tez
NoSQL
HBase
Accumulo
Sli der
1 ° ° ° ° ° ° °
Stream
Storm
Slider
HDFS
In-Memory
Spark
(Hadoop Distributed File System)
° ° ° ° ° ° ° °
Page 9 © Hortonworks Inc. 2014
YARN
The Architectural
Center of Hadoop
• Common data platform, many applications
• Support multi-tenant access & processing
• Batch, interactive & real-time use cases
• Supports 3rd-party ISV tools
(ex. SAS, Syncsort, Actian, etc.)
YARN: Data Operating System
(Cluster Resource Management)
° °
° °
Others
ISV
Engines
Search
Solr
° ° ° ° °
° ° ° ° °
YARN Ready Applications
Facilitates ongoing innovation and enterprise adoption via
ecosystem of new and existing “YARN Ready” solutions
10. Enterprise Hadoop: Central Set of Services
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
GOVERNANCE SECURITY OPERATIONS
Tez
TezTez
Page 10 © Hortonworks Inc. 2014
Slider
Slider
YARN: Data Operating System
(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
° °
° °
° ° ° ° °
° ° ° ° °
Enables Apache Hadoop to be
an Enterprise Data Platform
with centralized services for:
• Governance
• Operations
• Security
Everything that plugs into
Hadoop inherits these services
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Load data and
manage
according
to policy
Deploy and
effectively
manage the
platform
Provide layered
approach to
security through
Authentication,
Authorization,
Accounting, and
Data Protection
Script
Pig
SQL
Hive
Java
Scala
Cascading
Stream
Storm
Search
Solr
NoSQL
HBase
Accumulo
In-Memory
Spark
Others
ISV
Engines
HDFS
(Hadoop Distributed File System)
11. Hortonworks Development Investment for the Enterprise
Vertical Integration with YARN and HDFS
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
GOVERNANCE SECURITY OPERATIONS
Tez
TezTez
Slider
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Page 11 © Hortonworks Inc. 2014
Slider
° °
° °
° ° ° ° °
° ° ° ° °
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Load data and
manage
according
to policy
Deploy and
effectively
manage the
platform
Provide layered
approach to
security through
Authentication,
Authorization,
Accounting, and
Data Protection
Script
Pig
SQL
Hive
Java
Scala
Cascading
Stream
Storm
Search
Solr
NoSQL
HBase
Accumulo
In-Memory
Spark
Others
ISV
Engines
YARN: Data Operating System
(Cluster Resource Management)
HDFS
(Hadoop Distributed File System)
• Ensure engines can run reliably and respectfully in a YARN based cluster
• Implement features throughout the stack to accommodate
12. Hortonworks Development Investment for the Enterprise
Horizontal Integration for Enterprise Services
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
GOVERNANCE SECURITY OPERATIONS
Tez
TezTez
Slider
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Page 12 © Hortonworks Inc. 2014
Slider
° °
° °
° ° ° ° °
° ° ° ° °
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Load data and
manage
according
to policy
Deploy and
effectively
manage the
platform
Provide layered
approach to
security through
Authentication,
Authorization,
Accounting, and
Data Protection
Script
Pig
SQL
Hive
Java
Scala
Cascading
Stream
Storm
Search
Solr
NoSQL
HBase
Accumulo
In-Memory
Spark
Others
ISV
Engines
YARN: Data Operating System
(Cluster Resource Management)
HDFS
(Hadoop Distributed File System)
• Ensure consistent enterprise services are applied across the entire Hadoop stack
• Integrate with and extend existing data center solutions for these key requirements
13. HDP Delivers Enterprise Hadoop
Hortonworks Data Platform 2.2
GOVERNANCE BATCH, INTERACTIVE & REAL-TIME DATA ACCESS SECURITY OPERATIONS
Script
Pig
SQL
Hive
TezTez
Page 13 © Hortonworks Inc. 2014
Java
Scala
Cascading
Tez
Stream
Storm
YARN: Data Operating System
(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
° °
° °
° ° ° ° °
° ° ° ° °
HDFS
(Hadoop Distributed File System)
Search
Solr
NoSQL
HBase
Accumulo
Sli der
Slider
In-Memory
Spark
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow,
Lifecycle &
Governance
Falcon
Sqoop
Flume
Kafka
NFS
WebHDFS
Authentication
Authorization
Audit
Data Protection
Storage: HDFS
Resources: YARN
Access: Hive
Pipeline: Falcon
Cluster: Ranger
Cluster: Knox
Linux Windows Deployment Choice Cloud
YARN is the architectural
center of HDP
• Common data set across all
applications
• Batch, interactive & real-time
workloads
• Multi-tenant access & processing
Provides comprehensive
enterprise capabilities
• Governance
• Security
• Operations
Enables broad
ecosystem adoption
• ISVs can plug directly into Hadoop
The widest range of deployment options
• Linux & Windows
• On premises & cloud
Others
ISV
Engines
On-Premises
14. HDP Delivers Enterprise Hadoop
Hortonworks Data Platform 2.2
GOVERNANCE BATCH, INTERACTIVE & REAL-TIME DATA ACCESS SECURITY OPERATIONS
1 ° ° ° ° ° ° °
HDFS
(Hadoop Distributed File System)
° ° ° ° ° ° ° °
Page 14 © Hortonworks Inc. 2014
YARN: Data Operating System
(Cluster Resource Management)
Script
Pig
SQL
Hive
TezTez
Java
Scala
Cascading
Tez
Stream
Storm
Search
Solr
NoSQL
HBase
Accumulo
Sli der
Slider
In-Memory
Spark
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow,
Lifecycle &
Governance
Falcon
Sqoop
Flume
Kafka
NFS
WebHDFS
Authentication
Authorization
Audit
Data Protection
Storage: HDFS
Resources: YARN
Access: Hive
Pipeline: Falcon
Cluster: Ranger
Cluster: Knox
YARN is the architectural
center of HDP
• Common data set across all
applications
• Batch, interactive & real-time
workloads
• Multi-tenant access & processing
Provides comprehensive
enterprise capabilities
• Governance
• Security
• Operations
Enables broad
ecosystem adoption
• ISVs can plug directly into Hadoop
° °
° °
° ° ° ° °
° ° ° ° °
The widest range of deployment options
• Linux & Windows
• On premises & cloud
Others
ISV
Engines
Linux Windows Deployment Choice On-Premises Cloud
16. HDFS enables the Common Data Platform
Script
Pig
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
SQL
Hive
TezTez
Page 16 © Hortonworks Inc. 2014
HDFS
Storage Platform for Modern Data
Architecture
• Common data platform across multiple
application workloads
• Reliable
• Scalable
• Cost Efficient
Java
Scala
Cascading
Tez
Stream
Storm
YARN: Data Operating System
(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
° °
° °
Others
ISV
Engines
° ° ° ° °
° ° ° ° °
HDFS
(Hadoop Distributed File System)
Search
Solr
NoSQL
HBase
Accumulo
Sli der
Slider
In-Memory
Spark
18. HDFS in HDP 2.2: What’s New
Page 18 © Hortonworks Inc. 2014
Heterogeneous
Storage
• Archive
and
SSD
Tiers
• Tech
Preview:
Enable
intermediate
data
to
stored
in
memory
Heterogeneous
Storage
THEME
Encryp9on
• Tech
Preview:
Transparent
Data
Encryp?on
Security
THEME
DataNode
does
not
require
Root
to
start
• HDFS
services
in
a
Kerberized
cluster
no
longer
need
Root
to
start
Security
THEME
19. New in HDP 2.2:
Heterogeneous Storage
Page 19 © Hortonworks Inc. 2014
20. Heterogeneous Storage
Before
• DataNode is a single storage
• Storage is uniform - Only storage type Disk
• Storage types hidden from the file system
New Architecture
• DataNode is a collection of storages
• Support different types of storages
– Disk, SSDs, Memory
Page 20 © Hortonworks Inc. 2014
All disks as a single storage
S3
Swift
SAN
Filers
Collection of tiered storages
22. Storage Policies: Archival
DISK
DISK
DISK
DISK
Page 22 © Hortonworks Inc. 2014
DISK
DISK
DISK
DISK
DISK
ARCHIVE
ARCHIVE
ARCHIVE
ARCHIVE
ARCHIVE
ARCHIVE
ARCHIVE
ARCHIVE
ARCHIVE
Warm
1 replica on DISK,
others on ARCHIVE
Hot
All replicas on DISK
Cold
All replicas on
ARCHIVE
HDP Cluster
23. Storage Policy: SSD
SSD
DISK
DISK
SSD
Page 23 © Hortonworks Inc. 2014
DISK
DISK
SSD
DISK
DISK
SSD
DISK
DISK
SSD
DISK
DISK
HDP Cluster
A
SSD
DISK
DISK
A A
SSD
DataSet A All replicas on SSD
24. Store Intermediate Data in Memory
Page 24 © Hortonworks Inc. 2014
Application
Process
Write block to memory
Memory Tier
Lazy persist
block to disk
RAM_DISK
Tech Preview feature
For data writes that:
- Need low latency writes
- Where data is regenerate-able
25. New in HDP 2.2:
Encryption
Page 25 © Hortonworks Inc. 2014
26. HDFS Transparent Data Encryption
• HDFS Encryption – Transparent Encryption in HDFS – HDFS-6134
– Designate a dir as encryption zone, all files in the zone are encrypted
– Dependency on Key Management Server
• Key Management Server - HADOOP-10433
– The custodian for all encryption keys in Hadoop
– REST API for key CRUD operations
• Key Provider API - HADOOP-10141
– API to allow Hadoop code (NN, DN, DFS Clients) CRUD operations on key material
Page 26 © Hortonworks Inc. 2014
27. HDFS Transparent Data Encryption
1
°
°
°
°
1
°
°
°
°
°
Encrypted
File
(aIributes
-‐
EDEK,
IV)
°
°
°
°
°
°
Encryp9on
Zone
°
°
°
°
°
°
(aIributes
-‐
EZKey
ID,
version)
HDFS-‐6134
Page 27 © Hortonworks Inc. 2014
°
°
KeyProvider
°
°
°
°
Name
Node
°
°
°
°
N
DATA
ACCESS
DATA
MANAGEMENT
SECURITY
YARN
HDFS
Client
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
HDFS
°
(Hadoop
Distributed
File
System)
API
KeyProvider
API
KeyProvider
API
–
Hadoop-‐10141
Key
Management
System
(KMS)
Hadoop-‐10433
EDEK
DEK
Crypto
Stream
(r/w
with
DEK)
DEKs
EZKs
Acronym
Descrip?on
EZ
Encryp?on
Zone
(an
HDFS
directory)
EZK
Encryp?on
Zone
Key;
master
key
associated
with
all
files
in
an
EZ
DEK
Data
Encryp?on
Key,
unique
key
associated
with
each
file.
EZ
Key
used
to
generate
DEK
EDEK
Encrypted
DEK,
Name
Node
only
has
access
to
encrypted
DEK.
IV
Ini?aliza?on
Vector
EDEK
EDEK
28. New in HDP 2.2:
Operational Security Enhancements
Page 28 © Hortonworks Inc. 2014
29. DataNode does not require root
Enables Organizations to run services without utilizing root privilege
For Kerberized clusters
DataNode no longer needs to run as the Linux root user when starting
DataNode no longer needs to bind to privileged ports
DataNode utilizes SASL to transfer blocks between HDFS clients and
DataNodes.
Page 29 © Hortonworks Inc. 2014
30. Q & A
Page 30 © Hortonworks Inc. 2014
31. Thank you!
Learn more at:
hortonworks.com/hadoop/hdfs/
Page 31 © Hortonworks Inc. 2014
Register for the remaining 4
Discover HDP 2.2 Webinars
Hortonworks.com/webinars