HDFS Tiered Storage: Mounting Object Stores in HDFS

HDFS Tiered
Storage
Thomas Demoor (Western Digital)
Virajith Jalaparti (Microsoft)

>id
Thomas Demoor
• PO/Architect @ Western Digital
• S3-compatible object storage
• Hadoop:
̶ S3a optimizations
• Fast uploader (stream from mem)
• Hadoop2/YARN support
• Coming up: object-store committer
̶ HDFS Tiered Storage
Virajith Jalaparti
• Scientist @ Microsoft CISL
• Hadoop
̶ HDFS Tiered Storage
2

Overview
• HDFS Tiered Storage
̶ Mount and manage remote stores through HDFS
• Earlier talks
̶ Hadoop Summit ‘16, San Jose
̶ Dataworks Summit ‘17, Munich
• This talk
̶ Introduce Tiered Storage in HDFS (design, read path,…)
̶ Focus on progress since earlier talks (mounting in HDFS, write path,…)
̶ Demo
3
REMOTE
STORE
APP
HADOOP CLUSTER
HDFS

Use Case I: Ephemeral Hadoop Clusters
• EMR on S3, HDInsight over WASB, …
• Several workarounds used today
̶ DistCp
̶ Use only remote storage
̶ Explicitly manage local and cloud storage
• Goal: Seamlessly use local and remote
(cloud) stores as one instance of HDFS
̶ Retrieve data to local cluster on-demand
̶ Use local storage to cache data
4
Data in cloud store (e.g., S3, WASB)
Hadoop
clusterHadoop
cluster read/write
read/write

Use Case II: Backup data to object stores
• Business value of Hadoop + Object Storage:
̶ Data retention: very high fault tolerance (erasure coding)
̶ Economics: cheap storage for cold data
̶ Business continuity planning: backup, migrate, …
• Public Clouds: Microsoft Azure, AWS S3, GCS, …
• Private Clouds: WD ActiveScale Object Storage
̶ S3-compatible object storage system
̶ Linear scalability in # racks, objects, throughput
̶ Entry level (100’s TB) – Scale out (5PB+/rack)
̶ http://www.hgst.com/products/systems

• Today: Hadoop Compatible FileSystems (s3a://, wasb://)
̶ Direct IO between Hadoop apps and object store
̶ Scalable & Resilient: outsourcing NameNode functions
• Compatible does not mean identical
̶ Most are not even FileSystems (notion of directories, append, …)
̶ No Data Locality: less performant for hot/real-time data
̶ Hadoop admin tools require HDFS: permissions/quota/security/…
̶ Workaround: explicitly manage local HDFS and remote cloud storage
• Goal: integrate better with HDFS
̶ Data-locality for hot data + object storage for cold data
̶ Offer familiar HDFS admin abstractions
Use Case II: Backup data to object stores
APP
HADOOP CLUSTER
READWRITE

Solution: “Mount” remote storage in HDFS
• Use HDFS to manage remote storage
̶ HDFS coordinates reads/writes to remote store
̶ Mount remote store as a PROVIDED tier in HDFS
• Details later in the talk
̶ Set StoragePolicy to move data between the tiers
7
… …
/
a b
HDFS
Namespace
…
… …
/
d e f
Remote
Namespace
Mount remote
namespace
c
d e f
Mount point
REMOTE
STORE
APP
HADOOP CLUSTER
WRITE
THROUGH
LOAD
ON-DEMAND
HDFS
READWRITE
READWRITE

Solution: “Mount” remote storage in HDFS
• Use HDFS to manage remote storage
̶ HDFS coordinates reads/writes to remote store
̶ Mount remote store as a PROVIDED tier in HDFS
• Details later in the talk
̶ Set StoragePolicy to move data between the tiers
• Benefits
̶ Transparent to users/applications
̶ Provides unified namespace
̶ Can extend HDFS support for quotas, security etc.
̶ Enables caching/prefetching
8
REMOTE
STORE
APP
HADOOP CLUSTER
HDFS

Challenges
• Synchronize metadata without copying data
̶ Dynamically page in “blocks” on demand
̶ Define policies to prefetch and evict local replicas
• Mirror changes in remote namespace
̶ Handle out-of-band churn in remote storage
̶ Avoid dropping valid, cached data (e.g., rename)
• Handle writes consistently
̶ Writes committed to the backing store must “make sense”
• Dynamic mounting
̶ Efficient/clean mount-unmount behavior
̶ One Object Store mapping to multiple Namenodes
9

Outline
• Use cases
• Mounting remote stores in HDFS
• Demo
1. Backup from on-prem HDFS cluster to Azure Blob Store
2. Spin up an ephemeral HDFS cluster on Azure
• Types of mounts
• Reads in Tiered HDFS
• Writes in Tiered HDFS
10

Demo summary
11
Azure blob storage
Hadoop cluster on Azure
On-prem HDFS
Backup to Azure Storage
(-setStoragePolicy PROVIDED)
Generate FSImage
FSImage
/user/hadoop/workloads/
wasb://container@storageAccount
/backup/user/hadoop/workloads/
/user/hadoop/workloads/

Outline
• Use cases
• Mounting remote stores in HDFS
• Demo
1. Backup from on-prem HDFS cluster to Azure Blob Store
2. Spin up an ephemeral HDFS cluster on Azure
• Types of mounts
• Reads in Tiered HDFS
• Writes in Tiered HDFS
12

Types of mounts
• Ephemeral mounts
̶ Access data in remote store using HDFS (Use Case I)
̶ <source>: remoteFS://remote/path
̶ <dest>: hdfs://local/path
̶ Changes are bi-directional
• Backup mounts
̶ Backup data from HDFS to remote store (Use Case II)
̶ <source>: hdfs://local/path
̶ <dest>: remoteFS://remote/path
̶ Changes are uni-directional
hdfs dfsadmin -mount <source> <dest> [-ephemeral|-backup]
13
APP
HDFS
APP
HDFS
Ephemeral
mount
Backup
mount

Reads in ephemeral mounts
Remote namespace remoteFS://
… …
… …
/
a b c
e f g
d
Remote store
mount
Client
read(/d/e)
read(/c/d/e)
(file data)
(file data)
DN1 DN2
HDFS cluster
NN
… …
d
e f g
14

Enabled using the PROVIDED Storage Type
• Peer to RAM, SSD, DISK in HDFS (HDFS-2832)
• Data in remote store mapped to HDFS blocks
on PROVIDED storage
̶ Each block associated with BlockAlias = (REF, nonce)
• Nonce used to detect changes on external store
• REF = (file URI, offset, length); nonce = GUID
• REF= (s3a://bucket/file, 0, 1024); nonce = <ETag>
̶ Mapping stored in a AliasMap
• Can use a KV store which is external to or in the NN
• PROVIDEDVolume on Datanodes reads/writes
data from/to remote store
DN1
Remote store
DN2
BlockManager
/𝑎/𝑓𝑜𝑜 → 𝑏𝑖, … , 𝑏𝑗
𝑏𝑖 → {𝑠1, 𝑠2, 𝑠3}
/𝑟𝑒𝑚𝑜𝑡𝑒/𝑏𝑎𝑟
→ 𝑏 𝑘, … , 𝑏𝑙
𝑏 𝑘 → {𝑠 𝑃𝑅𝑂𝑉𝐼𝐷𝐸𝐷}
FSNamesystem
NN
AliasMap
𝑏 𝑘→ 𝐴𝑙𝑖𝑎𝑠 𝑘
…
RAM_DISK SSD DISK PROVIDED
15

Example: Using an immutable cloud store
• Create FSImage and AliasMap
̶ Block StoragePolicy can be set as required
̶ E.g.: {rep=2, PROVIDED, DISK }
FSImage
AliasMap
/𝑑/𝑒 → {𝑏1, 𝑏2, … }
/d/f/z1 → {𝑏𝑖, 𝑏𝑖+1, … }
…
𝑏𝑖 → {rep = 1, PROVIDED}
…
𝑏𝑖 → { 𝑟𝑒𝑚𝑜𝑡𝑒://c/d/f/z1, 0, 𝐿 , inodeId1}
𝑏𝑖+1 → { 𝑟𝑒𝑚𝑜𝑡𝑒://c/d/f/z1, 𝐿, 2𝐿 , inodeId1}
…
… …
… …
/
a b c
e f g
d
Remote store
16

• Start NN with the FSImage
• All blocks reachable when a DN with PROVIDED storage heartbeats in
… …
d
e f g
NN
BlockManager
DN1 DN2
… …
… …
/
a b c
e f g
d
FSImage
AliasMap
17

• DN uses BlockAlias to read
from external store
̶ Data can be cached locally as it
is read (read-through cache)
… …
d
e f g
NN
BlockManager
DFSClient
getBlockLocation
(“/d/f/z1”, 0, L)
return LocatedBlocks
{{DN2, 𝑏𝑖, PROVIDED}}
Remote store
lookup(𝑏𝑖)
FSImage
AliasMap
18
open(“remote:///c/d/f/z1/”, GUID1)
… …
… …
/
a b c
e f g
d
DN1 DN2

Writes in ephemeral mounts
• Metadata operations
̶ create(), mkdir(), chown etc.
̶ Synchronous on remote store
̶ For FileSystems: Namenode performs operation on remote store first
̶ For Blob Stores: metadata operations need not be propagated
• Example: Clients directly accessing S3 do not support notion of directories
• Data operations
̶ One of the Datanodes in the write pipeline writes to remote store
̶ BlockAlias passed in write pipeline
19
APP
HDFS
DN3DN1 DN2
DFSClient Remote store
Alias (Alias)

Writes in Backup mounts
• Daemon on Namenode backs up metadata/data in the mount
• Delegate work to Datanodes (similar to SPS [HDFS-10285])
• Backup of data based on remote store capabilities
̶ For FileSystems: Write block by block
̶ For blob stores: multi-part upload to upload blocks in parallel
20
APP
HDFS
DN2
Coordinator DN
Remote store
DN1

Writes in Backup mounts
• Daemon on Namenode backs up metadata/data in the mount
• Delegate work to Datanodes (similar to SPS [HDFS-10285])
• Backup of data based on remote store capabilities
̶ For FileSystems: Write block by block
̶ For blob stores: multi-part upload to upload blocks in parallel
• Use snapshots to maintain a consistent view
̶ Backup a particular snapshot
̶ Backup changes from previous snapshot
21
APP
HDFS

Assumptions
• Churn is rare and relatively predictable
̶ Analytic workloads, ETL into external/cloud storage, compute in cluster
• Clusters are either consumers/producers for a subtree/region
̶ FileSystem API has too little information to resolve conflicts
Ingest
ETL
Raw Data Bucket
Analytic Results
Bucket
Analytics
22

Conflict resolution
• Conflicts occur when remote store is directly modified
• Detected
̶ On read operations: e.g., using open-by-nonce operation
̶ On write operations: e.g., file to be created is already present
• Pluggable policy to resolve conflicts
̶ “HDFS wins”
̶ “Remote store wins”
̶ Rename files under conflict
23

Status
• Read-only ephemeral mounts
̶ HDFS-9806 branch on Apache Hadoop
• Backup mounts
̶ Prototype available (available on github)
• Next:
̶ Writes in ephemeral mounts
̶ Conflict resolution
̶ Create mounts in a running Namenode
24

Resources + Q&A
• HDFS Tiered Storage HDFS-9806
̶ Design documentation
̶ List of subtasks, lots of linked tickets – take one!
̶ Discussion of scope, implementation, and feedback
• Joint work Microsoft – Western Digital
̶ {thomas.demoor, ewan.higgs}@wdc.om
̶ {cdoug,vijala}@microsoft.com
25

Benefits of the PROVIDED design
• Use existing HDFS features to enforce quotas, limits on storage tiers
̶ Simpler implementation, no mismatch between HDFS invariants and framework
• Supports different types of back-end storages
̶ org.apache.hadoop.FileSystem, blob stores, etc.
• Credentials hidden from client
̶ Only NN and DNs require credentials of external store
̶ HDFS can be used to enforce access controls for remote store
• Enables several policies to improve performance
̶ Set replication in FSImage to pre-fetch
̶ Read-through cache
̶ Actively pre-fetch while cluster is running
27

HDFS Tiered Storage: Mounting Object Stores in HDFS

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie HDFS Tiered Storage: Mounting Object Stores in HDFS

Ähnlich wie HDFS Tiered Storage: Mounting Object Stores in HDFS (20)

Mehr von DataWorks Summit

Mehr von DataWorks Summit (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

HDFS Tiered Storage: Mounting Object Stores in HDFS

Hinweis der Redaktion