Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop

HopsFS – Breaking 1 million ops/sec barrier in Hadoop
Dr Jim Dowling
Associate Prof @ KTH
Senior Researcher @ SICS
CEO at Logical Clocks AB
www.hops.io
@hopshadoop

Evolution of Hadoop
2017-04-05 HopsFS - Breaking 1 million ops/s Barrier, J Dowling, Nov 2016 2/51
2009 2017

Evolution of Hadoop
2009 2017
?
Tiny Brain
(NameNode, ResourceMgr)
Huge Body (DataNodes)

HDFS Scalability Bottleneck – the NameNode
•Limited namespace/metadata
- JVM Heap (~200 GB)
•Limited concurrency
- Single global namespace lock
(single-writer, multiple readers)
HFDS
CLIENT
HFDS
DATANODE
NAMENODE

HopsFS
1. Scale-out Metadata
- Metadata in an in-memory distributed database
- Multiple stateless NameNodes
2. Remove the Global Namespace Lock
- Supports multiple concurrent read and write operations

HopsFS Architecture
2017-04-05 7/51

MySQL Cluster: Network Database Engine (NDB)
•Open-Source, Distributed, In-Memory Database
- Scales to 48 database nodes
• 200 Million NoSQL Read Ops/Sec*
•NewSQL (Relational) DB
- Read Committed Transactions
- Row-level Locking
- User-defined partitioning
- Efficient cross-partition
transactions
2017-04-05 8/51*https://www.mysql.com/why-mysql/benchmarks/mysql-cluster/
NameNode
(Apache v2)
DAL API
(Apache v2)
NDB-DAL-Impl
(GPL v2)
Other DB
(Other License)
hops-2.7.3.jar ndb-2.7.3-7.5.6.jar

HopsFS Metadata and Metadata Operations
/
user
F1 F2 F3

HopsFS Metadata & Metadata Partitioning
INode Table Block Table Replica Table
Inode_ID Name Parent_ID ... Block_ID Inode_ID ... Inode_ID Block_ID DataNode_ID ...
/
user
F1 F2 F3
➢Inode ID
➢Parent INode ID
➢Name
➢Size
➢Access Attributes
➢...

/
user
F1 F2 F3
➢File INode to Blocks Mapping
➢Block Size
➢...

/
user
F1 F2 F3
➢Location of blocks on
Datanodes
➢...

13/51
Inode_ID Name Parent_ID ... Inode_ID Block_ID ... Inode_ID Block_ID DataNode_ID ...
1 / 0 3 1 3 1 1
2 user 1 3 2 3 1 2
3 F1 2 3 3 3 1 3
4 F2 2 3 2 4
5 F3 2 3 2 5
3 ... ...
$> ls /user/*
/
user
F1 F2 F3
MySQL Cluster
Partition 1 Partition 2 Partition 3 Partition 4
/ user F1 [{3,1},{3,2},{3,3}
F2 ],[{3,1,1},{3,1,2},
F3 {3,1,3},{3,2,4}
…{3,3,9}]

14/51
Inode_ID Name Parent_ID ... Inode_ID Block_ID ... Inode_ID Block_ID DataNode_ID ...
1 / 0 3 1 3 1 1
2 user 1 3 2 3 1 2
3 F1 2 3 3 3 1 3
4 F2 2 3 2 4
5 F3 2 3 2 5
3 ... ...
$> cat /user/F1
/
user
F1 F2 F3
MySQL Cluster
Partition 1 Partition 2 Partition 3 Partition 4
/ user F1 [{3,1},{3,2},{3,3}
F2 ],[{3,1,1},{3,1,2},
F3 {3,1,3},{3,2,4}
…{3,3,9}]

Leader Election using NDB*
•Leader NN coordinates replication/lease mgmt
- NDB as shared memory for Election of Leader NN.
• Zookeeper not needed!
15/51*Niazi, Berthou, Ismail, Dowling, ”Leader Election in a NewSQL Database”, DAIS 2015

Metadata Locking (contd.)
17/51
●Exclusive Lock
●Shared Lock

Metadata Locking (contd.)
18/51
●Exclusive Lock
●Shared Lock
Subtree Lock

Performance Evaluation for HopsFS
19/51
• On Premise
- Up to 72 servers
- Dual Intel® Xeon® E5-2620 v3
@2.40GHz
- 256 GB RAM, 4 TB Disks
• 10 GbE
- 0.1 ms ping latency

Evaluation: Spotify Workload
20/51

HopsFS Higher Throughput with Same Hardware
21/51
HopsFS outperforms with equivalent
hardware: HA-HDFS with Five Servers
● 1 Active NameNode
● 1 Standby NameNode
● 3 Servers
○ Journal Nodes
○ ZooKeeper Nodes

Evaluation: Spotify Workload (contd.)
22/51

23/51

24/51

25/51
16X the performance of
HDFS.
Further scaling possible
with more hardware

Write Intensive workloads
26/51
Workloads
HopsFS
ops/sec HDFS ops/sec Scaling Factor
Synthetic Workload (5.0% File Writes) 1.19 M 53.6 K 22
Synthetic Workload (10% File Writes) 1.04 M 35.2 K 30
Scalability of HopsFS and HDFS for write intensive workloads

Write Intensive workloads
27/51
Workloads
HopsFS
ops/sec HDFS ops/sec Scaling Factor
Synthetic Workload (5.0% File Writes) 1.19 M 53.6 K 22
Scalability of HopsFS and HDFS for write intensive workloads

Metadata Scalability
28/51
37 times more files than HDFS

Operational Latency
29/51
File System Clients
No of Clients HopsFS Latency HDFS Latency
50 3.0 3.1

Operational Latency
30/51
File System Clients
50 3.0 3.1
1500 3.7 15.5

Operational Latency
31/51
File System Clients
50 3.0 3.1
1500 3.7 15.5
6500 6.8 67.4

Operational Latency
32/51
File System Clients
50 3.0 3.1
1500 3.7 15.5
6500 6.8 67.4

Erasure Coding with Data Locality
33/51
Reed-Solomon
(140%)

ZFS with HopsFS
HopsFS - Breaking 1 million ops/s Barrier, J Dowling, Nov 2016 34/51
RAID-0
10 Gb/s
~350 MB/s
Reads
~250 MB/s
Writes
RAID-5 + HopsFS Erasure Coding
~500 MB/s
Reads
~350 MB/s
Writes
Archive filesTriple-replicated files

Elasticsearch
Strong Eventually Consistent Metadata
35/51
Database
Kafka
Epipe
Hive Metastore Changelog
for HDFS
Namespace
Free-Text Search for Files/Dirs in
the HopsFS Namespace

Extending Metadata in HopsFS
Metadata API (HopsFS->Elasticsearch)
public void attachMetadata(Json obj, String pathToFileorDir)
public void removeMetadata(String name, String pathToFileorDir)
•Design your own tables
- Use foreign keys for metadata integrity
- Transactions ensure metadata consistency
2017-04-05 36/51

HopsYARN

Hops scalability now limited by YARN
•YARN scheduler (triggered on node heartbeats)*
- Scheduling decisions cost O(N), where N is the number of active Applications
- We reduced the cost to O(M), where M is the number of applications currently
requesting resources. Typically M << N.
38/51
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1000 3000 5000 7000 9000 11000 13000 15000 17000 19000
ClusterUtilisation
Number of Node Managers
Hadoop(fix)
Hadoop(OFF)
Hadoop (INFO)
*Experiments based on workload from YARN paper at SOCC’13 using our own distributed benchmarking tool.

Hops Distribution (2.7.3)
HopsYARNResource
Manager
Storage HopsFS
On-Premise GCEAWSPlatform
Processing
Logstash
TensorflowSpark
Flink
Kafka
Hopsworks Elasticsearch
Kibana Zeppelin

Hadoop Distributions Simplify Things
Cloudera MgrKaramel/ChefAmbariInstall /
Upgrade
YARN
HDFS
On-Premise
MR TensorflowSpark FlinkKafka

Future of HopsFS

Hive Metastore is Moving in with HopsFS
HopsFS
Hive
MetaStore

Hive Metastore is Moving in with HopsFS
HopsFSHive
MetaStore
Hive
MetaStore

Result: Strongly Consistent Hive Metadata
1.
3.
2.
Removing the HDFS
backing directory
removes the Table
from Hive the
Metastore

Small Files in Hadoop
•In both Spotify and Yahoo 20% of the files are <= 4 KB
45/51

*Niazi et al, Size Matters: Improving the Performance of Small Files in HDFS, Poster at Eurosys 2017
Small Files in HopsFS*
inode_id varbinary (on-disk column)
32123432 [File contents go here]
46/51
•In HopsFS, we can store small files co-located with the
metadata in MySQL Cluster as on-disk data.

30 namenodes/datanodes and 6 NDB nodes were used. Small file size was 4 KB. HopsFs files were stored on Intel 750 Series SSDs
HopsFS Small Files Performance (Early Results)
47/51

Multi-Data-Center HopsFS
• Multi-Master Replication of Metadata with Conflict Detection/Resolution.
48/51
NDB NDB
DN DN DN DN
Client
Synchronous Replication of Blocks
Network Partition Identification Service
NNNN NNNN
Asynchronous Replication of Metadata (~2000 ms delay)
Hops-eu-west1 Hops-eu-west2

Summary
•Hops is the only European distribution of Hadoop
- More scalable, tinker-friendly, and open-source.
•HopsFS has made a quantum leap in the
performance for HDFS
•HopsFS opens up new possibilities for building data
processing frameworks with support for small files,
free-text search of the namespace, and extensible
strongly consistent metadata.
2017-04-05 49/51

The Hops Team
Jim Dowling, Seif Haridi, Tor Björn Minde, Gautier Berthou, Salman
Niazi, Mahmoud Ismail, Theofilos Kakantousis, Ermias
Gebremeskel, Antonios Kouzoupis, Alex Ormenisan, Roberto Bampi,
Fabio Buso, Fanti Machmount Al Samisti, Braulio Grana, Zahin Azher
Rashid, Robin Andersson, ArunaKumari Yedurupaka, Tobias
Johansson, August Bonds, Filotas Siskos.
Active:
Alumni:
Vasileios Giannokostas, Johan Svedlund Nordström,Rizvi Hasan, Paul Mälzer, Bram
Leenders, Juan Roca, Misganu Dessalegn, K “Sri” Srijeyanthan, Jude D’Souza, Alberto
Lorente, Andre Moré, Ali Gholami, Davis Jaunzems, Stig Viaene, Hooman Peiro,
Evangelos Savvidis, Steffen Grohsschmiedt, Qi Qi, Gayana Chandrasekara, Nikolaos
Stanogias, Daniel Bali, Ioannis Kerkinos, Peter Buechler, Pushparaj Motamari, Hamid
Afzali, Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu.
Hops Heads

Resource manager
Lead simulator simulatorStart
start
Heartbeats
(nodes and apps)
Container allocations
stop
results
results
Scalable Benchmarker for YARN

Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop

Ähnlich wie Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop (20)

Mehr von DataWorks Summit/Hadoop Summit

Mehr von DataWorks Summit/Hadoop Summit (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop