2. About me
• 张松波 (Schubert Zhang)
• Backgrounds
• Senior Engineer Tech Lead and Architect, Infrastructure Data Team, @Baidu
• VP Engineering, Cloud & Big Data R&D, @Hanborq
• Senior Engineering Manager, @UTStarcom
• 10 years of Telecom, 5 years of Cloud Storage & Big Data, 1 year of Internet
2
3. Categories of (Big) Data
• Rows / Records
•
•
•
•
Logs
User Profiles
Shopping Orders
…
• Files / Objects
•
•
•
•
Documents
Photos
Videos
…
• Presentation
• Presentation
• A mess -> organizing, indexing -> fast to
retrieve …
• Batch and sequential processing …
• Organizing, indexing -> fast to retrieve …
• Batch and sequential processing …
• Tables with Schema
• Data Types
• Database, Data-Warehouse
• Files in File-System
• Objects in Object-Storage-System
• With metadata …
Over the common underlayer storage and IO system: Hardware, Disk, Network …
3
6. Cloud Object Storage System : oNest
• Web Service and API
• Amazon AWS S3 RESTful API
• S3 Data Model (User->Buckets->Objects)
• Backend Distributed Object Storage System
• Google GFS + Facebook Haystack
•
•
•
•
•
Triple copy of data trunks
Write-through, Strong consistency
Append only and Compaction
High efficient Local Index
…
SDK
(C++/Java/Python/PHP/Go…)
Web Service
(RESTful API over HTTP)
Metadata Layer
• Backend Distributed Metadata Layer
• Flexible data model
• NoSQL
Object/Trunk Storage Layer
6
7. Cloud Object Storage System : oNest
Logic
Physical
Rock
User
Bucket
Object/Pebble
Chunk
Part
Rock
Chunk
Object
Part
Bucket2
Bucket3
Bucket4
Chunk
Chunk
Rock
Chunk
Chunk
Chunk
Object
Part
Chunk
Object
Bucket1
Chunk
Part
Chunk
Chunk
Object
Object
Object
Object
Object
Object
Object
Object
Chunk
Object
Object
Rock
&
Chunks
Data Model and Data Organization
7
8. Cloud Object Storage System : RockStor-> oNest
应用系统1
……
应用系统N
SDK (Java) for Developers
HTTP接口
HTTP接口
HTTP接口
RESTful API
(Cloud Service)
HTTP接口
HTTP接口
接口层
RockStor Service Load Balancers
WEB服务
(访问请求负载均衡器,多点部署,LVS)
WEB服务
……
WEB服务
计量信息
RockMaster
AAA, CAS
RockServer
管理接口
管理接口
系统管理
负载均衡
分布式云对象存储系统
Management
Console
资源管理平台
RESTful API
(Internal)
RockServer
对象 对象访问
服务层 相关
功能 对象属性
RockServer
容器 容器访问
相关
功能 容器属性
用户
相关
功能
认证
用户控制
日志管理
鉴权
统计报表
RockServer
运维管理
分布式存储系统集群 Hadoop
(存储和管理Rock文件)
分布式数据库集群 HBase
(存储和管理元数据)
Fast/Simple Prototype Leverage Open Source
存储层
分布式存储系统
To be a Product and Service.
8
9. Cloud Object Storage System : oNest
Region
Console
Console
WebServer
WebServer
机房A
Console
Console
WebServer
WebServer
Console
Console
WebServer
WebServer
Console
Console
WebServer
WebServer
ClusterMaster
ClusterMaster
Master
Master
AAA
Slave
Stats
Master
Stats
Master
Stats
Slave
Stats
Slave
AAA AAA
Slave Slave
Master
Proxy
AAA AAA
Web Web
Service
Service
Stats Cluster
Master
Master
Stats
Master
(1) 支持高可靠,多副本数据存储,支持动
态环境下数据副本的自动修复
Stats
Master
Discovery Service Cluster
AZ
OAS Cluster
OAS
DataStorage Cluster
OAS
Healer Cluster
Healer
DataNode
DataNode DataNode
MetaNode Cluster
Healer
MetaNode
MetaNode
SlaveSlave
Master
Healer
MetaNode
Slave
Stats
Slave
Stats
Slave
AZ
OAS Cluster
OAS
OAS
DataStorage Cluster
OAS
Healer Cluster
Healer
DataNode
DataNode DataNode
MetaNode Cluster
Healer
Master
Master
• oNest对象云存储平台系统以对象的形式存
储数据,为互联网业务和企业用户提供可达
百PB级的云存储服务
• oNest系统提供的对象云存储服务的主要特
点:
AAA AAA
Web Web
Service
Service
Proxy
Discovery Service Cluster
OAS
机房B
AAA Cluster
AAA AAA
Slave Slave
Master
Console
Console
WebServer
WebServer
ClusterMaster
ClusterMaster
AAA Cluster
AAA
Slave
Console
Console
WebServer
WebServer
MetaNode
Slave
MetaNode
MetaNode
SlaveSlave
Master
Master
Master
(2) 支持大规模存储(容量x100PB级以上),
存储对象数量和容量的线性扩容
(3) 支持一个数据中心内和跨数据中心备份
数据
(4) 支持大规模并发访问
(5) 支持安全的数据访问
Healer
To be a more Complete Product and Service.
9
13. Dropbox-Like NetDisk Service: uDrop / eDrop
PC
Client
Mobile
Client
Browser
REST AccessServer
REST AccessServer
MetaAPI
DataAPI
MetaAPI
Meta
Server
Meta
Server
DataAPI
Web Server
MetaAPI
DataAPI
Register
Meta
Server
Meta
Server
Matcher
oNest
ZooKeeper
HBase
13
14. Big Data Platform
Users, Applications
SQL/Scrpits/Java/Web
Backup
Smart SQL and Executi on Engine
Big
Data
Source
Big
Data
Source
Hive
HugeTable
BulkLoad
(Flume
Flive)
ETL
Data
Mini ng
MapReduce/Impala
Hcatalog
Bigtable
Bigtable
HBase
Oozie
……
……
Big
Data
Source
Pig
file
file
file
HD FS
Ganglia
Nagios
Clus terMaster
(Deplo yment)
Shared Cluster of Serv ers
14
16. Big Data Warehouse: HugeTable -> Horizon
JDBC and ODBC
REST
API
Management
...
SQL Engine
(Standard, Familiar, Low Learning Curve, ...)
Data Warehouse Utilities / Tools
(SpeedLoader, SpeedScan, Data
LifeCycle, ...)
Bigtable (HBase)
DFS (Hadoop HDFS)
Connectors
Integrating into Hadoop Ecosystem
Data Model
(Data Organization, Indexing,
Partitioning, Encoding,
Compressing, ...)
Oozie
HCatalog
Pig
Hive
MapReduce
16
17. NoSQL vs. SQL
• NoSQL, BigTable, Cassandra, etc., are just the “Storage Engine Layer” of DBMS.
• Users always like and be familiar with SQL to touch their data.
MySQL Server
Horizon
SQL Engine Layer
Distributed
SQL Engine
vs.
Storage Engine Layer
(MyISAM, InnoDB, etc.)
Distributed Storage Engine
(NoSQL, HBase)
How about to build a Distributed DBMS? Megastore, Greenplum/Pivotal/GitusDB, 17
etc.
19. 大数据服务平台
JDBC for Local Deployment
RESTful for Remote Deployment
Load Balancer
(LVS, with HA)
HugeTable
Web Service
Web Service
Web Service
SQL Engine
Server
SQL Engine
Server
SQL Engine
Server
LifeCycle
file
Online
Generated
Data (CDR)
(On/Offline,
DataDrop)
Connector
Flive
HugeTable Data Model
BulkLoad
file
Hive/Pig
MapReduce
Hive/Pig
MapReduce
HBase, Hadoop
(with
SpeedScan)
Analysis
ETL
原则:以实时低时延数据查询为主,兼顾数据分析
19
22. Hadoop and Open Source Ecosystem
• MapReduce
• Runtime Job/Task Schedule & Latency
•
•
•
Work Pool
Transfer Job description information
…
• Processing Engine Improvements
•
•
Shuffle: sendfile, Netty Server, Batch Fetch
Sort Avoidance: Spilling and Partitioning, Hash
Aggregation
• HBase (to be a Data Warehouse backend)
•
•
•
•
•
Low Level HFile management
Speed Bulk Load
Speed Scan for Analysis
Flexible control of Flush, Compaction, Split, Balance
Coprocessor for parallel processing
• Flume
• Support more Data Sources and Data Storages
• More flexible Command Line tool
• Hive
• Faster SQL Engine
• Support more Storage Engines
• More UDFs for database functions (such as NVL,
DECODE from Oracle.)
• More UDFs for OLAP (such as Roll-Up, Cube, Efficient
Aggregations, etc.
• More algorithms for efficient statistics and estimate
(such as LogLog-Counter for estimated DISTINCT values)
• Pig
• Support more Data Storages
• More UDFs for analysis, statistics and data mining (such
as K-Mean, ID3 for Decision Tree, etc.)
• Tools
•
•
•
•
Deployment: Hdeploy, HTCfg, ClusterMaster
Management: Integrate Ganglia, Nagios, Puppet, etc.
Light and handy command line: Hman, etc.
Benchmark Tools: Hbench, etc.
22
31. 1. Right Design Comes from Basic Knowledge
of Computer System / Computer Science
• Computer Architecture and How
Computer Works
• Representing and Manipulating
Information and Programs
• Processor Architecture (Pipeline,
Parallel …)
• Storage Architecture
• IO System, etc.
•
•
•
•
• The core issues of database.
• File-system …
• To be distributed now.
Memory/Storage Hierarchy
Modern Operation System
Networking
Languages …
31
32. Basic Knowledge of CS
- Sequential vs. Random Access …
- Long latency of Disk Seek …
- Throughput
All solutions of database and big data processing system are stand on the characters of computer architecture,
especially disk, network ...
32
34. Basic Knowledge of CS
• What every data engineer needs to know about disks
• Basic Algorithms (Sorting, Searching, Strings, Bitmap, …)
• Linux Virtual Memory, Exceptions, Concurrency, etc.
•…
34
35. 2. Keep Simple and Straightforward
• Master-Slave vs. Decentralized (DHT, Consistent Hash)
• Almost all Google products follow Master-Slave pattern.
GFS/BigTable/MapReduce/ZooKeeper, etc..
• MapReduce: Simplified Data Processing on Large Clusters
• A simple programming model that applies to many large-scale computing problems
• Hide messy details
• Bigtable provides the simple data model, distributed B+ tree …
• Shards and Replicas
• Simple and clean API design
35
36. Keep Simple and Straightforward
• Example: Bigtable vs. Cassandra
Master
Master
Tablet Server
Tablet Server
Tablet Server
Tablet Server
Tablet
GFS
Bigtable
Cassandra
36
37. Keep Simple and Straightforward
Bigtable (++)
Cassandra (--)
• Master – Tablet Servers
• Dynamic Tablet Splits
• WAL + MemTable + SSTable
• Three Level Distributed B+Tree
• Replication in GFS
•…
•
•
•
•
•
•
•
•
•
•
•
•
Bigtable ’s architecture and data model make
more sense.
Identical Data Nodes, Gossip
Consistent Hash, Virtual Nodes
WAL + MemTable + SSTable
Hinted Handoff
DHT Ring (neighbor nodes)
Eventual consistency
Read Rapir
Merkle Tree
Clock Vector
Anti-entropy protocol (反熵)
…
好复杂:架构的错误,导致系统越来越复杂 …
http://www.slideshare.net/schubertzhang/cassandra-dynamo-paper
http://www.slideshare.net/schubertzhang/dastorcassandra-report-for-cdr-solution
37
38. 3. There is no “one-size-fits-all” solution
• There are too many contradictory requirements in the structured data world.
• The contradiction of data processing
• Real-time or near-real-time data availability.
• Batch processing for large size of data, such as aggregation.
• The contradiction of data access:
• Low-latency fast query response, like Lookup.
• High-latency ad-hoc analytic query for historical data.
• But, there is no one-size-fits-all answer for above contradictory requirements.
• Identify common problems, and build systems to address them in a general way.
• “Important not to try to be all things to all people!” – Jeff Dean, Keynote at
LADIS’09
38
39. There is no “one-size-fits-all” solution
• MapReduce
• Dremel (MPP)
• Tez/Stingger
• NoSQL/Bigtable (and with
Coprocessor)
• DBMS
•…
Lambda Architecture: New data is sent to both
layers and queries merge views from both layers.
39
40. There is no “one-size-fits-all” solution
SQL, Scripts, Java, etc.
Hive
Pig
MapReduce
Java
Impala
GoldenOrb
Dremel
Pregel
不同的查询和分析请求,采用不同的并行执行引擎操作数据。
40
41. 4. Monitorable and Metrizable at any time
• Sufficient Statistic, Monitoring …
• Add Sufficient Monitoring/Status/Debugging Hooks
• If your system is slow or misbehaving, can you figure out why?
• Don’t rely on logs too much, log is too costly and inefficient.
• Use real-time statistics/metrics.
• Use tools, jmxetric, JMX, Ganglia, Nagios, Noah …
41
42. Monitorable and Metrizable at any time
The magic matrix ??!
Captured from UTStarcom mSwitch R5 system, Guangxi Site, 2004.
42
43. Monitorable and Metrizable at any time
Write/Insert Operation Benchmark
Read/Query Operation Benchmark
43
44. Monitorable and Metrizable at any time
SLA Metrics:
•
•
Latency
o tAvgLat: Total Average Latency (ms)
o dAvgLat: Delta Average Latency (ms)
o dMaxLat : Delta Maximum Latency (ms)
o dMinLat : Delta Minimum Latency (ms)
•
percentage of read ops
Throughput
o tThrou :Total Throughput (operation
count)
o dThrou : Delta Throughput (operation
count)
Quantile %
•
•
Total : from benchmark start to present.
Delta: between each statistical interval (2
minutes here)
25.00%
20.00%
15.00%
10.00%
5.00%
0.00%
1
3
5
7
9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61
100ms
Read Throughput: average ~140 ops/s
Latency: average ~500ms, 97% < 2s (SLA)
Bottleneck: disk IO (random seek) (CPU load is very low)
44
46. 5. Try to make data in-situ
• The ability to access data ‘in place’.
• ProtocolBuffers/Parquet encoding Real-Time Data Service
Writes
(Puts)
• Example:
• Horizon over HDFS + HBase
Reads
(Get/Scan)
Real-Time API
Schema
Meta
Bulk Load
HBase
Flush/Compaction
(Batch Input)
Coprocessor
MapReduce/
Impala
HFiles (Batch Processing)
HDFS (HFile)
HFiles
46
47. 6. Approximated vs. Precise
• For large data sets, it can be prohibitively expensive to find the precise
result, but there are efficient estimating methods.
• Example Queries:
• How many distinct elements are in the data set (i.e. what is the cardinality of the
data set)?
• What are the most frequent elements (the terms “heavy hitters” and “top-k
elements” are also used)?
• What are the frequencies of the most frequent elements?
• How many elements belong to the specified range (range query, in SQL it looks
like SELECT count(v) WHERE v >= c1 AND v < c2)?
• Does the data set contain a particular element (membership query)?
• …
47
48. Approximated vs. Precise
• The algorithms are approximate: with high probability it returns
approximately the correct result. (e.g. ±2%)
• select count(distinct userid) from userlogs;
• select top(100) of count(*) from orders group by itemname;
•…
• Statistical and Probabilistic Analysis, Very interesting!
48
49. Approximated vs. Precise
• Usually Sample/Hash/Bitmap …
• Cardinality Estimation
• Linear Counting
• Loglog Counting …
• Frequency Estimation / Heavy Hitters
• Count-Min Sketch
• Count-Mean-Min Sketch
• Stream-Summary …
• Range Query
• Array of Count-Min Sketches …
• Membership Query
• Bloom Filter
• …
49
50. 5. Open Source and Open Spirit
• Choose you Building Blocks in Engineering view
• Know Your Basic Building Blocks, Not just their interfaces, but understand
their implementations (at least at a high level)
• 善用开源,回馈开源,使开源更好更强大
50
51. 6. And more …
• Description and Documents
• Avoid inventing new Interface for Users
• From simple to complete, From prototype to product
• Make the architecture robust, try it, and then improve and complete it.
• Product vs. Tech. vs. Trick
•…
51