Engineering practices in big data storage and processing

Engineering Practices in
Big Data Storage and Processing
Nov.20, 2013
Schubert (Songbo) Zhang

About me
• 张松波 (Schubert Zhang)
• Backgrounds
• Senior Engineer Tech Lead and Architect, Infrastructure Data Team, @Baidu
• VP Engineering, Cloud & Big Data R&D, @Hanborq
• Senior Engineering Manager, @UTStarcom
• 10 years of Telecom, 5 years of Cloud Storage & Big Data, 1 year of Internet

2

Categories of (Big) Data
• Rows / Records
•
•
•
•

Logs
User Profiles
Shopping Orders
…

• Files / Objects
•
•
•
•

Documents
Photos
Videos
…

• Presentation

• Presentation

• A mess -> organizing, indexing -> fast to
retrieve …
• Batch and sequential processing …

• Organizing, indexing -> fast to retrieve …
• Batch and sequential processing …

• Tables with Schema
• Data Types
• Database, Data-Warehouse

• Files in File-System
• Objects in Object-Storage-System
• With metadata …

Over the common underlayer storage and IO system: Hardware, Disk, Network …
3

Products and
Engineering Projects
Object Storage, Data Warehouse, Cluster Management, etc.
For enterprise!

4

Products Line
大数据工程 (Big Data)

云存储 (Cloud Storage)

HB-CDW产品线是基于云计算技术实现的面向大数据(PB级)存储、

HB-CSS产品线为企业或个人提供云存储解决方案及服务。提供类

查询和分析以及挖掘的大数据仓库系统。核心产品包括基于

似Amazon AWS S3的服务层API和用户体验，可扩展、安全、快速

Hadoop生态系统的大数据仓库、海量结构化数据管理系统

的云对象存储系统oNest。基于oNest，为企业和个人提供接入云

HugeTable。基于Hanborq增强并扩展的Hadoop、HBase、Hive、

存储服务的存储网关(Storage Gateway)及类似Dropbox的在线云

Pig等大数据基础软件，实现特有的数据模型、系统架构和标准

存储服务(uDrop/eDrop)。在大型互联网、教育、电信、媒体、交

的SQL/API，提供对大数据的快速加载、实时索引查询，以及基

通等行业领域有广泛的使用案例。

Hanborq
Products
系统提供灵活的扩展性和安全可靠性。在电信、电力、交通、
于MapReduce和MPP等并行计算技术的深度统计、分析和挖掘。

大型互联网等大数据行业领域有广泛的使用案例。

管理系统 (Management)
HB-ClusterMaster是大规模数据中心集群规划、操作系统及应用程序自动化安
装部署、配置管理、监控及运营维护的软件系统，实现大规模云计算集群的高
效部署和运维。目前部署和管理的最大单系统案例超过2000个物理服务器节点。
5

Cloud Object Storage System : oNest
• Web Service and API

• Amazon AWS S3 RESTful API
• S3 Data Model (User->Buckets->Objects)

• Backend Distributed Object Storage System
• Google GFS + Facebook Haystack
•
•
•
•
•

Triple copy of data trunks
Write-through, Strong consistency
Append only and Compaction
High efficient Local Index
…

SDK
(C++/Java/Python/PHP/Go…)

Web Service
(RESTful API over HTTP)

Metadata Layer

• Backend Distributed Metadata Layer
• Flexible data model
• NoSQL

Object/Trunk Storage Layer
6

Logic

Physical

Rock

User

Bucket

Object/Pebble

Chunk

Part
Rock

Chunk

Object

Part

Bucket2

Bucket3

Bucket4

Chunk

Chunk

Rock

Chunk

Chunk

Chunk

Object

Part

Chunk

Object

Bucket1

Chunk

Part

Chunk

Chunk
Object

Object

Object

Object

Object

Object

Object

Object

Chunk

Object

Object

Rock
&
Chunks

Data Model and Data Organization

7

Cloud Object Storage System : RockStor-> oNest
应用系统1

……

应用系统N

SDK (Java) for Developers

HTTP接口
HTTP接口

HTTP接口

RESTful API
(Cloud Service)

HTTP接口
HTTP接口

接口层
RockStor Service Load Balancers

WEB服务

(访问请求负载均衡器，多点部署，LVS)

WEB服务

……

WEB服务
计量信息

RockMaster
AAA, CAS

RockServer

管理接口
管理接口

系统管理

负载均衡

分布式云对象存储系统

Management
Console

资源管理平台

RESTful API
(Internal)

RockServer

对象对象访问
服务层相关
功能对象属性
RockServer

容器容器访问
相关
功能容器属性

用户
相关
功能

认证

用户控制
日志管理

鉴权
统计报表

RockServer

运维管理

分布式存储系统集群 Hadoop
(存储和管理Rock文件)

分布式数据库集群 HBase
(存储和管理元数据)

Fast/Simple Prototype Leverage Open Source

存储层

分布式存储系统

To be a Product and Service.

8

Region
Console
Console
WebServer
WebServer

机房A

Console
Console
WebServer
WebServer

Console
Console
WebServer
WebServer

Console
Console
WebServer
WebServer

ClusterMaster
ClusterMaster

Master
Master

AAA
Slave

Stats
Master

Stats
Master

Stats
Slave

Stats
Slave

AAA AAA
Slave Slave

Master

Proxy

AAA AAA
Web Web
Service
Service

Stats Cluster

Master
Master

Stats
Master

(1) 支持高可靠，多副本数据存储，支持动
态环境下数据副本的自动修复

Stats
Master

Discovery Service Cluster

AZ
OAS Cluster
OAS

DataStorage Cluster
OAS

Healer Cluster
Healer

DataNode
DataNode DataNode

MetaNode Cluster
Healer

MetaNode
MetaNode
SlaveSlave

Master

Healer

MetaNode
Slave

Stats
Slave

Stats
Slave

AZ

OAS Cluster
OAS

OAS

DataStorage Cluster
OAS

Healer Cluster
Healer

DataNode
DataNode DataNode

MetaNode Cluster
Healer

Master
Master

• oNest对象云存储平台系统以对象的形式存
储数据，为互联网业务和企业用户提供可达
百PB级的云存储服务
• oNest系统提供的对象云存储服务的主要特
点：

AAA AAA
Web Web
Service
Service

Proxy

Discovery Service Cluster

OAS

机房B

AAA Cluster
AAA AAA
Slave Slave

Master

Console
Console
WebServer
WebServer

ClusterMaster
ClusterMaster

AAA Cluster
AAA
Slave

Console
Console
WebServer
WebServer

MetaNode
Slave

MetaNode
MetaNode
SlaveSlave

Master

Master
Master

(2) 支持大规模存储（容量x100PB级以上），
存储对象数量和容量的线性扩容
(3) 支持一个数据中心内和跨数据中心备份
数据
(4) 支持大规模并发访问
(5) 支持安全的数据访问

Healer

To be a more Complete Product and Service.

9

创建Bucket

新建目录

上传对象

刷新列表

查看属性

操作记录

用户名

右键菜单
对象集列表

对象列表
对象基本属性描述

点击进入详细属性描述，包括对象下载地址
点击进入ACL权限管理

10

教育云应用的用户

教育云App-1

SDK

教育云应用服务

REST
oNest提供统一标准的云存储接口，教育云应用可
以通过该接口存储、读取、或操作这些数据对象

教育云App-2

教育云应用即是oNest云存
储的用户。

REST

注册、登录、
Console

oNest云存储服务

BC-oNest对象云存储服务
oNest是一个弹性的对象云存储系统，可类比Amazon AWS S3。
为教育云提供视频、音频、图片、文档等数据的存储服务。

11

Dropbox-Like NetDisk Service: uDrop / eDrop
• Hack Dropbox
208.43.202.5
...
Softlayer Datacenter

keep alive (http)

login (https)
list, delete rename and sync (https)

67.228.78.114
67.228.78.116
67.228.78.117
...
Dropbox Web Server

Client
download and upload data (https)

75.101.145.128
75.101.138.84
...
Amazon S3 & EC2

• keep-alive mechanism
• Delta update
• Mechanism of shared
file block
• Dropbox client database:
Sqlite

• 数据/文件分割和指纹
• 增量上传算法
• 所谓“秒传”
12

Dropbox-Like NetDisk Service: uDrop / eDrop
PC
Client

Mobile
Client

Browser

REST AccessServer

REST AccessServer

MetaAPI

DataAPI

MetaAPI

Meta
Server

Meta
Server

DataAPI

Web Server
MetaAPI

DataAPI

Register

Meta
Server

Meta
Server

Matcher

oNest
ZooKeeper

HBase
13

Big Data Platform
Users, Applications
SQL/Scrpits/Java/Web

Backup

Smart SQL and Executi on Engine
Big
Data
Source

Big
Data
Source

Hive
HugeTable
BulkLoad
(Flume

Flive)

ETL
Data
Mini ng

MapReduce/Impala
Hcatalog
Bigtable
Bigtable

HBase
Oozie

……
……
Big
Data
Source

Pig

file

file

file

HD FS

Ganglia
Nagios
Clus terMaster
(Deplo yment)

Shared Cluster of Serv ers

14

Big Data Warehouse: HugeTable -> Horizon
• 以HDFS为基础存储平台，支持多种存储格式，可扩展
SQuirreL SQL Client
(GUI)

SQLLine
(CLI)

Web SQL Client

Apps
(Programming)

JDBC Driver

JDBC Driver

JDBC Driver

JDBC Driver

•
•
•

• 多种数据访问模型
•
•
•

Smart SQL Engine
Smart SQL Engine
智能SQL引擎
智能SQL引擎

Pig

HugeTable Data Model
数据建模

Unified Schema
统一元数据

Impala
(MPP)

MapReduce

HFile

TextFile

SequenceFile

(SSTables)

(Recorded)

(Key-Value Rows)

HDFS

HBase
MapReduce
MPP: Impala

• HugeTable特有的数据存储模型
•
•
•
•

Encodeing/Decoding
Indexing
Partitioning
…

• 统一的Data Schema Metadata管理

Hive

HBase

HBase/HFile,
行存储：TextFile, SequenceFile
列存储：RCFile/ORCFile, Rarquet, …

RCFile/
ORCFile
(Columnar)

• Smart SQL Engine and Server
•
•

高性能、高并发、高稳定性、分布式
选择不同的数据访问模型路径

• 兼容Hive和Pig
Parquet
(ColumnIO)

User-Defined
Formats ...

• 标准化JDBC客户端接口和客户端工具
• 工程辅助工具
•
•

快速批量加载 BulkLoad和导出 (提供SQL界面)
快速部署工具
15

Big Data Warehouse: HugeTable -> Horizon
JDBC and ODBC

REST

API

Management

...

SQL Engine
(Standard, Familiar, Low Learning Curve, ...)

Data Warehouse Utilities / Tools
(SpeedLoader, SpeedScan, Data
LifeCycle, ...)

Bigtable (HBase)

DFS (Hadoop HDFS)

Connectors
Integrating into Hadoop Ecosystem

Data Model
(Data Organization, Indexing,
Partitioning, Encoding,
Compressing, ...)

Oozie

HCatalog

Pig

Hive

MapReduce

16

NoSQL vs. SQL
• NoSQL, BigTable, Cassandra, etc., are just the “Storage Engine Layer” of DBMS.
• Users always like and be familiar with SQL to touch their data.
MySQL Server

Horizon

SQL Engine Layer

Distributed
SQL Engine

vs.
Storage Engine Layer
(MyISAM, InnoDB, etc.)

Distributed Storage Engine
（NoSQL, HBase)

How about to build a Distributed DBMS? Megastore, Greenplum/Pivotal/GitusDB, 17
etc.

经分大数据平台
Plan & Design
数据存储模型定义 (Schema, Types, Indexes, StorageEngine, etc.)
数据处理操作和流程定义 (SQL, Scripts, Java, WorkFlow, etc.)

BOSS
帐详单CDR数据

批量加载工具
(Files,
BulkLoad, etc.)

网络
CDR数据
(Gn/Gb/IuPS ...)
信令数据
(Iub/Iucs/mmsc ...)

日志数据
(WAP, WLAN ...)

DPI采集数据

统一大数据存储和分析平台

Client

根据实
际业务
数据进
行开发
和移植

实时加载工具
(Flume, Flive,
etc.)

离线接
口一般
无需修
改

数据库数据转
移工具
(Sqoop, etc.)

SQL

Scripts

...

Java

Hive

Horizon

ETL处理
逻辑

HBase

MapRedu
ce

Impala

Hadoop HDFS基础存储层
CRM
用户资料

MapReduce

其他工程工具

Pig

根据实
际业务
数据进
行开发
和移植

离线接
口一般
无需修
改

统计、汇总
分析、报表
类业务

即席查询
类业务
(ad-hoc)
数据挖掘
类业务

Data
Mining

其他OLAP
业务

数据处理和访问

业务功能

其他数据

大数据来源 (多样性)

数据加载和预处理

数据存储、组
织和处理平台

原则：以离线、批量分析为主，兼顾数据查询和管理
18

大数据服务平台
JDBC for Local Deployment

RESTful for Remote Deployment

Load Balancer
(LVS, with HA)

HugeTable
Web Service

Web Service

Web Service

SQL Engine
Server

SQL Engine
Server

SQL Engine
Server

LifeCycle
file

Online
Generated
Data (CDR)

(On/Offline,
DataDrop)

Connector

Flive

HugeTable Data Model

BulkLoad
file

Hive/Pig
MapReduce

Hive/Pig
MapReduce

HBase, Hadoop

(with
SpeedScan)

Analysis

ETL

原则：以实时低时延数据查询为主，兼顾数据分析
19

Cluster Management: ClusterMaster

20

Cluster Management: ClusterMaster

21

Hadoop and Open Source Ecosystem
• MapReduce
• Runtime Job/Task Schedule & Latency
•
•
•

Work Pool
Transfer Job description information
…

• Processing Engine Improvements
•
•

Shuffle: sendfile, Netty Server, Batch Fetch
Sort Avoidance: Spilling and Partitioning, Hash
Aggregation

• HBase (to be a Data Warehouse backend)
•
•
•
•
•

Low Level HFile management
Speed Bulk Load
Speed Scan for Analysis
Flexible control of Flush, Compaction, Split, Balance
Coprocessor for parallel processing

• Flume
• Support more Data Sources and Data Storages
• More flexible Command Line tool

• Hive

• Faster SQL Engine
• Support more Storage Engines
• More UDFs for database functions (such as NVL,
DECODE from Oracle.)
• More UDFs for OLAP (such as Roll-Up, Cube, Efficient
Aggregations, etc.
• More algorithms for efficient statistics and estimate
(such as LogLog-Counter for estimated DISTINCT values)

• Pig

• Support more Data Storages
• More UDFs for analysis, statistics and data mining (such
as K-Mean, ID3 for Decision Tree, etc.)

• Tools
•
•
•
•

Deployment: Hdeploy, HTCfg, ClusterMaster
Management: Integrate Ganglia, Nagios, Puppet, etc.
Light and handy command line: Hman, etc.
Benchmark Tools: Hbench, etc.
22

Know the Details of Hadoop …

23

MapReduce Runtime Optimization
• Job/Task Schedule & Latency
• Worker Pool

Job Latency (in second, lower is
better)
Total Tasks (96 maps, 4 reduces)
50

MapReduce
Client

45

RPC
(JobConf)

JobTracker

43

40
35
30
25

24

20

TaskTracker

TaskTracker

15

TaskTracker

10
5

Child
Worker

Child
Worker
Worker Pool

Child
Worker

Child
Worker

Child
Worker
Worker Pool

Child
Worker

Child
Worker

Child
Worker

Child
Worker

1

0
CDH3u2 (Cloudera) CDH3u2 (Cloudera)
(reuse.jvm disabled) (reuse.jvm enabled)

HDH3u2 (Hanborq)

Worker Pool

24

MapReduce Processing Engine Optimization
• Shuffle: Use sendfile to reduce data copy and context switch.

• Shuffle: Netty Shuffle Server (map side) and Batch Fetch (reduce side).
• Sort Avoidance.
• Spilling and Partitioning, Counting Sort, Bytes Merge, Early Reduce, etc.
• Hash Aggregation in job implementation.

Real Aggregration Jobs
(lower is better)

Sort Avoidance and Aggregation
700

2400
2200
2000
1800
1600
1400
1200
1000
800
600
400
200
0

600

2186

500

615
197 175

216 198

Case1

Case2

197

216

175

198

615

300

200

2186

HDH (Hanborq)

400

Case3

CHD3u2 (Cloudera)

time (seconds)

time (seconds)

(lower is better)

100
0

Case1-1

Case2-1

Case1-2

Case2-2

CDH3u2 (Cloudera)

238

603

136

206

HDH (Hanborq)

233

578

96

151

25

中国移动BigCloud
自2008年开始与中国移动研究院合作定义、设计和开发“大云”1.0体系结构和产品系列，目前已完成
了“大云”2.0的研发任务。
已支持“大云”系统在中国移动及其它行业用户广泛部署，提供软、硬件系统解决方案及服务。云存储
及数据仓库产品及服务，单一数据中心部署容量已超过2,000节点，管理超过20PB的存储容量。为电信
详单、日志、信令、文档、视频、图片及互联网页数据，提供存储、分析及检索服务。
 BC-HugeTable(海量结构化数据管理系统)
 大数据仓库 (分析和查询)
 大数据库 (分析和查询)

 BC-Hadoop(海量数据存储和分析平台)
 研究院发行版
 汉播发行版HDH

 BC-oNest(分布式对象存储系统)
 BC-NAS(分布式文件系统中间件)
26

CDR帐详单仓库和查询
清单量(亿条)

HB-CDW集群系统

电信运营网络

450

数据存储和分析服务器集群
HB-CDW系统
(存储，索引，分析)

OSS服
务器

400
350

300
250

200

移动核
心网

网络交换设备

报
表
查
询

实时
采集设备批量
timeseries

PC浏览器查询

清单量(亿条)

150

100
50

Internet

0
200906 200907 200908 200909 200910 200911 200912

RDBMS和
Web服务器

查询量(次数)
8000000
7000000

6000000

集群监控管理服务器

BSS

智能手机查询

5000000
4000000
查询量

3000000
2000000

Intranet

1000000

0
200906 200907 200908 200909 200910 200911 200912

Terminals

分析报表

PC浏览器监控

方案制定时间：2009-10

智能手机监控

- CDR实时生效延迟<1分钟
- 查询响应(Latency) < 3秒(平均<0.5秒)
- 查询吞吐率：每月2亿次，忙时每秒1000
- 数据安全：数据在3个节点冗余备份
- 数据分析：每日或每月生成KPI报表

用户规模：约1亿用户
CDR详单数据量
- 每月：详单量500亿条，数据量20TB (每秒2
万条以上)
- 总存储6个月：详单量3000亿条，数据量
120TB
- 移动互联网业务详单数据量是普通业务CDR
的5倍以上
数据存储和处理集群规模
- 32台DELL PE C2100服务器
- 每台12 x 1TB数据硬盘，64GB内存

27

WorkFlow/Pipeline控制器

移动 – 经分ETL
周期(每小时)在接口机上运行Pig脚本，驱动MapReduce
Job并行从接口机读取数据，并做格式转换、编码、压缩

和清洗，写成SequenceFile到HDFS。节省存储空间，提高
输出中间汇总(细粒度)数据

后续处理效率，易扩展新的ETL功能

月180GB，存储到HDFS 31
天，待月汇总

WAP日志文件

Hadoop Node

接口机每小时拉文件
每日400GB，约4.6万个小文件

高性能/高并发/大存储

华为WAP日志服务器
(FTP Server)
#1
华为WAP日志服务器
(FTP Server)
#2

平台对外总数据接口

……

(输入/输出)

Hadoop Node

防
火
墙

大数据平台
接口机
(FTP Server)

大数据平台
(Hadoop/Hive/Pig/
HugeTable)
Hadoop Node

亚联系统

日汇总Job
(Hive SQL)

……

31天
日汇总Jobs
(Hive SQL)

日汇总
一经规整
(Pig/Scrpits)

31天

月汇总Jobs
(Hive SQL)

月汇总
一经规整
(Pig/Scrpits)

日汇总
一经规整
(Pig/Scrpits)

每日输出5GB规整
后的数据到接口机

每月输出规整后的
数据到接口机

Hadoop Node
每天更新号段维表数据
每月更新用户信息维表
数据
每日定时取前一日汇总数据
每月定时取前一月汇总数据
数据需符合一经规范

28

Lessons Learned
Many lessons and many feelings.

30

1. Right Design Comes from Basic Knowledge
of Computer System / Computer Science
• Computer Architecture and How
Computer Works
• Representing and Manipulating
Information and Programs
• Processor Architecture (Pipeline,
Parallel …)
• Storage Architecture
• IO System, etc.

•
•
•
•

• The core issues of database.
• File-system …
• To be distributed now.

Memory/Storage Hierarchy
Modern Operation System
Networking
Languages …
31

Basic Knowledge of CS

- Sequential vs. Random Access …
- Long latency of Disk Seek …
- Throughput
All solutions of database and big data processing system are stand on the characters of computer architecture,
especially disk, network ...
32


by Jeff Dean
33

• What every data engineer needs to know about disks
• Basic Algorithms (Sorting, Searching, Strings, Bitmap, …)
• Linux Virtual Memory, Exceptions, Concurrency, etc.
•…

34

2. Keep Simple and Straightforward
• Master-Slave vs. Decentralized (DHT, Consistent Hash)
• Almost all Google products follow Master-Slave pattern.
GFS/BigTable/MapReduce/ZooKeeper, etc..
• MapReduce: Simplified Data Processing on Large Clusters

• A simple programming model that applies to many large-scale computing problems
• Hide messy details

• Bigtable provides the simple data model, distributed B+ tree …

• Shards and Replicas

• Simple and clean API design
35

Keep Simple and Straightforward
• Example: Bigtable vs. Cassandra
Master
Master

Tablet Server

Tablet Server

Tablet Server

Tablet Server

Tablet

GFS

Bigtable

Cassandra
36

Keep Simple and Straightforward
Bigtable (++)

Cassandra (--)

• Master – Tablet Servers
• Dynamic Tablet Splits
• WAL + MemTable + SSTable
• Three Level Distributed B+Tree
• Replication in GFS
•…

•
•
•
•
•
•
•
•
•
•
•
•

Bigtable ’s architecture and data model make
more sense.

Identical Data Nodes, Gossip
Consistent Hash, Virtual Nodes
WAL + MemTable + SSTable
Hinted Handoff
DHT Ring (neighbor nodes)
Eventual consistency
Read Rapir
Merkle Tree
Clock Vector
Anti-entropy protocol (反熵)
…
好复杂：架构的错误，导致系统越来越复杂 …

http://www.slideshare.net/schubertzhang/cassandra-dynamo-paper
http://www.slideshare.net/schubertzhang/dastorcassandra-report-for-cdr-solution

37

3. There is no “one-size-fits-all” solution
• There are too many contradictory requirements in the structured data world.
• The contradiction of data processing
• Real-time or near-real-time data availability.
• Batch processing for large size of data, such as aggregation.

• The contradiction of data access:
• Low-latency fast query response, like Lookup.
• High-latency ad-hoc analytic query for historical data.

• But, there is no one-size-fits-all answer for above contradictory requirements.
• Identify common problems, and build systems to address them in a general way.

• “Important not to try to be all things to all people!” – Jeff Dean, Keynote at
LADIS’09

38

There is no “one-size-fits-all” solution
• MapReduce
• Dremel (MPP)
• Tez/Stingger
• NoSQL/Bigtable (and with
Coprocessor)
• DBMS
•…

Lambda Architecture: New data is sent to both
layers and queries merge views from both layers.

39

There is no “one-size-fits-all” solution
SQL, Scripts, Java, etc.

Hive

Pig

MapReduce

Java

Impala

GoldenOrb

Dremel

Pregel

不同的查询和分析请求，采用不同的并行执行引擎操作数据。

40

4. Monitorable and Metrizable at any time
• Sufficient Statistic, Monitoring …
• Add Sufficient Monitoring/Status/Debugging Hooks
• If your system is slow or misbehaving, can you figure out why?
• Don’t rely on logs too much, log is too costly and inefficient.
• Use real-time statistics/metrics.
• Use tools, jmxetric, JMX, Ganglia, Nagios, Noah …
41

Monitorable and Metrizable at any time
The magic matrix ??!

Captured from UTStarcom mSwitch R5 system, Guangxi Site, 2004.
42

Write/Insert Operation Benchmark

Read/Query Operation Benchmark

43

SLA Metrics:
•

•

Latency
o tAvgLat: Total Average Latency (ms)
o dAvgLat: Delta Average Latency (ms)
o dMaxLat : Delta Maximum Latency (ms)
o dMinLat : Delta Minimum Latency (ms)

•

percentage of read ops

Throughput
o tThrou :Total Throughput (operation
count)
o dThrou : Delta Throughput (operation
count)

Quantile %

•
•

Total : from benchmark start to present.
Delta: between each statistical interval (2
minutes here)

25.00%
20.00%
15.00%
10.00%
5.00%
0.00%
1

3

5

7

9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61
100ms

 Read Throughput: average ~140 ops/s
 Latency: average ~500ms, 97% < 2s (SLA)
 Bottleneck: disk IO (random seek) (CPU load is very low)

44


45

5. Try to make data in-situ
• The ability to access data ‘in place’.
• ProtocolBuffers/Parquet encoding Real-Time Data Service
Writes
(Puts)

• Example:
• Horizon over HDFS + HBase

Reads
(Get/Scan)

Real-Time API
Schema

Meta

Bulk Load

HBase
Flush/Compaction

(Batch Input)

Coprocessor

MapReduce/
Impala
HFiles (Batch Processing)

HDFS (HFile)
HFiles
46

6. Approximated vs. Precise
• For large data sets, it can be prohibitively expensive to find the precise
result, but there are efficient estimating methods.
• Example Queries:

• How many distinct elements are in the data set (i.e. what is the cardinality of the
data set)?
• What are the most frequent elements (the terms “heavy hitters” and “top-k
elements” are also used)?
• What are the frequencies of the most frequent elements?
• How many elements belong to the specified range (range query, in SQL it looks
like SELECT count(v) WHERE v >= c1 AND v < c2)?
• Does the data set contain a particular element (membership query)?
• …

47

Approximated vs. Precise
• The algorithms are approximate: with high probability it returns
approximately the correct result. (e.g. ±2%)
• select count(distinct userid) from userlogs;
• select top(100) of count(*) from orders group by itemname;
•…
• Statistical and Probabilistic Analysis, Very interesting!
48

Approximated vs. Precise
• Usually Sample/Hash/Bitmap …
• Cardinality Estimation
• Linear Counting
• Loglog Counting …

• Frequency Estimation / Heavy Hitters
• Count-Min Sketch
• Count-Mean-Min Sketch
• Stream-Summary …

• Range Query

• Array of Count-Min Sketches …

• Membership Query
• Bloom Filter

• …
49

5. Open Source and Open Spirit
• Choose you Building Blocks in Engineering view
• Know Your Basic Building Blocks, Not just their interfaces, but understand
their implementations (at least at a high level)

• 善用开源，回馈开源，使开源更好更强大

50

6. And more …
• Description and Documents
• Avoid inventing new Interface for Users
• From simple to complete, From prototype to product
• Make the architecture robust, try it, and then improve and complete it.

• Product vs. Tech. vs. Trick
•…
51

7. Read Books – Read English Books

52

Find me outside
• SlideShare:
http://www.slideshare.net/schubertzhang
http://www.slideshare.net/hanborq

• Github:
https://github.com/schubertzhang
https://github.com/hanborq

• Email & Gtalk:
schubert.zhang@gmail.com
• Weibo:
@schubertzh

• LinkedIn:
http://cn.linkedin.com/pub/schubertzhang/6/b51/b5b/

• Blog:

• WeChat:
schubertzh

http://cloudepr.blogspot.com

• Facebook:
https://www.facebook.com/schubertzhang
54

Engineering practices in big data storage and processing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Engineering practices in big data storage and processing

Similar to Engineering practices in big data storage and processing (20)

More from Schubert Zhang

More from Schubert Zhang (20)

Recently uploaded

Recently uploaded (20)

Engineering practices in big data storage and processing