分布式计算与Hadoop - 刘鹏

LAMP人主题分享交流会

www.LAMPER.cn
QQ群：3330312
http://weibo.com/lampercn

分布式计算与Hadoop
MediaV 刘鹏
Email: liupeng@mediav.com
微博: weibo.com/bmchs

MediaV(聚胜万合)简介
• 创建于2009年
• 领先的互联网精准营销技术服务机构
• 分支：上海（总部）, 北京、深圳、广州、南京、厦门
• 员工人数：超过300人

• 营业额：2009年2000万，2010年2.4亿，2011年预计5亿
• 客户数：超过100位B2C企业找到客户，成就业绩

MediaVers
董事长兼首席执行官首席技术官

杨炯纬胡宁博士

复旦大学计算机 MBA Carnegie Mellon
原好耶广告网络总裁 University 博士
原GOOGLE 技术总监

北京研发中心总经理首席科学家

魏小勇刘鹏博士

复旦大学计算机硕士清华大学电子工程博士
原微软大中华区软件原雅虎Y！ Labs高级科
安全事务部总监学家

Hadoop 概况
• Apache 开源项目
– 源于Lucene项目的一部分, 2006.1成为子项目, 现为Apache顶级项目之一
– Yahoo! 是最主要的源代码贡献者, 其他贡献者: Powerset, Facebook 等
– 已知为接近150家的大型组织实际使用: Yahoo!, Amazon, EBay, AOL,
Google, IBM, Facebook, Twitter, Baidu, Alibaba, Tencent, …
(http://wiki.apache.org/hadoop/PoweredBy)
• Hadoop 核心功能
– 高可靠性, 高效率的分布式文件系统
– 一个海量数据处理的编程框架
• Hadoop 目标
– 可扩展性: Petabytes (1015 Bytes) 级别的数据量, 数千个节点
– 经济性: 利用商品级(commodity)硬件完成海量数据存储和计
– 可靠性: 在大规模集群上提供应用级别的可靠性

分布式计算常用工具

GFS
BigTable

Avro Hbase

Storm S4
Oozie

Scribe Chuhwa Pig
Elephant-bird
Zoo
Chubby Keeper Hive

Map/Reduce
• 什么是Map/Reduce
– 一种高效, 海量的分布式计算编程模型
– 海量: 相比于MPI, Map处理之间的独立性使得整个系统的可
靠性大为提高.
– 高效: 用调度计算代替调度数据!
– 分布式操作和容错机制由系统实现, 应用级编程非常简单.
• 计算流程非常类似于简单的Unix pipe:
– Pipe: cat input | grep | sort | uniq -c > output
– M/R: Input | map | shuffle & sort | reduce | output
• 多样的编程接口:
– Java native map/reduce – 可以操作M/R各细节
– Streaming – 利用标准输入输出模拟以上pipeline
– Pig –只关注数据逻辑，无须考虑M/R实现

Map/reduce 计算流程

Input Map Shuffle & Reduce Output
Sort

Map

Reduce
Input Output
data data
Map
Reduce

Map

Combine Partition

用Java进行Map/Reduce编程
Mapper.run() Reducer.run()
public void run(Context context) public void run(Context context)
throws IOException, throws IOException,
InterruptedException InterruptedException
{ {
setup(context); setup(context);
while (context.nextKeyValue()) while (context.nextKey())
{ {
map( reduce(
context.getCurrentKey(), context.getCurrentKey(),
context.getCurrentValue(), context.getValues();
context); context);
} }
cleanup(context); cleanup(context);
} }

Combiner
1
1
1
Map 1Combiner 6
1
1
1 1
Map 1 1Combiner 6 Reduce

1 1
Map Combiner 4
1 1

• 实现: Combiner是一个Reducer的子类
• 调用: mapreduce.Job类的setCombinerClass()方法:
job. setCombinerClass(CombinerName.class)

Partitioner
Map

Reduce Part-0

Map
Reduce Part-1

Map

• 实现: 继承hadoop.mapreduce.Partitioner<KEY, VALUE>,
实现抽象方法 int getPartition(KEY key, VALUE value, int
numPartitions)
• 调用: mapreduce.Job类的setPartitionerClass()方法:
job. setPartitionerClass(PartitionerName.class)

Hadoop Streaming
• 模拟Pipe方式执行Map/Reduce Job, 并利用标准输入/输
出调度数据:
– Input | map | shuffle & sort | reduce > output
• 开发者可以使用任何编程语言实现map和reduce过程,
只需要从标准输入读入数据, 并将处理结果打印到标
准输出.
• 只支持文本格式数据, 数据缺省配置为每行为一个
Record, Key和value之间用t分隔.
• 例，生成大量文本上的字典：
– map：awk „{for (i=1; i <=NF; i ++){print $i}}‟
– reduce: uniq

常用统计模型
• 指数族分布:
– 举例: Gaussian, multinomial, maximum entropy
– Maximum likelihood (ML) estimation is linked to data through
2
sufficient statistics. (e.g. x , x for Gaussian)
i i

• 指数族混合分布:
– Examples: Mixture of Gaussians, hidden Markov models,
probabilistic latent semantic analysis (pLSA)
– ML estimation can be iteratively approached. In each iteration,
we update the model with the current statistics.
• 指数族分布的贝叶斯学习:
– Examples:

14

Map/Reduce 统计学习流程
(sufficient) statistics

data
mapper reducer

model

• Key points:
a. Only generate compact statistics (with size proportional to the model size) in a
mapper, to reduce the shuffle/sort cost.
b. Summarize such a generic flowchart by a template library with c++.

15

Map/reduce 基本统计模型训练
Mapper Reducer
template <class TModel> template <class TModel>
class CTrainMapper : public CFeature, public class CTrainReducer : public IDataAnalyzer{
IDataNnalyzer{ protected:
protected: TModel * pModel;
TModel * pModel;
public:
public: /// Comsume a data record author Peng Liu
/// Comsume a data record author Peng Liu virtual bool consume(const CRecord &
virtual bool consume(const CRecord & record)
record){ {return pModel -> consume(record);}
CFeature::consume(record);
pModel -> accumulate(*this, 1.0f); /// Try to update model after all input
return true; data finish author Peng Liu
} virtual void finish()
{pModel -> update();}
/// Produced statistics (or modified data
in case needed) author Peng Liu /// Produced model author Peng Liu
virtual bool produce(CRecord & record){ virtual bool produce(CRecord & record){
static bool first = true; pModel -> produce(record);
pModel -> produce(record); if (record.getField("STAT") != NULL){
if (record.getField("STAT") != NULL){ record.rmvField("STAT");
record.rmvField("PARAM") return true;
return true; }
} return false;
return false; }
}; };
}
16

示例: Gaussian模型训练
• Sufficient statistics: xi , xi2 • Model parameter: μ, σ 2

• Model update:
• Statistics accumulation:
void CGaussDiag::update()
void CGaussDiag::accumulate(CFeature
& x, float occ) {
{ size_t dim = getFeaDim();
size_t dim = getFeaDim(); for (size_t d = 0; d < dim; d ++)
assert(x.size() == dim); {
float X = stats[d],
accumOcc(occ); float X2 = stats[dim + d];
for (size_t d = 0; d < dim; d ++) params[ d] = X / occ();
{
params[dim + d] = occ() / (X2 -
stats[ d] += occ * x[d]; X * X / occ());
stats[dim + d] += occ * x[d] *
}
x[d];
} }
}

17

Pig – 类SQL Hadoop数据处理
• Example:
Users = load „users‟ as (name, age);
Fltrd = filter Users by
age >= 18 and age <= 25;
Pages = load „pages‟ as (user, url);
Jnd = join Fltrd by name, Pages by user; Native Java code
Grpd = group Jnd by url;
Smmd = foreach Grpd generate group,
COUNT(Jnd) as clicks;
Srtd = order Smmd by clicks desc;
Top5 = limit Srtd 5;
store Top5 into „top5sites‟;
• Pig解释器进行整体规划以减少总的map/reduce次数
• 可用UDF自定义数据格式，在只需要访问大量数据的
部分字段时，可以采用列存储的Zebra（Pig子项目)

Pig 常用语句一览
Category Operator Description
Load and LOAD Loads data from the file system into a relation
storing STORE Saves a relation to the file system
DUMP Prints a relation to the console
Filtering FILTER Remove unwanted rows from a relation
DISTINCT Remove duplicate rows from a relation
FOREACH…GENERATE Adds or removes fields from a relation
STREAM Transfers a relation using an external program
SAMPLE Selects a random sample of a relation
Grouping and JOIN Joins two or more relations
joining COGROUP Groups the data in two or more relations
GROUP Groups the data in a single relation
CROSS Creates the cross-product of two or more relations
Sorting ORDER Sorts a relation by one or more fields
LIMIT Limits the size of a relation
Combining and UNION Combines two or more relations into one
splitting SPLIT Splits a relation into two or more relations

若干用户定向技术
阶段定向方式
曝光(exposure) 效果
上下文
(2.1, 3.1)
关注(attention) 重定向
(2.2, 2.3, 3.1) 兴趣
Look-alike (2.3, 3.1)
理解(comprehension) (2.3, 3.1, 4.1, 6.1)

信息接受 Hyper-local 地域
(2.3, 4.1) 网站/频道 (2.3, 4.1)
(message (2.3, 3.1, 4.2) 用户属性
acceptance) (2.3, 3.1, 6.1)
团购
保持 (retention) (2.3, 4.1, 6.1)

购买(purchase) 作用阶段

精准广告系统框架
Ad server Zookeeper

Cache
Thrift +
Scribe Storm
Audience
Realtime
Targeting
CTR feedback
Anti
-spam
CTR User Realtime
Model Profiling Billing

Hadoop上的工作流引擎 - Oozie
• 连接多个Map/reduce Job, 完成复杂的数据处理
• 处理各Job以及数据之间的依赖关系（可以依赖的条
件：数据，时间，其他Job)
• 使用hPDL(一种XML流程语言) 来定义DAG工作流, 例：
<start to='ingestor'/>
<action name='ingestor'> … <ok to=“proc"/> <error to="fail"/> </action>
<fork name=“proc">
<path start=“proc1"/> <path start=" proc2"/> </fork>
<action name=„proc1‟> … <ok to="completed"/> <error to="fail"/> </action>
<action name='proc2'> … <ok to="completed"/> <error to="fail"/> </action>
<join name="completed" to="end"/>
<kill name="fail"> <message>Java failed, error message… </kill>
<end name='end'/>
• 注意： Oozie目前不支持Iteration

服务搭建基础工具 - Thrift
• 跨语言服务快速搭建 (c++, java, python, ruby, c# …)
• 用struct定义语言无关的通信数据结构:
• struct KV
{1:optional i32 key=10; 2:optional string value=“x”}
• 用service定义RPC服务接口:
• service KVCache{void set(1:i32 key, 2:string value);
string get(1:32 key); void delete(1:i32 key);}
• 将上述声明放在IDL文件(比如service.thrift)中, 用thrift –r –
gen cpp service.thrift生成服务框架代码
• 能实现结构体和接口的Backward compatible
• 类似工具：Hadoop子项目Avro，Google开发的ProtoBuf

Thrift 相关工具
• Scribe:
• 大规模分布式日志收集系统
• 利用Thrift实现底层服务
• Elephant-bird:
• ProtoBuf, Thrift的Pig, Hive代码生成器
• 输入Thrift IDL文件，输出：
• 对应的Pig Loader jar包；
• 一段包含结构体声明的Pig code:
KV_data = load „input‟ using …() as (key: long, value:
char_array)

分布式同步服务 - Zookeeper
• 以分布式方式解决分布式系统中的各种同步问题
• 用层次式Namespace维护同步需要的状态空间

• 保证实现以下特性
• Sequential Consistency, Atomicity, Single System
Image, Reliability, Timeliness
• 较为复杂的同步模式，需要利用API编程实现

流式计算平台 - Storm
• 大规模实时数据处理框架, 自动完成数据分发和可靠性管
理, 开发者只需要关注处理逻辑.
• 数据流基本在网络和内存进行.
• 计算逻辑类似于Map/Reduce, 区别在固定计算而调度数据

Topology Tasks

分布式计算与Hadoop - 刘鹏

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (6)

Ähnlich wie 分布式计算与Hadoop - 刘鹏

Ähnlich wie 分布式计算与Hadoop - 刘鹏 (20)

Mehr von Shaoning Pan

Mehr von Shaoning Pan (20)

分布式计算与Hadoop - 刘鹏