14. 常用统计模型
• 指数族分布:
– 举例: Gaussian, multinomial, maximum entropy
– Maximum likelihood (ML) estimation is linked to data through
2
sufficient statistics. (e.g. x , x for Gaussian)
i i
• 指数族混合分布:
– Examples: Mixture of Gaussians, hidden Markov models,
probabilistic latent semantic analysis (pLSA)
– ML estimation can be iteratively approached. In each iteration,
we update the model with the current statistics.
• 指数族分布的贝叶斯学习:
– Examples:
14
15. Map/Reduce 统计学习流程
(sufficient) statistics
data
mapper reducer
model
• Key points:
a. Only generate compact statistics (with size proportional to the model size) in a
mapper, to reduce the shuffle/sort cost.
b. Summarize such a generic flowchart by a template library with c++.
15
16. Map/reduce 基本统计模型训练
Mapper Reducer
template <class TModel> template <class TModel>
class CTrainMapper : public CFeature, public class CTrainReducer : public IDataAnalyzer{
IDataNnalyzer{ protected:
protected: TModel * pModel;
TModel * pModel;
public:
public: /// Comsume a data record author Peng Liu
/// Comsume a data record author Peng Liu virtual bool consume(const CRecord &
virtual bool consume(const CRecord & record)
record){ {return pModel -> consume(record);}
CFeature::consume(record);
pModel -> accumulate(*this, 1.0f); /// Try to update model after all input
return true; data finish author Peng Liu
} virtual void finish()
{pModel -> update();}
/// Produced statistics (or modified data
in case needed) author Peng Liu /// Produced model author Peng Liu
virtual bool produce(CRecord & record){ virtual bool produce(CRecord & record){
static bool first = true; pModel -> produce(record);
pModel -> produce(record); if (record.getField("STAT") != NULL){
if (record.getField("STAT") != NULL){ record.rmvField("STAT");
record.rmvField("PARAM") return true;
return true; }
} return false;
return false; }
}; };
}
16
17. 示例: Gaussian模型训练
• Sufficient statistics: xi , xi2 • Model parameter: μ, σ 2
• Model update:
• Statistics accumulation:
void CGaussDiag::update()
void CGaussDiag::accumulate(CFeature
& x, float occ) {
{ size_t dim = getFeaDim();
size_t dim = getFeaDim(); for (size_t d = 0; d < dim; d ++)
assert(x.size() == dim); {
float X = stats[d],
accumOcc(occ); float X2 = stats[dim + d];
for (size_t d = 0; d < dim; d ++) params[ d] = X / occ();
{
params[dim + d] = occ() / (X2 -
stats[ d] += occ * x[d]; X * X / occ());
stats[dim + d] += occ * x[d] *
}
x[d];
} }
}
17
18. Pig – 类SQL Hadoop数据处理
• Example:
Users = load „users‟ as (name, age);
Fltrd = filter Users by
age >= 18 and age <= 25;
Pages = load „pages‟ as (user, url);
Jnd = join Fltrd by name, Pages by user; Native Java code
Grpd = group Jnd by url;
Smmd = foreach Grpd generate group,
COUNT(Jnd) as clicks;
Srtd = order Smmd by clicks desc;
Top5 = limit Srtd 5;
store Top5 into „top5sites‟;
• Pig解释器进行整体规划以减少总的map/reduce次数
• 可用UDF自定义数据格式,在只需要访问大量数据的
部分字段时,可以采用列存储的Zebra(Pig子项目)
19. Pig 常用语句一览
Category Operator Description
Load and LOAD Loads data from the file system into a relation
storing STORE Saves a relation to the file system
DUMP Prints a relation to the console
Filtering FILTER Remove unwanted rows from a relation
DISTINCT Remove duplicate rows from a relation
FOREACH…GENERATE Adds or removes fields from a relation
STREAM Transfers a relation using an external program
SAMPLE Selects a random sample of a relation
Grouping and JOIN Joins two or more relations
joining COGROUP Groups the data in two or more relations
GROUP Groups the data in a single relation
CROSS Creates the cross-product of two or more relations
Sorting ORDER Sorts a relation by one or more fields
LIMIT Limits the size of a relation
Combining and UNION Combines two or more relations into one
splitting SPLIT Splits a relation into two or more relations
21. 精准广告系统框架
Ad server Zookeeper
Cache
Thrift +
Scribe Storm
Audience
Realtime
Targeting
CTR feedback
Anti
-spam
CTR User Realtime
Model Profiling Billing