SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Downloaden Sie, um offline zu lesen
FeatHub: 流批一体的实时特征工程平台
林东
阿里巴巴高级技术专家
Apache Flink/Kafka committer
01 为什么需要FeatHub
02 FeatHub架构和概念
03 FeatHub API展示
01 为什么需要FeatHub
目标场景
需要Python环境的
数据科学家
生成实时特征 需要开源方案
支持多云部署
实时特征工程的痛点
开发难度高 部署难度高 监控难度高 分享难度高
特征穿越
需要手动翻译
性能压力大
特征分布变化 开发工作重复
point-in-time correct语
义
时间 用户ID 最近2分钟点击数
5:00 1 5
6:00 1 10
7:00 1 6
时间 用户ID Label
5:02 1 1
6:03 1 0
6:10 1 1
7:05 1 0
时间 用户ID 最近2分钟点击数 Label
5:02 1 6 1
6:03 1 6 0
6:10 1 6 1
7:05 1 6 0
错误的join
多版本特征
样本数据
训练数据
point-in-time correct语
义
时间 用户ID 最近2分钟点击数
5:00 1 5
6:00 1 10
7:00 1 6
时间 用户ID Label
5:02 1 1
6:03 1 0
6:10 1 1
7:05 1 0
时间 用户ID 最近2分钟点击数 Label
5:02 1 5 1
6:03 1 10 0
6:10 1 10 1
7:05 1 6 0
正确的join
多版本特征
样本数据
训练数据
Feature Store的核心场景
特征开发 特征部署 特征监控 特征分享
Feature Store的核心场景
特征开发 特征部署 特征监控 特征分享
SDK 执行引擎 特征储存 指标 报警 注册 搜索
Feature Store的核心场景
特征开发 特征部署 特征监控 特征分享
SDK 执行引擎 特征储存 指标 报警 注册 搜索
简单易用
point-in-time correct
语义
高吞吐
实时特征
离线储存
在线储存
数值分布变化
实时监控
简单易用
Python SDK 简单易用
快速生产部署
快速开发实验
高性能实时特征
代码开源
处理引擎可扩展
为什么需要FeatHub
02 FeatHub架构和概念
架构
架构 (续)
核心概念
TableDecscriptor TableDecscriptor TableDecscriptor
Transform Transform
TableDecscriptor
FeatureTable
保存在储存上的
物理特征表
FeatureView
从FeatureTable计算得到的
逻辑特征表
DerivedFeatureView SlidingFeatureView OnDemandFeatureView
Transform
Expression Join
OverWindow SlidingWindow
PythonUDF
核心概念 (续)
FlinkProcessor
LocalProcessor SparkProcessor
Processor
Source Sink
Table Table
KafkaSource
FIleSystem
Source
HiveSource KafkaSink
FileSystemSink RedisSink
HiveSink
03 FeatHub API展示
特征计算功能
f_price = Feature(
name="price",
dtype=types.Float32,
transform=JoinTransform(
table_name="price_updates",
feature_name="price"
),
keys=["item_id"],
)
f_total_payment_last_two_minutes = Feature(
name="total_payment_last_two_minutes",
dtype=types.Float32,
transform=SlidingWindowTransform(
expr="item_count * price",
agg_func="SUM",
window_size=timedelta(minutes=2),
step_size=timedelta(minutes=1),
group_by_keys=["user_id"]
)
)
f_total_payment_last_two_minutes = Feature(
name="total_payment_last_two_minutes",
dtype=types.Float32,
transform=OverWindowTransform(
expr="item_count * price",
agg_func="SUM",
window_size=timedelta(minutes=2),
group_by_keys=["user_id"]
)
)
特征拼接 Over窗口聚合 滑动窗口聚合
f_trip_time_duration = Feature(
name="f_trip_time_duration",
dtype=types.Int32, FeatHub性能优化
transform=
"UNIX_TIMESTAMP(taxi_dropoff_datetime) -
UNIX_TIMESTAMP(taxi_pickup_datetime)",
)
内置函数调用
f_lower_case_name = Feature(
name="lower_case_name",
dtype=types.String,
transform=PythonUdfTransform(lambda row: row["name"].lower()),
)
Python UDF调用
样例场景
Purchase Events
- user_id (PK)
- item_id
- item_count
- timestamp
Item Price Events
- item_id (PK)
- price
- timestamp
拼接
- user_id (PK)
- item_id
- item_count
- price
- timestamp
聚合
- user_id (PK)
- item_id
- item_count
- price
- total_payment_last_two_minutes
- timestamp
生成机器学习训练数据集
样例代码
client = FeathubClient(
config={
"processor": {
"processor_type": "flink",
"flink": {
"rest.address": "localhost",
"rest.port":" 8081"
},
},
...
}
)
f_total_payment_last_two_minutes = Feature(
name="total_payment_last_two_minutes",
dtype=types.Float32,
transform=OverWindowTransform(
expr="item_count * price",
agg_func="SUM",
window_size=timedelta(minutes=2),
group_by_keys=["user_id"],
),
)
purchase_events_with_features =
DerivedFeatureView(
name="purchase_events_with_features",
source=purchase_events_source,
features=[
"item_price_events.price",
f_total_payment_last_two_minutes,
],
keep_source_fields=True,
)
purchase_events_source = FileSystemSource(
name="purchase_events",
path="/tmp/data/purchase_events.json",
data_format="json",
schema=purchase_events_schema,
timestamp_field="timestamp",
timestamp_format="%Y-%m-%d %H:%M:%S",
)
item_price_events_source = FileSystemSource(
name="item_price_events",
path="/tmp/data/item_price_events.json",
data_format="json",
schema=item_price_events_schema,
keys=["item_id"],
timestamp_field="timestamp",
timestamp_format="%Y-%m-%d %H:%M:%S",
)
创建FeatHub
Client
创建Source 创建FeatureView
样例代码 (续)
result_table = client.get_features(
features=purchase_events_with_features
)
result_table_df = result_table.to_pandas()
print(result_table_df)
hdfs_sink = FileSystemSink(
path="/tmp/data/output.json",
data_format="CSV"
)
获取特征到本地Pandas
DF
创建Sink 输出特征到离线储存HDFS
result_table.execute_insert(
sink=hdfs_sink,
allow_overwrite=True
)
FeatHub性能优化
• 滑动窗口聚合的特征仅在特征数值变化时输出数据,减少网络带宽使用
• 滑动窗口聚合的特征在窗口为空的时候输出数据,无需下游周期性扫描数据来移除过期数据
• 提供高性能内置UDAF (e.g. VALUE_COUNTS),减少稀疏特征的网络带宽使用量
• E.g. 最近一分钟每个用户点击每种产品分类的数量
• 滑动窗口聚合的特征共用状态,减少计算量和内存使用 (尚未完成)
• E.g. 同时计算最近1分钟/5分钟/10分钟/30分钟内的用户点击数
• …
FeatHub未来工作
• 完善LocalProcessor和FlinkProcessor的功能和性能
• 支持常用的离线/在线储存 (e.g. HDFS, Redis)
• 支持对接Notebook
• 提供可视化UI来访问特征元数据 (e.g. 特征定义,特征血缘)
• 支持特征质量监控和报警
• 支持Spark作为处理引擎
• 支持特征授权,鉴权,审计等企业级功能
• …
欢迎加入FeatHub社区
FeatHub代码库:https://github.com/alibaba/feathub
FeatHub代码样例:https://github.com/flink-extended/feathub-
examples
THANK YOU
谢 谢 观 看

Weitere ähnliche Inhalte

Was ist angesagt?

The top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleThe top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleFlink Forward
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
 
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...Altinity Ltd
 
Flink SQL & TableAPI in Large Scale Production at Alibaba
Flink SQL & TableAPI in Large Scale Production at AlibabaFlink SQL & TableAPI in Large Scale Production at Alibaba
Flink SQL & TableAPI in Large Scale Production at AlibabaDataWorks Summit
 
Kafka 101 and Developer Best Practices
Kafka 101 and Developer Best PracticesKafka 101 and Developer Best Practices
Kafka 101 and Developer Best Practicesconfluent
 
Yahoo! JAPANのデータパイプラインで起きた障害とチューニング - Apache Kafka Meetup Japan #5 -
Yahoo! JAPANのデータパイプラインで起きた障害とチューニング - Apache Kafka Meetup Japan #5 -Yahoo! JAPANのデータパイプラインで起きた障害とチューニング - Apache Kafka Meetup Japan #5 -
Yahoo! JAPANのデータパイプラインで起きた障害とチューニング - Apache Kafka Meetup Japan #5 -Yahoo!デベロッパーネットワーク
 
基于 Flink 和 AI Flow 的实时推荐系统
基于 Flink 和 AI Flow 的实时推荐系统基于 Flink 和 AI Flow 的实时推荐系统
基于 Flink 和 AI Flow 的实时推荐系统Dong Lin
 
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasVirtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasFlink Forward
 
Local Apache NiFi Processor Debug
Local Apache NiFi Processor DebugLocal Apache NiFi Processor Debug
Local Apache NiFi Processor DebugDeon Huang
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotXiang Fu
 
Introduction to Apache NiFi dws19 DWS - DC 2019
Introduction to Apache NiFi   dws19 DWS - DC 2019Introduction to Apache NiFi   dws19 DWS - DC 2019
Introduction to Apache NiFi dws19 DWS - DC 2019Timothy Spann
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
 
Real-time Analytics with Apache Flink and Druid
Real-time Analytics with Apache Flink and DruidReal-time Analytics with Apache Flink and Druid
Real-time Analytics with Apache Flink and DruidJan Graßegger
 
Deploying Flink on Kubernetes - David Anderson
 Deploying Flink on Kubernetes - David Anderson Deploying Flink on Kubernetes - David Anderson
Deploying Flink on Kubernetes - David AndersonVerverica
 
Building ZingMe News Feed System
Building ZingMe News Feed SystemBuilding ZingMe News Feed System
Building ZingMe News Feed SystemChau Thanh
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase强 王
 

Was ist angesagt? (20)

The top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleThe top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scale
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
 
Flink SQL & TableAPI in Large Scale Production at Alibaba
Flink SQL & TableAPI in Large Scale Production at AlibabaFlink SQL & TableAPI in Large Scale Production at Alibaba
Flink SQL & TableAPI in Large Scale Production at Alibaba
 
Kafka 101 and Developer Best Practices
Kafka 101 and Developer Best PracticesKafka 101 and Developer Best Practices
Kafka 101 and Developer Best Practices
 
Yahoo! JAPANのデータパイプラインで起きた障害とチューニング - Apache Kafka Meetup Japan #5 -
Yahoo! JAPANのデータパイプラインで起きた障害とチューニング - Apache Kafka Meetup Japan #5 -Yahoo! JAPANのデータパイプラインで起きた障害とチューニング - Apache Kafka Meetup Japan #5 -
Yahoo! JAPANのデータパイプラインで起きた障害とチューニング - Apache Kafka Meetup Japan #5 -
 
基于 Flink 和 AI Flow 的实时推荐系统
基于 Flink 和 AI Flow 的实时推荐系统基于 Flink 和 AI Flow 的实时推荐系统
基于 Flink 和 AI Flow 的实时推荐系统
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasVirtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
 
Local Apache NiFi Processor Debug
Local Apache NiFi Processor DebugLocal Apache NiFi Processor Debug
Local Apache NiFi Processor Debug
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
 
Introduction to Apache NiFi dws19 DWS - DC 2019
Introduction to Apache NiFi   dws19 DWS - DC 2019Introduction to Apache NiFi   dws19 DWS - DC 2019
Introduction to Apache NiFi dws19 DWS - DC 2019
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Real-time Analytics with Apache Flink and Druid
Real-time Analytics with Apache Flink and DruidReal-time Analytics with Apache Flink and Druid
Real-time Analytics with Apache Flink and Druid
 
Deploying Flink on Kubernetes - David Anderson
 Deploying Flink on Kubernetes - David Anderson Deploying Flink on Kubernetes - David Anderson
Deploying Flink on Kubernetes - David Anderson
 
Building ZingMe News Feed System
Building ZingMe News Feed SystemBuilding ZingMe News Feed System
Building ZingMe News Feed System
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 

Ähnlich wie FeatHub_FFA_2022

FeatHub_GAIDC_2022.pptx
FeatHub_GAIDC_2022.pptxFeatHub_GAIDC_2022.pptx
FeatHub_GAIDC_2022.pptxDong Lin
 
Scaling Offline Database Usage On GCP @ Dcard
Scaling Offline Database Usage On GCP @ DcardScaling Offline Database Usage On GCP @ Dcard
Scaling Offline Database Usage On GCP @ DcardJui An Huang (黃瑞安)
 
Python在移动社交平台中的应用
Python在移动社交平台中的应用Python在移动社交平台中的应用
Python在移动社交平台中的应用Leo Zhou
 
Phalcon phpconftw2012
Phalcon phpconftw2012Phalcon phpconftw2012
Phalcon phpconftw2012Rack Lin
 
Phalcon the fastest php framework 阿土伯
Phalcon   the fastest php framework 阿土伯Phalcon   the fastest php framework 阿土伯
Phalcon the fastest php framework 阿土伯Hash Lin
 
基于 lucene 的站内搜索
基于 lucene 的站内搜索基于 lucene 的站内搜索
基于 lucene 的站内搜索fulin tang
 
Python 于 webgame 的应用
Python 于 webgame 的应用Python 于 webgame 的应用
Python 于 webgame 的应用勇浩 赖
 
JCConf 2015 TW 高效率資料爬蟲組合包
JCConf 2015 TW 高效率資料爬蟲組合包JCConf 2015 TW 高效率資料爬蟲組合包
JCConf 2015 TW 高效率資料爬蟲組合包書豪 李
 
Streaming architecture zx_dec2015
Streaming architecture zx_dec2015Streaming architecture zx_dec2015
Streaming architecture zx_dec2015Zhenzhong Xu
 
HiveSQL如何平迁到FlinkSQL
HiveSQL如何平迁到FlinkSQLHiveSQL如何平迁到FlinkSQL
HiveSQL如何平迁到FlinkSQLJarkWu
 
Software Engineer Talk
Software Engineer TalkSoftware Engineer Talk
Software Engineer TalkLarry Cai
 
Kamigo reviews 20191127
Kamigo reviews 20191127Kamigo reviews 20191127
Kamigo reviews 20191127Jia Yu Lin
 
Weibo lamp improvements
Weibo lamp improvementsWeibo lamp improvements
Weibo lamp improvementsXinchen Hui
 
GaoLei\'s Summer Intern Report.pdf
GaoLei\'s Summer Intern Report.pdfGaoLei\'s Summer Intern Report.pdf
GaoLei\'s Summer Intern Report.pdfLeon Gao(高磊)
 
探索 API 開發的挑戰與解決之道 | .NET Conf 2023 Taiwan
探索 API 開發的挑戰與解決之道 | .NET Conf 2023 Taiwan探索 API 開發的挑戰與解決之道 | .NET Conf 2023 Taiwan
探索 API 開發的挑戰與解決之道 | .NET Conf 2023 TaiwanAlan Tsai
 
Intro to Dialogflow Chatbot Development
Intro to Dialogflow Chatbot DevelopmentIntro to Dialogflow Chatbot Development
Intro to Dialogflow Chatbot DevelopmentRyan Chung
 
LikeCoin SDK 及 API 分享
LikeCoin SDK 及 API 分享LikeCoin SDK 及 API 分享
LikeCoin SDK 及 API 分享William Chong
 
Mochimedia's Success Story - Case Study I (Python-based Company)
Mochimedia's Success Story - Case Study I (Python-based Company)Mochimedia's Success Story - Case Study I (Python-based Company)
Mochimedia's Success Story - Case Study I (Python-based Company)Sting Chen
 
Langchain and Azure ML and Open AI
Langchain and Azure ML and Open AILangchain and Azure ML and Open AI
Langchain and Azure ML and Open AIKo Ko
 
Python与抓包
Python与抓包Python与抓包
Python与抓包Leo Zhou
 

Ähnlich wie FeatHub_FFA_2022 (20)

FeatHub_GAIDC_2022.pptx
FeatHub_GAIDC_2022.pptxFeatHub_GAIDC_2022.pptx
FeatHub_GAIDC_2022.pptx
 
Scaling Offline Database Usage On GCP @ Dcard
Scaling Offline Database Usage On GCP @ DcardScaling Offline Database Usage On GCP @ Dcard
Scaling Offline Database Usage On GCP @ Dcard
 
Python在移动社交平台中的应用
Python在移动社交平台中的应用Python在移动社交平台中的应用
Python在移动社交平台中的应用
 
Phalcon phpconftw2012
Phalcon phpconftw2012Phalcon phpconftw2012
Phalcon phpconftw2012
 
Phalcon the fastest php framework 阿土伯
Phalcon   the fastest php framework 阿土伯Phalcon   the fastest php framework 阿土伯
Phalcon the fastest php framework 阿土伯
 
基于 lucene 的站内搜索
基于 lucene 的站内搜索基于 lucene 的站内搜索
基于 lucene 的站内搜索
 
Python 于 webgame 的应用
Python 于 webgame 的应用Python 于 webgame 的应用
Python 于 webgame 的应用
 
JCConf 2015 TW 高效率資料爬蟲組合包
JCConf 2015 TW 高效率資料爬蟲組合包JCConf 2015 TW 高效率資料爬蟲組合包
JCConf 2015 TW 高效率資料爬蟲組合包
 
Streaming architecture zx_dec2015
Streaming architecture zx_dec2015Streaming architecture zx_dec2015
Streaming architecture zx_dec2015
 
HiveSQL如何平迁到FlinkSQL
HiveSQL如何平迁到FlinkSQLHiveSQL如何平迁到FlinkSQL
HiveSQL如何平迁到FlinkSQL
 
Software Engineer Talk
Software Engineer TalkSoftware Engineer Talk
Software Engineer Talk
 
Kamigo reviews 20191127
Kamigo reviews 20191127Kamigo reviews 20191127
Kamigo reviews 20191127
 
Weibo lamp improvements
Weibo lamp improvementsWeibo lamp improvements
Weibo lamp improvements
 
GaoLei\'s Summer Intern Report.pdf
GaoLei\'s Summer Intern Report.pdfGaoLei\'s Summer Intern Report.pdf
GaoLei\'s Summer Intern Report.pdf
 
探索 API 開發的挑戰與解決之道 | .NET Conf 2023 Taiwan
探索 API 開發的挑戰與解決之道 | .NET Conf 2023 Taiwan探索 API 開發的挑戰與解決之道 | .NET Conf 2023 Taiwan
探索 API 開發的挑戰與解決之道 | .NET Conf 2023 Taiwan
 
Intro to Dialogflow Chatbot Development
Intro to Dialogflow Chatbot DevelopmentIntro to Dialogflow Chatbot Development
Intro to Dialogflow Chatbot Development
 
LikeCoin SDK 及 API 分享
LikeCoin SDK 及 API 分享LikeCoin SDK 及 API 分享
LikeCoin SDK 及 API 分享
 
Mochimedia's Success Story - Case Study I (Python-based Company)
Mochimedia's Success Story - Case Study I (Python-based Company)Mochimedia's Success Story - Case Study I (Python-based Company)
Mochimedia's Success Story - Case Study I (Python-based Company)
 
Langchain and Azure ML and Open AI
Langchain and Azure ML and Open AILangchain and Azure ML and Open AI
Langchain and Azure ML and Open AI
 
Python与抓包
Python与抓包Python与抓包
Python与抓包
 

FeatHub_FFA_2022

Hinweis der Redaktion

  1. 开发难度高:需要较大的工作量来避免特征穿越问题 部署难度高:数据科学家需要工程师支持完成特征的高性能在线部署 监控难度高:需要较大的工作量来发现和解决特征质量问题 分享难度高:缺乏元数据中心来方便开发者注册/搜索/复用特征定义和特征数据
  2. 特征开发:提供简单易用的SDK来定义特征,避免特征穿越问题,提升开发效率 特征部署:将特征定义自动翻译成高性能分布式数据处理作业 (e.g. Flink, Spark),减少额外的开发工作 特征监控:提供丰富的metric以及报警,来及时发现,定位以及解决特征质量问题,提升机器学习推理效果 特征分享:提供SDK以及UI来方便开发者注册,搜索和读取特征。鼓励合作分享,减少重复的开发工作
  3. 特征开发:提供简单易用的SDK来定义特征,避免特征穿越问题,提升开发效率 特征部署:将特征定义自动翻译成高性能分布式数据处理作业 (e.g. Flink, Spark),减少额外的开发工作 特征监控:提供丰富的metric以及报警,来及时发现,定位以及解决特征质量问题,提升机器学习推理效果 特征分享:提供SDK以及UI来方便开发者注册,搜索和读取特征。鼓励合作分享,减少重复的开发工作
  4. 容易使用的特征开发SDK 支持丰富的数据源和数据储存 定义具有point-in-time correctness的丰富的特征计算逻辑 特征部署/计算/储存 输出特征到离线存储 (e.g. HDFS) 用于离线训练 输出特征到在线存储 (e.g. Redis) 用于在线推理,支持基于时间的覆盖规则 提供高吞吐,低延迟的实时特征计算 特征质量监控 监控特征作业延迟,稳定性 etc. 监控特征数值分布变化,异常数值比例 etc. 特征元数据中心 支持特征注册,搜索和读取
  5. SDK简单易用:基于Python的SDK,可以插入Python/Java/C++ UDF, 贴近数据科学家的开发习惯 支持快速实验:数据科学家可以在单机运行实验,无需依赖远程分布式集群 支持快速部署:数据科学家无需改代码就能完成从实验到生产的切换 支持实时特征:支持基于Flink的原生实时特征计算,实现流批一体 处理引擎可扩展: 处理引擎框架能被扩展来支持Spark, 公司内部自研的引擎 etc. 开源且支持本地化部署:支持多云,不绑定任何云服务
  6. TableDescriptor – 声明式的table定义,代表一组特征 FeatureTable – 代表具有物理位置的table (e.g. KafkaSource, FileSystemSink) FeatureView – 代表对某个table应用Transform逻辑所得到的table DerivedFeatureView – 支持单行计算,Over窗口聚合,以及table join SlidingFeatureView – 支持单行计算和滑动窗口聚合 OnDemandFeatureView – 支持单行计算和table join。用于特征在线计算 Transformation – 声明式的特征计算逻辑 ExpressionTransform – 表达单行的计算逻辑 (e.g. x + y * z) JoinTransform – 表达多个table之间的特征拼接 OverWindowTransform – 表达基于Over窗口聚合的计算逻辑 SlidingWindowTransform – 表达基于滑动窗口聚合的计算逻辑 PythonUDFTransform – 支持用户自定义的Python UDF …
  7. Processor – 可插拔的计算引擎,用于特征ETL LocalProcessor – 使用单机上的CPU来执行特征ETL,方便单机开发调试,无需依赖分布式集群 FlinkProcessor – 使用分布式Flink集群来执行特征ETL,提供生产所需的高并发,稳定性 etc. … Feature Registry – 特征元数据中心,用于特征注册,搜索和读取 LocalRegistry – 使用单机上的内存来储存特征元数据,方便单机开发调试 … FeatureService - 特征在线服务,用于计算/读取/拼接在线特征 LocalFeatureService – 使用单机上的CPU来计算在线特征,方便单机开发调试 …