Suche senden
Hochladen
数据科学分析协作平台CDSW
•
1 gefällt mir
•
1,058 views
Jianwei Li
Folgen
基于Hadoop,Spark的数据科学分析协作平台
Weniger lesen
Mehr lesen
Technologie
Diashow-Anzeige
Melden
Teilen
Diashow-Anzeige
Melden
Teilen
1 von 39
Jetzt herunterladen
Downloaden Sie, um offline zu lesen
Empfohlen
Ozone: An Object Store in HDFS
Ozone: An Object Store in HDFS
DataWorks Summit
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
DataWorks Summit
Intro to InfluxDB 2.0 and Your First Flux Query by Sonia Gupta
Intro to InfluxDB 2.0 and Your First Flux Query by Sonia Gupta
InfluxData
InfluxDB 101 - Concepts and Architecture | Michael DeSa | InfluxData
InfluxDB 101 - Concepts and Architecture | Michael DeSa | InfluxData
InfluxData
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
Yahoo!デベロッパーネットワーク
Fast analytics kudu to druid
Fast analytics kudu to druid
Worapol Alex Pongpech, PhD
Introduction to HBase - NoSqlNow2015
Introduction to HBase - NoSqlNow2015
Apekshit Sharma
[찾아가는세미나] 고객맞춤형 재해복구서비스
[찾아가는세미나] 고객맞춤형 재해복구서비스
해은 최
Empfohlen
Ozone: An Object Store in HDFS
Ozone: An Object Store in HDFS
DataWorks Summit
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
DataWorks Summit
Intro to InfluxDB 2.0 and Your First Flux Query by Sonia Gupta
Intro to InfluxDB 2.0 and Your First Flux Query by Sonia Gupta
InfluxData
InfluxDB 101 - Concepts and Architecture | Michael DeSa | InfluxData
InfluxDB 101 - Concepts and Architecture | Michael DeSa | InfluxData
InfluxData
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
Yahoo!デベロッパーネットワーク
Fast analytics kudu to druid
Fast analytics kudu to druid
Worapol Alex Pongpech, PhD
Introduction to HBase - NoSqlNow2015
Introduction to HBase - NoSqlNow2015
Apekshit Sharma
[찾아가는세미나] 고객맞춤형 재해복구서비스
[찾아가는세미나] 고객맞춤형 재해복구서비스
해은 최
Introduction to Azure SQL DB
Introduction to Azure SQL DB
Christopher Foot
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache Ranger
DataWorks Summit
Convert single instance to RAC
Convert single instance to RAC
Satishbabu Gunukula
How to build a successful Data Lake
How to build a successful Data Lake
DataWorks Summit/Hadoop Summit
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
DataStax
HBase Low Latency
HBase Low Latency
DataWorks Summit
DNS High-Availability Tools - Open-Source Load Balancing Solutions
DNS High-Availability Tools - Open-Source Load Balancing Solutions
Men and Mice
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
confluent
How Insurance Companies Use MongoDB
How Insurance Companies Use MongoDB
MongoDB
Zero Data Loss Recovery Appliance 設定手順例
Zero Data Loss Recovery Appliance 設定手順例
オラクルエンジニア通信
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
Databricks
Construindo um data lake na nuvem aws
Construindo um data lake na nuvem aws
Amazon Web Services LATAM
Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ Uber
Xiang Fu
Getting Started with Amazon Redshift
Getting Started with Amazon Redshift
Amazon Web Services
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
DataWorks Summit
Best Practices Using RTI Connext DDS
Best Practices Using RTI Connext DDS
Real-Time Innovations (RTI)
Microsoft Data Platform - What's included
Microsoft Data Platform - What's included
James Serra
Extending Apache Ranger Authorization Beyond Hadoop: Review of Apache Ranger ...
Extending Apache Ranger Authorization Beyond Hadoop: Review of Apache Ranger ...
DataWorks Summit
25 snowflake
25 snowflake
剑飞 陈
Recovery Time Objective and Recovery Point Objective
Recovery Time Objective and Recovery Point Objective
Yankee Maharjan
Cloudera企业数据中枢平台
Cloudera企业数据中枢平台
Jianwei Li
Track A-1: Cloudera 大數據產品和技術最前沿資訊報告
Track A-1: Cloudera 大數據產品和技術最前沿資訊報告
Etu Solution
Weitere ähnliche Inhalte
Was ist angesagt?
Introduction to Azure SQL DB
Introduction to Azure SQL DB
Christopher Foot
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache Ranger
DataWorks Summit
Convert single instance to RAC
Convert single instance to RAC
Satishbabu Gunukula
How to build a successful Data Lake
How to build a successful Data Lake
DataWorks Summit/Hadoop Summit
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
DataStax
HBase Low Latency
HBase Low Latency
DataWorks Summit
DNS High-Availability Tools - Open-Source Load Balancing Solutions
DNS High-Availability Tools - Open-Source Load Balancing Solutions
Men and Mice
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
confluent
How Insurance Companies Use MongoDB
How Insurance Companies Use MongoDB
MongoDB
Zero Data Loss Recovery Appliance 設定手順例
Zero Data Loss Recovery Appliance 設定手順例
オラクルエンジニア通信
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
Databricks
Construindo um data lake na nuvem aws
Construindo um data lake na nuvem aws
Amazon Web Services LATAM
Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ Uber
Xiang Fu
Getting Started with Amazon Redshift
Getting Started with Amazon Redshift
Amazon Web Services
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
DataWorks Summit
Best Practices Using RTI Connext DDS
Best Practices Using RTI Connext DDS
Real-Time Innovations (RTI)
Microsoft Data Platform - What's included
Microsoft Data Platform - What's included
James Serra
Extending Apache Ranger Authorization Beyond Hadoop: Review of Apache Ranger ...
Extending Apache Ranger Authorization Beyond Hadoop: Review of Apache Ranger ...
DataWorks Summit
25 snowflake
25 snowflake
剑飞 陈
Recovery Time Objective and Recovery Point Objective
Recovery Time Objective and Recovery Point Objective
Yankee Maharjan
Was ist angesagt?
(20)
Introduction to Azure SQL DB
Introduction to Azure SQL DB
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache Ranger
Convert single instance to RAC
Convert single instance to RAC
How to build a successful Data Lake
How to build a successful Data Lake
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
HBase Low Latency
HBase Low Latency
DNS High-Availability Tools - Open-Source Load Balancing Solutions
DNS High-Availability Tools - Open-Source Load Balancing Solutions
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
How Insurance Companies Use MongoDB
How Insurance Companies Use MongoDB
Zero Data Loss Recovery Appliance 設定手順例
Zero Data Loss Recovery Appliance 設定手順例
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
Construindo um data lake na nuvem aws
Construindo um data lake na nuvem aws
Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ Uber
Getting Started with Amazon Redshift
Getting Started with Amazon Redshift
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Best Practices Using RTI Connext DDS
Best Practices Using RTI Connext DDS
Microsoft Data Platform - What's included
Microsoft Data Platform - What's included
Extending Apache Ranger Authorization Beyond Hadoop: Review of Apache Ranger ...
Extending Apache Ranger Authorization Beyond Hadoop: Review of Apache Ranger ...
25 snowflake
25 snowflake
Recovery Time Objective and Recovery Point Objective
Recovery Time Objective and Recovery Point Objective
Ähnlich wie 数据科学分析协作平台CDSW
Cloudera企业数据中枢平台
Cloudera企业数据中枢平台
Jianwei Li
Track A-1: Cloudera 大數據產品和技術最前沿資訊報告
Track A-1: Cloudera 大數據產品和技術最前沿資訊報告
Etu Solution
Qcon2013 罗李 - hadoop在阿里
Qcon2013 罗李 - hadoop在阿里
li luo
Cloudera security and enterprise license by Athemaster(繁中)
Cloudera security and enterprise license by Athemaster(繁中)
Athemaster Co., Ltd.
新浪云计算公开课第一期:Let’s run @ sae(丛磊)
新浪云计算公开课第一期:Let’s run @ sae(丛磊)
锐 张
Oracle db 12c 加速企业转型之十大功能
Oracle db 12c 加速企业转型之十大功能
Ethan M. Liu
How do we manage more than one thousand of Pegasus clusters - engine part
How do we manage more than one thousand of Pegasus clusters - engine part
acelyc1112009
如何快速实现数据编织架构
如何快速实现数据编织架构
Denodo
王涛:基于Cloudera impala的非关系型数据库sql执行引擎
王涛:基于Cloudera impala的非关系型数据库sql执行引擎
hdhappy001
Analytics in a Day.pptx
Analytics in a Day.pptx
LigangJin
大型电商的数据服务的要点和难点
大型电商的数据服务的要点和难点
Chao Zhu
Pegasus KV Storage, Let the Users focus on their work (2018/07)
Pegasus KV Storage, Let the Users focus on their work (2018/07)
涛 吴
雲端環境的快取策略-Global Azure Bootcamp 2015 臺北場
雲端環境的快取策略-Global Azure Bootcamp 2015 臺北場
twMVC
Q con成都主题演讲【弹性计算】by马介悦
Q con成都主题演讲【弹性计算】by马介悦
drewz lin
Accelerate Database as a Service(DBaaS) in Cloud era
Accelerate Database as a Service(DBaaS) in Cloud era
Junchi Zhang
海通证券金融云思考与实践(数据技术嘉年华2017)
海通证券金融云思考与实践(数据技术嘉年华2017)
Zhaoyang Wang
Hadoop的典型应用与企业化之路 for HBTC 2012
Hadoop的典型应用与企业化之路 for HBTC 2012
James Chen
淘宝双11双12案例分享
淘宝双11双12案例分享
vanadies10
ODB in the Cloud (Cn)
ODB in the Cloud (Cn)
Lei Xu
Raising The MySQL Bar-Manyi Lu
Raising The MySQL Bar-Manyi Lu
郁萍 王
Ähnlich wie 数据科学分析协作平台CDSW
(20)
Cloudera企业数据中枢平台
Cloudera企业数据中枢平台
Track A-1: Cloudera 大數據產品和技術最前沿資訊報告
Track A-1: Cloudera 大數據產品和技術最前沿資訊報告
Qcon2013 罗李 - hadoop在阿里
Qcon2013 罗李 - hadoop在阿里
Cloudera security and enterprise license by Athemaster(繁中)
Cloudera security and enterprise license by Athemaster(繁中)
新浪云计算公开课第一期:Let’s run @ sae(丛磊)
新浪云计算公开课第一期:Let’s run @ sae(丛磊)
Oracle db 12c 加速企业转型之十大功能
Oracle db 12c 加速企业转型之十大功能
How do we manage more than one thousand of Pegasus clusters - engine part
How do we manage more than one thousand of Pegasus clusters - engine part
如何快速实现数据编织架构
如何快速实现数据编织架构
王涛:基于Cloudera impala的非关系型数据库sql执行引擎
王涛:基于Cloudera impala的非关系型数据库sql执行引擎
Analytics in a Day.pptx
Analytics in a Day.pptx
大型电商的数据服务的要点和难点
大型电商的数据服务的要点和难点
Pegasus KV Storage, Let the Users focus on their work (2018/07)
Pegasus KV Storage, Let the Users focus on their work (2018/07)
雲端環境的快取策略-Global Azure Bootcamp 2015 臺北場
雲端環境的快取策略-Global Azure Bootcamp 2015 臺北場
Q con成都主题演讲【弹性计算】by马介悦
Q con成都主题演讲【弹性计算】by马介悦
Accelerate Database as a Service(DBaaS) in Cloud era
Accelerate Database as a Service(DBaaS) in Cloud era
海通证券金融云思考与实践(数据技术嘉年华2017)
海通证券金融云思考与实践(数据技术嘉年华2017)
Hadoop的典型应用与企业化之路 for HBTC 2012
Hadoop的典型应用与企业化之路 for HBTC 2012
淘宝双11双12案例分享
淘宝双11双12案例分享
ODB in the Cloud (Cn)
ODB in the Cloud (Cn)
Raising The MySQL Bar-Manyi Lu
Raising The MySQL Bar-Manyi Lu
数据科学分析协作平台CDSW
1.
1© Cloudera, Inc. All rights reserved. Cloudera Data Science Workbench 企业级数据科学家自助分析合作平台 李建伟|大数据架构师@Cloudera
2.
2© Cloudera, Inc. All rights reserved. 议程 • 数据科学及其面临的挑战 • CDSW功能介绍 •
CDSW原理及架构 • 基于CDSW实现客户流失预警 • Q&A
3.
3© Cloudera, Inc. All rights reserved. 客户流失预警
4.
4© Cloudera, Inc. All rights reserved. 客户流失预警
5.
‹#›© Cloudera, Inc. All rights reserved. •KS, 128, 415,
382-4657, no, yes, 25, 265.1, 110, 45.07, 197.4, 99, 16.78, 244.7, 91, 11.01, 10, 3, 2.7, 1, False. •OH, 107, 415, 371-7191, no, yes, 26, 161.6, 123, 27.47, 195.5, 103, 16.62, 254.4, 103, 11.45, 13.7, 3, 3.7, 1, False. •NJ, 137, 415, 358-1921, no, no, 0, 243.4, 114, 41.38, 121.2, 110, 10.3, 162.6, 104, 7.32, 12.2, 5, 3.29, 0, False. •OH, 84, 408, 375-9999, yes, no, 0, 299.4, 71, 50.9, 61.9, 88, 5.26, 196.9, 89, 8.86, 6.6, 7, 1.78, 2, False. •OK, 75, 415, 330-6626, yes, no, 0, 166.7, 113, 28.34, 148.3, 122, 12.61, 186.9, 121, 8.41, 10.1, 3, 2.73, 3, True 客户时间,号码,国际漫游,语⾳邮箱,留⾔个数,⽩天电话分钟,⽩天电话次数,⽩ 天电话费⽤,晚上…,半夜…, 客服电话次数,是否流失? 客户流失预警-数据
6.
6© Cloudera, Inc. All rights reserved. 设备维护的三种⽅式 维修维护 • 设备出现故障后⼈⼯维修 • 被动响应 预防维护 •
定期对设备进⾏维护 • 定期响应 预测维护 • 持续监控设备的运⾏指标, 根据异常情况进⾏干预维护 • 主动响应 业务价值 被动 主动 ⼤多数企业利⽤这两种⽅式
7.
7© Cloudera, Inc. All rights reserved. • 通过传感器实时监控设备的状态及性能指标 • 检测异常变量,模式可能导致潜在的故障,预测设备何时会发生故障 •
制定相应的检修,维护计划 降低成本 减少宕机时间 提升质量
8.
8© Cloudera, Inc. All rights reserved. 1 预测性维护的业务价值 通过实时数据预测,预防 系统宕机,减少宕机时间 50% 50% 维修&更换 预测&预防 预测性维护减少设备维 护成本10%到40% 40% 减少宕机时间 - 数据源: 麦肯锡 降低成本
9.
9© Cloudera, Inc. All rights reserved. 能源 » 设备故障预测 » 提高生产效率 »
降低成本 风力发电 案例分析 • ⻛机整体状态评估 • 测⻛仪健康状态评估 • ⻛机内部齿轮箱状态评估 • ⻛机外部部件状态评估
10.
10© Cloudera, Inc. All rights reserved. ⻛机故障预测
11.
11© Cloudera, Inc. All rights reserved. 开放的数据科学⼯具集
12.
12© Cloudera, Inc. All rights reserved. 数据科学面临的挑战 数据工程 数据科学 (Exploratory)
生产(Operational) Data Governance ⼤部分的数据科学算法 在个⼈的⼯具上⼩规模 数据上运⾏,⽅案很难 复制 很少的模型进⼊⽣ 产阶段 不同的部⻔,团队对⼯ 具,编程语⾔有不同的 需求 需要在不同系统之间进 ⾏⼤量数据拷⻉
13.
13© Cloudera, Inc. All rights reserved. 数据科学家遇到的问题 数据访问 l 企业内部的数据由于安全的限制 不能访问⼤数据集群的数据 l 已有的数据分析⼯具不能对接企 业的Hadoop系统 平台扩展 n
个⼈电脑提供有限的存储及 计算能⼒ n 基于抽样数据进⾏建模 n 模型训练时间⻓(基于SAS的 模型训练8⼩时) ⽤户体验 l 软件⼯具版本维护困难 l Python vs R l Python 2.7 vs 3.5 l 开发的模型很难上⽣产环境 l Notebooks ⼯具很难对接⼤数 据技术
14.
14© Cloudera, Inc. All rights reserved. IT团队遇到的问题 • 多租户管理: • 多个软件的管理及软件的依赖关系管理 •
软件的版本管理 • 数据⼯程师与数据科学家的集群共享 • 安全监管: • 通过Notebook⼯具,失去数据⾎统关系分析 • 数据质量与数据拷⻉: • 本地数据拷⻉过期 • 多个数据集拷⻉
15.
15© Cloudera, Inc. All rights reserved. Hadoop与机器学习 提升数据科学效率,缩短挖掘数据价值时间 数据资源 数据消化 分布存储和处理
数据分析和智能(机器学习) Apache Kafka Stream or batch ingestion of IoT data Apache Sqoop Ingestion of data from relational sources Apache HDFS Storage (HDFS) & deep batch processing Apache Kudu Storage & serving for fast changing data Apache HBase NoSQL data store for real time applications Apache Impala MPP SQL for fast analytics Cloudera Search Real time searchIoT数据 企业内部数据 安全, 扩展& 易管理 部署灵活: 数据中⼼ 云 Apache Spark Stream & iterative processing, ML
16.
16© Cloudera, Inc. All rights reserved. Hadoop与机器学习 提升数据科学效率,缩短挖掘数据价值时间 • 更多的数据,不止更好的算法 • 更多种类数据,不止结构化数据 •
更多计算引擎, 不止基于Schema的SQL引擎 • 易于水平扩展 vs 垂直扩展 • 一个平台,多个计算框架,支持批处理,流处理,数据服务等 vs 多个系统
17.
17© Cloudera, Inc. All rights reserved. https://medium.com/@KevinSchmidtBiz/data-engineer-vs-data-scientist-vs-business-analyst-b68d201364bc
18.
18© Cloudera, Inc. All rights reserved. Cloudera Data Science Workbench 企业级⾃服务数据科学平台 • 基于Hadoop进⾏数据科学分析 • 数据集中存放在HDFS •
利⽤Spark, Impala及其他Hadoop计算引擎 • 解决分析“烟囱”问题 • ⾃服务协作平台 • 在浏览器上运⾏Python, R及Scala • ⾃定义项⺫软件,环境变量 • 数据分析过程合作,分析结果共享 • 满⾜企业⽤户需求 • 业务部⻔⾃服务数据探索分析 • 保证数据安全前提下的数据分析(Kerberos) • 部署灵活:数据中⼼,云
19.
19© Cloudera, Inc. All rights reserved. Cloudera企业数据中心 数据治理 运维管理 CDH –
100% 开源 商业版 公有云 数据中心 所有X86服务器 部署 云应⽤迁移 Navigator Optimizer 传统数据库 迁移到 Hadoop Cloudera Data Science Workbench (CDSW) R, Python, Scala Data Science at Scale PaaS 私有云 数据加⼯、处理 发现与分析 在线服务 统⼀数据服务 存储 批处理 流处理 SQL 全⽂检索 建模 在线 资源管理— YARN, Zookeeper 安全管理— SENTRY + Record Service MR, HIve, Pig Spark Streaming Impala Solr Spark MLLib HBase HDFS Kudu HBase 数据接⼊ — Sqoop, Flume, Kafka 分布式⽂件系统 关系数据 NoSQL Cloudera Navigator 安全 审计 溯源 加密 Cloudera Manager 管理 监控 诊断 集成 Cloudera Director 云上⼤数 据
20.
20© Cloudera, Inc. All rights reserved. 端到端的数据科学流程 数据工程 数据科学(Exploratory) 生产
(Operational) 数据清洗 特征选择 数据可视化及 分析 模型训练及测 试 生产模型准备 离线应用 在线应有 模型 服务 开发工具: IDEs/Notebooks, 合作 运维工具: 版本控制, 定期作业, 工作流, 模型发布 Data Governance数据转换 数据预处理 数据获取 模型质量 模型试验
21.
21© Cloudera, Inc. All rights reserved. Cloudera Data Science Workbench 企业级⾃服务数据科学平台 开发 集成工具 运维 作业管理
22.
22© Cloudera, Inc. All rights reserved. Cloudera Data Science Workbench 企业级⾃服务数据科学平台 开发 集成工具 运维 作业管理
23.
23© Cloudera, Inc. All rights reserved. 功能特性-数据预处理 支持多种类 型数据源, 简化了数据 建模、分析 前大量繁重、 重复的数据 加工、清洗 工作
24.
24© Cloudera, Inc. All rights reserved. 功能特性-开发模型 使用最强大的工具,包括R,Python,SQL, Spark等,来构建数据科学和高级分析解决方 案,加速数据科学从探索到部署。
25.
25© Cloudera, Inc. All rights reserved. 功能特性-数据可视化 自动部署模型程序,发布数据可视化图表,实现数据科学家和业务团 队紧密合作,构建分析管道和模型,为企业带来更深入的洞察。
26.
26© Cloudera, Inc. All rights reserved. 功能特性-作业调度管理 构建及管理R,Python,SQL,Spark等的ETL和模型分析工作流。 构建 分析基础架构,实现无限制的分析。
27.
27© Cloudera, Inc. All rights reserved. CDSW部署架构 Cloudera Manager HTTP Users CDH Nodes CDH Nodes CDH Nodes CDH Cluster 1 Cloudera Manager CDH Nodes CDH Nodes CDH Cluster 2 CDSW Application CDSW Nodes CDSW Nodes CDSW Nodes CDH Nodes Config Spark, Impala, Hive, HDFS, etc. •做为“edge node
cluster”运⾏ • 在Docker + Kubernetes • CDH 5.11, Spark 2.0+ •或者是AWS等云环境 • 使⽤虚拟镜像VMs/AMIs • 脚本化安装 •安全策略⽀持 LDAP/SAML/Kerberos
28.
28© Cloudera, Inc. All rights reserved. CDWS软件架构 CDH Gateway CDH Node CDH Node CDH Node Cloudera Manager CDSW Worker Node Spark, Impala, Hive, HDFS, … CDH Gateway CDSW Master Node Docker Application Pods Engine Pods Kubernetes Cloudera Manager Agent CDSW 应用组件及用户负载 容器调度服务 容器运行环境 Local management of CDH services CDH Gateway CDSW Worker Node
29.
29© Cloudera, Inc. All rights reserved. CDSW + Spark
Architecture
30.
30© Cloudera, Inc. All rights reserved. • 操作系统: RHEL/CentOS
7.2 • 硬件配置 • 1个主CDSW节点, 0个或多个CDSW从节点 • CPU: 16+ CPU (vCPU) 核 • 内存: 32+ GB • 硬盘: • Root Volume: 100+ GB • Docker Image Block Device(s): 500+ GB • Application Block Device(s) (Master Node Only): 500+ GB • 网络: • 通配域名, 例如: *.cdsw.company.com • 禁用防火墙 • 建议: 8 CPU cores and 16GB of RAM/用户 ⺴关节点要求
31.
‹#›© Cloudera, Inc. All rights reserved. •KS, 128, 415,
382-4657, no, yes, 25, 265.1, 110, 45.07, 197.4, 99, 16.78, 244.7, 91, 11.01, 10, 3, 2.7, 1, False. •OH, 107, 415, 371-7191, no, yes, 26, 161.6, 123, 27.47, 195.5, 103, 16.62, 254.4, 103, 11.45, 13.7, 3, 3.7, 1, False. •NJ, 137, 415, 358-1921, no, no, 0, 243.4, 114, 41.38, 121.2, 110, 10.3, 162.6, 104, 7.32, 12.2, 5, 3.29, 0, False. •OH, 84, 408, 375-9999, yes, no, 0, 299.4, 71, 50.9, 61.9, 88, 5.26, 196.9, 89, 8.86, 6.6, 7, 1.78, 2, False. •OK, 75, 415, 330-6626, yes, no, 0, 166.7, 113, 28.34, 148.3, 122, 12.61, 186.9, 121, 8.41, 10.1, 3, 2.73, 3, True 客户时间,号码,国际漫游,语⾳邮箱,留⾔个数,⽩天电话分钟,⽩天电话次数,⽩ 天电话费⽤,晚上…,半夜…, 客服电话次数,是否流失? 客户流失预警-数据
32.
‹#›© Cloudera, Inc. All rights reserved. 建模流程
33.
33© Cloudera, Inc. All rights reserved. 获取数据
34.
34© Cloudera, Inc. All rights reserved. 特征抽取&特征转换
35.
35© Cloudera, Inc. All rights reserved. 训练数据集&测试数据集
36.
‹#›© Cloudera, Inc. All rights reserved. 模型效果评估
37.
‹#›© Cloudera, Inc. All rights reserved. 模型效果评估: ROC
38.
38© Cloudera, Inc. All rights reserved. 模型效果评估
39.
39© Cloudera, Inc. All rights reserved. Thank you
Jetzt herunterladen