More Related Content Similar to Apache NiFi + Tensorflow + Hadoop:Big Data AI サンドイッチの作り方 (20) Apache NiFi + Tensorflow + Hadoop:Big Data AI サンドイッチの作り方1. 1 © Hortonworks Inc. 2011–2018. All rights reserved
Apache NiFi + Tensorflow + Hadoop:
Big Data AI サンドイッチの作り方
Zhen Zeng
Solution Engineer
5th July, 2018
2. 2 © Hortonworks Inc. 2011–2018. All rights reserved
Agenda
• 自己紹介
• Bigdata AI サンドイッチの作り方
• NiFi
• TensorFlow
• NiFi/TensorFlow/Hadoopとの組み合わせ
3. 3 © Hortonworks Inc. 2011–2018. All rights reserved
About Me
• 曾 臻(Zhen Zeng)
• Solution Engineer, Hortonworks Japan
• Java Engineer, BigData Engineer
4. 4 © Hortonworks Inc. 2011–2018. All rights reserved
Hortonworks 会社概要
会社概要: 本社 米国カリフォルニア州サンタクララ市
次世代データプラットフォームの世界標準・デファクトスタンダードを提供するオープンソースソフトウェア企業の世界的リーダー
2017年売上実績
$ 261.8M (前年同期比 +42%)
Q4 2017/2016
Support Subscription売上高
+63% YoY
データレイクの市場浸透と BigData, IoTの基盤の
標準技術としての採用が加速し売上が順調に伸びている
創立 2011年 Yahoo!のApache Hadoop
オリジナルチームのメンバー24人のエンジニアが設立
役員 CEO: ロブ・ビアデン、COO:スコット・デイビッドソン
オープンソースソフトウェアへ100%コミット
Apache Hadoop プロジェクトへの貢献世界一
2011年 創業、Microsoft社 (Azure HDInsight )と提携
2014年 9月 日本法人ホートンワークスジャパン株式会社設立
12月 NASDAQ上場(NASDAQ: HDP)
2015年 創業以来最速で売上$100Mを達成
Apache NiFiのOnyara社を買収、Hortonworks DataFlow (HDF)を市場投入
2016年 Billingが$270M越す
Hortonworks Data Cloud (HDC) for AWSを市場投入
2016年 DellEMC社と提携 Pivotal Hadoop Distroを Hortonworks Data Platform
(HDP) に
2017年 6月 IBM社と提携 BigInsight Hadoop Distroを HDPに
9月 サイバーセキュリティ HCPとデータプレーンサービスDPSを市場投入
9月 NECグローバル契約締結
2018年 1月 HDF3.1市場投入
6月 HDP3.0市場投入
6月 Google Cloudとの連携を拡大
6月 Microsoft社との提携を強化
創業以来の売上推移
2011 創業
2013 $24.085M
2014 $46.048M + 91.1% IPO
2015 $121.944M + 164.8%
2016 $184.461M + 51.3%
2017 $261.810M + 41.9%
5. 5 © Hortonworks Inc. 2011–2018. All rights reserved
Big data AIサンドイッチの
作り方
6. 6 © Hortonworks Inc. 2011–2018. All rights reserved
AIサンドイッチの中身
これらのMachine Learning/Deep LearningのWorkflowをどうやって統合するか?
Computer Vision
• Object Recognition
• Image Classification
• Object Detection
• Motion Estimation
• Annotation
• Visual Question and Answer
• Autonomous Driving
• Speech to Text
• Speech Recognition
• Chat Bot
• Voice UI
Speech Recognition Natural Language Processing
• Sentiment Analysis
• Text Classification
• Named Entity Recognition
https://github.com/zackchase/mxnet-the-straight-dope
Recommender Systems
• Content-based
Recommendations
7. 7 © Hortonworks Inc. 2011–2018. All rights reserved
Bigdata AI サンドイッチ レシピ
• 材料
• Apache NiFi
• MiNiFi Agent
• TensorFlow
• Apache Hadoop
8. 8 © Hortonworks Inc. 2011–2018. All rights reserved
Bigdata AI サンドイッチ 構成図(Basic版)
Ingestion
Simple Event Processing
Destination
Build
Predictive Model
From Historical Data
Deploy
Predictive Model
For Real-time Insights
Perishable Insights
Historical Insights
9. 9 © Hortonworks Inc. 2011–2018. All rights reserved
Bigdata AI サンドイッチ 構成図(Professional版)
Ingestion
Simple Event Processing
Engine
Stream Processing
Destination
Data Bus
Build
Predictive Model
From Historical Data
Deploy
Predictive Model
For Real-time Insights
Perishable Insights
Historical Insights
10. 10 © Hortonworks Inc. 2011–2018. All rights reserved
Deep Learning Components
Streaming Analytics
Manager
Machine Learning
Distributed queue
Buffering
Process decoupling
Streaming and SQL
Orchestration
Queueing
Simple Event Processing
REST API
Secure Spark Execution
11. 11 © Hortonworks Inc. 2011–2018. All rights reserved
Streaming Analytics
Manager
Detect metadata and data
Extract metadata and data
Content Analysis
Deep Learning Framework
Entity Resolution
Natural Language Processing
Deep Learning Components
Work with MiNiFi Agent
Deep Learning Framework
12. 12 © Hortonworks Inc. 2011–2018. All rights reserved
What do we want to do?
• MiNiFi ingests camera images and
sensor data
• MiNiFi executes algorithms at the edge
• Run Trained Inception Classification to
recognize objects in image
• Apache NiFi stores images, metadata
and enriched data in Hadoop
• Apache NiFi ingests social data and
REST feeds
• Apache OpenNLP and Apache Tika for
textual data
13. 13 © Hortonworks Inc. 2011–2018. All rights reserved
Recommendations
• Model Training
• Install CPU Version on CPU YARN Nodes
• Install GPU Version on Nvidia (CUDA)
• Do training on GPU YARN Nodes where possible
• Model Applying
• Apply Model on All Nodes and Trigger with Apache NiFi
• What helps Hadoop and Spark will help TensorFlow.
• More RAM, More and Faster Cores, More Nodes.
• Try YARN 3.1 Containerized TensorFlow.
14. 14 © Hortonworks Inc. 2011–2018. All rights reserved
Aggregate all data from sensors, drones, logs, geo-location devices,
machines and social feeds
Collect: Bring Together
Mediate point-to-point and bi-directional data flows, delivering data
reliably to Apache HBase, Apache Hive, HDFS, Slack and Email.
Conduct: Mediate the Data Flow
Parse, filter, join, transform, fork, query, sort, dissect; enrich with weather,
location, sentiment analysis, image analysis, object detection, image
recognition, voice recognition with Apache Tika, Apache OpenNLP,
TensorFlow and Apache MXNet.
Curate: Gain Insights
16. 16 © Hortonworks Inc. 2011–2018. All rights reserved
Why Apache NiFi?
• Guaranteed delivery
• Data buffering
- Backpressure
- Pressure release
• Prioritized queuing
• Flow specific QoS
- Latency vs. throughput
- Loss tolerance
• Data provenance
• Supports push and pull
models
• Hundreds of processors
• Visual command and
control
• Over a fifty sources
• Flow templates
• Pluggable/multi-role
security
• Designed for extension
• Clustering
• Version Control
17. 17 © Hortonworks Inc. 2011–2018. All rights reserved
まずはデータがなければ始まらない
HDPクラスタ
データ分析
ビッグデータも、AIも
まずデータがなければ始まらない
どうやってデータを集めてくればよい?
Web App, Logs, RDBMS, NoSQL
TCP, HTTP, WebSocket,
JMS, Syslog, Email, Image
JSON, CSV, XML, Avro, Parquet
… etc. 多種多様な入力
18. 18 © Hortonworks Inc. 2011–2018. All rights reserved
Apache NiFiを利用したデータインジェスション
MiNiFi
Web App, Logs, RDBMS, NoSQL
TCP, HTTP, WebSocket,
JMS, Syslog, Email, Image
JSON, CSV, XML, Avro, Parquet
… etc. 多種多様な入力
エッジ、オンプレ、クラウド間
セキュアなデータ転送
HadoopクラスタNiFiクラスタ データ分析
19. 19 © Hortonworks Inc. 2011–2018. All rights reserved
220以上のエコシステム連携用プロセッサ
Hash
Extract
Merge
Duplicate
Scan
GeoEnrich
Replace
ConvertSplit
Translate
Route Content
Route Context
Route Text
Control Rate
Distribute Load
Generate Table Fetch
Jolt Transform JSON
Prioritized Delivery
Encrypt
Tail
Evaluate
Execute
All Apache project logos are trademarks of the ASF and the respective projects.
Fetch
HTTP
Syslog
Email
HTML
Image
HL7
FTP
UDP
XML
SFTP
AMQP
WebSocket
20. 20 © Hortonworks Inc. 2011–2018. All rights reserved
Few possible scenarios with NiFi
• Ingestion: connectors to read/write data from/to several data sources
• Protocols: FTP, HTTP, Syslog, email, WS, etc
• Databases: JDBC, MongoDB, HBase, Cassandra, etc
• Brokers: Kafka, JMS, AMQP, MQTT, etc
• Transformation:
• Format conversion (JSON to Avro, CSV to ORC, etc
• Compression/decompression, Merge, Split, encryption, etc
• Data enrichment
• Attribute, content, rules, etc
• Routing
• Priority, dynamic/static, based on content or metadata, etc
• Parsing (XML, JSON, Regex, Grok, etc)
• Etc …
21. 21 © Hortonworks Inc. 2011–2018. All rights reserved
Drag-and-Drop でデータフローを作成
22. 22 © Hortonworks Inc. 2011–2018. All rights reserved
HDP + HDF Component Land Scope
SAM
Storm
MiNiFi
Web App, Logs, RDBMS, NoSQL
TCP, HTTP, WebSocket,
JMS, Syslog, Email, Image
JSON, CSV, XML, Avro, Parquet
… etc. multiple data source/format
Securely transfer data
between edge, On-premise and Cloud
HDP ClusterHDF(NiFi/Kafka/Storm) Cluster Streaming Application
Development
Cluster operation and
management
Data Analytics
Model
Authorization policy management
23. 23 © Hortonworks Inc. 2011–2018. All rights reserved
Event Broker Cluster
Sensor Sources
Truck Sensors
Truck Sensors
Truck Sensors
Truck Sensors
HDFができること
Flow Management
Clusters
Ingress
Gateway
Nifi
Site to Site
Protocol
Egress
Gateway
Stream Analytics Cluster
Ingest
Streams
Generate
Insights
Real-Time Apps
Real-time
Apps &
Exploration Platform
25. 25 © Hortonworks Inc. 2011–2018. All rights reserved
What is TensorFlow?
• Google
• Multiple platform
support
• Hadoop integration
• Spark integration
• Keras
• Large Community
• Python and Java APIs
• GPU Support
• Mobile Support
• Inception v3
• Clustering
• Fully functional demos
• Open Source
• Apache Licensed
• Large Model Library
• Buzz
• Extensive Documentation
• Raspberry Pi Support
26. 26 © Hortonworks Inc. 2011–2018. All rights reserved
TensorFlow with Hadoop 3.1
27. 27 © Hortonworks Inc. 2011–2018. All rights reserved
TensorFlow Serving on YARN 3.1 https://github.com/NVIDIA/nvidia-docker
We use NVIDIA Docker
containers on top of YARN
28. 28 © Hortonworks Inc. 2011–2018. All rights reserved
Run TensorFlow on YARN 3.1
https://community.hortonworks.com/articles/83872/data-lake-30-containerization-erasure-coding-gpu-p.html
29. 29 © Hortonworks Inc. 2011–2018. All rights reserved
Run TensorFlow on YARN 3.1
https://hortonworks.com/blog/trying-containerized-applications-apache-hadoop-yarn-3-1/
30. 30 © Hortonworks Inc. 2011–2018. All rights reserved
python classify_image.py --image_file /opt/demo/dronedata/Bebop2_20160920083655-0400.jpg
solar dish, solar collector, solar furnace (score = 0.98316)
window screen (score = 0.00196)
manhole cover (score = 0.00070)
radiator (score = 0.00041)
doormat, welcome mat (score = 0.00041)
bazel-bin/tensorflow/examples/label_image/label_image --
image=/opt/demo/dronedata/Bebop2_20160920083655-0400.jpg
tensorflow/examples/label_image/main.cc:204] solar dish (577): 0.983162I
tensorflow/examples/label_image/main.cc:204] window screen (912): 0.00196204I
tensorflow/examples/label_image/main.cc:204] manhole cover (763): 0.000704005I
tensorflow/examples/label_image/main.cc:204] radiator (571): 0.000408321I
tensorflow/examples/label_image/main.cc:204] doormat (972): 0.000406186
TensorFlow via Python or C++ Binary
31. 31 © Hortonworks Inc. 2011–2018. All rights reserved
TensorFlow Java Processor in NiFi
https://community.hortonworks.com/content/kbentry/116803/building-a-custom-processor-in-
apache-nifi-12-for.html
https://github.com/tspannhw/nifi-tensorflow-processor
https://community.hortonworks.com/articles/178498/integrating-tensorflow-
16-image-labelling-with-hdf.html
32. 32 © Hortonworks Inc. 2011–2018. All rights reserved
TensorFlow Java Processor in NiFi
Installation On A Single Node of Apache NiFi 1.5+
Download NAR here: https://github.com/tspannhw/nifi-tensorflow-
processor/releases/tag/1.6
Install NAR file to /usr/hdf/current/nifi/lib/
Create a model directory (/opt/demo/models)
wget https://raw.githubusercontent.com/tspannhw/nifi-tensorflow-processor/master/nifi-
tensorflow-processors/src/test/resources/models/imagenet_comp_graph_label_strings.txt
wget https://github.com/tspannhw/nifi-tensorflow-processor/blob/master/nifi-tensorflow-
processors/src/test/resources/models/tensorflow_inception_graph.pb?raw=true
Restart Apache NiFi via Ambari
33. 33 © Hortonworks Inc. 2011–2018. All rights reserved
TensorFlow Java Processor in NiFi
34. 34 © Hortonworks Inc. 2011–2018. All rights reserved
TensorFlow Running on Edge Nodes (MiniFi)
CREATE EXTERNAL TABLE IF NOT EXISTS tfimage (image
STRING, ts STRING, host STRING, score STRING,
human_string STRING, node_id FLOAT) STORED AS ORC
LOCATION '/tfimage'
35. 35 © Hortonworks Inc. 2011–2018. All rights reserved
Deploy
Capture Billions of images in
data lake in Core
Pool GPUs and CPUs
- think a giant super computer
for 100x faster processing
Deploy data intensive containerized
deep learning micro-services in minutes
Train deep learning models using
GPUs & images in data lake
Edge
Nvidia Drive PX 2
Use Case — Autonomous Driving Car
37. 37 © Hortonworks Inc. 2011–2018. All rights reserved
Watson:様々なAPIが用意されている
• https://console.bluemix.net/docs/
38. 38 © Hortonworks Inc. 2011–2018. All rights reserved
画像識別 API
• https://console.bluemix.net/docs/services/visual-recognition/getting-
started.html#getting-started-tutorial
41. 44 © Hortonworks Inc. 2011–2018. All rights reserved
BigData AI サンドイッチ まとめ
• NiFi
• データ収集、Data Flow
• TensorFlow
• Deep Learning
• Hadoop/Spark
• データ蓄積、処理
42. 45 © Hortonworks Inc. 2011–2018. All rights reserved
Bigdata AI サンドイッチ 構成図(Professional版)
Ingestion
Simple Event Processing
Engine
Stream Processing
Destination
Data Bus
Build
Predictive Model
From Historical Data
Deploy
Predictive Model
For Real-time Insights
Perishable Insights
Historical Insights
Editor's Notes TALK TRACK
Hortonworks Powers the Future of Data: data-in-motion, data-at-rest, and Modern Data Applications.
[NEXT SLIDE] Kafka
Reads events in memory and write to distributed log
Kafka
Reads events in memory and write to distributed log
https://www.tensorflow.org/tutorials/image_recognition
https://github.com/tensorflow/models
https://github.com/yahoo/TensorFlowOnSpark/wiki/Conversion
https://community.hortonworks.com/articles/54954/setting-up-gpu-enabled-tensorflow-to-work-with-zep.html
We install the GPU enabled tensorflow on the nodes that have GPUs and CPU version on the others. We label which ones have GPUs and send to those for training. https://community.hortonworks.com/articles/54954/setting-up-gpu-enabled-tensorflow-to-work-with-zep.html https://community.hortonworks.com/articles/54954/setting-up-gpu-enabled-tensorflow-to-work-with-zep.html Kafka
Reads events in memory and write to distributed log