Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi, Kafka, and Spark - Chris Fregly

•

19 gefällt mir•8,331 views

Chris Fregly

Generating Real-time Recommendations with NiFi, Kafka, and Spark - Chris Fregly

Software

Generating Real-time,
Streaming Recommendations
[NiFi + Kafka + Spark ML]
Kafka Summit SF
April 26, 2016

Who am I?
Chris Fregly, Principal Data Solutions Engineer
@ IBM Spark Technology Center
Previously, Data Engineer @ Netflix and Databricks
Contributor @ Apache Spark, Committer @ Netflix OSS
Founder @ Advanced Spark and TensorFlow
Author @ Advanced Spark (advancedspark.com)

Relevant Spark Contribution
SPARK-1981: Add Kinesis support for Spark Streaming
Me

Fun Workshop!!
San Jose: May 14th (full details @ advancedspark.com)

Agenda
Live, Interactive Demo!
NiFi
Spark Streaming
Streaming Recommendations
Netflix Pipeline (Bonus!)

Live, Interactive Demo
http://demo2.advancedspark.com

Agenda
Live, Interactive Demo!
NiFi
Spark Streaming
Streaming Recommendations
Netflix Recommendations (Bonus!)

NiFi
NiFi = “Niagra Files”
Maintainers @ Hortonworks since 2015
Developed @ NSA over last 8+ years
Integrates with EVERYTHING!
Provides Data Provenance
Data Flow Management
Me,
Normal Guy
Joe Witt,
NiFi Co-Creator
Buffalo
Wild Wings
Hat

NiFi Provenance Event Types
ATTRIBUTES_MODIFIED (ie. Extract Topic Name)
CONTENT_MODIFIED (ie. Enrich with Geo)
RECEIVE (ie. Handle Http Request)
ROUTE (ie. Check Http Method)
SEND (ie. PutKafka)
DROP (Handle Http Response)

Spark Streaming
Submits Time-Based Micro Batches of Data as Spark Jobs
Supports Kinesis, Flume, MQTT, ZeroMQ, Sockets, KAFKA!
Framework for Custom Streaming Receivers
Flexible Window Operations, Optimized State Management
Basic Back Pressure and Throttling Support
At Least Once Guarantees through Write Ahead Log (WAL)

Spark Streaming KafkaRDD
Kafka “Direct” Streaming Implementation (Spark 1.4+)
Recover/Replay from Kafka using File System-like Offsets
Removes need for Write Ahead Log (WAL)
Uses Kafka, itself, as the WAL!
KafkaRDD

Streaming Recommendations
Incremental Matrix Factorization!!
(Based on github.com/brkyvz/streaming-matrix-factorization)

Recommendation Serving Layer
Use Case: Recommendation Service Depends on Redis Cache
Problem: Redis Cache Goes Down!?
Answer: github.com/Netflix/Hystrix Circuit Breaker!
Circuit States:
Closed: Service OK
Open: Service DOWN
Fallback to Non-Personalized Recommendations from Disk

Netflix Data Pipeline
9 million events, 22 GB per second @ peak!
EC2 D2XL
Disk: 6 TB, 475 MB/s
RAM: 30 G
Network: 700 Mbps
Auto-scaling,
Fault tolerance
A/B Tests,
Trending Now
SAMZA
Splits high and
normal priority

Recommendations Pipeline
Batch Matrix Factorization
Keep Batch Video (V) Matrix
Calculate Newer User (U) Matrix
Compute U x V Dot Product
Save Model to Disk and EVCache
https://github.com/Netflix/EVCache
Throw away
batch user factors (U)
Keep video
factors (V)

Thank You, Kafka Summit SF!
Chris Fregly
@cfregly
All Source Code, Demos, and Docker Images Available
@ advancedspark.com,
github.com/fluxcapacitor/pipeline
Join the Global Meetup for Slides, Videos, Book
@ advancedspark.com

Weitere ähnliche Inhalte

Was ist angesagt?

Local Apache NiFi Processor DebugDeon Huang

Apache NiFi User GuideDeon Huang

StreamNative FLiP into scylladb - scylla summit 2022Timothy Spann

Data science online camp using the flipn stack for edge ai (flink, nifi, pu...Timothy Spann

Real time stock processing with apache nifi, apache flink and apache kafkaTimothy Spann

Pulsar summit asia 2021 apache pulsar with mqtt for edge computingTimothy Spann

Best practices and lessons learnt from Running Apache NiFi at RenaultDataWorks Summit

Architecting for ScalePooyan Jamshidi

Building Streaming Applications with Apache Storm 1.1Hugo Louro

Seattle spark-meetup-032317Nan Zhu

Apache Kafka 0.8 basic training - VerisignMichael Noll

Music city data Hail Hydrate! from stream to lakeTimothy Spann

Running Apache Spark & Apache Zeppelin in ProductionDataWorks Summit/Hadoop Summit

ApacheCon 2021 - Apache NiFi Deep Dive 300Timothy Spann

Using FLiP with influxdb for edgeai iot at scale 2022Timothy Spann

DBCC 2021 - FLiP Stack for Cloud Data LakesTimothy Spann

[March sn meetup] apache pulsar + apache nifi for cloud data lakeTimothy Spann

The Power of Intelligent Flows: Real-Time IoT Botnet Classification with Apac...DataWorks Summit

fluentd -- the missing log collectorMuga Nishizawa

Devfest uk & ireland using apache nifi with apache pulsar for fast data on-r...Timothy Spann

Was ist angesagt? (20)

Local Apache NiFi Processor Debug

Apache NiFi User Guide

StreamNative FLiP into scylladb - scylla summit 2022

Data science online camp using the flipn stack for edge ai (flink, nifi, pu...

Real time stock processing with apache nifi, apache flink and apache kafka

Pulsar summit asia 2021 apache pulsar with mqtt for edge computing

Best practices and lessons learnt from Running Apache NiFi at Renault

Architecting for Scale

Building Streaming Applications with Apache Storm 1.1

Seattle spark-meetup-032317

Apache Kafka 0.8 basic training - Verisign

Music city data Hail Hydrate! from stream to lake

Running Apache Spark & Apache Zeppelin in Production

ApacheCon 2021 - Apache NiFi Deep Dive 300

Using FLiP with influxdb for edgeai iot at scale 2022

DBCC 2021 - FLiP Stack for Cloud Data Lakes

[March sn meetup] apache pulsar + apache nifi for cloud data lake

The Power of Intelligent Flows: Real-Time IoT Botnet Classification with Apac...

fluentd -- the missing log collector

Devfest uk & ireland using apache nifi with apache pulsar for fast data on-r...

Andere mochten auch

Machine Learning Preliminaries and Math Refresherbutest

Boston Spark Meetup May 24, 2016Chris Fregly

qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...Sri Ambati

02 math essentialsPoongodi Mano

The Genome Assembly ProblemMark Chang

Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...Chris Fregly

Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...Chris Fregly

Machine Learning Essentials (dsth Meetup#3)Data Science Thailand

High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...Chris Fregly

[系列活動] 資料探勘速遊台灣資料科學年會

陸永祥/全球網路攝影機帶來的機會與挑戰台灣資料科學年會

高嘉良/Open Innovation as Strategic Plan台灣資料科學年會

Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...Chris Fregly

Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...Alex Pinto

TensorFlow 深度學習快速上手班--電腦視覺應用Mark Chang

Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...Chris Fregly

Machine Learning without the Math: An overview of Machine LearningArshad Ahmed

[DSC 2016] 系列活動：李泳泉 / 星火燎原 - Spark 機器學習初探台灣資料科學年會

Generative Adversarial NetworksMark Chang

NTHU AI Reading Group: Improved Training of Wasserstein GANsMark Chang

Andere mochten auch (20)

Machine Learning Preliminaries and Math Refresher

Boston Spark Meetup May 24, 2016

qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...

02 math essentials

The Genome Assembly Problem

Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...

Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...

Machine Learning Essentials (dsth Meetup#3)

High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...

[系列活動] 資料探勘速遊

陸永祥/全球網路攝影機帶來的機會與挑戰

高嘉良/Open Innovation as Strategic Plan

Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...

Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...

TensorFlow 深度學習快速上手班--電腦視覺應用

Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...

Machine Learning without the Math: An overview of Machine Learning

[DSC 2016] 系列活動：李泳泉 / 星火燎原 - Spark 機器學習初探

Generative Adversarial Networks

NTHU AI Reading Group: Improved Training of Wasserstein GANs

Ähnlich wie Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi, Kafka, and Spark - Chris Fregly

Sql bits apache nifi 101 Introduction and best practicesTimothy Spann

Apache Kafka 0.11 の Exactly Once SemanticsYoshiyasu SAEKI

Using the FLiPN Stack for Edge AI (Flink, NiFi, Pulsar) - Pulsar Summit Asia ...StreamNative

Using the flipn stack for edge ai (flink, nifi, pulsar)Timothy Spann

Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingDatabricks

What is Apache Kafka and What is an Event Streaming Platform?confluent

Sink Your Teeth into Streaming at Any ScaleTimothy Spann

Sink Your Teeth into Streaming at Any ScaleScyllaDB

kash.py - How to Make Your Data Scientists Love Real-time with Ralph M. Debus...HostedbyConfluent

What is Apache Kafka®?Eventador

What is apache Kafka?Kenny Gorman

データの民主化のために StackStorm を活用した事例Yoshiyasu SAEKI

KafkaとAWS Kinesisの比較Yoshiyasu SAEKI

Akka Streams And Kafka Streams: Where Microservices Meet Fast DataLightbend

Codeless pipelines with pulsar and flinkTimothy Spann

Kafka Streams for Java enthusiastsSlim Baltagi

Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...Trivadis

Akka-chan's Survival Guide for the Streaming WorldKonrad Malawski

Akka streams kafka kinesisPeter Vandenabeele

Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...Michael Noll

Ähnlich wie Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi, Kafka, and Spark - Chris Fregly (20)

Sql bits apache nifi 101 Introduction and best practices

Apache Kafka 0.11 の Exactly Once Semantics

Using the FLiPN Stack for Edge AI (Flink, NiFi, Pulsar) - Pulsar Summit Asia ...

Using the flipn stack for edge ai (flink, nifi, pulsar)

Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling

What is Apache Kafka and What is an Event Streaming Platform?

Sink Your Teeth into Streaming at Any Scale

kash.py - How to Make Your Data Scientists Love Real-time with Ralph M. Debus...

What is Apache Kafka®?

What is apache Kafka?

データの民主化のために StackStorm を活用した事例

KafkaとAWS Kinesisの比較

Akka Streams And Kafka Streams: Where Microservices Meet Fast Data

Codeless pipelines with pulsar and flink

Kafka Streams for Java enthusiasts

Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...

Akka-chan's Survival Guide for the Streaming World

Akka streams kafka kinesis

Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...

Mehr von Chris Fregly

AWS reInvent 2022 reCap AI/ML and DataChris Fregly

Pandas on AWS - Let me count the ways.pdfChris Fregly

Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupChris Fregly

Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedChris Fregly

Amazon reInvent 2020 Recap: AI and Machine LearningChris Fregly

Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...Chris Fregly

Quantum Computing with Amazon BraketChris Fregly

15 Tips to Scale a Large AI/ML Workshop - Both Online and In-PersonChris Fregly

AWS Re:Invent 2019 Re:CapChris Fregly

KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...Chris Fregly

Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Chris Fregly

Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Chris Fregly

Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Chris Fregly

PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...Chris Fregly

PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...Chris Fregly

Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Chris Fregly

PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...Chris Fregly

Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Chris Fregly

High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...Chris Fregly

PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...Chris Fregly

Mehr von Chris Fregly (20)

AWS reInvent 2022 reCap AI/ML and Data

Pandas on AWS - Let me count the ways.pdf

Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup

Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated

Amazon reInvent 2020 Recap: AI and Machine Learning

Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...

Quantum Computing with Amazon Braket

15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person

AWS Re:Invent 2019 Re:Cap

KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...

Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...

Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...

Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...

PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...

PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...

Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...

PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...

Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...

High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...

PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...

Kürzlich hochgeladen

Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed

Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran

How to submit a standout Adobe Champion ApplicationBradBedford3

Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions

Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel

2.pdf Ejercicios de programación competitivaDiego Iván Oliveros Acosta

Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki

SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler

Software Coding for software engineeringssuserb3a23b

PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122

A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska

Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray

Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts

CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies

Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions

Recruitment Management Software Benefits (Infographic)Hr365.us smith

Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López

Powering Real-Time Decisions with Continuous Data StreamsSafe Software

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea

Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky

Kürzlich hochgeladen (20)

Unveiling Design Patterns: A Visual Guide with UML Diagrams

Intelligent Home Wi-Fi Solutions | ThinkPalm

How to submit a standout Adobe Champion Application

Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...

Unveiling the Future: Sylius 2.0 New Features

2.pdf Ejercicios de programación competitiva

Machine Learning Software Engineering Patterns and Their Engineering

SensoDat: Simulation-based Sensor Dataset of Self-driving Cars

Software Coding for software engineering

PREDICTING RIVER WATER QUALITY ppt presentation

A healthy diet for your Java application Devoxx France.pdf

Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...

Odoo 14 - eLearning Module In Odoo 14 Enterprise

CRM Contender Series: HubSpot vs. Salesforce

Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...

Recruitment Management Software Benefits (Infographic)

Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...

Powering Real-Time Decisions with Continuous Data Streams

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样

Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...

Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi, Kafka, and Spark - Chris Fregly

1. Generating Real-time, Streaming Recommendations [NiFi + Kafka + Spark ML] Kafka Summit SF April 26, 2016

2. Who am I? Chris Fregly, Principal Data Solutions Engineer @ IBM Spark Technology Center Previously, Data Engineer @ Netflix and Databricks Contributor @ Apache Spark, Committer @ Netflix OSS Founder @ Advanced Spark and TensorFlow Author @ Advanced Spark (advancedspark.com)

3. Relevant Spark Contribution SPARK-1981: Add Kinesis support for Spark Streaming Me

4. Fun Meetup!

5. Fun Workshop!! San Jose: May 14th (full details @ advancedspark.com)

6. Fun Github Repo!!!

7. Agenda Live, Interactive Demo! NiFi Spark Streaming Streaming Recommendations Netflix Pipeline (Bonus!)

8. Live, Interactive Demo http://demo2.advancedspark.com

9. Agenda Live, Interactive Demo! NiFi Spark Streaming Streaming Recommendations Netflix Recommendations (Bonus!)

10. NiFi NiFi = “Niagra Files” Maintainers @ Hortonworks since 2015 Developed @ NSA over last 8+ years Integrates with EVERYTHING! Provides Data Provenance Data Flow Management Me, Normal Guy Joe Witt, NiFi Co-Creator Buffalo Wild Wings Hat

11. NiFi + Kafka

12. NiFi Routing: Http Request

13. NiFi Geo-Enrichment

14. NiFi Extract Kafka Topic

15. NiFi Kafka PUT (Finally!)

16. NiFi Post-Kafka HttpResponse

17. NiFi Data Provenance

18. NiFi Provenance Event Types ATTRIBUTES_MODIFIED (ie. Extract Topic Name) CONTENT_MODIFIED (ie. Enrich with Geo) RECEIVE (ie. Handle Http Request) ROUTE (ie. Check Http Method) SEND (ie. PutKafka) DROP (Handle Http Response)

19. NiFi Search Data Provenance

20. NiFi Kafka Provenance Event

21. NiFi Kafka Provenance Event

22. NiFi Kafka Provenance Event

23. NiFi Provenance Lineage

24. Agenda Live, Interactive Demo! NiFi Spark Streaming Streaming Recommendations Netflix Recommendations (Bonus!)

25. Spark Streaming Submits Time-Based Micro Batches of Data as Spark Jobs Supports Kinesis, Flume, MQTT, ZeroMQ, Sockets, KAFKA! Framework for Custom Streaming Receivers Flexible Window Operations, Optimized State Management Basic Back Pressure and Throttling Support At Least Once Guarantees through Write Ahead Log (WAL)

26. Original Kafka Receiver

27. Newer Kafka “Direct” Receiver

28. Spark Streaming KafkaRDD Kafka “Direct” Streaming Implementation (Spark 1.4+) Recover/Replay from Kafka using File System-like Offsets Removes need for Write Ahead Log (WAL) Uses Kafka, itself, as the WAL! KafkaRDD

29. Agenda Live, Interactive Demo! NiFi Spark Streaming Streaming Recommendations Netflix Recommendations (Bonus!)

30. Streaming Recommendations Incremental Matrix Factorization!! (Based on github.com/brkyvz/streaming-matrix-factorization)

31. Recommendation Serving Layer Use Case: Recommendation Service Depends on Redis Cache Problem: Redis Cache Goes Down!? Answer: github.com/Netflix/Hystrix Circuit Breaker! Circuit States: Closed: Service OK Open: Service DOWN Fallback to Non-Personalized Recommendations from Disk

32. Agenda Live, Interactive Demo! NiFi Spark Streaming Streaming Recommendations Netflix Recommendations (Bonus!)

33. Netflix Data Pipeline 9 million events, 22 GB per second @ peak! EC2 D2XL Disk: 6 TB, 475 MB/s RAM: 30 G Network: 700 Mbps Auto-scaling, Fault tolerance A/B Tests, Trending Now SAMZA Splits high and normal priority

34. Recommendations Pipeline Batch Matrix Factorization Keep Batch Video (V) Matrix Calculate Newer User (U) Matrix Compute U x V Dot Product Save Model to Disk and EVCache https://github.com/Netflix/EVCache Throw away batch user factors (U) Keep video factors (V)

35. Thank You, Kafka Summit SF! Chris Fregly @cfregly All Source Code, Demos, and Docker Images Available @ advancedspark.com, github.com/fluxcapacitor/pipeline Join the Global Meetup for Slides, Videos, Book @ advancedspark.com

Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi, Kafka, and Spark - Chris Fregly

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi, Kafka, and Spark - Chris Fregly

Ähnlich wie Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi, Kafka, and Spark - Chris Fregly (20)

Mehr von Chris Fregly

Mehr von Chris Fregly (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi, Kafka, and Spark - Chris Fregly