2. Who am I?
Chris Fregly, Principal Data Solutions Engineer
@ IBM Spark Technology Center
Previously, Data Engineer @ Netflix and Databricks
Contributor @ Apache Spark, Committer @ Netflix OSS
Founder @ Advanced Spark and TensorFlow
Author @ Advanced Spark (advancedspark.com)
10. NiFi
NiFi = “Niagra Files”
Maintainers @ Hortonworks since 2015
Developed @ NSA over last 8+ years
Integrates with EVERYTHING!
Provides Data Provenance
Data Flow Management
Me,
Normal Guy
Joe Witt,
NiFi Co-Creator
Buffalo
Wild Wings
Hat
25. Spark Streaming
Submits Time-Based Micro Batches of Data as Spark Jobs
Supports Kinesis, Flume, MQTT, ZeroMQ, Sockets, KAFKA!
Framework for Custom Streaming Receivers
Flexible Window Operations, Optimized State Management
Basic Back Pressure and Throttling Support
At Least Once Guarantees through Write Ahead Log (WAL)
31. Recommendation Serving Layer
Use Case: Recommendation Service Depends on Redis Cache
Problem: Redis Cache Goes Down!?
Answer: github.com/Netflix/Hystrix Circuit Breaker!
Circuit States:
Closed: Service OK
Open: Service DOWN
Fallback to Non-Personalized Recommendations from Disk
33. Netflix Data Pipeline
9 million events, 22 GB per second @ peak!
EC2 D2XL
Disk: 6 TB, 475 MB/s
RAM: 30 G
Network: 700 Mbps
Auto-scaling,
Fault tolerance
A/B Tests,
Trending Now
SAMZA
Splits high and
normal priority
34. Recommendations Pipeline
Batch Matrix Factorization
Keep Batch Video (V) Matrix
Calculate Newer User (U) Matrix
Compute U x V Dot Product
Save Model to Disk and EVCache
https://github.com/Netflix/EVCache
Throw away
batch user factors (U)
Keep video
factors (V)
35. Thank You, Kafka Summit SF!
Chris Fregly
@cfregly
All Source Code, Demos, and Docker Images Available
@ advancedspark.com,
github.com/fluxcapacitor/pipeline
Join the Global Meetup for Slides, Videos, Book
@ advancedspark.com