SlideShare ist ein Scribd-Unternehmen logo
1 von 21
After Dark
Real-time, Advanced Analytics with Spark
Chris Fregly
Hadoop Summit San Jose
June 11th, 2015
Who am I?
2
Streaming Platform Engineer
playboy.com
Streaming Platform Engineer
netflix.com
Data Solutions Engineer
spark.apache.org, databricks.com
What is ?
3
Spark Core
Spark
Streaming
real-timeSpark SQL
structured data
MLlib
machine
learning
GraphX
graph
analytics
…
BlinkDB
approx queries
What is ?
4
Founded by creators of Spark, largest contributor
Offers a hosted
service
Spark on EC2
Notebooks
Plot visualizations
Cluster management
Scheduled jobs
Why After Dark?
Playboy After Dark
Late 1960’s
Progressive Show
It rhymes!
5
6
Goal: Generate high-quality recommendations
for its users
Side goal: Demonstrate Spark Libraries
Spark Streaming -> Kafka
Spark SQL -> DataFrames
MLlib -> Machine Learning
GraphX -> Graph Analytics
What is After Dark?
Interactive Demo:
Streaming + Spark SQL + DataFrames
7
①Navigate to sparkafterdark.com
②Click 3 actors and 3 actresses
2 Types of Recommendations
8
① Non-personalized
No data about user, yet
“Cold Start” problem
② Personalized
Adapt to user preferences and behavior
Recommend from other users with similar
preferences and behavior
Non-personalized Recommendations
9
① Top Users by Like Count
“I might like users who are most-liked overall.”
SparkSQL + DataFrame: Aggregations
② Top Influencers by Like Graph
“I might like users who have the highest probability of
me liking them randomly based on overall likes.”
GraphX: PageRank
Demos:
Spark SQL + DataFrames + GraphX
10
Personalized Recommendations
11
③ Collaborative Filtering
“I like the same people that you like.
What other people do you like that I haven’t seen?”
MLlib: ALS, User-Item Similarity
④ Text Analytics
“Our profiles have similar, unique keywords.
We might like each other.”
MLlib: RowMatrix, Doc Similarity, TF/IDF
Demo:
MLlib
12
2 Types of Feedback
13
① Explicit Feedback
Ratings, Like/Dislike
② Implicit Feedback
Searches, Clicks, Hovers, Views, Scrolls
Bonus!
No demos.
Exercises for the reader.
14
More Personalized Recommendations!!
15
④ Eigenfaces
“Your face looks similar to others that I’ve liked.
I might like you.”
MLlib: RowMatrix, PCA, Item-Item Similarity
More Personalized Recommendations!!!
16
⑤ Compromise Recommendations (For Couples)
“I want Mad Max. You want Message In a Bottle.
Let’s find something in between.”
MLlib: RowMatrix, Item-Item Similarity
GraphX: Nearest Neighbors, Shortest Path
similar similar
plot actor
More Personalized Recommendations!
17
⑥ Text Analytics
“Our profiles have similar, unique keywords.
We might like each other.”
MLlib: RowMatrix, Doc Similarity, TF/IDF
More Personalized Recommendations!!!!
18
⑦ High-Value Emails
“Your email has similar keywords to my profile.
I might like you for making the effort.”
MLlib: TF/IDF, Entity Recognition, Doc
Similarity
^
Her Email< My Profile
Conversation-Starter Bot
19
⑧ MLlib: TF/IDF, DecisionTree, Sentiment
“If your responses to my trite opening lines are positive,
I might actually read your profile.”
MLlib: TF/IDF, DecisionTree, Sentiment
Other Talks
20
① Online Approximate OLAP in SparkSQL
Sameer Agarwal & Kai Zeng, Tues 6/9 @ 3:25pm
② Recipes for Running Streaming Apps in Prod
Tathagata Das, Wed 6/10 @ 12:05pm
③ Practical Distributed ML Pipelines on Hadoop
Joseph Bradley, Wed 6/10 @ 4:35pm
④ Dynamically Allocate Spark Cluster Resources
Andrew Or & Aaron Davidson, Thurs 6/11 @ 11:15am
Thank you!
cfregly@databricks.com
@cfregly

Weitere ähnliche Inhalte

Andere mochten auch

Кружок
КружокКружок
Кружок
koneqq
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
DataWorks Summit
 
Pro and cons of quoting a small cap
Pro and cons of quoting a small capPro and cons of quoting a small cap
Pro and cons of quoting a small cap
Antevenio S.A
 
Use of l1 at primary level in l2 learning class room
Use of l1 at primary level in l2 learning class roomUse of l1 at primary level in l2 learning class room
Use of l1 at primary level in l2 learning class room
muhammad asif
 
Экология озер Ново-савиновского района г. Казани
Экология озер Ново-савиновского района г. КазаниЭкология озер Ново-савиновского района г. Казани
Экология озер Ново-савиновского района г. Казани
koneqq
 
Data Science: Driving Smarter Finance and Workforce Decsions for the Enterprise
Data Science: Driving Smarter Finance and Workforce Decsions for the EnterpriseData Science: Driving Smarter Finance and Workforce Decsions for the Enterprise
Data Science: Driving Smarter Finance and Workforce Decsions for the Enterprise
DataWorks Summit
 
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared ClustersMercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
DataWorks Summit
 

Andere mochten auch (17)

Etimology
EtimologyEtimology
Etimology
 
Dell | Your Path – Our Platform & Great Partnerships
Dell | Your Path – Our Platform & Great PartnershipsDell | Your Path – Our Platform & Great Partnerships
Dell | Your Path – Our Platform & Great Partnerships
 
Algorithms of the heart
Algorithms of the heartAlgorithms of the heart
Algorithms of the heart
 
Кружок
КружокКружок
Кружок
 
Leadership in the sixth wave
Leadership in the sixth waveLeadership in the sixth wave
Leadership in the sixth wave
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
Pro and cons of quoting a small cap
Pro and cons of quoting a small capPro and cons of quoting a small cap
Pro and cons of quoting a small cap
 
Biotecnologia
BiotecnologiaBiotecnologia
Biotecnologia
 
Unit 3 vocab
Unit 3 vocabUnit 3 vocab
Unit 3 vocab
 
ρατσισμος
ρατσισμοςρατσισμος
ρατσισμος
 
Use of l1 at primary level in l2 learning class room
Use of l1 at primary level in l2 learning class roomUse of l1 at primary level in l2 learning class room
Use of l1 at primary level in l2 learning class room
 
Экология озер Ново-савиновского района г. Казани
Экология озер Ново-савиновского района г. КазаниЭкология озер Ново-савиновского района г. Казани
Экология озер Ново-савиновского района г. Казани
 
Marketing Social para una sociedad responsable
Marketing Social para una sociedad responsableMarketing Social para una sociedad responsable
Marketing Social para una sociedad responsable
 
Data Science: Driving Smarter Finance and Workforce Decsions for the Enterprise
Data Science: Driving Smarter Finance and Workforce Decsions for the EnterpriseData Science: Driving Smarter Finance and Workforce Decsions for the Enterprise
Data Science: Driving Smarter Finance and Workforce Decsions for the Enterprise
 
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared ClustersMercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
 
Archetypes in Branding
Archetypes in Branding Archetypes in Branding
Archetypes in Branding
 
L1 use in the L2 classroom
L1 use in the L2 classroomL1 use in the L2 classroom
L1 use in the L2 classroom
 

Ähnlich wie Spark After Dark: Real-time, Advanced Analytics with Spark

Ähnlich wie Spark After Dark: Real-time, Advanced Analytics with Spark (20)

Dublin Ireland Spark Meetup October 15, 2015
Dublin Ireland Spark Meetup October 15, 2015Dublin Ireland Spark Meetup October 15, 2015
Dublin Ireland Spark Meetup October 15, 2015
 
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark:  Real time Advanced Analytics and Machine Learning with SparkSpark After Dark:  Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
 
Taking the Reins: Website Redesign by the Librarians, for the Users
Taking the Reins: Website Redesign by the Librarians, for the UsersTaking the Reins: Website Redesign by the Librarians, for the Users
Taking the Reins: Website Redesign by the Librarians, for the Users
 
Clickstream & Social Media Analysis using Apache Spark
Clickstream & Social Media Analysis using Apache SparkClickstream & Social Media Analysis using Apache Spark
Clickstream & Social Media Analysis using Apache Spark
 
Calin Constantinov - Neo4j - Keyboards and Mice - Craiova 2016
Calin Constantinov - Neo4j - Keyboards and Mice - Craiova 2016Calin Constantinov - Neo4j - Keyboards and Mice - Craiova 2016
Calin Constantinov - Neo4j - Keyboards and Mice - Craiova 2016
 
Bootstrapping Recommendations with Neo4j
Bootstrapping Recommendations with Neo4jBootstrapping Recommendations with Neo4j
Bootstrapping Recommendations with Neo4j
 
Spark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark GroupSpark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark Group
 
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of DatabricksBig Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
 
IMCSummit 2015 - Day 1 Developer Track - Spark After Dark: Generating High Qu...
IMCSummit 2015 - Day 1 Developer Track - Spark After Dark: Generating High Qu...IMCSummit 2015 - Day 1 Developer Track - Spark After Dark: Generating High Qu...
IMCSummit 2015 - Day 1 Developer Track - Spark After Dark: Generating High Qu...
 
Recommendations and Statistics with Graph Databases
Recommendations and Statistics with Graph DatabasesRecommendations and Statistics with Graph Databases
Recommendations and Statistics with Graph Databases
 
Big Data, Analytics, and Content Recommendations on AWS
Big Data, Analytics, and Content Recommendations on AWSBig Data, Analytics, and Content Recommendations on AWS
Big Data, Analytics, and Content Recommendations on AWS
 
Bootstrapping Recommendations OSCON 2015
Bootstrapping Recommendations OSCON 2015Bootstrapping Recommendations OSCON 2015
Bootstrapping Recommendations OSCON 2015
 
Partner Webinar: Recommendation Engines with MongoDB and Hadoop
 Partner Webinar: Recommendation Engines with MongoDB and Hadoop Partner Webinar: Recommendation Engines with MongoDB and Hadoop
Partner Webinar: Recommendation Engines with MongoDB and Hadoop
 
Top Rated:Improve Personality.
Top Rated:Improve Personality.Top Rated:Improve Personality.
Top Rated:Improve Personality.
 
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
 
Generating docs from APIs
Generating docs from APIsGenerating docs from APIs
Generating docs from APIs
 
The Future of the OPAC...?
The Future of the OPAC...?The Future of the OPAC...?
The Future of the OPAC...?
 
Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014
 
Polyglot Persistence with MongoDB and Neo4j
Polyglot Persistence with MongoDB and Neo4jPolyglot Persistence with MongoDB and Neo4j
Polyglot Persistence with MongoDB and Neo4j
 
Still using MySQL? Maybe you should reconsider.
Still using MySQL? Maybe you should reconsider.Still using MySQL? Maybe you should reconsider.
Still using MySQL? Maybe you should reconsider.
 

Mehr von DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 

Spark After Dark: Real-time, Advanced Analytics with Spark

  • 1. After Dark Real-time, Advanced Analytics with Spark Chris Fregly Hadoop Summit San Jose June 11th, 2015
  • 2. Who am I? 2 Streaming Platform Engineer playboy.com Streaming Platform Engineer netflix.com Data Solutions Engineer spark.apache.org, databricks.com
  • 3. What is ? 3 Spark Core Spark Streaming real-timeSpark SQL structured data MLlib machine learning GraphX graph analytics … BlinkDB approx queries
  • 4. What is ? 4 Founded by creators of Spark, largest contributor Offers a hosted service Spark on EC2 Notebooks Plot visualizations Cluster management Scheduled jobs
  • 5. Why After Dark? Playboy After Dark Late 1960’s Progressive Show It rhymes! 5
  • 6. 6 Goal: Generate high-quality recommendations for its users Side goal: Demonstrate Spark Libraries Spark Streaming -> Kafka Spark SQL -> DataFrames MLlib -> Machine Learning GraphX -> Graph Analytics What is After Dark?
  • 7. Interactive Demo: Streaming + Spark SQL + DataFrames 7 ①Navigate to sparkafterdark.com ②Click 3 actors and 3 actresses
  • 8. 2 Types of Recommendations 8 ① Non-personalized No data about user, yet “Cold Start” problem ② Personalized Adapt to user preferences and behavior Recommend from other users with similar preferences and behavior
  • 9. Non-personalized Recommendations 9 ① Top Users by Like Count “I might like users who are most-liked overall.” SparkSQL + DataFrame: Aggregations ② Top Influencers by Like Graph “I might like users who have the highest probability of me liking them randomly based on overall likes.” GraphX: PageRank
  • 10. Demos: Spark SQL + DataFrames + GraphX 10
  • 11. Personalized Recommendations 11 ③ Collaborative Filtering “I like the same people that you like. What other people do you like that I haven’t seen?” MLlib: ALS, User-Item Similarity ④ Text Analytics “Our profiles have similar, unique keywords. We might like each other.” MLlib: RowMatrix, Doc Similarity, TF/IDF
  • 13. 2 Types of Feedback 13 ① Explicit Feedback Ratings, Like/Dislike ② Implicit Feedback Searches, Clicks, Hovers, Views, Scrolls
  • 15. More Personalized Recommendations!! 15 ④ Eigenfaces “Your face looks similar to others that I’ve liked. I might like you.” MLlib: RowMatrix, PCA, Item-Item Similarity
  • 16. More Personalized Recommendations!!! 16 ⑤ Compromise Recommendations (For Couples) “I want Mad Max. You want Message In a Bottle. Let’s find something in between.” MLlib: RowMatrix, Item-Item Similarity GraphX: Nearest Neighbors, Shortest Path similar similar plot actor
  • 17. More Personalized Recommendations! 17 ⑥ Text Analytics “Our profiles have similar, unique keywords. We might like each other.” MLlib: RowMatrix, Doc Similarity, TF/IDF
  • 18. More Personalized Recommendations!!!! 18 ⑦ High-Value Emails “Your email has similar keywords to my profile. I might like you for making the effort.” MLlib: TF/IDF, Entity Recognition, Doc Similarity ^ Her Email< My Profile
  • 19. Conversation-Starter Bot 19 ⑧ MLlib: TF/IDF, DecisionTree, Sentiment “If your responses to my trite opening lines are positive, I might actually read your profile.” MLlib: TF/IDF, DecisionTree, Sentiment
  • 20. Other Talks 20 ① Online Approximate OLAP in SparkSQL Sameer Agarwal & Kai Zeng, Tues 6/9 @ 3:25pm ② Recipes for Running Streaming Apps in Prod Tathagata Das, Wed 6/10 @ 12:05pm ③ Practical Distributed ML Pipelines on Hadoop Joseph Bradley, Wed 6/10 @ 4:35pm ④ Dynamically Allocate Spark Cluster Resources Andrew Or & Aaron Davidson, Thurs 6/11 @ 11:15am