SlideShare a Scribd company logo
1 of 34
Download to read offline
Prototyping Data Intensive Apps:
       TrendingTopics.org
              Pete Skomoroch
              Research Scientist at LinkedIn
              Consultant at Data Wrangling




           @peteskomoroch
                09/29/09                       1
Talk Outline
• TrendingTopics Overview
• Wikipedia Page View Dataset
• Hadoop on Amazon EC2
• Loading Data on EC2: Amazon EBS & S3
• Daily Timelines with Hadoop Streaming
• Hive Data Warehouse Layer
• Trend Computation with Hive
• Hooking It All Together
• Front End & Visualizations
Data Intensive Web Apps
• Batch data mining or prediction with Hadoop
• Iterate quickly with high level languages & tools
 – Pig, Hive, Clojure, Cascading, Python, Ruby
• EC2: Get running with limited initial capital
• Use external data and APIs in novel ways
• Recent real world example: FlightCaster




                                                  3
TrendingTopics.org




                     4
Daily Pageview Timeline Charts




                                 5
Detects Rising Trends with Hadoop




                                    6
TrendingTopics is Open Source
• Built as a side project at Data Wrangling
• Core code completed over 2 weeks in June
• Code on Github
• Data released on Amazon Public Datasets




                                              7
Technology Stack




                   8
Application Data Flow




                        9
Hadoop on Amazon EC2
• EC2: Launch servers on demand
• S3: Simple Storage Service
• EBS: Persistent disks for EC2
• Used Cloudera Hadoop Distribution
 – Includes Pig, Hive, HBase, Sqoop, EBS integration
 – Allows customization of Hadoop EC2 instances
 – See EC2 Docs and Tutorials on the Cloudera site




                                                       10
Wikipedia Page View Log Data
• 1 TB of hourly log files from Wikipedia Squid proxy
• Filename contains timestamp
• Columns: project, pagename, pageviews, & bytes




                                                 11
Wikipedia Redirect Data
• Many articles in the logs redirect to canonical titles
• Amazon Public Dataset contains a lookup table:




                                                    12
Loading Data into Hadoop on EC2
• Hadoop can read directly from Amazon S3

• Pulling in data from EBS snapshots




                                            13
Python Streaming for Daily Timelines




                                       14
Streaming: Filter & Aggregate Logs




                                     15
Hive Data Warehouse Layer on EC2
• Easy for analysts familar with SQL
• Familar syntax (SELECT, GROUP BY, SORT...)
• Hide the MR muck, for JOIN, UNION, etc.
• Supports streaming UDFs
• Sampling: generate datasets for R&D
• Used on Timeline and Redirect data




                                           16
Fixing Wiki Redirects with Hive JOIN




                                       17
Trend Detection With Hadoop




                              18
Hourly Data: “Java” vs. “Hangover”




                                     19
Robust Regression with R & Hadoop
• Fit R model to hourly data for 2.3M Wiki articles
• “Trending” if article views >> model prediction



                               Page View “Spikes”




                                                      20
Call Python Trend Mapper from Hive




                                     21
Simple Trend Calculation in Python




                                     22
Use Trend Scores to Rank Articles




                                    23
Exporting Data from our EC2 Cluster
• Output files are delimited text for bulk DB load
• Distcp outputs to S3:
 – hadoop distcp /user/output s3n://trendingtopics/exports


• Hive can create tables directly over S3 filesystem:
 – create external table foo ... location 's3n://trendingtopics/
   hiveoutput’




                                                           24
Hooking It All Together




                          25
Ruby on Rails Front End
• Run a development version of the app locally:




                                                  26
Automate & Deploy with EC2onRails
• EC2onRails: Ubuntu server images for EC2, runs
  instances of Ruby on Rails, MySQL, Apache
• Daily cron job on App server launches Hadoop
• Hadoop EC2 boot script checks out latest trend
  detection code from Github
• Capistrano tasks deploy new application code,
  maintain DB, cache, archives




                                             27
Cron Triggers Hadoop Jobs on EC2
 Modify Cloudera hadoop-ec2-init-remote.sh boot script




                                                    28
Bash Scripts Automate Daily Run
• run_daily_merge.sh kicks off Hadoop/Hive jobs
• Queries MySQL for last run date
• Generates timelines with Hadoop streaming
• Hive operates on timeline data, computes Trends
  with Python
• Hive output copied to S3
• Calls remote script to load MySQL staging tables
• Hadoop cluster self terminates
• Remote script swaps staging and prod after load



                                               29
Capistrano Tasks
• Capistrano: Ruby tool for automating tasks on one
  or more remote servers (www.capify.org)
• Deploys Rails code and cron jobs to App server
• Cleanup tasks are triggered at end of load:
 – Flags featured Wikipedia articles
 – Indexes MySQL tables, Flushes cache




                                               30
Visualization APIs
• Google Viz API: Annotated Timeline for Rails



• Google Charts sparklines: gc4r



• Yahoo BOSS image search in Rails




                                                 31
Ideas & Next Steps
• Try out the code on Github
• Show related articles using link graph data
• Extract trending phrases from article text
• Boost Twitter trending topics with Wikipedia logs
• Update top trends hourly
• Use date partitioned tables, incremental updates
• Add a real search engine (uses MySQL now)




                                                 32
LinkedIn Analytics Team




                     We’re Hiring




                                    33
Questions?
• Contact
 – Blog: datawrangling.com
 – Twitter: @peteskomoroch

• Code
 – http://github.com/datawrangling/trendingtopics
 – Hadoop blog posts & tutorials on the Cloudera site
 – Amazon EMR tutorial on Finding Similar Artists with
   Hadoop & Last.FM data http://bit.ly/EMRsimilarity




                                                         34

More Related Content

What's hot

Dataflow in 104corp - AWS UserGroup TW 2018
Dataflow in 104corp - AWS UserGroup TW 2018Dataflow in 104corp - AWS UserGroup TW 2018
Dataflow in 104corp - AWS UserGroup TW 2018Gavin Lin
 
Summingbird: Streaming Portable, MapReduce
Summingbird: Streaming Portable, MapReduceSummingbird: Streaming Portable, MapReduce
Summingbird: Streaming Portable, MapReduceDataWorks Summit
 
Flink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward SF 2017: James Malone - Make The Cloud Work For YouFlink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward SF 2017: James Malone - Make The Cloud Work For YouFlink Forward
 
Scalable Scientific Computing with Dask
Scalable Scientific Computing with DaskScalable Scientific Computing with Dask
Scalable Scientific Computing with DaskUwe Korn
 
Dataflow in 104corp - DataConTW2018
Dataflow in 104corp - DataConTW2018Dataflow in 104corp - DataConTW2018
Dataflow in 104corp - DataConTW2018Gavin Lin
 
(CMP310) Data Processing Pipelines Using Containers & Spot Instances
(CMP310) Data Processing Pipelines Using Containers & Spot Instances(CMP310) Data Processing Pipelines Using Containers & Spot Instances
(CMP310) Data Processing Pipelines Using Containers & Spot InstancesAmazon Web Services
 
Spark Summit EU talk by Nimbus Goehausen
Spark Summit EU talk by Nimbus GoehausenSpark Summit EU talk by Nimbus Goehausen
Spark Summit EU talk by Nimbus GoehausenSpark Summit
 
Paris DataGeek - SummingBird
Paris DataGeek - SummingBirdParis DataGeek - SummingBird
Paris DataGeek - SummingBirdSofian Djamaa
 
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014Amazon Web Services
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Amazon Web Services
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data ScienceErik Bernhardsson
 
Introduction to Apache Storm - Concept & Example
Introduction to Apache Storm - Concept & ExampleIntroduction to Apache Storm - Concept & Example
Introduction to Apache Storm - Concept & ExampleDung Ngua
 
Community-Driven Graphs with JanusGraph
Community-Driven Graphs with JanusGraphCommunity-Driven Graphs with JanusGraph
Community-Driven Graphs with JanusGraphJason Plurad
 
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Databricks
 
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementIntro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementBurasakorn Sabyeying
 
Luigi presentation OA Summit
Luigi presentation OA SummitLuigi presentation OA Summit
Luigi presentation OA SummitOpen Analytics
 
Custom object detection_gcp
Custom object detection_gcpCustom object detection_gcp
Custom object detection_gcpVaibhav Sahu
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Databricks
 
Diminuendo! Tactics in Support of FaaS Migrations Slides
Diminuendo! Tactics in Support of FaaS Migrations SlidesDiminuendo! Tactics in Support of FaaS Migrations Slides
Diminuendo! Tactics in Support of FaaS Migrations SlidesSebastian Werner
 

What's hot (20)

Dataflow in 104corp - AWS UserGroup TW 2018
Dataflow in 104corp - AWS UserGroup TW 2018Dataflow in 104corp - AWS UserGroup TW 2018
Dataflow in 104corp - AWS UserGroup TW 2018
 
Summingbird: Streaming Portable, MapReduce
Summingbird: Streaming Portable, MapReduceSummingbird: Streaming Portable, MapReduce
Summingbird: Streaming Portable, MapReduce
 
Flink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward SF 2017: James Malone - Make The Cloud Work For YouFlink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward SF 2017: James Malone - Make The Cloud Work For You
 
Scalable Scientific Computing with Dask
Scalable Scientific Computing with DaskScalable Scientific Computing with Dask
Scalable Scientific Computing with Dask
 
Dataflow in 104corp - DataConTW2018
Dataflow in 104corp - DataConTW2018Dataflow in 104corp - DataConTW2018
Dataflow in 104corp - DataConTW2018
 
(CMP310) Data Processing Pipelines Using Containers & Spot Instances
(CMP310) Data Processing Pipelines Using Containers & Spot Instances(CMP310) Data Processing Pipelines Using Containers & Spot Instances
(CMP310) Data Processing Pipelines Using Containers & Spot Instances
 
Spark Summit EU talk by Nimbus Goehausen
Spark Summit EU talk by Nimbus GoehausenSpark Summit EU talk by Nimbus Goehausen
Spark Summit EU talk by Nimbus Goehausen
 
Paris DataGeek - SummingBird
Paris DataGeek - SummingBirdParis DataGeek - SummingBird
Paris DataGeek - SummingBird
 
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data Science
 
Introduction to Apache Storm - Concept & Example
Introduction to Apache Storm - Concept & ExampleIntroduction to Apache Storm - Concept & Example
Introduction to Apache Storm - Concept & Example
 
Community-Driven Graphs with JanusGraph
Community-Driven Graphs with JanusGraphCommunity-Driven Graphs with JanusGraph
Community-Driven Graphs with JanusGraph
 
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
 
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementIntro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
 
Luigi presentation OA Summit
Luigi presentation OA SummitLuigi presentation OA Summit
Luigi presentation OA Summit
 
Custom object detection_gcp
Custom object detection_gcpCustom object detection_gcp
Custom object detection_gcp
 
Luigi future
Luigi futureLuigi future
Luigi future
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2
 
Diminuendo! Tactics in Support of FaaS Migrations Slides
Diminuendo! Tactics in Support of FaaS Migrations SlidesDiminuendo! Tactics in Support of FaaS Migrations Slides
Diminuendo! Tactics in Support of FaaS Migrations Slides
 

Viewers also liked

Requirements Analysis, Prototyping and Evaluation - Lecture 03 - Next Generat...
Requirements Analysis, Prototyping and Evaluation - Lecture 03 - Next Generat...Requirements Analysis, Prototyping and Evaluation - Lecture 03 - Next Generat...
Requirements Analysis, Prototyping and Evaluation - Lecture 03 - Next Generat...Beat Signer
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnJosef A. Habdank
 
Software Requirement Specification
Software Requirement SpecificationSoftware Requirement Specification
Software Requirement SpecificationVishal Singh
 
In plant training review 1 msc
In plant training review 1 mscIn plant training review 1 msc
In plant training review 1 mscmanish chauhan
 
Manager Accounts (SRE)
Manager Accounts (SRE)Manager Accounts (SRE)
Manager Accounts (SRE)Nasir Ali
 
EDC_2131006_BipolarJunctionTransistor
EDC_2131006_BipolarJunctionTransistorEDC_2131006_BipolarJunctionTransistor
EDC_2131006_BipolarJunctionTransistorJitin Pillai
 
戒除甜食,享受健康
戒除甜食,享受健康戒除甜食,享受健康
戒除甜食,享受健康八正 文化
 
Andrew SFX-ADF
Andrew SFX-ADFAndrew SFX-ADF
Andrew SFX-ADFsavomir
 
Minturnae Roman Resort
Minturnae Roman ResortMinturnae Roman Resort
Minturnae Roman ResortWilliam Hay
 
Grds international conference on pure and applied science (8)
Grds international conference on pure and applied science (8)Grds international conference on pure and applied science (8)
Grds international conference on pure and applied science (8)Global R & D Services
 
chocolate cake
chocolate cakechocolate cake
chocolate cakepagusti
 
Compare Nation Food Prices
Compare Nation Food PricesCompare Nation Food Prices
Compare Nation Food PricesShaikhani.
 
Amigo De Verdad Gardfield
Amigo De Verdad GardfieldAmigo De Verdad Gardfield
Amigo De Verdad Gardfieldhuehue 1
 
ประวัติส่วนตัว
ประวัติส่วนตัวประวัติส่วนตัว
ประวัติส่วนตัวKatesuda Fon
 
«Добрая точка»
«Добрая точка»«Добрая точка»
«Добрая точка»nfnfrz
 
Daily newsletter-e no500 6-6-2014
Daily newsletter-e no500 6-6-2014Daily newsletter-e no500 6-6-2014
Daily newsletter-e no500 6-6-2014al-nashra
 

Viewers also liked (20)

Requirements Analysis, Prototyping and Evaluation - Lecture 03 - Next Generat...
Requirements Analysis, Prototyping and Evaluation - Lecture 03 - Next Generat...Requirements Analysis, Prototyping and Evaluation - Lecture 03 - Next Generat...
Requirements Analysis, Prototyping and Evaluation - Lecture 03 - Next Generat...
 
Prototyping
PrototypingPrototyping
Prototyping
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
 
Software Requirement Specification
Software Requirement SpecificationSoftware Requirement Specification
Software Requirement Specification
 
lzonepsu
lzonepsu lzonepsu
lzonepsu
 
In plant training review 1 msc
In plant training review 1 mscIn plant training review 1 msc
In plant training review 1 msc
 
Manager Accounts (SRE)
Manager Accounts (SRE)Manager Accounts (SRE)
Manager Accounts (SRE)
 
EDC_2131006_BipolarJunctionTransistor
EDC_2131006_BipolarJunctionTransistorEDC_2131006_BipolarJunctionTransistor
EDC_2131006_BipolarJunctionTransistor
 
戒除甜食,享受健康
戒除甜食,享受健康戒除甜食,享受健康
戒除甜食,享受健康
 
Andrew SFX-ADF
Andrew SFX-ADFAndrew SFX-ADF
Andrew SFX-ADF
 
CV
CVCV
CV
 
Minturnae Roman Resort
Minturnae Roman ResortMinturnae Roman Resort
Minturnae Roman Resort
 
Grds international conference on pure and applied science (8)
Grds international conference on pure and applied science (8)Grds international conference on pure and applied science (8)
Grds international conference on pure and applied science (8)
 
chocolate cake
chocolate cakechocolate cake
chocolate cake
 
Compare Nation Food Prices
Compare Nation Food PricesCompare Nation Food Prices
Compare Nation Food Prices
 
P.cemas belia
P.cemas beliaP.cemas belia
P.cemas belia
 
Amigo De Verdad Gardfield
Amigo De Verdad GardfieldAmigo De Verdad Gardfield
Amigo De Verdad Gardfield
 
ประวัติส่วนตัว
ประวัติส่วนตัวประวัติส่วนตัว
ประวัติส่วนตัว
 
«Добрая точка»
«Добрая точка»«Добрая точка»
«Добрая точка»
 
Daily newsletter-e no500 6-6-2014
Daily newsletter-e no500 6-6-2014Daily newsletter-e no500 6-6-2014
Daily newsletter-e no500 6-6-2014
 

Similar to Prototyping Data Intensive Apps: TrendingTopics.org

CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4Michael Kehoe
 
Introducing Venice
Introducing VeniceIntroducing Venice
Introducing VeniceYan Yan
 
Latest Developments in H2O
Latest Developments in H2OLatest Developments in H2O
Latest Developments in H2OSri Ambati
 
Introducing Venice - Strata NYC 2017
Introducing Venice - Strata NYC 2017Introducing Venice - Strata NYC 2017
Introducing Venice - Strata NYC 2017Felix GV
 
Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWSPaolo latella
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015ontopic
 
Available platforms for Big Data 2.0
Available platforms for Big Data 2.0Available platforms for Big Data 2.0
Available platforms for Big Data 2.0Petr Novotný
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSSri Ambati
 
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...Radhika Puthiyetath
 
Storage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on KubernetesStorage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on KubernetesDataWorks Summit
 
NoSQL on the move
NoSQL on the moveNoSQL on the move
NoSQL on the moveCodemotion
 
PHP and the Cloud: The view from the bazaar
PHP and the Cloud: The view from the bazaarPHP and the Cloud: The view from the bazaar
PHP and the Cloud: The view from the bazaarvitoc
 
Spark volume requirements 2018
Spark volume requirements 2018Spark volume requirements 2018
Spark volume requirements 2018Rachit Arora
 

Similar to Prototyping Data Intensive Apps: TrendingTopics.org (20)

CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4
 
Introducing Venice
Introducing VeniceIntroducing Venice
Introducing Venice
 
Latest Developments in H2O
Latest Developments in H2OLatest Developments in H2O
Latest Developments in H2O
 
Introducing Venice - Strata NYC 2017
Introducing Venice - Strata NYC 2017Introducing Venice - Strata NYC 2017
Introducing Venice - Strata NYC 2017
 
Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWS
 
Wmware NoSQL
Wmware NoSQLWmware NoSQL
Wmware NoSQL
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015
 
Available platforms for Big Data 2.0
Available platforms for Big Data 2.0Available platforms for Big Data 2.0
Available platforms for Big Data 2.0
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWS
 
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
 
Apache hadoop
Apache hadoopApache hadoop
Apache hadoop
 
Big data on aws
Big data on awsBig data on aws
Big data on aws
 
Storage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on KubernetesStorage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on Kubernetes
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
NoSQL on the move
NoSQL on the moveNoSQL on the move
NoSQL on the move
 
PHP and the Cloud: The view from the bazaar
PHP and the Cloud: The view from the bazaarPHP and the Cloud: The view from the bazaar
PHP and the Cloud: The view from the bazaar
 
Spark volume requirements 2018
Spark volume requirements 2018Spark volume requirements 2018
Spark volume requirements 2018
 

More from Peter Skomoroch

Bridging the AI Gap: Building Stakeholder Support
Bridging the AI Gap: Building Stakeholder SupportBridging the AI Gap: Building Stakeholder Support
Bridging the AI Gap: Building Stakeholder SupportPeter Skomoroch
 
Managing Machines: The New AI Dev Stack
Managing Machines: The New AI Dev StackManaging Machines: The New AI Dev Stack
Managing Machines: The New AI Dev StackPeter Skomoroch
 
Product Management for AI
Product Management for AIProduct Management for AI
Product Management for AIPeter Skomoroch
 
Executive Briefing: Why managing machines is harder than you think
Executive Briefing: Why managing machines is harder than you thinkExecutive Briefing: Why managing machines is harder than you think
Executive Briefing: Why managing machines is harder than you thinkPeter Skomoroch
 
Building Competitive Moats With Data
Building Competitive Moats With DataBuilding Competitive Moats With Data
Building Competitive Moats With DataPeter Skomoroch
 
O'Reilly Strata: Distilling Data Exhaust
O'Reilly Strata: Distilling Data ExhaustO'Reilly Strata: Distilling Data Exhaust
O'Reilly Strata: Distilling Data ExhaustPeter Skomoroch
 
SF Data Science: Developing Data Products
SF Data Science: Developing Data ProductsSF Data Science: Developing Data Products
SF Data Science: Developing Data ProductsPeter Skomoroch
 
Skills, Reputation, and Search
Skills, Reputation, and SearchSkills, Reputation, and Search
Skills, Reputation, and SearchPeter Skomoroch
 
LinkedIn Endorsements: Reputation, Virality, and Social Tagging
LinkedIn Endorsements: Reputation, Virality, and Social TaggingLinkedIn Endorsements: Reputation, Virality, and Social Tagging
LinkedIn Endorsements: Reputation, Virality, and Social TaggingPeter Skomoroch
 
Developing Data Products
Developing Data ProductsDeveloping Data Products
Developing Data ProductsPeter Skomoroch
 
Practical Problem Solving with Data - Onlab Data Conference, Tokyo
Practical Problem Solving with Data - Onlab Data Conference, TokyoPractical Problem Solving with Data - Onlab Data Conference, Tokyo
Practical Problem Solving with Data - Onlab Data Conference, TokyoPeter Skomoroch
 
Street Fighting Data Science
Street Fighting Data ScienceStreet Fighting Data Science
Street Fighting Data SciencePeter Skomoroch
 
Data Mashups -Data Science Summit
Data Mashups -Data Science SummitData Mashups -Data Science Summit
Data Mashups -Data Science SummitPeter Skomoroch
 
Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011Peter Skomoroch
 
Rapid Data Exploration With Hadoop
Rapid Data Exploration With HadoopRapid Data Exploration With Hadoop
Rapid Data Exploration With HadoopPeter Skomoroch
 

More from Peter Skomoroch (15)

Bridging the AI Gap: Building Stakeholder Support
Bridging the AI Gap: Building Stakeholder SupportBridging the AI Gap: Building Stakeholder Support
Bridging the AI Gap: Building Stakeholder Support
 
Managing Machines: The New AI Dev Stack
Managing Machines: The New AI Dev StackManaging Machines: The New AI Dev Stack
Managing Machines: The New AI Dev Stack
 
Product Management for AI
Product Management for AIProduct Management for AI
Product Management for AI
 
Executive Briefing: Why managing machines is harder than you think
Executive Briefing: Why managing machines is harder than you thinkExecutive Briefing: Why managing machines is harder than you think
Executive Briefing: Why managing machines is harder than you think
 
Building Competitive Moats With Data
Building Competitive Moats With DataBuilding Competitive Moats With Data
Building Competitive Moats With Data
 
O'Reilly Strata: Distilling Data Exhaust
O'Reilly Strata: Distilling Data ExhaustO'Reilly Strata: Distilling Data Exhaust
O'Reilly Strata: Distilling Data Exhaust
 
SF Data Science: Developing Data Products
SF Data Science: Developing Data ProductsSF Data Science: Developing Data Products
SF Data Science: Developing Data Products
 
Skills, Reputation, and Search
Skills, Reputation, and SearchSkills, Reputation, and Search
Skills, Reputation, and Search
 
LinkedIn Endorsements: Reputation, Virality, and Social Tagging
LinkedIn Endorsements: Reputation, Virality, and Social TaggingLinkedIn Endorsements: Reputation, Virality, and Social Tagging
LinkedIn Endorsements: Reputation, Virality, and Social Tagging
 
Developing Data Products
Developing Data ProductsDeveloping Data Products
Developing Data Products
 
Practical Problem Solving with Data - Onlab Data Conference, Tokyo
Practical Problem Solving with Data - Onlab Data Conference, TokyoPractical Problem Solving with Data - Onlab Data Conference, Tokyo
Practical Problem Solving with Data - Onlab Data Conference, Tokyo
 
Street Fighting Data Science
Street Fighting Data ScienceStreet Fighting Data Science
Street Fighting Data Science
 
Data Mashups -Data Science Summit
Data Mashups -Data Science SummitData Mashups -Data Science Summit
Data Mashups -Data Science Summit
 
Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011
 
Rapid Data Exploration With Hadoop
Rapid Data Exploration With HadoopRapid Data Exploration With Hadoop
Rapid Data Exploration With Hadoop
 

Recently uploaded

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 

Recently uploaded (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

Prototyping Data Intensive Apps: TrendingTopics.org

  • 1. Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research Scientist at LinkedIn Consultant at Data Wrangling @peteskomoroch 09/29/09 1
  • 2. Talk Outline • TrendingTopics Overview • Wikipedia Page View Dataset • Hadoop on Amazon EC2 • Loading Data on EC2: Amazon EBS & S3 • Daily Timelines with Hadoop Streaming • Hive Data Warehouse Layer • Trend Computation with Hive • Hooking It All Together • Front End & Visualizations
  • 3. Data Intensive Web Apps • Batch data mining or prediction with Hadoop • Iterate quickly with high level languages & tools – Pig, Hive, Clojure, Cascading, Python, Ruby • EC2: Get running with limited initial capital • Use external data and APIs in novel ways • Recent real world example: FlightCaster 3
  • 6. Detects Rising Trends with Hadoop 6
  • 7. TrendingTopics is Open Source • Built as a side project at Data Wrangling • Core code completed over 2 weeks in June • Code on Github • Data released on Amazon Public Datasets 7
  • 10. Hadoop on Amazon EC2 • EC2: Launch servers on demand • S3: Simple Storage Service • EBS: Persistent disks for EC2 • Used Cloudera Hadoop Distribution – Includes Pig, Hive, HBase, Sqoop, EBS integration – Allows customization of Hadoop EC2 instances – See EC2 Docs and Tutorials on the Cloudera site 10
  • 11. Wikipedia Page View Log Data • 1 TB of hourly log files from Wikipedia Squid proxy • Filename contains timestamp • Columns: project, pagename, pageviews, & bytes 11
  • 12. Wikipedia Redirect Data • Many articles in the logs redirect to canonical titles • Amazon Public Dataset contains a lookup table: 12
  • 13. Loading Data into Hadoop on EC2 • Hadoop can read directly from Amazon S3 • Pulling in data from EBS snapshots 13
  • 14. Python Streaming for Daily Timelines 14
  • 15. Streaming: Filter & Aggregate Logs 15
  • 16. Hive Data Warehouse Layer on EC2 • Easy for analysts familar with SQL • Familar syntax (SELECT, GROUP BY, SORT...) • Hide the MR muck, for JOIN, UNION, etc. • Supports streaming UDFs • Sampling: generate datasets for R&D • Used on Timeline and Redirect data 16
  • 17. Fixing Wiki Redirects with Hive JOIN 17
  • 18. Trend Detection With Hadoop 18
  • 19. Hourly Data: “Java” vs. “Hangover” 19
  • 20. Robust Regression with R & Hadoop • Fit R model to hourly data for 2.3M Wiki articles • “Trending” if article views >> model prediction Page View “Spikes” 20
  • 21. Call Python Trend Mapper from Hive 21
  • 22. Simple Trend Calculation in Python 22
  • 23. Use Trend Scores to Rank Articles 23
  • 24. Exporting Data from our EC2 Cluster • Output files are delimited text for bulk DB load • Distcp outputs to S3: – hadoop distcp /user/output s3n://trendingtopics/exports • Hive can create tables directly over S3 filesystem: – create external table foo ... location 's3n://trendingtopics/ hiveoutput’ 24
  • 25. Hooking It All Together 25
  • 26. Ruby on Rails Front End • Run a development version of the app locally: 26
  • 27. Automate & Deploy with EC2onRails • EC2onRails: Ubuntu server images for EC2, runs instances of Ruby on Rails, MySQL, Apache • Daily cron job on App server launches Hadoop • Hadoop EC2 boot script checks out latest trend detection code from Github • Capistrano tasks deploy new application code, maintain DB, cache, archives 27
  • 28. Cron Triggers Hadoop Jobs on EC2 Modify Cloudera hadoop-ec2-init-remote.sh boot script 28
  • 29. Bash Scripts Automate Daily Run • run_daily_merge.sh kicks off Hadoop/Hive jobs • Queries MySQL for last run date • Generates timelines with Hadoop streaming • Hive operates on timeline data, computes Trends with Python • Hive output copied to S3 • Calls remote script to load MySQL staging tables • Hadoop cluster self terminates • Remote script swaps staging and prod after load 29
  • 30. Capistrano Tasks • Capistrano: Ruby tool for automating tasks on one or more remote servers (www.capify.org) • Deploys Rails code and cron jobs to App server • Cleanup tasks are triggered at end of load: – Flags featured Wikipedia articles – Indexes MySQL tables, Flushes cache 30
  • 31. Visualization APIs • Google Viz API: Annotated Timeline for Rails • Google Charts sparklines: gc4r • Yahoo BOSS image search in Rails 31
  • 32. Ideas & Next Steps • Try out the code on Github • Show related articles using link graph data • Extract trending phrases from article text • Boost Twitter trending topics with Wikipedia logs • Update top trends hourly • Use date partitioned tables, incremental updates • Add a real search engine (uses MySQL now) 32
  • 33. LinkedIn Analytics Team We’re Hiring 33
  • 34. Questions? • Contact – Blog: datawrangling.com – Twitter: @peteskomoroch • Code – http://github.com/datawrangling/trendingtopics – Hadoop blog posts & tutorials on the Cloudera site – Amazon EMR tutorial on Finding Similar Artists with Hadoop & Last.FM data http://bit.ly/EMRsimilarity 34