SlideShare a Scribd company logo
1 of 34
Download to read offline
Prototyping Data Intensive Apps:
       TrendingTopics.org
              Pete Skomoroch
              Research Scientist at LinkedIn
              Consultant at Data Wrangling




           @peteskomoroch
                09/29/09                       1
Talk Outline
• TrendingTopics Overview
• Wikipedia Page View Dataset
• Hadoop on Amazon EC2
• Loading Data on EC2: Amazon EBS & S3
• Daily Timelines with Hadoop Streaming
• Hive Data Warehouse Layer
• Trend Computation with Hive
• Hooking It All Together
• Front End & Visualizations
Data Intensive Web Apps
• Batch data mining or prediction with Hadoop
• Iterate quickly with high level languages & tools
 – Pig, Hive, Clojure, Cascading, Python, Ruby
• EC2: Get running with limited initial capital
• Use external data and APIs in novel ways
• Recent real world example: FlightCaster




                                                  3
TrendingTopics.org




                     4
Daily Pageview Timeline Charts




                                 5
Detects Rising Trends with Hadoop




                                    6
TrendingTopics is Open Source
• Built as a side project at Data Wrangling
• Core code completed over 2 weeks in June
• Code on Github
• Data released on Amazon Public Datasets




                                              7
Technology Stack




                   8
Application Data Flow




                        9
Hadoop on Amazon EC2
• EC2: Launch servers on demand
• S3: Simple Storage Service
• EBS: Persistent disks for EC2
• Used Cloudera Hadoop Distribution
 – Includes Pig, Hive, HBase, Sqoop, EBS integration
 – Allows customization of Hadoop EC2 instances
 – See EC2 Docs and Tutorials on the Cloudera site




                                                       10
Wikipedia Page View Log Data
• 1 TB of hourly log files from Wikipedia Squid proxy
• Filename contains timestamp
• Columns: project, pagename, pageviews, & bytes




                                                 11
Wikipedia Redirect Data
• Many articles in the logs redirect to canonical titles
• Amazon Public Dataset contains a lookup table:




                                                    12
Loading Data into Hadoop on EC2
• Hadoop can read directly from Amazon S3

• Pulling in data from EBS snapshots




                                            13
Python Streaming for Daily Timelines




                                       14
Streaming: Filter & Aggregate Logs




                                     15
Hive Data Warehouse Layer on EC2
• Easy for analysts familar with SQL
• Familar syntax (SELECT, GROUP BY, SORT...)
• Hide the MR muck, for JOIN, UNION, etc.
• Supports streaming UDFs
• Sampling: generate datasets for R&D
• Used on Timeline and Redirect data




                                           16
Fixing Wiki Redirects with Hive JOIN




                                       17
Trend Detection With Hadoop




                              18
Hourly Data: “Java” vs. “Hangover”




                                     19
Robust Regression with R & Hadoop
• Fit R model to hourly data for 2.3M Wiki articles
• “Trending” if article views >> model prediction



                               Page View “Spikes”




                                                      20
Call Python Trend Mapper from Hive




                                     21
Simple Trend Calculation in Python




                                     22
Use Trend Scores to Rank Articles




                                    23
Exporting Data from our EC2 Cluster
• Output files are delimited text for bulk DB load
• Distcp outputs to S3:
 – hadoop distcp /user/output s3n://trendingtopics/exports


• Hive can create tables directly over S3 filesystem:
 – create external table foo ... location 's3n://trendingtopics/
   hiveoutput’




                                                           24
Hooking It All Together




                          25
Ruby on Rails Front End
• Run a development version of the app locally:




                                                  26
Automate & Deploy with EC2onRails
• EC2onRails: Ubuntu server images for EC2, runs
  instances of Ruby on Rails, MySQL, Apache
• Daily cron job on App server launches Hadoop
• Hadoop EC2 boot script checks out latest trend
  detection code from Github
• Capistrano tasks deploy new application code,
  maintain DB, cache, archives




                                             27
Cron Triggers Hadoop Jobs on EC2
 Modify Cloudera hadoop-ec2-init-remote.sh boot script




                                                    28
Bash Scripts Automate Daily Run
• run_daily_merge.sh kicks off Hadoop/Hive jobs
• Queries MySQL for last run date
• Generates timelines with Hadoop streaming
• Hive operates on timeline data, computes Trends
  with Python
• Hive output copied to S3
• Calls remote script to load MySQL staging tables
• Hadoop cluster self terminates
• Remote script swaps staging and prod after load



                                               29
Capistrano Tasks
• Capistrano: Ruby tool for automating tasks on one
  or more remote servers (www.capify.org)
• Deploys Rails code and cron jobs to App server
• Cleanup tasks are triggered at end of load:
 – Flags featured Wikipedia articles
 – Indexes MySQL tables, Flushes cache




                                               30
Visualization APIs
• Google Viz API: Annotated Timeline for Rails



• Google Charts sparklines: gc4r



• Yahoo BOSS image search in Rails




                                                 31
Ideas & Next Steps
• Try out the code on Github
• Show related articles using link graph data
• Extract trending phrases from article text
• Boost Twitter trending topics with Wikipedia logs
• Update top trends hourly
• Use date partitioned tables, incremental updates
• Add a real search engine (uses MySQL now)




                                                 32
LinkedIn Analytics Team




                     We’re Hiring




                                    33
Questions?
• Contact
 – Blog: datawrangling.com
 – Twitter: @peteskomoroch

• Code
 – http://github.com/datawrangling/trendingtopics
 – Hadoop blog posts & tutorials on the Cloudera site
 – Amazon EMR tutorial on Finding Similar Artists with
   Hadoop & Last.FM data http://bit.ly/EMRsimilarity




                                                         34

More Related Content

What's hot

Dataflow in 104corp - AWS UserGroup TW 2018
Dataflow in 104corp - AWS UserGroup TW 2018Dataflow in 104corp - AWS UserGroup TW 2018
Dataflow in 104corp - AWS UserGroup TW 2018Gavin Lin
 
Summingbird: Streaming Portable, MapReduce
Summingbird: Streaming Portable, MapReduceSummingbird: Streaming Portable, MapReduce
Summingbird: Streaming Portable, MapReduceDataWorks Summit
 
Flink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward SF 2017: James Malone - Make The Cloud Work For YouFlink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward SF 2017: James Malone - Make The Cloud Work For YouFlink Forward
 
Scalable Scientific Computing with Dask
Scalable Scientific Computing with DaskScalable Scientific Computing with Dask
Scalable Scientific Computing with DaskUwe Korn
 
Dataflow in 104corp - DataConTW2018
Dataflow in 104corp - DataConTW2018Dataflow in 104corp - DataConTW2018
Dataflow in 104corp - DataConTW2018Gavin Lin
 
(CMP310) Data Processing Pipelines Using Containers & Spot Instances
(CMP310) Data Processing Pipelines Using Containers & Spot Instances(CMP310) Data Processing Pipelines Using Containers & Spot Instances
(CMP310) Data Processing Pipelines Using Containers & Spot InstancesAmazon Web Services
 
Spark Summit EU talk by Nimbus Goehausen
Spark Summit EU talk by Nimbus GoehausenSpark Summit EU talk by Nimbus Goehausen
Spark Summit EU talk by Nimbus GoehausenSpark Summit
 
Paris DataGeek - SummingBird
Paris DataGeek - SummingBirdParis DataGeek - SummingBird
Paris DataGeek - SummingBirdSofian Djamaa
 
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014Amazon Web Services
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Amazon Web Services
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data ScienceErik Bernhardsson
 
Introduction to Apache Storm - Concept & Example
Introduction to Apache Storm - Concept & ExampleIntroduction to Apache Storm - Concept & Example
Introduction to Apache Storm - Concept & ExampleDung Ngua
 
Community-Driven Graphs with JanusGraph
Community-Driven Graphs with JanusGraphCommunity-Driven Graphs with JanusGraph
Community-Driven Graphs with JanusGraphJason Plurad
 
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Databricks
 
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementIntro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementBurasakorn Sabyeying
 
Luigi presentation OA Summit
Luigi presentation OA SummitLuigi presentation OA Summit
Luigi presentation OA SummitOpen Analytics
 
Custom object detection_gcp
Custom object detection_gcpCustom object detection_gcp
Custom object detection_gcpVaibhav Sahu
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Databricks
 
Diminuendo! Tactics in Support of FaaS Migrations Slides
Diminuendo! Tactics in Support of FaaS Migrations SlidesDiminuendo! Tactics in Support of FaaS Migrations Slides
Diminuendo! Tactics in Support of FaaS Migrations SlidesSebastian Werner
 

What's hot (20)

Dataflow in 104corp - AWS UserGroup TW 2018
Dataflow in 104corp - AWS UserGroup TW 2018Dataflow in 104corp - AWS UserGroup TW 2018
Dataflow in 104corp - AWS UserGroup TW 2018
 
Summingbird: Streaming Portable, MapReduce
Summingbird: Streaming Portable, MapReduceSummingbird: Streaming Portable, MapReduce
Summingbird: Streaming Portable, MapReduce
 
Flink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward SF 2017: James Malone - Make The Cloud Work For YouFlink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward SF 2017: James Malone - Make The Cloud Work For You
 
Scalable Scientific Computing with Dask
Scalable Scientific Computing with DaskScalable Scientific Computing with Dask
Scalable Scientific Computing with Dask
 
Dataflow in 104corp - DataConTW2018
Dataflow in 104corp - DataConTW2018Dataflow in 104corp - DataConTW2018
Dataflow in 104corp - DataConTW2018
 
(CMP310) Data Processing Pipelines Using Containers & Spot Instances
(CMP310) Data Processing Pipelines Using Containers & Spot Instances(CMP310) Data Processing Pipelines Using Containers & Spot Instances
(CMP310) Data Processing Pipelines Using Containers & Spot Instances
 
Spark Summit EU talk by Nimbus Goehausen
Spark Summit EU talk by Nimbus GoehausenSpark Summit EU talk by Nimbus Goehausen
Spark Summit EU talk by Nimbus Goehausen
 
Paris DataGeek - SummingBird
Paris DataGeek - SummingBirdParis DataGeek - SummingBird
Paris DataGeek - SummingBird
 
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data Science
 
Introduction to Apache Storm - Concept & Example
Introduction to Apache Storm - Concept & ExampleIntroduction to Apache Storm - Concept & Example
Introduction to Apache Storm - Concept & Example
 
Community-Driven Graphs with JanusGraph
Community-Driven Graphs with JanusGraphCommunity-Driven Graphs with JanusGraph
Community-Driven Graphs with JanusGraph
 
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
 
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementIntro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
 
Luigi presentation OA Summit
Luigi presentation OA SummitLuigi presentation OA Summit
Luigi presentation OA Summit
 
Custom object detection_gcp
Custom object detection_gcpCustom object detection_gcp
Custom object detection_gcp
 
Luigi future
Luigi futureLuigi future
Luigi future
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2
 
Diminuendo! Tactics in Support of FaaS Migrations Slides
Diminuendo! Tactics in Support of FaaS Migrations SlidesDiminuendo! Tactics in Support of FaaS Migrations Slides
Diminuendo! Tactics in Support of FaaS Migrations Slides
 

Viewers also liked

Requirements Analysis, Prototyping and Evaluation - Lecture 03 - Next Generat...
Requirements Analysis, Prototyping and Evaluation - Lecture 03 - Next Generat...Requirements Analysis, Prototyping and Evaluation - Lecture 03 - Next Generat...
Requirements Analysis, Prototyping and Evaluation - Lecture 03 - Next Generat...Beat Signer
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnJosef A. Habdank
 
Software Requirement Specification
Software Requirement SpecificationSoftware Requirement Specification
Software Requirement SpecificationVishal Singh
 
In plant training review 1 msc
In plant training review 1 mscIn plant training review 1 msc
In plant training review 1 mscmanish chauhan
 
Manager Accounts (SRE)
Manager Accounts (SRE)Manager Accounts (SRE)
Manager Accounts (SRE)Nasir Ali
 
EDC_2131006_BipolarJunctionTransistor
EDC_2131006_BipolarJunctionTransistorEDC_2131006_BipolarJunctionTransistor
EDC_2131006_BipolarJunctionTransistorJitin Pillai
 
戒除甜食,享受健康
戒除甜食,享受健康戒除甜食,享受健康
戒除甜食,享受健康八正 文化
 
Andrew SFX-ADF
Andrew SFX-ADFAndrew SFX-ADF
Andrew SFX-ADFsavomir
 
Minturnae Roman Resort
Minturnae Roman ResortMinturnae Roman Resort
Minturnae Roman ResortWilliam Hay
 
Grds international conference on pure and applied science (8)
Grds international conference on pure and applied science (8)Grds international conference on pure and applied science (8)
Grds international conference on pure and applied science (8)Global R & D Services
 
chocolate cake
chocolate cakechocolate cake
chocolate cakepagusti
 
Compare Nation Food Prices
Compare Nation Food PricesCompare Nation Food Prices
Compare Nation Food PricesShaikhani.
 
Amigo De Verdad Gardfield
Amigo De Verdad GardfieldAmigo De Verdad Gardfield
Amigo De Verdad Gardfieldhuehue 1
 
ประวัติส่วนตัว
ประวัติส่วนตัวประวัติส่วนตัว
ประวัติส่วนตัวKatesuda Fon
 
«Добрая точка»
«Добрая точка»«Добрая точка»
«Добрая точка»nfnfrz
 
Daily newsletter-e no500 6-6-2014
Daily newsletter-e no500 6-6-2014Daily newsletter-e no500 6-6-2014
Daily newsletter-e no500 6-6-2014al-nashra
 

Viewers also liked (20)

Requirements Analysis, Prototyping and Evaluation - Lecture 03 - Next Generat...
Requirements Analysis, Prototyping and Evaluation - Lecture 03 - Next Generat...Requirements Analysis, Prototyping and Evaluation - Lecture 03 - Next Generat...
Requirements Analysis, Prototyping and Evaluation - Lecture 03 - Next Generat...
 
Prototyping
PrototypingPrototyping
Prototyping
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
 
Software Requirement Specification
Software Requirement SpecificationSoftware Requirement Specification
Software Requirement Specification
 
lzonepsu
lzonepsu lzonepsu
lzonepsu
 
In plant training review 1 msc
In plant training review 1 mscIn plant training review 1 msc
In plant training review 1 msc
 
Manager Accounts (SRE)
Manager Accounts (SRE)Manager Accounts (SRE)
Manager Accounts (SRE)
 
EDC_2131006_BipolarJunctionTransistor
EDC_2131006_BipolarJunctionTransistorEDC_2131006_BipolarJunctionTransistor
EDC_2131006_BipolarJunctionTransistor
 
戒除甜食,享受健康
戒除甜食,享受健康戒除甜食,享受健康
戒除甜食,享受健康
 
Andrew SFX-ADF
Andrew SFX-ADFAndrew SFX-ADF
Andrew SFX-ADF
 
CV
CVCV
CV
 
Minturnae Roman Resort
Minturnae Roman ResortMinturnae Roman Resort
Minturnae Roman Resort
 
Grds international conference on pure and applied science (8)
Grds international conference on pure and applied science (8)Grds international conference on pure and applied science (8)
Grds international conference on pure and applied science (8)
 
chocolate cake
chocolate cakechocolate cake
chocolate cake
 
Compare Nation Food Prices
Compare Nation Food PricesCompare Nation Food Prices
Compare Nation Food Prices
 
P.cemas belia
P.cemas beliaP.cemas belia
P.cemas belia
 
Amigo De Verdad Gardfield
Amigo De Verdad GardfieldAmigo De Verdad Gardfield
Amigo De Verdad Gardfield
 
ประวัติส่วนตัว
ประวัติส่วนตัวประวัติส่วนตัว
ประวัติส่วนตัว
 
«Добрая точка»
«Добрая точка»«Добрая точка»
«Добрая точка»
 
Daily newsletter-e no500 6-6-2014
Daily newsletter-e no500 6-6-2014Daily newsletter-e no500 6-6-2014
Daily newsletter-e no500 6-6-2014
 

Similar to Prototyping Data Intensive Apps: TrendingTopics.org

CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4Michael Kehoe
 
Introducing Venice
Introducing VeniceIntroducing Venice
Introducing VeniceYan Yan
 
Latest Developments in H2O
Latest Developments in H2OLatest Developments in H2O
Latest Developments in H2OSri Ambati
 
Introducing Venice - Strata NYC 2017
Introducing Venice - Strata NYC 2017Introducing Venice - Strata NYC 2017
Introducing Venice - Strata NYC 2017Felix GV
 
Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWSPaolo latella
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015ontopic
 
Available platforms for Big Data 2.0
Available platforms for Big Data 2.0Available platforms for Big Data 2.0
Available platforms for Big Data 2.0Petr Novotný
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSSri Ambati
 
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...Radhika Puthiyetath
 
Storage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on KubernetesStorage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on KubernetesDataWorks Summit
 
NoSQL on the move
NoSQL on the moveNoSQL on the move
NoSQL on the moveCodemotion
 
PHP and the Cloud: The view from the bazaar
PHP and the Cloud: The view from the bazaarPHP and the Cloud: The view from the bazaar
PHP and the Cloud: The view from the bazaarvitoc
 
Spark volume requirements 2018
Spark volume requirements 2018Spark volume requirements 2018
Spark volume requirements 2018Rachit Arora
 

Similar to Prototyping Data Intensive Apps: TrendingTopics.org (20)

CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4
 
Introducing Venice
Introducing VeniceIntroducing Venice
Introducing Venice
 
Latest Developments in H2O
Latest Developments in H2OLatest Developments in H2O
Latest Developments in H2O
 
Introducing Venice - Strata NYC 2017
Introducing Venice - Strata NYC 2017Introducing Venice - Strata NYC 2017
Introducing Venice - Strata NYC 2017
 
Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWS
 
Wmware NoSQL
Wmware NoSQLWmware NoSQL
Wmware NoSQL
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015
 
Available platforms for Big Data 2.0
Available platforms for Big Data 2.0Available platforms for Big Data 2.0
Available platforms for Big Data 2.0
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWS
 
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
 
Apache hadoop
Apache hadoopApache hadoop
Apache hadoop
 
Big data on aws
Big data on awsBig data on aws
Big data on aws
 
Storage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on KubernetesStorage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on Kubernetes
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
NoSQL on the move
NoSQL on the moveNoSQL on the move
NoSQL on the move
 
PHP and the Cloud: The view from the bazaar
PHP and the Cloud: The view from the bazaarPHP and the Cloud: The view from the bazaar
PHP and the Cloud: The view from the bazaar
 
Spark volume requirements 2018
Spark volume requirements 2018Spark volume requirements 2018
Spark volume requirements 2018
 

More from Peter Skomoroch

Bridging the AI Gap: Building Stakeholder Support
Bridging the AI Gap: Building Stakeholder SupportBridging the AI Gap: Building Stakeholder Support
Bridging the AI Gap: Building Stakeholder SupportPeter Skomoroch
 
Managing Machines: The New AI Dev Stack
Managing Machines: The New AI Dev StackManaging Machines: The New AI Dev Stack
Managing Machines: The New AI Dev StackPeter Skomoroch
 
Product Management for AI
Product Management for AIProduct Management for AI
Product Management for AIPeter Skomoroch
 
Executive Briefing: Why managing machines is harder than you think
Executive Briefing: Why managing machines is harder than you thinkExecutive Briefing: Why managing machines is harder than you think
Executive Briefing: Why managing machines is harder than you thinkPeter Skomoroch
 
Building Competitive Moats With Data
Building Competitive Moats With DataBuilding Competitive Moats With Data
Building Competitive Moats With DataPeter Skomoroch
 
O'Reilly Strata: Distilling Data Exhaust
O'Reilly Strata: Distilling Data ExhaustO'Reilly Strata: Distilling Data Exhaust
O'Reilly Strata: Distilling Data ExhaustPeter Skomoroch
 
SF Data Science: Developing Data Products
SF Data Science: Developing Data ProductsSF Data Science: Developing Data Products
SF Data Science: Developing Data ProductsPeter Skomoroch
 
Skills, Reputation, and Search
Skills, Reputation, and SearchSkills, Reputation, and Search
Skills, Reputation, and SearchPeter Skomoroch
 
LinkedIn Endorsements: Reputation, Virality, and Social Tagging
LinkedIn Endorsements: Reputation, Virality, and Social TaggingLinkedIn Endorsements: Reputation, Virality, and Social Tagging
LinkedIn Endorsements: Reputation, Virality, and Social TaggingPeter Skomoroch
 
Developing Data Products
Developing Data ProductsDeveloping Data Products
Developing Data ProductsPeter Skomoroch
 
Practical Problem Solving with Data - Onlab Data Conference, Tokyo
Practical Problem Solving with Data - Onlab Data Conference, TokyoPractical Problem Solving with Data - Onlab Data Conference, Tokyo
Practical Problem Solving with Data - Onlab Data Conference, TokyoPeter Skomoroch
 
Street Fighting Data Science
Street Fighting Data ScienceStreet Fighting Data Science
Street Fighting Data SciencePeter Skomoroch
 
Data Mashups -Data Science Summit
Data Mashups -Data Science SummitData Mashups -Data Science Summit
Data Mashups -Data Science SummitPeter Skomoroch
 
Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011Peter Skomoroch
 
Rapid Data Exploration With Hadoop
Rapid Data Exploration With HadoopRapid Data Exploration With Hadoop
Rapid Data Exploration With HadoopPeter Skomoroch
 

More from Peter Skomoroch (15)

Bridging the AI Gap: Building Stakeholder Support
Bridging the AI Gap: Building Stakeholder SupportBridging the AI Gap: Building Stakeholder Support
Bridging the AI Gap: Building Stakeholder Support
 
Managing Machines: The New AI Dev Stack
Managing Machines: The New AI Dev StackManaging Machines: The New AI Dev Stack
Managing Machines: The New AI Dev Stack
 
Product Management for AI
Product Management for AIProduct Management for AI
Product Management for AI
 
Executive Briefing: Why managing machines is harder than you think
Executive Briefing: Why managing machines is harder than you thinkExecutive Briefing: Why managing machines is harder than you think
Executive Briefing: Why managing machines is harder than you think
 
Building Competitive Moats With Data
Building Competitive Moats With DataBuilding Competitive Moats With Data
Building Competitive Moats With Data
 
O'Reilly Strata: Distilling Data Exhaust
O'Reilly Strata: Distilling Data ExhaustO'Reilly Strata: Distilling Data Exhaust
O'Reilly Strata: Distilling Data Exhaust
 
SF Data Science: Developing Data Products
SF Data Science: Developing Data ProductsSF Data Science: Developing Data Products
SF Data Science: Developing Data Products
 
Skills, Reputation, and Search
Skills, Reputation, and SearchSkills, Reputation, and Search
Skills, Reputation, and Search
 
LinkedIn Endorsements: Reputation, Virality, and Social Tagging
LinkedIn Endorsements: Reputation, Virality, and Social TaggingLinkedIn Endorsements: Reputation, Virality, and Social Tagging
LinkedIn Endorsements: Reputation, Virality, and Social Tagging
 
Developing Data Products
Developing Data ProductsDeveloping Data Products
Developing Data Products
 
Practical Problem Solving with Data - Onlab Data Conference, Tokyo
Practical Problem Solving with Data - Onlab Data Conference, TokyoPractical Problem Solving with Data - Onlab Data Conference, Tokyo
Practical Problem Solving with Data - Onlab Data Conference, Tokyo
 
Street Fighting Data Science
Street Fighting Data ScienceStreet Fighting Data Science
Street Fighting Data Science
 
Data Mashups -Data Science Summit
Data Mashups -Data Science SummitData Mashups -Data Science Summit
Data Mashups -Data Science Summit
 
Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011
 
Rapid Data Exploration With Hadoop
Rapid Data Exploration With HadoopRapid Data Exploration With Hadoop
Rapid Data Exploration With Hadoop
 

Recently uploaded

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 

Recently uploaded (20)

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 

Prototyping Data Intensive Apps: TrendingTopics.org

  • 1. Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research Scientist at LinkedIn Consultant at Data Wrangling @peteskomoroch 09/29/09 1
  • 2. Talk Outline • TrendingTopics Overview • Wikipedia Page View Dataset • Hadoop on Amazon EC2 • Loading Data on EC2: Amazon EBS & S3 • Daily Timelines with Hadoop Streaming • Hive Data Warehouse Layer • Trend Computation with Hive • Hooking It All Together • Front End & Visualizations
  • 3. Data Intensive Web Apps • Batch data mining or prediction with Hadoop • Iterate quickly with high level languages & tools – Pig, Hive, Clojure, Cascading, Python, Ruby • EC2: Get running with limited initial capital • Use external data and APIs in novel ways • Recent real world example: FlightCaster 3
  • 6. Detects Rising Trends with Hadoop 6
  • 7. TrendingTopics is Open Source • Built as a side project at Data Wrangling • Core code completed over 2 weeks in June • Code on Github • Data released on Amazon Public Datasets 7
  • 10. Hadoop on Amazon EC2 • EC2: Launch servers on demand • S3: Simple Storage Service • EBS: Persistent disks for EC2 • Used Cloudera Hadoop Distribution – Includes Pig, Hive, HBase, Sqoop, EBS integration – Allows customization of Hadoop EC2 instances – See EC2 Docs and Tutorials on the Cloudera site 10
  • 11. Wikipedia Page View Log Data • 1 TB of hourly log files from Wikipedia Squid proxy • Filename contains timestamp • Columns: project, pagename, pageviews, & bytes 11
  • 12. Wikipedia Redirect Data • Many articles in the logs redirect to canonical titles • Amazon Public Dataset contains a lookup table: 12
  • 13. Loading Data into Hadoop on EC2 • Hadoop can read directly from Amazon S3 • Pulling in data from EBS snapshots 13
  • 14. Python Streaming for Daily Timelines 14
  • 15. Streaming: Filter & Aggregate Logs 15
  • 16. Hive Data Warehouse Layer on EC2 • Easy for analysts familar with SQL • Familar syntax (SELECT, GROUP BY, SORT...) • Hide the MR muck, for JOIN, UNION, etc. • Supports streaming UDFs • Sampling: generate datasets for R&D • Used on Timeline and Redirect data 16
  • 17. Fixing Wiki Redirects with Hive JOIN 17
  • 18. Trend Detection With Hadoop 18
  • 19. Hourly Data: “Java” vs. “Hangover” 19
  • 20. Robust Regression with R & Hadoop • Fit R model to hourly data for 2.3M Wiki articles • “Trending” if article views >> model prediction Page View “Spikes” 20
  • 21. Call Python Trend Mapper from Hive 21
  • 22. Simple Trend Calculation in Python 22
  • 23. Use Trend Scores to Rank Articles 23
  • 24. Exporting Data from our EC2 Cluster • Output files are delimited text for bulk DB load • Distcp outputs to S3: – hadoop distcp /user/output s3n://trendingtopics/exports • Hive can create tables directly over S3 filesystem: – create external table foo ... location 's3n://trendingtopics/ hiveoutput’ 24
  • 25. Hooking It All Together 25
  • 26. Ruby on Rails Front End • Run a development version of the app locally: 26
  • 27. Automate & Deploy with EC2onRails • EC2onRails: Ubuntu server images for EC2, runs instances of Ruby on Rails, MySQL, Apache • Daily cron job on App server launches Hadoop • Hadoop EC2 boot script checks out latest trend detection code from Github • Capistrano tasks deploy new application code, maintain DB, cache, archives 27
  • 28. Cron Triggers Hadoop Jobs on EC2 Modify Cloudera hadoop-ec2-init-remote.sh boot script 28
  • 29. Bash Scripts Automate Daily Run • run_daily_merge.sh kicks off Hadoop/Hive jobs • Queries MySQL for last run date • Generates timelines with Hadoop streaming • Hive operates on timeline data, computes Trends with Python • Hive output copied to S3 • Calls remote script to load MySQL staging tables • Hadoop cluster self terminates • Remote script swaps staging and prod after load 29
  • 30. Capistrano Tasks • Capistrano: Ruby tool for automating tasks on one or more remote servers (www.capify.org) • Deploys Rails code and cron jobs to App server • Cleanup tasks are triggered at end of load: – Flags featured Wikipedia articles – Indexes MySQL tables, Flushes cache 30
  • 31. Visualization APIs • Google Viz API: Annotated Timeline for Rails • Google Charts sparklines: gc4r • Yahoo BOSS image search in Rails 31
  • 32. Ideas & Next Steps • Try out the code on Github • Show related articles using link graph data • Extract trending phrases from article text • Boost Twitter trending topics with Wikipedia logs • Update top trends hourly • Use date partitioned tables, incremental updates • Add a real search engine (uses MySQL now) 32
  • 33. LinkedIn Analytics Team We’re Hiring 33
  • 34. Questions? • Contact – Blog: datawrangling.com – Twitter: @peteskomoroch • Code – http://github.com/datawrangling/trendingtopics – Hadoop blog posts & tutorials on the Cloudera site – Amazon EMR tutorial on Finding Similar Artists with Hadoop & Last.FM data http://bit.ly/EMRsimilarity 34