SlideShare ist ein Scribd-Unternehmen logo
1 von 24
Hadoop Based Intelligent Text Processing System
October 12, 2010
Hadoop World, NYC
Page 2
Who are we?
•Vaijanath N. Rao
•AOL
•Contact: vaijanath.rao@teamaol.com
•Rohini Uppuluri
•AOL
•Contact: rohini.uppuluri@teamaol.com
Page 3
Agenda
1. Introduction
2. Problem Statement
3. Our Intelligent Text Processing System
1. Overview
2. Detailed
3. Application(s)
4. Q and A
Page 4
Introduction
Page 5
Introduction( Continued…)
• Information Extraction - Extracting information From Text
• Part of Speech Analysis
Ex: BlackBeauty<noun> is<verb> a<det> pretty<adjective> horse<noun>
• Named Entity Extraction
Ex: The CEO <Person>Mr. A</Person> of <Location>New York</Location> based Firm
<Organization>Foo.Inc</Organization> announced its new Product
<date>today</date>
• Sentiment Analysis
Ex: Watch this film. AVATAR is an achievement in many technical departments. It is a
beautiful experience
• Sentence Detection
Ex: <Start Sentence>BlackBeauty is a pretty horse <End Sentence>
• Some Tools: OpenNLP[5], LingPipe[6], GATE[7], NLTK[8] etc
• Categorization/Classification - Categorize items into one of the predefined
classes
Ex: An article talking about some baseball match is a “Sports” article.
Page 6
Introduction (Continued…)
• Challenges
• Processing large amount of data
• Most approaches use machine learning methods
• Need to be trained on large amount of data
• Need to way to perform the computations in a scalable manner
• Domain Dependency
Page 7
Problem Statement
• What we want to do?
• Build Large Scale applications (processing text)
• Why is this useful?
• Analyze Large Content available at AOL
• Applications: User interests Mining, Ad Targeting, Personalization etc
• We need
• A Large Scale NLP System
• A Pipeline sort of architecture with users being able to plug in or out
components
• Abstraction or Transparency of the algorithms used as requested by the user
Page 8
Our Intelligent
Text Processing System
• Overview
• Pipelined Architecture
• Pluggable components
• Work Flow Manager
• Recovery Manager
• Job Manager
• Applications
• Large Scale Applications using scalable way of applying NLP Models
Page 9
Overview
Page 10
Job Manager
•Creates series of parallel and sequential dependent jobs (takes configuration
file)
•Example :
Jobs A, B, C, D, E and F
Job B depends on Job A ; Job E depends on D
•Job manager creates following Tree
•Jobs A,D and F are executed parallel
•Jobs B and E will be executed parallel depending upon there parent jobs
completion.
Page 11
Recovery Manager
•Each job writes the configuration, start time, end time (
if completed) into the status file
•Periodically checks for the status file updates to see if
any job failed, if so restarts the job, by calling the Job
manager with required configuration
Page 12
Sample Configuration
<job name="keyphrase">
<mapreduce depends="none" name="postagger">
<inputargs>input arguments as string</inputargs>
<output>$hdfsoutputLocation</output>
<jar>postagger.jar</jar>
<mainClass>com.aol.datalayer.nlp.postagger</mainClass>
</mapreduce>
<mapreduce depends="postagger" name="nounphrase">
<inputargs>input arguments as string</inputargs>
<output>$hdfsoutputlocation</output>
<jar>chunker.jar</jar>
<mainClass>com.aol.datalayer.nlp.chunker</mainClass>
</mapreduce>
</job>
Page 13
Overview
Page 14
NLP Modeling Engine
Page 15
Detailed
Page 16
Applications
Page 17
Application 1- Location Aware Contextual Advertising -
Example
Page 18
Location Aware Contextual Advertising- Overview
Page 19
Application 2- User Aware Ad Targetting - Example
This is an illustrative example and does not represent any real user
Page 20
User Aware Ad Targetting
Page 21
Conclusions
• Pipelined Architecture
• NLP System
• Large Scale Applications
• Location aware Contextual Ad Targetting
• User aware Ad targetting
Page 22
Future Work
• Developing distributed algorithms for
• POS Tagger
• Sentiment Analyzer models
• Exploring if it might be useful integrating with any
open source distributed ML/TM framework
Page 23
References
1. Part-of-Speech Tagging: en.wikipedia.org/wiki/Part-of-
speech_tagging
2. Coreference Resolution: en.wikipedia.org/wiki/Coreference
3. Named Entity Recognition:
en.wikipedia.org/wiki/Named_entity_recognition
4. Sentiment
Analysis:en.wikipedia.org/wiki/Sentiment_analysis
5. Open NLP: http://opennlp.sourceforge.net/
6. LingPipe: http://alias-i.com/lingpipe/
7. GATE: http://gate.ac.uk/ie/
8. NLTK: www.nltk.org
Page 24
Q & A
Thank You 

Weitere ähnliche Inhalte

Andere mochten auch

Apache zeppelin the missing component for the big data ecosystem
Apache zeppelin the missing component for the big data ecosystemApache zeppelin the missing component for the big data ecosystem
Apache zeppelin the missing component for the big data ecosystemDuyhai Doan
 
Software Architecture: Styles
Software Architecture: StylesSoftware Architecture: Styles
Software Architecture: StylesHenry Muccini
 
Principles of software architecture design
Principles of software architecture designPrinciples of software architecture design
Principles of software architecture designLen Bass
 
Software Architecture and Design - An Overview
Software Architecture and Design - An OverviewSoftware Architecture and Design - An Overview
Software Architecture and Design - An OverviewOliver Stadie
 
Three Software Architecture Styles
Three Software Architecture StylesThree Software Architecture Styles
Three Software Architecture StylesJorgen Thelin
 
A Software Architect's View On Diagramming
A Software Architect's View On DiagrammingA Software Architect's View On Diagramming
A Software Architect's View On Diagrammingmeghantaylor
 
revenue model of paytm
revenue model of paytmrevenue model of paytm
revenue model of paytmVIJAY KUMAR
 

Andere mochten auch (7)

Apache zeppelin the missing component for the big data ecosystem
Apache zeppelin the missing component for the big data ecosystemApache zeppelin the missing component for the big data ecosystem
Apache zeppelin the missing component for the big data ecosystem
 
Software Architecture: Styles
Software Architecture: StylesSoftware Architecture: Styles
Software Architecture: Styles
 
Principles of software architecture design
Principles of software architecture designPrinciples of software architecture design
Principles of software architecture design
 
Software Architecture and Design - An Overview
Software Architecture and Design - An OverviewSoftware Architecture and Design - An Overview
Software Architecture and Design - An Overview
 
Three Software Architecture Styles
Three Software Architecture StylesThree Software Architecture Styles
Three Software Architecture Styles
 
A Software Architect's View On Diagramming
A Software Architect's View On DiagrammingA Software Architect's View On Diagramming
A Software Architect's View On Diagramming
 
revenue model of paytm
revenue model of paytmrevenue model of paytm
revenue model of paytm
 

Ähnlich wie AOL - Rao & Uppuluri - Hadoop World 2010

Santhosh_ Production Support_
Santhosh_ Production Support_Santhosh_ Production Support_
Santhosh_ Production Support_Santhosh Dattaprasad
 
Workflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoWorkflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoTaro L. Saito
 
Hadoop at Meebo: Lessons in the Real World
Hadoop at Meebo: Lessons in the Real WorldHadoop at Meebo: Lessons in the Real World
Hadoop at Meebo: Lessons in the Real Worldvoberoi
 
Resume
ResumeResume
ResumeKristy Lo
 
Using primavera analytics across multiple remote site locations - Oracle Prim...
Using primavera analytics across multiple remote site locations - Oracle Prim...Using primavera analytics across multiple remote site locations - Oracle Prim...
Using primavera analytics across multiple remote site locations - Oracle Prim...p6academy
 
Using primavera analytics across multiple remote site locations - Oracle Prim...
Using primavera analytics across multiple remote site locations - Oracle Prim...Using primavera analytics across multiple remote site locations - Oracle Prim...
Using primavera analytics across multiple remote site locations - Oracle Prim...p6academy
 
Stat 5.4 Pre Sales Demo Master
Stat 5.4 Pre Sales Demo MasterStat 5.4 Pre Sales Demo Master
Stat 5.4 Pre Sales Demo Masterreachtimsq
 
Resume_Sunil_Faroz
Resume_Sunil_FarozResume_Sunil_Faroz
Resume_Sunil_FarozSunil Faroz
 
Getting your project off the ground (BuildStuffLt)
Getting your project off the ground (BuildStuffLt)Getting your project off the ground (BuildStuffLt)
Getting your project off the ground (BuildStuffLt)Johannes Brodwall
 
Untangling - fall2017 - week 9
Untangling - fall2017 - week 9Untangling - fall2017 - week 9
Untangling - fall2017 - week 9Derek Jacoby
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowPyData
 
Nirdesh_Developer_2.0_Years_6_months_Exp
Nirdesh_Developer_2.0_Years_6_months_ExpNirdesh_Developer_2.0_Years_6_months_Exp
Nirdesh_Developer_2.0_Years_6_months_ExpNirdesh Kulshreshtha
 
Mohd_Shaukath_5_Exp_Datastage
Mohd_Shaukath_5_Exp_DatastageMohd_Shaukath_5_Exp_Datastage
Mohd_Shaukath_5_Exp_DatastageMohammed Shaukath
 
Ranjit gupta(mainframe 6.1 years)
Ranjit gupta(mainframe 6.1 years)Ranjit gupta(mainframe 6.1 years)
Ranjit gupta(mainframe 6.1 years)Ranjit Gupta
 

Ähnlich wie AOL - Rao & Uppuluri - Hadoop World 2010 (20)

Santhosh_ Production Support_
Santhosh_ Production Support_Santhosh_ Production Support_
Santhosh_ Production Support_
 
Workflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoWorkflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. Tokyo
 
MyResume
MyResumeMyResume
MyResume
 
JS Essence
JS EssenceJS Essence
JS Essence
 
SumitJaiswal
SumitJaiswalSumitJaiswal
SumitJaiswal
 
My C.V
My C.VMy C.V
My C.V
 
Hadoop at Meebo: Lessons in the Real World
Hadoop at Meebo: Lessons in the Real WorldHadoop at Meebo: Lessons in the Real World
Hadoop at Meebo: Lessons in the Real World
 
Resume
ResumeResume
Resume
 
Resume
ResumeResume
Resume
 
Using primavera analytics across multiple remote site locations - Oracle Prim...
Using primavera analytics across multiple remote site locations - Oracle Prim...Using primavera analytics across multiple remote site locations - Oracle Prim...
Using primavera analytics across multiple remote site locations - Oracle Prim...
 
Using primavera analytics across multiple remote site locations - Oracle Prim...
Using primavera analytics across multiple remote site locations - Oracle Prim...Using primavera analytics across multiple remote site locations - Oracle Prim...
Using primavera analytics across multiple remote site locations - Oracle Prim...
 
Stat 5.4 Pre Sales Demo Master
Stat 5.4 Pre Sales Demo MasterStat 5.4 Pre Sales Demo Master
Stat 5.4 Pre Sales Demo Master
 
Resume_Sunil_Faroz
Resume_Sunil_FarozResume_Sunil_Faroz
Resume_Sunil_Faroz
 
Getting your project off the ground (BuildStuffLt)
Getting your project off the ground (BuildStuffLt)Getting your project off the ground (BuildStuffLt)
Getting your project off the ground (BuildStuffLt)
 
RKCV
RKCVRKCV
RKCV
 
Untangling - fall2017 - week 9
Untangling - fall2017 - week 9Untangling - fall2017 - week 9
Untangling - fall2017 - week 9
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
Nirdesh_Developer_2.0_Years_6_months_Exp
Nirdesh_Developer_2.0_Years_6_months_ExpNirdesh_Developer_2.0_Years_6_months_Exp
Nirdesh_Developer_2.0_Years_6_months_Exp
 
Mohd_Shaukath_5_Exp_Datastage
Mohd_Shaukath_5_Exp_DatastageMohd_Shaukath_5_Exp_Datastage
Mohd_Shaukath_5_Exp_Datastage
 
Ranjit gupta(mainframe 6.1 years)
Ranjit gupta(mainframe 6.1 years)Ranjit gupta(mainframe 6.1 years)
Ranjit gupta(mainframe 6.1 years)
 

Mehr von Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

Mehr von Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

KĂźrzlich hochgeladen

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel AraĂşjo
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 

KĂźrzlich hochgeladen (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

AOL - Rao & Uppuluri - Hadoop World 2010

  • 1. Hadoop Based Intelligent Text Processing System October 12, 2010 Hadoop World, NYC
  • 2. Page 2 Who are we? •Vaijanath N. Rao •AOL •Contact: vaijanath.rao@teamaol.com •Rohini Uppuluri •AOL •Contact: rohini.uppuluri@teamaol.com
  • 3. Page 3 Agenda 1. Introduction 2. Problem Statement 3. Our Intelligent Text Processing System 1. Overview 2. Detailed 3. Application(s) 4. Q and A
  • 5. Page 5 Introduction( Continued…) • Information Extraction - Extracting information From Text • Part of Speech Analysis Ex: BlackBeauty<noun> is<verb> a<det> pretty<adjective> horse<noun> • Named Entity Extraction Ex: The CEO <Person>Mr. A</Person> of <Location>New York</Location> based Firm <Organization>Foo.Inc</Organization> announced its new Product <date>today</date> • Sentiment Analysis Ex: Watch this film. AVATAR is an achievement in many technical departments. It is a beautiful experience • Sentence Detection Ex: <Start Sentence>BlackBeauty is a pretty horse <End Sentence> • Some Tools: OpenNLP[5], LingPipe[6], GATE[7], NLTK[8] etc • Categorization/Classification - Categorize items into one of the predefined classes Ex: An article talking about some baseball match is a “Sports” article.
  • 6. Page 6 Introduction (Continued…) • Challenges • Processing large amount of data • Most approaches use machine learning methods • Need to be trained on large amount of data • Need to way to perform the computations in a scalable manner • Domain Dependency
  • 7. Page 7 Problem Statement • What we want to do? • Build Large Scale applications (processing text) • Why is this useful? • Analyze Large Content available at AOL • Applications: User interests Mining, Ad Targeting, Personalization etc • We need • A Large Scale NLP System • A Pipeline sort of architecture with users being able to plug in or out components • Abstraction or Transparency of the algorithms used as requested by the user
  • 8. Page 8 Our Intelligent Text Processing System • Overview • Pipelined Architecture • Pluggable components • Work Flow Manager • Recovery Manager • Job Manager • Applications • Large Scale Applications using scalable way of applying NLP Models
  • 10. Page 10 Job Manager •Creates series of parallel and sequential dependent jobs (takes configuration file) •Example : Jobs A, B, C, D, E and F Job B depends on Job A ; Job E depends on D •Job manager creates following Tree •Jobs A,D and F are executed parallel •Jobs B and E will be executed parallel depending upon there parent jobs completion.
  • 11. Page 11 Recovery Manager •Each job writes the configuration, start time, end time ( if completed) into the status file •Periodically checks for the status file updates to see if any job failed, if so restarts the job, by calling the Job manager with required configuration
  • 12. Page 12 Sample Configuration <job name="keyphrase"> <mapreduce depends="none" name="postagger"> <inputargs>input arguments as string</inputargs> <output>$hdfsoutputLocation</output> <jar>postagger.jar</jar> <mainClass>com.aol.datalayer.nlp.postagger</mainClass> </mapreduce> <mapreduce depends="postagger" name="nounphrase"> <inputargs>input arguments as string</inputargs> <output>$hdfsoutputlocation</output> <jar>chunker.jar</jar> <mainClass>com.aol.datalayer.nlp.chunker</mainClass> </mapreduce> </job>
  • 17. Page 17 Application 1- Location Aware Contextual Advertising - Example
  • 18. Page 18 Location Aware Contextual Advertising- Overview
  • 19. Page 19 Application 2- User Aware Ad Targetting - Example This is an illustrative example and does not represent any real user
  • 20. Page 20 User Aware Ad Targetting
  • 21. Page 21 Conclusions • Pipelined Architecture • NLP System • Large Scale Applications • Location aware Contextual Ad Targetting • User aware Ad targetting
  • 22. Page 22 Future Work • Developing distributed algorithms for • POS Tagger • Sentiment Analyzer models • Exploring if it might be useful integrating with any open source distributed ML/TM framework
  • 23. Page 23 References 1. Part-of-Speech Tagging: en.wikipedia.org/wiki/Part-of- speech_tagging 2. Coreference Resolution: en.wikipedia.org/wiki/Coreference 3. Named Entity Recognition: en.wikipedia.org/wiki/Named_entity_recognition 4. Sentiment Analysis:en.wikipedia.org/wiki/Sentiment_analysis 5. Open NLP: http://opennlp.sourceforge.net/ 6. LingPipe: http://alias-i.com/lingpipe/ 7. GATE: http://gate.ac.uk/ie/ 8. NLTK: www.nltk.org
  • 24. Page 24 Q & A Thank You 