5. Page 5
Introduction( ContinuedâŚ)
⢠Information Extraction - Extracting information From Text
⢠Part of Speech Analysis
Ex: BlackBeauty<noun> is<verb> a<det> pretty<adjective> horse<noun>
⢠Named Entity Extraction
Ex: The CEO <Person>Mr. A</Person> of <Location>New York</Location> based Firm
<Organization>Foo.Inc</Organization> announced its new Product
<date>today</date>
⢠Sentiment Analysis
Ex: Watch this film. AVATAR is an achievement in many technical departments. It is a
beautiful experience
⢠Sentence Detection
Ex: <Start Sentence>BlackBeauty is a pretty horse <End Sentence>
⢠Some Tools: OpenNLP[5], LingPipe[6], GATE[7], NLTK[8] etc
⢠Categorization/Classification - Categorize items into one of the predefined
classes
Ex: An article talking about some baseball match is a âSportsâ article.
6. Page 6
Introduction (ContinuedâŚ)
⢠Challenges
⢠Processing large amount of data
⢠Most approaches use machine learning methods
⢠Need to be trained on large amount of data
⢠Need to way to perform the computations in a scalable manner
⢠Domain Dependency
7. Page 7
Problem Statement
⢠What we want to do?
⢠Build Large Scale applications (processing text)
⢠Why is this useful?
⢠Analyze Large Content available at AOL
⢠Applications: User interests Mining, Ad Targeting, Personalization etc
⢠We need
⢠A Large Scale NLP System
⢠A Pipeline sort of architecture with users being able to plug in or out
components
⢠Abstraction or Transparency of the algorithms used as requested by the user
8. Page 8
Our Intelligent
Text Processing System
⢠Overview
⢠Pipelined Architecture
⢠Pluggable components
⢠Work Flow Manager
⢠Recovery Manager
⢠Job Manager
⢠Applications
⢠Large Scale Applications using scalable way of applying NLP Models
10. Page 10
Job Manager
â˘Creates series of parallel and sequential dependent jobs (takes configuration
file)
â˘Example :
Jobs A, B, C, D, E and F
Job B depends on Job A ; Job E depends on D
â˘Job manager creates following Tree
â˘Jobs A,D and F are executed parallel
â˘Jobs B and E will be executed parallel depending upon there parent jobs
completion.
11. Page 11
Recovery Manager
â˘Each job writes the configuration, start time, end time (
if completed) into the status file
â˘Periodically checks for the status file updates to see if
any job failed, if so restarts the job, by calling the Job
manager with required configuration
21. Page 21
Conclusions
⢠Pipelined Architecture
⢠NLP System
⢠Large Scale Applications
⢠Location aware Contextual Ad Targetting
⢠User aware Ad targetting
22. Page 22
Future Work
⢠Developing distributed algorithms for
⢠POS Tagger
⢠Sentiment Analyzer models
⢠Exploring if it might be useful integrating with any
open source distributed ML/TM framework