Scanning the Internet for External Cloud Exposures via SSL Certs
The Role of Natural Language Processing in Information Retrieval
1. The Role of Natural Language Processingin Information Retrieval Searching for Meaning in Text Tony Russell-Rose, PhD 21-Mar-2011
2. Contents Introduction & Overview Background, Terminology NLP Fundamentals Paradigms & Perspectives NLP Tools & Techniques NLP Applications Text mining, Question Answering, etc. Conclusions & Further Reading 21-Mar-2011 2 NLP & IR
3. Introduction & Overview The Role of Natural Language Processing in Information Retrieval 21-Mar-2011 3 NLP & IR
4. Information Retrieval Models, techniques & frameworks for satisfying an information need Common paradigm: document retrieval 21-Mar-2011 4 NLP & IR
5. Document Retrieval Methods include Boolean, vector space, probabilistic Rely on index terms “bag of words” Stop lists + stemming But text is “unstructured” Information may be “hidden” 21-Mar-2011 5 NLP & IR
6. NLP – a solved problem? As humans we do it effortlessly ... don’t we? DRUNK GETS NINE YEARS IN VIOLIN CASE PROSTITUTES APPEAL TO POPE STOLEN PAINTING FOUND BY TREE RED TAPE HOLDS UP NEW BRIDGE DEER KILL 300,000 RESIDENTS CAN DROP OFF TREES INCLUDE CHILDREN WHEN BAKING COOKIES MINERS REFUSE TO WORK AFTER DEATH 21-Mar-2011 6 NLP & IR
7. Problems with Text (1) Polysemy One word maps to many concepts e.g. bat Synonymy One concept maps to many words 21-Mar-2011 7 NLP & IR
8. Problems with Text (2) Word order Venetian blind vs. Blind venetian 21-Mar-2011 8 NLP & IR
9. Problems with Text (3) Language is generative Starbucks coffee is the best The place I like most when I need to feed my caffeine addiction is the company from Seattle with branches everywhere Many different ways to express a given idea Synonymy, paraphrase, metaphor, etc. 21-Mar-2011 9 NLP & IR
10.
11. The meaning of a sentence is completely determined by the meaning of its symbols and the syntax used to combine themLanguage is a form of communication All communication has a context: time and place of utterance, the writer, the reader, their backgroundknowledge, intentions, assumptions re the reader’s knowledge/intentions, etc. 21-Mar-2011 10 NLP & IR
12. Problems with Text (5) Language is changing I want to buy a mobile Ill-formed input “accomodation office” Co-ordination, negation, etc. This is not a talk about neuro-linguistic programming Multi-linguality Claudia Schiffer is on the cover of Elle Also sarcasm, irony, slang, jargon, etc. That was a wicked lecture Yep – the coffee break was the best part 21-Mar-2011 11 NLP & IR
13. Enter NLP / Text Analytics… Text Analytics: a set of linguistic, analytical and predictive techniques to extract structure and meaning from unstructured documents NLP: the academic term for Text Analytics analogous to “search” vs. “IR” Text Analytics ~= Natural Language Processing ~= Text Mining Text Mining -> Scientific / technical context, automated processing Text Analytics -> Business context, interactive apps 21-Mar-2011 12 NLP & IR
14. NLP Fundamentals The Role of Natural Language Processing in Information Retrieval 21-Mar-2011 13 NLP & IR
15. NLP Perspectives Computational Linguistics Use of computational techniques to study linguistic phenomena Cognitive science Study of human information processing (perception, language, reasoning, etc.) Computer science Theoretical foundations of computation and practical techniques for implementation Information science Analysis, classification, manipulation, retrieval and dissemination of information 21-Mar-2011 14 NLP & IR
16. NLP Paradigms Symbolic approaches Rule-based, hand coded (by linguists) Knowledge-intensive Statistical approaches Shallow statistical models, trained on annotated corpora Data-intensive 21-Mar-2011 15 NLP & IR
17. NLP Fundamentals (word level) Language is AMBIGUOUS To find structure, we must remove ambiguity! Lexical analysis (tokenisation) The cat sat on the mat I can’t tokenise this sentence Stop word removal No definitive list The Who, The The, Take That… To be or not to be 21-Mar-2011 16 NLP & IR
18. NLP Fundamentals (word level) Stemming fishing, fished, fish, fisher -> fish Lemmatization Stemming + context Passing -> pass + ING Were -> be + PAST Delegate = de-leg-ate (?) Ratify = rat-ify (?) Morphology (prefixes, suffixes, etc.) Gebäudereinigungsfirmenangestellter -> Gebäude + Reinigung + Firma + Angestellter (building + cleaning + company + employee) 21-Mar-2011 17 NLP & IR
19. NLP Fundamentals (sentence level) Syntax (part of speech tagging) book -> NOUN, VERB that -> DETERMINER flight -> NOUN Book that flight -> VB DT NN Ambiguity problem Time flies like an arrow -> ? Fruit flies like a banana -> ? Eats shoots and leaves -> ? 21-Mar-2011 18 NLP & IR
20. NLP Fundamentals (sentence level) Parsing (grammar) I saw a venetian blind I saw a blind venetian I saw the man on the hill with a telescope Rugby is a game played by men with odd-shaped balls Sentence boundary detection Punctuation denotes the end of a sentence! “But not always!”, said Fred... 21-Mar-2011 19 NLP & IR
21. NLP Tools & Techniques The Role of Natural Language Processing in Information Retrieval 21-Mar-2011 20 NLP & IR
22. Early NLP Systems ELIZA Wiezenbaum 1966 Simple pattern matching PARRY Colby et al 1971 Pattern matching & planning SHRDLU Winograd 1972 Natural language understanding Comprehensive grammar of English 21-Mar-2011 21 NLP & IR
23. Word Prediction Assistive technologies Aurora Systems, TextHelp Google, Bing, Yahoo query suggestions 21-Mar-2011 22 NLP & IR
25. Text Categorization News agencies: classify incoming news stories Search engines: classify queries, e.g. search Google for ‘LOG313’ Identifying spam emails, e.g. http://www.paulgraham.com/spam.html Routing email or documents to appropriate people 21-Mar-2011 24 NLP & IR
26. Terminology Extraction Differentiate between useful index terms and ‘noise’ Help lexicographers identify new terminology: the C4 carbons of the nicotinamide rings X-ray therapy vs. X-ray therapies PAHO vs. Pan-American Health Organization Term extraction systems process scientific papers to identify terminology, possibly comparing it with a known list 21-Mar-2011 25 NLP & IR
27. Speech Recognition Spoken Dialogue Systems Directory enquiries (AT&T) Railways (Germany, Switzerland, Italy) Airlines (USA) iPhone Voice Search IBM Watson 21-Mar-2011 26 NLP & IR
28. Named Entity Recognition Identification of key concepts, e.g. people, places, organisations, etc. Also postcodes, temporal/numerical expressions, etc. “Mexico has been trying to stage a recovery since the beginning of this year and it's always been getting ahead of itself in terms of fundamentals,” said Matthew Hickman of Lehman Brothers in New York.” 21-Mar-2011 27 NLP & IR
29. Named Entity Recognition: Applications Increase precision of IR New companies in York vs. Companies in New York Support navigation SMART tags, Apture, etc. Improve machine translation: “I live in a new house in new york” Speech synthesis, auto-summarisation, etc. 21-Mar-2011 28 NLP & IR
30. Named Entity Recognition Systems usually developed for specific domain Are typically language and domain-specific Accuracy > 90% for newswire text Best system at MUC-7 scored 93.39% f-measure human annotators scored 97.60% and 96.95%. Lower for certain domains, e.g. life sciences (species, substances, proteins, genes, etc.) extended naming conventions, extensive use of conjunction and disjunction, high occurrence of neologisms and abbreviations, etc. 21-Mar-2011 29 NLP & IR
34. Summarisation Section 3 discusses two information access applications (text mining and question answering) closely associated with NLP. NLP Techniques Named entity recognition Information Extraction Current document retrieval technologies could not identify information as specific as this within text. Word Sense Disambiguation Text Mining Question Answering “Wh” questions List questions Examples include those mentioned here: text mining, question answering and cross-language information retrieval. Ponte and Croft (1998) “A Language Modeling Approach to Information Retrieval” SIGIR. Sanderson, M. (1994) “Word sense disambiguation and information retrieval” Proceedings of the 17th ACM SIGIR Conference Sanderson, M. (2000) “Retrieving with Good Sense” Information Retrieval 2(1):49-69. Question Answering (Section 3.2). Summarization single-document vs. multi-document MS Word Bing 21-Mar-2011 33 NLP & IR
35.
36. Based on pre-defined structures, e.g.movements of company executives victims of terrorist attacks corporate mergers / acquisitions interactions between genes and proteins Example (from MUC-6): Now, Mr. James is preparing to sail into the sunset, and Mr. Dooner is poised to rev up the engines to guide Interpublic Group's McCann-Erickson into the 21st century. Yesterday, McCann made official what had been widely anticipated: Mr. James, 57 years old, is stepping down as chief executive officer on July 1 and will retire as chairman at the end of the year. He will be succeeded by Mr. Dooner, 45. 21-Mar-2011 34 NLP & IR
37. Information Extraction <SUCCESSION_EVENT-2> := SUCCESSION_ORG: <ORGANIZATION-1> POST: "chairman" IN_AND_OUT: <IN_AND_OUT-4> VACANCY_REASON: DEPART_WORKFORCE <IN_AND_OUT-4> := IO_PERSON: <PERSON-1> NEW_STATUS: IN ON_THE_JOB: NO OTHER_ORG: <ORGANIZATION-1> REL_OTHER_ORG: SAME_ORG <ORGANIZATION-1> := ORG_NAME: "McCann-Erickson" ORG_ALIAS: "McCann" ORG_TYPE: COMPANY <PERSON-1> := PER_NAME: "John J. Dooner Jr." PER_ALIAS: "John Dooner" "Dooner" 21-Mar-2011 35 NLP & IR
38. Information Extraction Can now use as metadata (for retrieval) Or store in DB and query against it, e.g. “show me all the events where a finance officer left a company to take up the position of CEO in another company” 21-Mar-2011 36 NLP & IR
42. Information Extraction But IE is difficult when concepts are spread across multiple sentences: Pace American Group Inc. said it notified two top executives it intends to dismiss them because an internal investigation found evidence of “self-dealing” and “undisclosed financial relationships.” The executives are Don H. Pace, president and CEO; and Greg S. Kaplan, SVP and CFO. Event & org in first S, but names in second S Need to identify connection between “two top executives” and “the executives”. 21-Mar-2011 40 NLP & IR
43. Information Extraction Anaphora resolution relies on context (semantic / pragmatic knowledge): We gave the bananas to the monkeys because they were hungry. We gave the bananas to the monkeys because they were ripe. We gave the bananas to the monkeys because they were here. 21-Mar-2011 41 NLP & IR
44. WordNet Semantic database for English See also EuroWordNet Based on synsets car = automobile | railway car | elevator car | cable car 1stsynset = car, motorcar, machine, auto, automobile Connected via relationships E.g. hypernymy (a kind of) separate hierarchies for nouns, verbs, adjectives and adverbs 21-Mar-2011 42 NLP & IR
45. Word Sense Disambiguation Resolve ambiguities due to polysemy Often characterised by syntax, e.g. Magnesium is a light metal (ADJ) The light in the kitchen is quite bright (NOUN) Can use a Part of Speech Tagger to resolve broad ambiguities PoS tags = NN, NNP, NNS, NNPS, etc. 21-Mar-2011 43 NLP & IR
46.
47. 22. to adjust (a timer, alarm of a clock, etc.) so as to sound when desired: He set the alarm for seven o'clock.
48. 23. to fix or mount (a gem or the like) in a frame or setting.
49. 24. to ornament or stud with gems or the like: a bracelet set with pearls.
50. 25. to cause to sit; seat: to set a child in a highchair.
70. 41. to cause (glue, mortar, or the like) to become fixed or hard.
71. 42. to urge, goad, or encourage to attack: to set the hounds on a trespasser.
72. 43. Bridge . to cause (the opposing partnership or their contract) to fall short: We set them two tricks at four spades. Only perfect defense could set four spades.
73. 44. to affix or apply, as by stamping: The king set his seal to the decree.
74. 45. to fix or engage (a fishhook) firmly into the jaws of a fish by pulling hard on the line once the fish has taken the bait.
75. 46. to sharpen or put a keen edge on (a blade, knife, razor, etc.) by honing or grinding.
76. 47. to fix the length, width, and shape of (yarn, fabric, etc.).
82. 53. to assume a fixed or rigid state, as the countenance or the muscles.
83. 54. (of the hair) to be placed temporarily on rollers, in clips, or the like, in order to assume a particular style: Long hair sets more easily than short hair.
84. 55. to become firm, solid, or permanent, as mortar, glue, cement, or a dye, due to drying or physical or chemical change.
87. 58. to begin to move; start (usually followed by forth, out, off, etc.).
88. 59. (of a flower's ovary) to develop into a fruit.
89. 60. (of a hunting dog) to indicate the position of game. 21-Mar-2011 44 NLP & IR
90.
91. 77. an apparatus for receiving radio or television programs; receiver.
92. 78. Philately . a group of stamps that form a complete series.
93. 79. Tennis . a unit of a match, consisting of a group of not fewer than sixgames with a margin of at least two games between the winner and loser: He won the match in straight sets of 6–3, 6–4, 6–4.
94. 80. a construction representing a place or scene in which the action takes place in a stage, motion-picture, or television production.
95. 81. Machinery . a. the bending out of the points of alternate teeth of a saw in opposite directions.
96. b. a permanent deformation or displacement of an object or part.
97. c. a tool for giving a certain form to something, as a saw tooth.
98. 82. a chisel having a wide blade for dividing bricks.
99. 83. Horticulture . a young plant, or a slip, tuber, or the like, suitable for planting.
100. 84. Dance . a. the number of couples required to execute a quadrille or the like.
101. b. a series of movements or figures that make up a quadrille or the like.
102. 85. Music . a. a group of pieces played by a band, as in a night club, and followed by an intermission.
103. b. the period during which these pieces are played.
104. 86. Bridge . a failure to take the number of tricks specified by one's contract: Our being vulnerable made the set even more costly.
105. 87. Nautical . a. the direction of a wind, current, etc.
106. b. the form or arrangement of the sails, spars, etc., of a vessel.
121. –interjection 100. (in calling the start of a race): Ready! Set! Go! Also, get set! 21-Mar-2011 45 NLP & IR
122. Word Sense Disambiguation Surely PoS tagging would help IR performance? Could index docs using words AND PoS tags Very sensitive to tagger performance Does WSD improve IR performance? Conflicting evidence …“Perfect” WSD engine may provide only marginal gain, e.g. >76% accuracy needed for homonymy >55% for polysemy 21-Mar-2011 46 NLP & IR
123. NLP Applications The Role of Natural Language Processing in Information Retrieval 21-Mar-2011 47 NLP & IR
124. NLP Applications Web & enterprise search Text classification Text summarisation Machine translation Human-computer interfaces (spelling correction etc.) Speech recognition & synthesis Natural language generation (e.g. in games) Text mining Question answering 21-Mar-2011 48 NLP & IR
125. Finally becoming mainstream? ‘80% of corporate information is unstructured’ Entire value chain for some organisation (media / publishing etc.) Retail / eCommerce: Product reviews User generated content: blogs, forums, wikis Voice of the Customer: social media + sentiment analysis 161 billion gigabytes of digital information in 2006 approximately 988 exabytes by 2010 Audio / video still needs summaries & tags etc. 21-Mar-2011 49 NLP & IR
126. Market Outlook Combine component technologies e.g. PoS tagging + NER, etc. Healthy growth: massive expansion in social media Lightweight NLP for buzz analysis, brand monitoring, etc. Dominant markets: Customer Experience (VoC), media/publishing, financial services & insurance, intelligence, life sciences, e-discovery Solutions still not standardized Need for self-service tuning & configuration Partner ecosystem developing Marketing services providers, platform vendors, CRM + call centre vendors, system integrators 21-Mar-2011 50 NLP & IR
127. Text Mining Analogy with Data Mining Discover or infer new knowledge from unstructured text resources A <-> B and B <-> C Infer A <-> C? e.g. link between migraine headaches and magnesium deficiency Applications in life sciences, media/publishing, counter terrorism and competitive intelligence 21-Mar-2011 51 NLP & IR
128. Text Mining Levels of text mining, e.g. for bioscience: Tagging entities: e.g. genes, proteins, diseases and chemical compounds Information extraction: identify meaningful structural relations within the text (e.g. interactions between proteins) Knowledge discovery: combine with other knowledge sources (such as external ontologies) to infer or discover new knowledge 21-Mar-2011 52 NLP & IR
129. Question Answering Going beyond the document retrieval paradigm provide specific answers to specific questions Introduced as a TREC track in 1999 21-Mar-2011 53 NLP & IR
130. Question Answering Yes/no questions “Is George W. Bush the current president of the USA?” “Is the Sea of Tranquility deep?” “Wh” questions “Who was the British Prime Minister before Margaret Thatcher?” “When was the Battle of Hastings?” List questions “Which football teams have won the Champions League this decade?” “Which roads lead to Rome?” Instruction-based questions “How do I cook lasagne?”, “What is the best way to build a bridge?” Explanation questions “Why did World War I start?” “How does a computer process floating point numbers?” Commands “Tell me the height of the Eiffel Tower.” “Name all the Kings of England” 21-Mar-2011 54 NLP & IR
131. Question Answering Common use cases handled by major search engines Google, Yahoo!, Bing, Baidu, ... Specialist QA systems Wolfram Alpha: http://www.wolframalpha.com True Knowledge: http://www.trueknowledge.com Powerset: http://www.powerset.com/ IBM Watson 21-Mar-2011 55 NLP & IR
132. Question Answering Basic approach: question analysis Predict type of answer expected Create query document retrieval answer extraction Based on output of (1) Varies from regex to deep linguistic analysis 21-Mar-2011 56 NLP & IR
133. Evaluation How do we measure NLP accuracy? Compare against human annotation However: Humans often disagree Particularly for complex tasks Comparison can be measured in many ways “Bill Gates is chairman of Microsoft” -> “Gates” = person? 21-Mar-2011 57 NLP & IR
134. Conclusions The Role of Natural Language Processing in Information Retrieval 21-Mar-2011 58 NLP & IR
135. Conclusions NLP’s contribution to IR not always apparent Overemphasis on precision & recall? NLP provides benefits that go beyond quantitative metrics to support broader tasks & richer experiences Insight, analysis, visualisation, etc. NLP required for text mining, question answering & cross-language IR Bag of words model is insufficient Web becoming increasingly multi-lingual 21-Mar-2011 59 NLP & IR
136. State of the Art Commodity technologies: Tokenisation, normalization, regular expression search Mainstream, but room for improvement: Lemmatization, spelling correction, speech synthesis Mainstream but needs improvement: PoS tagging, term extraction, summarization, speech recognition, text categorization, spoken dialogue systems Almost there: Information extraction, NL generation, simple speech translation Don’t hold your breath: Full machine translation, advanced dialogue systems 21-Mar-2011 60 NLP & IR
137. Platforms & Toolkits Many commercialvendors Open source/access: GATE (Sheffield University) RapidMiner Stanford NLP OpenNLP OpenCalais LingPipe nltk 21-Mar-2011 61 NLP & IR
138. Further Reading A. Goker & J. Davies, Information Retrieval: Searching in the 21st Century. Wiley-Blackwell, 2009 D. Jurafsky and J. Martin, Speech and Language Processing. Prentice-Hall, 2nd edition, 2008. Manning, C. and Schutze, H. Foundations of Statistical Language Processing. MIT Press, 1999. 21-Mar-2011 62 NLP & IR
139. Thank you Tony Russell-Rose, PhD Vice-chair, BCS IRSG Chair, IEHF HCI Group Email: tgr@uxlabs.co.uk Blog: http://isquared.wordpress.com LinkedIn: http://www.linkedin.com/tonyrussellrose/ Twitter: @tonygrr
Editor's Notes
Exercise: experiment with the 3 major search engines (Google, Bing, Yahoo) to see how they handle stop words. Try queries that are dominated by stop words, with and without quotes. How do they differ?
Exercise: experiment with the 3 major search engines (Google, Bing, Yahoo) to see how they handle stemming. Try queries that are dominated by inflected forms, with and without quotes. How do they differ?
Do as class exercise
Demo: ELIZA http://nlp-addiction.com/eliza/
Demo Google suggest
Demo Auto-correct & DYM
Demo: try Google Translate (or Babelfish) with sentences containing ambiguous terms or NEs, e.g. “I live in a new house in new york”. Use the speech synthesizerhttp://translate.google.com/#en|de|I%20live%20in%20a%20new%20house%20in%20new%20york
Demo Yahoo search assist + “explore concepts”
Paired exercise: what is the most polysemous English word you can think of? How many senses do you think it has in an average dictionary? Think of some polysemous words (examples include “ball”, “bank”, “port” but there are many others). Use these as queries in a search engine and comment on what you notice about the results.
Paired Exercise: try each of above queries in Google, Yahoo, Bing. Now try to answer the same questions with a normal web search (by entering a set of query terms instead of a natural language question). Are the results of this search any better or worse?
Paired exercise: try previous queries on these sites