2. Goals of information extraction “Processing of natural language texts for the extraction of relevant content pieces” (MARTÍ AND CASTELLÓN, 2000) Raw texts => structured databases Templates filling Improving search engines Auxiliary tool for other language applications
3. Name Entity Recognition Named Entities are proper names in texts, i.e. the names of persons, organizations, locations, times and quantities. NER is the task of processing a text and identifying named entities.
4. Why is Named Entity Recognition difficult? -Names too numerous to include in dictionaries -Variations e.g. John Smith, Mr Smith, John -Changing constantly new names invent unknown words -Ambiguity For some proper nouns it is hard to determine the category Name
5. Example Delimit the named entities in a text and tag them withNE Categories: – entity names - ENAMEX – temporal expressions - TIMEX – number expressions - NUMEX Subcategories of tags – captured by a SGML tag attribute called TYPE
6. Example Original text: The U.K. satellite television broadcaster said its subscriber base grew 17.5 percent during the past year to 5.35 million • Tagged text: The <ENAMEX TYPE="LOCATION">U.K.</ENAMEX> satellite television broadcaster said its subscriber base grew <NUMEX TYPE="PERCENT">17.5 percent</NUMEX> during <TIMEX TYPE="DATE">the past year</TIMEX> to 5.35 million Example
7. Maximum Entropy for NER Use the probability distribution that has maximum entropy, or that is maximally uncertain, from those that are consistent with observed evidence • P = {models consistent with evidence} • H(p) = entropy of p • PME = argmax p∈P H(p)
8. Maximum Entropy for NER Given a set of answer candidates Model the probability Define Features Functions Decision Rule
9. Template Filling A template is a frame (of a record structure), consisting of slots and fillers. A template denotes an event or a semantic concept. After extracting NEs, relations and events, IE fills an appropriate template
10. Template filling techniques Two common approaches for templatefilling: – Statistical approach – Finite-state cascade approach
11. Again, by using a sequence labeling method: Label sequences of tokens as potential fillers for a particular slot Train separate sequence classifiers for each slot Slots are filled with the text segments identified by each slot’s corresponding classifier Statistical Approach
12. Statistical Approach – Resolve multiple labels assigned to the same/overlapping text segment by adding weights (heuristic confidence) to the slots – State-of-the-art performance – F1-measure of 75 to 98 However, those methods are shown to be effective only for small, homogenous data
13. Finite-State Template-Filling Systems Message Understanding Conferences (MUC) – the genesis of IE DARPA funded significant efforts in IE in the early to mid 1990’s. MUC was an annual event/competition where results were presented.
14. Finite-State Template-Filling Systems – Focused on extracting information from news articles: • Terrorist events (MUC-4, 1992) • Industrial joint ventures (MUC-5, 1993) • Company management changes – Informationextraction of particular interest to the intelligence community (CIA, NSA). (Note: early ’90’s)
15. Applications It has a wide range of application in search engines biomedical field Customer profile analysis Trend analysis Information filtering and routing Event tracks news stories classification
16. conclusion In this presentation we studied about Goals of information extraction Entity Extraction: The Maximum Entropy method Template filling Applications
17. Visit more self help tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net