Multimodal Information Extraction: Disease, Date and Location Retrieval
1. Multimodal Information Extraction:
Disease, DateTime, and Location Retrieval
Laboratory for Knowledge Discovery in Databases
Department of Computing and Information Sciences
Kansas State University
Dr. William H. Hsu, Associate Professor of Computing and Information Sciences
Svitlana O. Volkova, Graduate Research Assistant
Timothy E. Weninger, Research Associate
Jing Xia, Graduate Research Assistant
Surya Teja Kallumadi, Graduate Research Assistant
Wesam S. Elshamy, Graduate Research Assistant
4. MAIN STEPS
Assist the integrator (Elder
Research, Inc.) in incorporating
these into a single system
Perform collection-level analysis and
interactive visualization of timelines, maps
Extend the basic document-level IE, temporal
annotation, and spatial annotation components
with more state-of-the-field analytical functions
5. HOW CAN WE GET DATA?
WWW Information Retrieval (IR)
EMAIL from Web by crawling news,
blogs, reports, etc.
CRAWLER
DB
QUERY DOCUMENTS
LITERATURE COLLECTION
6. DOCUMENTS COLLECTION
DOMAIN SPECIFIC DOMAIN INDEPENDENT
KNOWLEDGE KNOWLEDGE
medical ontology, containing location hierarchy, containing
names of diseases, viruses, names of countries, states or
animal species etc., organized provinces, cities, etc;
in a conceptual hierarchy. canonical date and time
representation.
7. A TWO-LEVEL ANALYTICAL FRAMEWORK IN
THE DOMAIN OF EPIZOOTICS
Document Level Analysis Collection Level Analysis
Web document content Semi-supervised Document
extraction: Clustering & Linking by Finding
Named entity recognition Similarities by Keywords
(NER) Document Categorization as
Co-reference & association Topics Summarization Task
resolution, relation extraction (pLSA, LDA )
Geotagging: location extraction, map view
Temporal tagging: date/time extraction, timeline
view
Event Identification <…what, where, when, …>
8. HIGH LEVEL SYSTEM’S ARCHITECTURE
Data Search
User Access and Query
Control API (Java)
Temporal
Tagging:
TimeLine
Access View
Privilege
Spatial
Tagging:
Map View
Internet Browser (IE/Mozilla/…)
Event
Detection
Deduplication
Data Store
(MSSQL)
Web Server
Data Storage IAAC Server
Researchers, public health
professionals, and governmental
health agencies, other users
10. EXTENSION OF ENTITIES FOR MULTIMODAL
INFORMATION EXTRACTION SYSTEM
Stanford NER Entities KDD Group’s NER Entities
Person (e.g. “John Lenin”, Animal diseases (e.g. “rift valley
“William K. Smith”) fever”, “fmd”);
Organization (e.g. “U.K. Date and time (e.g. “May 24
Department for Environment, 2001”, “last year”);
Food and Rural Affairs”) Location (e.g. “London, Great
Britain”, “Manhattan, KS, USA”)
Location (e.g. “Europe”,
Animal Species (e.g. “cow”,
“Canada”)
“horse”, “mammals”)
Miscellaneous (e.g. “African”, Quantities (e.g. # of animals
researcher etc.) died, amount of money spend, $)
11. INFORMATION EXTRACTION TASK
Goal: Extract structured information
with facts and entities related to events from
unstructured/semistructured sources.
Result: The US saw its latest FMD
outbreak in Montebello, California in 1929
where 3,600 animals were slaughtered.
DOCUMENTS Animal Disease Names
Locations
COLLECTION Dates/Times
Quantities
12. NAME ENTITIES REPRESENTATION
FOR NER TASK
Disease Multi-Faceted Quantitative Summary
Location Map View
Date and time Timeline View
Timeline View Example:
http://press.jrc.it/NewsExplorer/time
lineedition/en/timeline.html
Map View Example:
http://www.healthmap.org/promed/en
13. DISEASE EXTRACTOR MODULE
INPUT AND OUTPUT
Output:
Index of the first character
Disease Index of the last character
Extractor Length of the matched text
Input: Text Module
from file Matched Text
Canonical disease name
Disease ExtractionTask
The task of disease recognition can be considered as NER/information
extraction (IE) task. The main purpose is to retrieve tokens that much at
least one term from list of the disease names
15. RESULTS FOR DISEASE EXTRACTOR MODULE
INPUT A OUTPUT A
Foot and mouth disease is
one of the most contagious
diseases of cloven-hooved
mammals…
INPUT B OUTPUT B
Rift Valley Fever | CDC
Special Pathogens Branch
Mission Statement Disease …
16. VOCABULARY CONSTRUCTION FOR DISEASE
EXTRACTOR
1. Disease names and fact sheets from Iowa State University Center for Food
Security and Public Health (CFSPH):
http://www.cfsph.iastate.edu/diseaseinfo/animaldiseaseindex.htm
2.Word Organization of Animal Health (OIE) Animal Disease Data:
http://www.oie.int/eng/maladies/en_alpha.htm
3. Department for Environmental Food and Rural Affairs, UK (DEFRA):
http://www.defra.gov.uk/animalh/diseases/vetsurveillance/az_index.htm
4. United States Department of Agriculture (USDA), Animal and Plant Health
Inspection Service
http://www.aphis.usda.gov/animal_health/animal_diseases/
5. MedlinePlus, Service of National Library of Medicine and National Institute of
Health
http://www.nlm.nih.gov/medlineplus/animaldiseasesandyourhealth.html
6.Wikipedia
http://en.wikipedia.org/wiki/Animal_diseases
17. RESULTS FOR DISEASE EXTRACTOR MODULE
ClearForest Gnosis Software: http://www.clearforest.com/
18. COMPARATIVE RESULTS FOR DISEASE EXTRACTORS:
KDD GROUP’S VS. GNOSIS
Disease Extraction "FMD" Disease Extraction "RVF"
Quantities of Extracted Diseases
Quantities of Extracted Diseases
400 180
350 Gnosis Soft. 160 Gnosis Soft.
300 140
KDD Group's 120 KDD Group's
250
Disease 100 Disease
200 Extractor Extractor
80
150
60
100 40
50 20
0 0
0 5 10 15 0 5 10 15
Number of seed Number of seeds
Non-unique Animal Disease Extraction
1200
Non-unique Extracted Diseases
1000
Gnosis Soft.
800
600
400 KDD Group's Disease
Extractor
200
0
0 2 4 6 8 10 12 14
Number of seeds
19. COMPARATIVE RESULTS FOR UNIQUE DISEASE
EXTRACTORS: KDD GROUP’S VS. GNOSIS
Unique Disease Extraction
160
140
Extracted Unique Diseases
120 Gnosis Soft.
100
80
60 KDD Group's Disease
Extractor
40
20
0
1 2 3 4 5 6 7 8 9 10 11 12 13
Number of seeds Random Permutation of Extracted Diseases
400
# of Extracted Animal Diseases
350 Gnosis Soft.
300
250 KDD Group's
Disease Extractor
200
150
100
50
0
1 2 3 4 5 6 7
Run number
21. FUTURE IMPROVEMENTS FOR DISEASE
EXTRACTOR MODULE
Intermediate Functionality
to add functionality for species extraction and construct
vocabulary;
to enrich dictionary with animal disease by species:
National Center of Infection Disease:
http://www.cdc.gov/healthypets/browse_by_animal.htm
United States Department of Agriculture (APHIS), Animal Health:
http://www.aphis.usda.gov/animal_health/animal_dis_spec/
to construct disease ontology with Protégé software.
Advanced Functionality
to apply “seeds set expansion" approach for improvements of
diseases extraction.
23. DATE/TIME EXTRACTOR AND EVENT TAGGER MODULE
INPUT AND OUTPUT
Output:
Disease Name
Date Event Trigger
Time
Input:Text Extractor Location
from file Canonical date/time
Temporal Extraction and EventsTaggingTask
The main purpose is extracting temporal quantities associated with
events from text, identifying events and the semantic relatedness of
events and summarizing them.
Extraction of temporal events involves identifying dates and times and
the entities associated with these events.
24. COMPONENTS OF DATE/TIME EXTRACTOR
AND EVENT TAGGER MODULE
Date/Time
Extractor
Pattern-Based Event Named Entity
Extractor Recognition Tool
It is based on quantities and
units’ chunker
It is built through analysis of Standard Time data structure
It extracts Named Entities:
the reports of disease outbreak:
Location, Person, Organization
e.g.“a report has been and Disease
confirmed that …”
Goal: Extracting facts and entity relations associated with events.
Disease outbreaks: disease, organisms, victim, symptoms, location,
country, date, containment measures …
25. RESULTS FOR DATE/TIME EXTRACTOR AND
EVENT TAGGER MODULE
iiac.ksu.edu/Event Extractor
26. EVENT REPRESENTATION BY DATE/TIME:
TIMELINE VIEW
Advanced functionality of Date/Time Extractor Module includes resolving of timeline
mapping of events. Representative example can be found on EMM News Explorer:
http://press.jrc.it/NewsExplorer/timelineedition/en/timeline.html
27. FUTURE IMPROVEMENTS FOR DATE/TIME
EXTRACTOR MODULE
Intermediate Functionality
to implement event extraction as event tuple <what[Disease],
where[Location], when[DateTime]> by individual entities that were
obtained from Disease, Temporal and Spatial Extraction
Modules in Basic Phase.
Advanced Functionality
spatiotemporal clustering, extraction of qualitative and
quantitative details about events from documents, and
relationship extraction among events;
to integrate information extraction and information
visualization components.
29. LOCATION EXTRACTOR MODULE
INPUT AND OUTPUT
NGA GEOnet Names Server (GNS)
http://earth-info.nga.mil/gns/html/
Output:
Location Matched text (location)
Extractor Location’s latitude
Input:Text Module
from file Location’s longitude
Location’s radius
Location ExtractionTask
Goal is to extract and tag geographical location mentions in the given
text as part of the multimodal event extraction application. Extracted
locations from the given text is presented to the user with their
geographical latitude and longitude coordinates.
30. RESULTS FOR LOCATION EXTRACTOR MODULE
INPUT OUTPUT
A third case of Foot-and-
Mouth Disease in Kansas
was reported yesterday in a
small farm North East of
Topeka. Roger Pride, who
owns the farm where foot-
and-mouth was discovered,
said the financial hardship of
losing his cattle was not as
devastating as the impact on
his reputation. It is to be
noted that the previous two
cases were reported earlier
this month in Wichita and
Leavenworth.
iiac.ksu.edu/LocationExtractor
31. FUTURE IMPROVEMENTS FOR LOCATION
EXTRACTOR MODULE
Intermediate Functionality
improves on the results obtained using the basic phase by
filtering out outliers, deduplicating, and possibly clustering
them.
Advanced Functionality
by considering implicit spatial relationships and
independent observations that would add richness to the
data presented to the user and would help in detecting
pattern among them.
32. EVENT REPRESENTATION BY LOCATION:
MAP VIEW
Advanced functionality of Location Extractor Module includes resolution of geotagging task that means
mapping events that were extracted from different resources. Representative example can be found on
http://www.healthmap.org/promed/en
34. ESSENTIAL TASKS FOR EVENT TRACKING
Automatic population of large databases with factual information
from many text sources
Rapid semantic processing of large volumes of unstructured text
Automatic merging of facts and entity relationships across sets of
documents
Innovative techniques for extracting, summarizing and tracking
information about events and their progressions over time from
unstructured text
Identification of events and outbreaks includes constituent
tasks of date, time, and quantity extraction and timeline
visualization, while geospatial IE includes location (in latter
stages) disambiguation and map view visualization.
35. EVENT FORMAL REPRESENTATION
Event is an occurrence of disease within particular time and space range, so
the single event attributes are: specific disease,date and time and location:
Event examples with missing values:
36. ADDITIONAL ASPECTS OF EVENT/OUTBREAK
Outbreak Status - confirmed
Date of event’s report - 12.18.2007
Reported source - www.dafra.gov/reports
Suffered species - cattle
Morbidity/Mortality - 155 infected/12 died
Damage measure, $ - $155,000
Standard features for event identification:
<disease, location, date/time…>
+ <…person, organization,…
+ <…, length of sentence, quantities,
temporal/spatial terms occurrences…>
37. OUTBREAK FORMAL REPRESENTATION
Outbreak is a collection of events that are connected by some disease that
happened within restricted space and time:
For outbreak identification events should be similar in temporal features:
time overlap and similar in spatial features: space overlap
38. DATA FLOW FOR
EVENT
IDENTIFICATION
BASED ON
SENTENCES
CLASSIFICATION
OUTBREAK
Disease: foot-and-mouth disease
Species: hog
Location: Taiwan
DateTime: 06/09/2009
Status: N/A
39. NLP TASKS
Foot-and-mouth disease[DIS] killed 15 hog on
farm in Taiwan[LOC]
Foot-and-mouth disease [SUBJ] killed[VP] 15 hog
Syntactic Analysis on farm in Taiwan [PP]
Fact: killed
Disease: foot-and-mouth disease
Location: Taiwan
Species: hog
Extraction Quantity: 15
Foot-and-mouth disease killed 15 hog on farm
Co-reference Resolution in Taiwan. Outbreak was reported on 9 June.
Event: outbreak
Species: 15 hog
Disease: foot-and-mouth disease
Template Generation Location: Taiwan
39 DateTime: 9 June
41. http://l2r.cs.uiuc.edu/~cogcomp/srl-demo.php
SEMANTIC ROLE LABELING TASK: EXAMPLE 2
Ecuador[LOC] - The Ecuadorian government[ORG] on Tuesday[DT] confirmed 48[QT] cases
of foot-and-mouth disease[DIS] in domestic animals, which prompted neighboring
Colombia[LOC] and Peru[LOC] to take preventive measures on their meat imports
41