Anastasios Martidis developed a system to extract maritime safety events from news articles. The system uses rule-based text classification, information extraction, and named entity recognition to identify event types, ships, locations, dates, and relations between extracted information. Evaluation on a test set showed the system achieved over 80% precision and recall for most extraction tasks, though ship name extraction was less accurate. The system accomplishes the research goal but could be improved, particularly for ship name extraction.
Maritime safety events extraction from news articles
1. Vrije Universiteit
MSc Information Sciences
Maritime Safety Events Extraction
from News Articles
Anastasios Martidis
anastasios.martidis@student.vu.nl
July 31, 2012
Supervisors:
Willem R. van Hage, Dr
Davide Ceolin, MSc
1
2. Outline
Introduction Training Sets
Information System Overview
Spectrum Test sets
Problem Statement Evaluation
Significance of Results
Research Conclusions
Research Questions
Hypotheses
Materials and
Methods
2
3. Introduction
“We are drowning in information, and
starved for knowledge. ”
John Naisbitt
3
4. Information Spectrum
Structured Data: Automatic Identification System (AIS)
theoceandreamer.files.wordpress.com/
2011/03/img_21861.jpg
Free Text: News Articles
http://www.tideway.nl/images/NorthWestEveningMail-
PortSettoRockasTurbinesGetBoostfromaRollingstone-Walney2010-kleinbestan.jpg
4
5. Problem Statement
News Articles:
Descriptive and informative, but…
Vast in number, daily growing and updated
Free text, difficult to process automatically
Generic Natural Language Processing tools:
Popular and useful, but…
Present limitations in recognizing specific
types of maritime safety events and ship
names
5
6. Significance of the Research
Applications Potential Stakeholders
Risk assessments Ship owners, operators
Improvement of vessel and managers
safety standards Insurance Companies
Port facility security Coast Guard
assessments International Maritime
Recognition of problematic Organization (IMO)
areas (Piracy) International Maritime
Identification of shipping Security (IMS)
companies, ships, ship Private Security
constructors with history Companies (PCSs)
in maritime safety events
Maritime education and
training
6
7. Research Questions
1. Can we automatically process a news article in order to
determine if it concerns a maritime safety event?
2. Can we automatically extract a description of a maritime
safety event? The objective of the description is to
automatically recognize the type of maritime safety event,
ships involved, location, date and time.
3. Can we recognize relations and significance of the
extracted information from the text?
-Can we recognize the dominant event? Dominant
event is considered the event that is primarily described
in the news article.
-Can we identify relations between extracted locations
and specific event types described in the text?
7
8. Hypotheses
1. We can define sets of keywords that if are present in
certain combinations in the text under processing, indicate
that it concerns a maritime safety event.
2. We can extract a description for the event described in
the news article using rule based text classification and
sets of keywords, datasets of ship names, regular
expressions matching and Name Entity Recognition tasks.
3. We can evaluate the extracted information from the text:
-identifying the dominant event by measuring the
frequency of keyword indicators for each event type
-recognize relation between locations and event types
by examining the position of locations and event type
indicators in the text
8
9. Materials & Methods
Rule Based Text Classification
Information Extraction
OpenCalais
NLTK
AIS
dbpedia
9
10. Training Set
200 news articles (retrieved from CBS news)
100 related to maritime safety (53937 tokens)
100 of general domains (47053 tokens)
Word Frequency
Maritime Safety Related General Domains
10
11. Training Set Outcomes
Manual discrimination of significant words
Categorize into sets of keywords by their
meaning
Use of keywords for text classification
Mapping of keywords into maritime safety
event types
Use of keywords as event type indicators
11
12. Text Classification
Document D
Lists of keywords:
L1, most frequent keywords
L2, safety related keywords
L3, vessel type keywords
L4, maritime related keywords
L5, naval hierarchy keywords
L6, part of ship keywords
L7, water based locations keywords
12
14. Ship Names Extraction
Datasetof ship names retrieved from AIS
messages and dbpedia
Comparison of the dataset entries to the
text
Compromises
Location names
Part of names
14
15. Locations Extraction
Use of OpenCalais for NER tasks
Interested in locations only
Four types of locations recognized by
Calais:
Continent
Country
City
Provenance or State
15
16. Date and Time Extraction
Chucked sentences
Pattern matching using regular
expressions
Numeric representation of date (e.g., 1322012, 22-07-12)
Months (e.g., January or Jan.)
Days (e.g., Monday or Mon.)
Day periods (e.g., morning, afternoon)
Time (e.g., 11:00am or 11.00 a.m.)
Presented in specific order for each
sentence
16
17. Dominant Event Recognition
For each list of event type indicators
keywords
Sum of keywords occurrence in the text
Event type with the highest sum is
predicted as the dominant event
17
18. Location to Event Relations
Chunked sentences
For every sentence containing an
extracted location, if a keyword indicator
of an event type also occurs in the same
sentence
Then is predicted that the location is
related to the event type
18
19. Test Set
200 news articles (BBC, Reuters)
100 maritime safety related
100 of general domains (50 of them
selected as an attempt to mislead the
system)
Each news article manually labeled and
automatically processed by the system
Comparison of the results to the labeled
news article
19
30. Conclusions
The system accomplished the extraction
of maritime safety events from news
articles
Overall performance of the system was
satisfying
The system can be improved and refined
Ship names extraction require a different
approach
30
31. Vrije Universiteit
MSc Information Sciences
Maritime Safety Events Extraction
from News Articles
Anastasios Martidis
anastasios.martidis@student.vu.nl
July 31, 2012
Supervisors:
Willem R. van Hage, Dr
Davide Ceolin, MSc
31