Cedem futia-2014

G. Futia F. Cairo F. Morando L. Leschiutta
Exploiting Linked Open Data
and Natural Language Processing for
Classification of Political Speech
Krems, 22nd
May 2014

22nd
May 2014 Giuseppe Futia – Politecnico di Torino 2
Introduction
●
Our goal:
● assist anyone interested in automatic categorization of political
speeches, to identify unambiguously the main political trends
addressed by the White House
●
What we have to achieve our goal:
● TellMeFirst (http://tellmefirst.polito.it/), a topic extraction tool:
– it leverages DBpedia knowledge base and English Wikipedia
linguistic corpus
– it exploits Linked Open Data (LOD) and Natural Language
Processing (NLP) techniques

22nd
DBpedia
● A crowd-sourced community effort to extract
structured information from Wikipedia and a
central interlinking hub for the Linking Open Data
project.
● It is a suitable knowledge base for text classification
(Mendes et al., 2012; Hellmann et al., 2013; Steinmetz
et al., 2013)

22nd
Why DBpedia for US
political speeches?
Comparison between the
coverage of US politics and the
coverage of politics of other
countries
The coverage of politics in Wikipedia is “often very good for recent or
prominent topics but is lacking on older or more obscure topics”
(Brown, 2011).

22nd
Text Categorization Approach
● An instance-based approch:TellMeFirst assigns target
documents to classes based on a local comparison between
a set of pre-classified documents and the target
document itself
● This training set consists of all the Wikipedia paragraphs
where a wikilink occurs.These paragraphs are stored in a
Lucene index, where each document represents a DBpedia
resource

22nd
Success rate (%) of the TellMeFirst classification
process on US Presidents profiles
1st topic Within the
first 2 topics
Within the
first 7 topics
Full text of the Presidents profiles 95.4% 100% 100%
President profiles without name
and surname
45.4% 61.3% 90.9%
TellMeFirst provides as output the seven most relevant topics
(in the form of DBpedia URI) of the document sorted by relevance

22nd
whitehouse.gov
●
3173 videos in English were available on the White House
website on the 24th of November 2013
● These videos are categorized according to a taxonomy not
related to the subject of the speeches
● They need a semantic layer that point out the content of the
speeches, so that questions such as “what is the First Lady
talking about?” could be automatically answered

22nd
Not just a bag-of-words tool
Results obtained with TellMeFirst (on the left) and withTagCrowd (on the right)
«President Obama Speaks on the Affordable Care Act»
http://1.usa.gov/1jR4Ky2

22nd
Results (i)
Occ. % overall % 2013 % 2012 % 2011 % 2010 % 2009
Barack Obama 607 4.88% 5.68% 4.52% 5.51% 4.45% 3.88%
Patient Protection and
Affordable Care Act
286 2.30% 3.06% 1.35% 1.91% 2.47% 2.71%
American Recovery and
Reinvestment Act of 2009
278 2.23% 1.09% 1.82% 2.88% 2.84% 1.88%
Social Security 272 2.19% 2.58% 1.77% 3.54% 1.61% 0.78%
Amount and percentage of topic
occurrences extracted with TellMeFirst

22nd
Results (ii)
● “New Deal” (141 occurrences), probably used as a metaphor
within the political speeches of President Obama
● “Libya” has a value corresponding to 1.00% in 2011.This result can
be related to the full-scale revolt beginning on 17 February 2011 in
Libya
● “Deepwater Horizon oil spill” reaches the 1.05% in 2010.This
result is related to the marine oil spill which took place in the Gulf
of Mexico that began on 20 april 2010

22nd
Correlation among topics

22nd
A focus on the First Lady (i)
● According to Michelle Obama’s page on the White House
website, the First Lady “looks forward to continuing her work
on the issues close to her heart”:
● supporting military families
● helping working women balance career and family
encouraging national service
● promoting the arts and arts education
● fostering healthy eating and healthy living for children and
families across the country

22nd
A focus on the First Lady (ii)
● We tested whether TellMeFirst confirms or not these
impressions and claims, manually selecting nine Wikipedia
categories which seemed to be related to these issues
● We then interrogated the SPARQL end-point of DBpedia with
a query to collect all the topics of these categories
●
We then associated each topic to one or more of the nine
high-level categories: these categories encompassed
almost 75% of the topics

22nd
A focus on the First Lady (iii)
Wikipedia Category First Lady sp.
9 categories
All speeches
9 categories
Government of the United States 26.68% 32.68%
Education 21.64% 5.40%
Nutrition 19.96% 1.61%
Social issues 14.71% 28.38%
Barack Obama 13.66% 14.00%
Health care 11.34% 7.57%
Arts 8.61% 1.11%
Military personnel 3.99% 3.16%
Gender equality 2.73% 0.84%
Others (unclassified topics) 25.63% 38.34%

22nd
Conclusions (i)
● The ability for citizens to easily retrieve the content of political
speeches and decisions is a crucial factor in e-participation
● Not guaranteed by a traditional keywords search, as in
most of the public administration websites (the White
House website included)
● Example: in a keyword-based system, by typing the word
"education", for instance, users get as result only videos that
have the word education in their title
● All terms that belong to the semantic area of education
are omitted

22nd
Conclusions (ii)
● When documents are semantically classified through
DBpedia URIs all synonyms, hypernyms and hyponyms of
lemmas are traced to the same concept making
user search more effective
● Leveraging Wikipedia categories would allow to go
even a step further, taking advantage of the links
between concepts as designed by the Wikipedia
community

22nd
Next steps
● Building a content search/navigation layer around the
scraping/classification module
● Integration with other Linked Open Data repositories on the
Web, combining the extracted topics with other information
(President Obama's federal budget proposal?)

Thank you!
Giuseppe Futia (giuseppe.futia@polito.it)
This paper was drafted in the context of the Network of Excellence in Internet Science EINS (GA n°288021), and, in
particular, in relation with the activities concerning Evidence and Experimentation (JRA3).

22nd
Appendix - The algorithm
●
The classifier needs to hold in memory all the instances of the
training set and calculate, during classification stage, the vector
distance between training documents and target documents.
● Specifically, the algorithm used by TMF is k-Nearest Neighbor
(kNN), a type of memory-based approach which selects the
categories for a target document on the basis of the k most
similar documents within the vector space.

22nd
Appendix - Scoring formula
●
In a Lucene query, both the target document and the training
set become weighed terms vectors, where terms are weighted
by means of the TF-IDF algorithm.The query returns a list of
documents in the form of DBpedia URIs, ordered by similarity
score. Scoring formula is:

22nd
Appendix - Basic concepts
● Natural Language Processing - A field of computer science,
concerned with the interactions between computers and human
(natural) languages.
● Linked Data - A recommended best practice for exposing, sharing,
and connecting pieces of data, information, and knowledge on the
Semantic Web using URIs and RDF.
● DBpedia - A crowd-sourced community effort to extract
structured information from Wikipedia and a central interlinking hub
for the Linking Open Data project.

Cedem futia-2014

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (12)

Ähnlich wie Cedem futia-2014

Ähnlich wie Cedem futia-2014 (20)

Mehr von Danube University Krems, Centre for E-Governance

Mehr von Danube University Krems, Centre for E-Governance (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Cedem futia-2014