In this paper we present an algorithm that, using Wikipedia as a reference, extracts semantic information from an arbitrary text. Our algorithm refines a procedure proposed by others, which mines all the text contained in the whole Wikipedia. Our refinement, based on a clustering approach, exploits the semantic information contained in certain types of Wikipedia hyperlinks, and also introduces an analysis based on multi-words. Our algorithm outperforms current methods in that the output contains many less false positives. We were also able to understand which (structural) part of the texts provides most of the semantic information extracted by the algorithm.
4. - Leukemia
- Severe combined
immunodeficiency
A sample (ESA) - Cancer
-Non-Hodgkin lymphoma
The development of T-cell leukaemia - AIDS
following the otherwise successful -ICD-10 Chapter II:
treatment of three patients with X-linked
severe combined immune deficiency (X-
Neoplasms;
SCID) in gene-therapy trials using -Chapter III: Diseases of the
haematopoietic stem cells has led to a re- blood and blood-forming
evaluation of this approach. Using a
mouse model for gene therapy of X-
organs, and certain
SCID, we find that the corrective disorders involving the
therapeutic gene IL2RG itself can act as immune mechanism
a contributor to the genesis of T-cell
lymphomas, with one-third of animals
- Bone marrow transplant
being affected. Gene-therapy trials for X- - Immunosuppressive drug
SCID, which have been based on the - Acute lymphoblastic
assumption that IL2RG is minimally
oncogenic, may therefore pose some risk
leukemia
to patients. - Multiple sclerosis.
5. 1-Glossary_of_cue_sports_terms
A sample (ESA) 2-Swimming,
3-Ian_Thorpe.
4-NCAA_football_bowl_games,
Being so tightly packed, Venice doesn't 2005-06,
make an ideal place to come to practise
5-Swimming_machine,
your favourite sport, although you'll get a
6-American_football_strategy,
decent workout just walking around and
up and down bridges! If you've got any 7-Contract_bridge_glossary,
energy left for some extra exercise, try a 8-Olympic_Games,
spot of swimming (although pools are 9-Pingu_episodes_series_6,
rare) or even a jog. Venice is a bit of a 10-Venice.
desert for swimmers. You can go in off …
the Lido (if you're game) or at one of 15 - Corruption_in_Ghana
Venice's two public swimming pools …
(handily, they close in summer). 27 - Legislative_system_of_the
Lonely Planet Tourist Guide Peopleʼs_Republic_of_China.
9. 1-Glossary_of_cue_sports_terms
A sample (ESA) 2-Swimming,
3-Ian_Thorpe.
4-NCAA_football_bowl_games,
Being so tightly packed, Venice doesn't 2005-06,
make an ideal place to come to practise
5-Swimming_machine,
your favourite sport, although you'll get a
6-American_football_strategy,
decent workout just walking around and
up and down bridges! If you've got any 7-Contract_bridge_glossary,
energy left for some extra exercise, try a 8-Olympic_Games,
spot of swimming (although pools are 9-Pingu_episodes_series_6,
rare) or even a jog. Venice is a bit of a 10-Venice.
desert for swimmers. You can go in off …
the Lido (if you're game) or at one of 15 - Corruption_in_Ghana
Venice's two public swimming pools …
(handily, they close in summer). 27 - Legislative_system_of_the
Lonely Planet Tourist Guide Peopleʼs_Republic_of_China.
11. After clustering:
only 3 clusters with cardinality larger than 1.
The first cluster, with cardinality 21, was
automatically named Swimming.
The second and the third both have cardinality
equal to 2, and they are named Training and
Venice-bucentaur.
12. Which one is
machine -generated?
Validation: Turing test
Classification
Text Classification
Classification
13. 20 texts of length
Outcome ranging between 60
and 200 words. Texts
were collected from
various sources like
newspaper articles,
text books, random
web pages, MSN
Encarta.
15. Using only nouns
Using a POS Tagger to identify syntactic
roles in document to be classified
Keep only names (throw away the rest)
No degradation in the results!
16. Define Multiwords
Lexical multiword identification approach:
The following generative pattern is considered
((Adj∣Noun) + ∣((Adj∣Noun) ∗ (Noun Prep)?)
(Adj∣Noun)∗)Noun
+: One or more *: Zero or more ?: Zero or one ∣: Or
Validation: A candidate multiword is valid if there
is a Wikipedia entry related to it.