SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Downloaden Sie, um offline zu lesen
Using Wikipedia as a reference
    for extracting semantic
    information from a text

            Andrea Prato
                  &
          Marco Ronchetti
      Università di Trento, Italy
Explicit Semantic Analysis




                             Gabrilovich
                             Markovich
                             2007
Throw away:

Stopwords
Fragment pages (<100 words)
Suffixes (stemming)
- Leukemia
                                                - Severe combined
                                                immunodeficiency
    A sample (ESA)                              - Cancer
                                                -Non-Hodgkin lymphoma
The development of T-cell leukaemia             - AIDS
   following the otherwise successful           -ICD-10 Chapter II:
   treatment of three patients with X-linked
   severe combined immune deficiency (X-
                                                Neoplasms;
   SCID) in gene-therapy trials using           -Chapter III: Diseases of the
   haematopoietic stem cells has led to a re-   blood and blood-forming
   evaluation of this approach. Using a
   mouse model for gene therapy of X-
                                                organs, and certain
   SCID, we find that the corrective             disorders involving the
   therapeutic gene IL2RG itself can act as     immune mechanism
   a contributor to the genesis of T-cell
   lymphomas, with one-third of animals
                                                - Bone marrow transplant
   being affected. Gene-therapy trials for X-   - Immunosuppressive drug
   SCID, which have been based on the           - Acute lymphoblastic
   assumption that IL2RG is minimally
   oncogenic, may therefore pose some risk
                                                leukemia
   to patients.                                 - Multiple sclerosis.
1-Glossary_of_cue_sports_terms
    A sample (ESA)                               2-Swimming,
                                                 3-Ian_Thorpe.
                                                 4-NCAA_football_bowl_games,
Being so tightly packed, Venice doesn't          2005-06,
   make an ideal place to come to practise
                                                 5-Swimming_machine,
   your favourite sport, although you'll get a
                                                 6-American_football_strategy,
   decent workout just walking around and
   up and down bridges! If you've got any        7-Contract_bridge_glossary,
   energy left for some extra exercise, try a    8-Olympic_Games,
   spot of swimming (although pools are          9-Pingu_episodes_series_6,
   rare) or even a jog. Venice is a bit of a     10-Venice.
   desert for swimmers. You can go in off        …
   the Lido (if you're game) or at one of        15 - Corruption_in_Ghana
   Venice's two public swimming pools            …
   (handily, they close in summer).              27 - Legislative_system_of_the
Lonely Planet Tourist Guide                      Peopleʼs_Republic_of_China.
Clustering
Wikipedia is hyperlinked
Swimming is clustered with Olympic Games
1-Glossary_of_cue_sports_terms
    A sample (ESA)                               2-Swimming,
                                                 3-Ian_Thorpe.
                                                 4-NCAA_football_bowl_games,
Being so tightly packed, Venice doesn't          2005-06,
   make an ideal place to come to practise
                                                 5-Swimming_machine,
   your favourite sport, although you'll get a
                                                 6-American_football_strategy,
   decent workout just walking around and
   up and down bridges! If you've got any        7-Contract_bridge_glossary,
   energy left for some extra exercise, try a    8-Olympic_Games,
   spot of swimming (although pools are          9-Pingu_episodes_series_6,
   rare) or even a jog. Venice is a bit of a     10-Venice.
   desert for swimmers. You can go in off        …
   the Lido (if you're game) or at one of        15 - Corruption_in_Ghana
   Venice's two public swimming pools            …
   (handily, they close in summer).              27 - Legislative_system_of_the
Lonely Planet Tourist Guide                      Peopleʼs_Republic_of_China.
Throw away:

Large aggregators
   Category links
   Numbers
   Pages with more than (N=100) links
After clustering:

 only 3 clusters with cardinality larger than 1.
 The first cluster, with cardinality 21, was
  automatically named Swimming.
 The second and the third both have cardinality
  equal to 2, and they are named Training and
  Venice-bucentaur.
Which one is
                          machine -generated?
Validation: Turing test


                            Classification



   Text                     Classification



                            Classification
20 texts of length
Outcome   ranging between 60
          and 200 words. Texts
          were collected from
          various sources like
          newspaper articles,
          text books, random
          web pages, MSN
          Encarta.
Further improvements
Using only nouns

Using a POS Tagger to identify syntactic
 roles in document to be classified
Keep only names (throw away the rest)


No degradation in the results!
Define Multiwords

 Lexical multiword identification approach:
 The following generative pattern is considered
 ((Adj∣Noun) + ∣((Adj∣Noun) ∗ (Noun     Prep)?)
               (Adj∣Noun)∗)Noun

  +: One or more *: Zero or more ?: Zero or one ∣: Or


Validation: A candidate multiword is valid if there
          is a Wikipedia entry related to it.
Text with multiwords:

Keep all nouns
Keep all adjectives that are part of a
 multiword
Evaluation (human inspection of
results)
100 samples (50 technical, 50 generic)
Multiword improved significanty 7 (5 technical)
It improved marginally 13
It worsened marginally 6


Overall improvement: 10/% on technical text
Work in progress
Concept-mediated mapping
among documents
How similar are two docs?
                                   Jaccard Index



           Concept 1

           Concept 2   Concept 2
  Doc 1                                  Doc 3
           Concept 3   Concept 3

                       Concept 4
Syllabi comparison
Inter
links
Mapping documents in different
  languages
   Deploying Wikipedia Interlinks
                                         Jaccard Index



          Concept 1

          Concept 2          Concept 2
Doc 1                                             Doc 3
          Concept 3          Concept 3

                  INTERLINKS Concept 4

Weitere ähnliche Inhalte

Ähnlich wie Using Wikipedia as a reference for extracting semantic information

kurous case neural text.pdf
kurous case neural text.pdfkurous case neural text.pdf
kurous case neural text.pdfYawarAbbas73
 
Variability, Bugs, and Cognition
Variability, Bugs, and CognitionVariability, Bugs, and Cognition
Variability, Bugs, and CognitionAndrzej Wasowski
 
DNA memories
DNA memoriesDNA memories
DNA memoriesHoda msw
 
SFSCON23 - Michele Finelli - Management of large genomic data with free software
SFSCON23 - Michele Finelli - Management of large genomic data with free softwareSFSCON23 - Michele Finelli - Management of large genomic data with free software
SFSCON23 - Michele Finelli - Management of large genomic data with free softwareSouth Tyrol Free Software Conference
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streamingc.titus.brown
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible researchYannick Wurm
 
2015 10-7-11am-reproducible research
2015 10-7-11am-reproducible research2015 10-7-11am-reproducible research
2015 10-7-11am-reproducible researchYannick Wurm
 
Enabling next-generation sequencing applications with IBM Storwize V7000 Unif...
Enabling next-generation sequencing applications with IBM Storwize V7000 Unif...Enabling next-generation sequencing applications with IBM Storwize V7000 Unif...
Enabling next-generation sequencing applications with IBM Storwize V7000 Unif...IBM India Smarter Computing
 
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...CSCJournals
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talkc.titus.brown
 
Quality Assessment of Biomedical Metadata using Topic Modeling
Quality Assessment of Biomedical Metadata using Topic ModelingQuality Assessment of Biomedical Metadata using Topic Modeling
Quality Assessment of Biomedical Metadata using Topic ModelingStuti Nayak
 
SIGEVOlution Summer 2007
SIGEVOlution Summer 2007SIGEVOlution Summer 2007
SIGEVOlution Summer 2007Pier Luca Lanzi
 
Emerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsEmerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsmikaelhuss
 
Gene Wiki and Mark2Cure update for BD2K
Gene Wiki and Mark2Cure update for BD2KGene Wiki and Mark2Cure update for BD2K
Gene Wiki and Mark2Cure update for BD2KBenjamin Good
 
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLERHPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLERcscpconf
 

Ähnlich wie Using Wikipedia as a reference for extracting semantic information (20)

kurous case neural text.pdf
kurous case neural text.pdfkurous case neural text.pdf
kurous case neural text.pdf
 
Variability, Bugs, and Cognition
Variability, Bugs, and CognitionVariability, Bugs, and Cognition
Variability, Bugs, and Cognition
 
DNA memories
DNA memoriesDNA memories
DNA memories
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
SFSCON23 - Michele Finelli - Management of large genomic data with free software
SFSCON23 - Michele Finelli - Management of large genomic data with free softwareSFSCON23 - Michele Finelli - Management of large genomic data with free software
SFSCON23 - Michele Finelli - Management of large genomic data with free software
 
Ismb2009
Ismb2009Ismb2009
Ismb2009
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research
 
2015 10-7-11am-reproducible research
2015 10-7-11am-reproducible research2015 10-7-11am-reproducible research
2015 10-7-11am-reproducible research
 
Enabling next-generation sequencing applications with IBM Storwize V7000 Unif...
Enabling next-generation sequencing applications with IBM Storwize V7000 Unif...Enabling next-generation sequencing applications with IBM Storwize V7000 Unif...
Enabling next-generation sequencing applications with IBM Storwize V7000 Unif...
 
Bioinformatics.pptx
Bioinformatics.pptxBioinformatics.pptx
Bioinformatics.pptx
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talk
 
Quality Assessment of Biomedical Metadata using Topic Modeling
Quality Assessment of Biomedical Metadata using Topic ModelingQuality Assessment of Biomedical Metadata using Topic Modeling
Quality Assessment of Biomedical Metadata using Topic Modeling
 
SIGEVOlution Summer 2007
SIGEVOlution Summer 2007SIGEVOlution Summer 2007
SIGEVOlution Summer 2007
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
Emerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsEmerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomics
 
Gene Wiki and Mark2Cure update for BD2K
Gene Wiki and Mark2Cure update for BD2KGene Wiki and Mark2Cure update for BD2K
Gene Wiki and Mark2Cure update for BD2K
 
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLERHPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
 

Kürzlich hochgeladen

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 

Kürzlich hochgeladen (20)

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 

Using Wikipedia as a reference for extracting semantic information

  • 1. Using Wikipedia as a reference for extracting semantic information from a text Andrea Prato & Marco Ronchetti Università di Trento, Italy
  • 2. Explicit Semantic Analysis Gabrilovich Markovich 2007
  • 3. Throw away: Stopwords Fragment pages (<100 words) Suffixes (stemming)
  • 4. - Leukemia - Severe combined immunodeficiency A sample (ESA) - Cancer -Non-Hodgkin lymphoma The development of T-cell leukaemia - AIDS following the otherwise successful -ICD-10 Chapter II: treatment of three patients with X-linked severe combined immune deficiency (X- Neoplasms; SCID) in gene-therapy trials using -Chapter III: Diseases of the haematopoietic stem cells has led to a re- blood and blood-forming evaluation of this approach. Using a mouse model for gene therapy of X- organs, and certain SCID, we find that the corrective disorders involving the therapeutic gene IL2RG itself can act as immune mechanism a contributor to the genesis of T-cell lymphomas, with one-third of animals - Bone marrow transplant being affected. Gene-therapy trials for X- - Immunosuppressive drug SCID, which have been based on the - Acute lymphoblastic assumption that IL2RG is minimally oncogenic, may therefore pose some risk leukemia to patients. - Multiple sclerosis.
  • 5. 1-Glossary_of_cue_sports_terms A sample (ESA) 2-Swimming, 3-Ian_Thorpe. 4-NCAA_football_bowl_games, Being so tightly packed, Venice doesn't 2005-06, make an ideal place to come to practise 5-Swimming_machine, your favourite sport, although you'll get a 6-American_football_strategy, decent workout just walking around and up and down bridges! If you've got any 7-Contract_bridge_glossary, energy left for some extra exercise, try a 8-Olympic_Games, spot of swimming (although pools are 9-Pingu_episodes_series_6, rare) or even a jog. Venice is a bit of a 10-Venice. desert for swimmers. You can go in off … the Lido (if you're game) or at one of 15 - Corruption_in_Ghana Venice's two public swimming pools … (handily, they close in summer). 27 - Legislative_system_of_the Lonely Planet Tourist Guide Peopleʼs_Republic_of_China.
  • 8. Swimming is clustered with Olympic Games
  • 9. 1-Glossary_of_cue_sports_terms A sample (ESA) 2-Swimming, 3-Ian_Thorpe. 4-NCAA_football_bowl_games, Being so tightly packed, Venice doesn't 2005-06, make an ideal place to come to practise 5-Swimming_machine, your favourite sport, although you'll get a 6-American_football_strategy, decent workout just walking around and up and down bridges! If you've got any 7-Contract_bridge_glossary, energy left for some extra exercise, try a 8-Olympic_Games, spot of swimming (although pools are 9-Pingu_episodes_series_6, rare) or even a jog. Venice is a bit of a 10-Venice. desert for swimmers. You can go in off … the Lido (if you're game) or at one of 15 - Corruption_in_Ghana Venice's two public swimming pools … (handily, they close in summer). 27 - Legislative_system_of_the Lonely Planet Tourist Guide Peopleʼs_Republic_of_China.
  • 10. Throw away: Large aggregators  Category links  Numbers  Pages with more than (N=100) links
  • 11. After clustering:  only 3 clusters with cardinality larger than 1.  The first cluster, with cardinality 21, was automatically named Swimming.  The second and the third both have cardinality equal to 2, and they are named Training and Venice-bucentaur.
  • 12. Which one is machine -generated? Validation: Turing test Classification Text Classification Classification
  • 13. 20 texts of length Outcome ranging between 60 and 200 words. Texts were collected from various sources like newspaper articles, text books, random web pages, MSN Encarta.
  • 15. Using only nouns Using a POS Tagger to identify syntactic roles in document to be classified Keep only names (throw away the rest) No degradation in the results!
  • 16. Define Multiwords  Lexical multiword identification approach:  The following generative pattern is considered ((Adj∣Noun) + ∣((Adj∣Noun) ∗ (Noun Prep)?) (Adj∣Noun)∗)Noun +: One or more *: Zero or more ?: Zero or one ∣: Or Validation: A candidate multiword is valid if there is a Wikipedia entry related to it.
  • 17. Text with multiwords: Keep all nouns Keep all adjectives that are part of a multiword
  • 18. Evaluation (human inspection of results) 100 samples (50 technical, 50 generic) Multiword improved significanty 7 (5 technical) It improved marginally 13 It worsened marginally 6 Overall improvement: 10/% on technical text
  • 20. Concept-mediated mapping among documents How similar are two docs? Jaccard Index Concept 1 Concept 2 Concept 2 Doc 1 Doc 3 Concept 3 Concept 3 Concept 4
  • 23. Mapping documents in different languages Deploying Wikipedia Interlinks Jaccard Index Concept 1 Concept 2 Concept 2 Doc 1 Doc 3 Concept 3 Concept 3 INTERLINKS Concept 4