SlideShare ist ein Scribd-Unternehmen logo
1 von 16
Semantic Transforms Using
 Collaborative Knowledge Bases


Yegin Genc, Winter Mason, Jeffrey V. Nickerson

          Stevens Institute of Technology
Overview


• Automatically understand online information

• Using network artifacts, such as Wikipedia, to
  help
Topic Models
       Algorithms to understand and
       organize documents by
       uncovering semantic structure
       of a document collection

       • Discover hidden themes –
         patterns of word use
       • Connect documents that
         exhibit similar patterns
Latent Dirichlet Allocation (LDA)

   “In the computer science field of artificial intelligence, a genetic algorithm (GA) is a
   search heuristic that mimics the process of natural evolution. This heuristic is
   routinely used to generate useful solutions to optimization and search problems.
   Genetic algorithms belong to the larger class of evolutionary algorithms (EA), which
   generate solutions to optimization problems using techniques inspired by natural
   evolution, such as inheritance, mutation, selection, and crossover.” 1


            Algorithms      – 0.28               Genetic         – 0.18
            Optimization    – 0.28               Natural         – 0.18
            Algorithm       – 0.14               Evolution       – 0.18
            Computer        – 0.14               Evolutionary    – 0.09
            Techniques      – 0.14               …
            ….
1http://en.wikipedia.org/wiki/Genetic_algorithm
Topics from LDA
     computer          chemistry           cortex             orbit           infection
     methods            synthesis         stimulus            dust            immune
      number            oxidation             fig            jupiter             aids
         two            reaction            vision             line            infected
      principle          product           neuron            system              viral
       design            organic         recordings           solar              cells
Five topics from a 50-topic LDA model to fit Science from 1980 – 2002 (Blei and Lafferty, 2009)



   methods      k               of   the    for  the              the      operations     the
      the      the           objects of     the   o               and         the          of
       a        of              to     a  linear we                of      functional       a
       of   algorithm         and     to problem and               to       requires       is
   problems    for             the   we problems a                that        and          in
Ten randomly chosen topics from a 50-topic LDA model fit to abstracts from the Journal of
the ACM (JACM) from the years 1987 to 2004 (Blei et al., 2010).
The interpretation problem
1. Labeling the topics is difficult (J. Chang et al.,
   2009)
2. The relationships between topics are not
   identified
3. The information in the topics is based solely
   on the input corpus
4. The external validity of the topics may be
   limited
Collaborative Knowledge Bases
1. Labeled topics
2. Connected to each other in a meaningful way
3. Contain rich, focused information on
   particular topics
4. Contain fresh, up-to-date information about
   practically everything
Wikipedia Pages as Topics
LDA topic      Wikipedia Page

   orbit       Solar System
   dust        “The Solar System[a] consists of the Sun
  jupiter      and the astronomical objects
               gravitationally bound in orbit around it,
    line
               all of which formed from the collapse of a
  system       giant molecular cloud approximately 4.6
   solar       billion years ago…”
    gas
atmospheric    (http://en.wikipedia.org/wiki/Solar_System)
   mars
   field
Wikipedia Pages as Topics
Topics are characterized as distributions over observed words in
Wikipedia pages

 Wikipedia Word Freq.
     orbit    34      0.12
     dust      7      0.02                                   {Wi Î k}
                                      bk = p(Wi | k) =   N
    jupiter   36      0.12
      line     0      0.00                               å {W Î k}
                                                                i
                                                         i
    system    76      0.26
                                      βk : Per-topic word distribution
     solar    110     0.38
      gas     11      0.04
  atmospheric  1      0.00
     mars      8      0.03
     field     8      0.03
DOCUMENT – TOPIC          DOCUMENT – W0RD                    TOPIC - WORD
          Θ (D x K)                 W (D x W )
                                                                    β (K x W)
             Z d,n                                                         W d,n

                                              n
                                                            Z d,n
LDA



         d                          d




                                                                     Wiki (W x K)
                     k                                                       k
WIKI




         d                   =          d
                                                          *


                     D: Documents           K: Topics   W: Words
Experiment
Data
617 abstracts from Journal of the ACM
Classified into 80 categories by their authors
53 categories have corresponding Wikipedia Pages

Abstracts
{Article Name:        On the (Im)possibility of Obfuscating Programs,
    Category:         D.4. Operating Systems
    Add. Category:    F.1 Computation by Abstract Devices
    …
}

Category Mappings
    Category                                Wikipedia Page
    D.4 Operating Systems:                  Operating System
    F.1 Computation by Abstract Devices :   Abstract Machine
Three variations of our method



- Inbound links are Wikipedia pages that link to the topic page
- Outbound links are Wikipedia pages linked to by the topic
  page
- Text-based method only uses word distributions in topic pages
Results
      Method                    Primary                   Primary or Additional

         Text                 182 (29.5%)                      314 (50.8%)

   Inbound links              131 (21.2%)                      249 (40.0%)

  Outbound links               79 (12.8%)                      166 (26.9%)



The number (and percentage) of authors’ primary ACM topic labels, or authors’
primary + additional ACM topics successfully identified by each method.

LDA cannot be compared without an additional step mapping word distributions to
ACM topics.
Results (Qualitative)
Concluding Remarks
The Wiki categories often match the categories that
were chosen by the authors. When they don’t
match, they generally appear plausible.

Among the variations of our method, the text based
approach performed better than link based
approaches.

Among the link based approaches, inbound links
performed better than outbound links.
Next Steps

Dependent topic structures

Combine heuristics with generative models:
  Wikipedia as a prior for the topic distribution
  Learn from the documents observed.

Weitere ähnliche Inhalte

Was ist angesagt?

Application of a Novel Subject Classification Scheme for a Bibliographic Data...
Application of a Novel Subject Classification Scheme for a Bibliographic Data...Application of a Novel Subject Classification Scheme for a Bibliographic Data...
Application of a Novel Subject Classification Scheme for a Bibliographic Data...National Institute of Informatics
 
Machine Learning in Materials Science and Chemistry, USPTO, Nathan C. Frey
Machine Learning in Materials Science and Chemistry, USPTO, Nathan C. FreyMachine Learning in Materials Science and Chemistry, USPTO, Nathan C. Frey
Machine Learning in Materials Science and Chemistry, USPTO, Nathan C. FreyNathan Frey, PhD
 
Accelerated Materials Discovery & Characterization with Classical, Quantum an...
Accelerated Materials Discovery & Characterization with Classical, Quantum an...Accelerated Materials Discovery & Characterization with Classical, Quantum an...
Accelerated Materials Discovery & Characterization with Classical, Quantum an...KAMAL CHOUDHARY
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Anubhav Jain
 
Smart Metrics for High Performance Material Design
Smart Metrics for High Performance Material DesignSmart Metrics for High Performance Material Design
Smart Metrics for High Performance Material Designaimsnist
 
Software Tools, Methods and Applications of Machine Learning in Functional Ma...
Software Tools, Methods and Applications of Machine Learning in Functional Ma...Software Tools, Methods and Applications of Machine Learning in Functional Ma...
Software Tools, Methods and Applications of Machine Learning in Functional Ma...Anubhav Jain
 
A Machine Learning Framework for Materials Knowledge Systems
A Machine Learning Framework for Materials Knowledge SystemsA Machine Learning Framework for Materials Knowledge Systems
A Machine Learning Framework for Materials Knowledge Systemsaimsnist
 

Was ist angesagt? (8)

Application of a Novel Subject Classification Scheme for a Bibliographic Data...
Application of a Novel Subject Classification Scheme for a Bibliographic Data...Application of a Novel Subject Classification Scheme for a Bibliographic Data...
Application of a Novel Subject Classification Scheme for a Bibliographic Data...
 
Machine Learning in Materials Science and Chemistry, USPTO, Nathan C. Frey
Machine Learning in Materials Science and Chemistry, USPTO, Nathan C. FreyMachine Learning in Materials Science and Chemistry, USPTO, Nathan C. Frey
Machine Learning in Materials Science and Chemistry, USPTO, Nathan C. Frey
 
Accelerated Materials Discovery & Characterization with Classical, Quantum an...
Accelerated Materials Discovery & Characterization with Classical, Quantum an...Accelerated Materials Discovery & Characterization with Classical, Quantum an...
Accelerated Materials Discovery & Characterization with Classical, Quantum an...
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
 
Data Mining The Sky
Data Mining The SkyData Mining The Sky
Data Mining The Sky
 
Smart Metrics for High Performance Material Design
Smart Metrics for High Performance Material DesignSmart Metrics for High Performance Material Design
Smart Metrics for High Performance Material Design
 
Software Tools, Methods and Applications of Machine Learning in Functional Ma...
Software Tools, Methods and Applications of Machine Learning in Functional Ma...Software Tools, Methods and Applications of Machine Learning in Functional Ma...
Software Tools, Methods and Applications of Machine Learning in Functional Ma...
 
A Machine Learning Framework for Materials Knowledge Systems
A Machine Learning Framework for Materials Knowledge SystemsA Machine Learning Framework for Materials Knowledge Systems
A Machine Learning Framework for Materials Knowledge Systems
 

Andere mochten auch

Andere mochten auch (6)

H0ly L4nd
H0ly L4ndH0ly L4nd
H0ly L4nd
 
windward5
windward5windward5
windward5
 
Discovering Context
Discovering ContextDiscovering Context
Discovering Context
 
Creative
CreativeCreative
Creative
 
Knights
KnightsKnights
Knights
 
Advertising
AdvertisingAdvertising
Advertising
 

Ähnlich wie Semantic Transforms Using Collaborative Knowledge Bases

Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Anubhav Jain
 
Materials Modelling: From theory to solar cells (Lecture 1)
Materials Modelling: From theory to solar cells  (Lecture 1)Materials Modelling: From theory to solar cells  (Lecture 1)
Materials Modelling: From theory to solar cells (Lecture 1)cdtpv
 
Development of a Trans-Field Learning System Based on Multidimensional Topic ...
Development of a Trans-Field Learning System Based on Multidimensional Topic ...Development of a Trans-Field Learning System Based on Multidimensional Topic ...
Development of a Trans-Field Learning System Based on Multidimensional Topic ...tmra
 
Performance Comparison of Ad-hoc Retrieval Models over Full-text vs. Titles o...
Performance Comparison of Ad-hoc Retrieval Models over Full-text vs. Titles o...Performance Comparison of Ad-hoc Retrieval Models over Full-text vs. Titles o...
Performance Comparison of Ad-hoc Retrieval Models over Full-text vs. Titles o...Ahmed Saleh
 
Ontology driven Annotation
Ontology driven AnnotationOntology driven Annotation
Ontology driven AnnotationAshish Kulkarni
 
The Unbearable Lightness of Wiking
The Unbearable Lightness of Wiking The Unbearable Lightness of Wiking
The Unbearable Lightness of Wiking Jie Bao
 
SWiM – A wiki for collaborating on mathematical ontologies
SWiM – A wiki for collaborating on mathematical ontologiesSWiM – A wiki for collaborating on mathematical ontologies
SWiM – A wiki for collaborating on mathematical ontologiesChristoph Lange
 
Linking KOS Data [using SKOS and OWL2]
Linking KOS Data [using SKOS and OWL2]Linking KOS Data [using SKOS and OWL2]
Linking KOS Data [using SKOS and OWL2]Marcia Zeng
 
Exploring Content with Wikipedia
Exploring Content with WikipediaExploring Content with Wikipedia
Exploring Content with WikipediaYegin Genc
 
Blei lafferty2009
Blei lafferty2009Blei lafferty2009
Blei lafferty2009Ajay Ohri
 
#4 Convolutional Neural Networks for Natural Language Processing
#4 Convolutional Neural Networks for Natural Language Processing#4 Convolutional Neural Networks for Natural Language Processing
#4 Convolutional Neural Networks for Natural Language ProcessingBerlin Language Technology
 
Wikipedia as an Ontology for Describing Documents
Wikipedia as an Ontology for Describing DocumentsWikipedia as an Ontology for Describing Documents
Wikipedia as an Ontology for Describing DocumentsZareen Syed
 
bridging formal semantics and social semantics on the web
bridging formal semantics and social semantics on the webbridging formal semantics and social semantics on the web
bridging formal semantics and social semantics on the webFabien Gandon
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology:  A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology:  A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasAngelo Salatino
 
How To Make Linked Data More than Data
How To Make Linked Data More than DataHow To Make Linked Data More than Data
How To Make Linked Data More than DataAmit Sheth
 
AdS Biology and Quantum Information Science
AdS Biology and Quantum Information ScienceAdS Biology and Quantum Information Science
AdS Biology and Quantum Information ScienceMelanie Swan
 

Ähnlich wie Semantic Transforms Using Collaborative Knowledge Bases (20)

Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
 
Materials Modelling: From theory to solar cells (Lecture 1)
Materials Modelling: From theory to solar cells  (Lecture 1)Materials Modelling: From theory to solar cells  (Lecture 1)
Materials Modelling: From theory to solar cells (Lecture 1)
 
Development of a Trans-Field Learning System Based on Multidimensional Topic ...
Development of a Trans-Field Learning System Based on Multidimensional Topic ...Development of a Trans-Field Learning System Based on Multidimensional Topic ...
Development of a Trans-Field Learning System Based on Multidimensional Topic ...
 
Performance Comparison of Ad-hoc Retrieval Models over Full-text vs. Titles o...
Performance Comparison of Ad-hoc Retrieval Models over Full-text vs. Titles o...Performance Comparison of Ad-hoc Retrieval Models over Full-text vs. Titles o...
Performance Comparison of Ad-hoc Retrieval Models over Full-text vs. Titles o...
 
Ontology driven Annotation
Ontology driven AnnotationOntology driven Annotation
Ontology driven Annotation
 
The Unbearable Lightness of Wiking
The Unbearable Lightness of Wiking The Unbearable Lightness of Wiking
The Unbearable Lightness of Wiking
 
SWiM – A wiki for collaborating on mathematical ontologies
SWiM – A wiki for collaborating on mathematical ontologiesSWiM – A wiki for collaborating on mathematical ontologies
SWiM – A wiki for collaborating on mathematical ontologies
 
Linking KOS Data [using SKOS and OWL2]
Linking KOS Data [using SKOS and OWL2]Linking KOS Data [using SKOS and OWL2]
Linking KOS Data [using SKOS and OWL2]
 
Exploring Content with Wikipedia
Exploring Content with WikipediaExploring Content with Wikipedia
Exploring Content with Wikipedia
 
Blei lafferty2009
Blei lafferty2009Blei lafferty2009
Blei lafferty2009
 
Cyberinfrastructure for Einstein's Equations and Beyond
Cyberinfrastructure for Einstein's Equations and BeyondCyberinfrastructure for Einstein's Equations and Beyond
Cyberinfrastructure for Einstein's Equations and Beyond
 
#4 Convolutional Neural Networks for Natural Language Processing
#4 Convolutional Neural Networks for Natural Language Processing#4 Convolutional Neural Networks for Natural Language Processing
#4 Convolutional Neural Networks for Natural Language Processing
 
Wikipedia as an Ontology for Describing Documents
Wikipedia as an Ontology for Describing DocumentsWikipedia as an Ontology for Describing Documents
Wikipedia as an Ontology for Describing Documents
 
bridging formal semantics and social semantics on the web
bridging formal semantics and social semantics on the webbridging formal semantics and social semantics on the web
bridging formal semantics and social semantics on the web
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology:  A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology:  A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
 
How To Make Linked Data More than Data
How To Make Linked Data More than DataHow To Make Linked Data More than Data
How To Make Linked Data More than Data
 
How To Make Linked Data More than Data
How To Make Linked Data More than DataHow To Make Linked Data More than Data
How To Make Linked Data More than Data
 
AdS Biology and Quantum Information Science
AdS Biology and Quantum Information ScienceAdS Biology and Quantum Information Science
AdS Biology and Quantum Information Science
 
LDAvis
LDAvisLDAvis
LDAvis
 
mx & dbs
mx & dbsmx & dbs
mx & dbs
 

Kürzlich hochgeladen

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 

Kürzlich hochgeladen (20)

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 

Semantic Transforms Using Collaborative Knowledge Bases

  • 1. Semantic Transforms Using Collaborative Knowledge Bases Yegin Genc, Winter Mason, Jeffrey V. Nickerson Stevens Institute of Technology
  • 2. Overview • Automatically understand online information • Using network artifacts, such as Wikipedia, to help
  • 3. Topic Models Algorithms to understand and organize documents by uncovering semantic structure of a document collection • Discover hidden themes – patterns of word use • Connect documents that exhibit similar patterns
  • 4. Latent Dirichlet Allocation (LDA) “In the computer science field of artificial intelligence, a genetic algorithm (GA) is a search heuristic that mimics the process of natural evolution. This heuristic is routinely used to generate useful solutions to optimization and search problems. Genetic algorithms belong to the larger class of evolutionary algorithms (EA), which generate solutions to optimization problems using techniques inspired by natural evolution, such as inheritance, mutation, selection, and crossover.” 1 Algorithms – 0.28 Genetic – 0.18 Optimization – 0.28 Natural – 0.18 Algorithm – 0.14 Evolution – 0.18 Computer – 0.14 Evolutionary – 0.09 Techniques – 0.14 … …. 1http://en.wikipedia.org/wiki/Genetic_algorithm
  • 5. Topics from LDA computer chemistry cortex orbit infection methods synthesis stimulus dust immune number oxidation fig jupiter aids two reaction vision line infected principle product neuron system viral design organic recordings solar cells Five topics from a 50-topic LDA model to fit Science from 1980 – 2002 (Blei and Lafferty, 2009) methods k of the for the the operations the the the objects of the o and the of a of to a linear we of functional a of algorithm and to problem and to requires is problems for the we problems a that and in Ten randomly chosen topics from a 50-topic LDA model fit to abstracts from the Journal of the ACM (JACM) from the years 1987 to 2004 (Blei et al., 2010).
  • 6. The interpretation problem 1. Labeling the topics is difficult (J. Chang et al., 2009) 2. The relationships between topics are not identified 3. The information in the topics is based solely on the input corpus 4. The external validity of the topics may be limited
  • 7. Collaborative Knowledge Bases 1. Labeled topics 2. Connected to each other in a meaningful way 3. Contain rich, focused information on particular topics 4. Contain fresh, up-to-date information about practically everything
  • 8. Wikipedia Pages as Topics LDA topic Wikipedia Page orbit Solar System dust “The Solar System[a] consists of the Sun jupiter and the astronomical objects gravitationally bound in orbit around it, line all of which formed from the collapse of a system giant molecular cloud approximately 4.6 solar billion years ago…” gas atmospheric (http://en.wikipedia.org/wiki/Solar_System) mars field
  • 9. Wikipedia Pages as Topics Topics are characterized as distributions over observed words in Wikipedia pages Wikipedia Word Freq. orbit 34 0.12 dust 7 0.02 {Wi Î k} bk = p(Wi | k) = N jupiter 36 0.12 line 0 0.00 å {W Î k} i i system 76 0.26 βk : Per-topic word distribution solar 110 0.38 gas 11 0.04 atmospheric 1 0.00 mars 8 0.03 field 8 0.03
  • 10. DOCUMENT – TOPIC DOCUMENT – W0RD TOPIC - WORD Θ (D x K) W (D x W ) β (K x W) Z d,n W d,n n Z d,n LDA d d Wiki (W x K) k k WIKI d = d * D: Documents K: Topics W: Words
  • 11. Experiment Data 617 abstracts from Journal of the ACM Classified into 80 categories by their authors 53 categories have corresponding Wikipedia Pages Abstracts {Article Name: On the (Im)possibility of Obfuscating Programs, Category: D.4. Operating Systems Add. Category: F.1 Computation by Abstract Devices … } Category Mappings Category Wikipedia Page D.4 Operating Systems: Operating System F.1 Computation by Abstract Devices : Abstract Machine
  • 12. Three variations of our method - Inbound links are Wikipedia pages that link to the topic page - Outbound links are Wikipedia pages linked to by the topic page - Text-based method only uses word distributions in topic pages
  • 13. Results Method Primary Primary or Additional Text 182 (29.5%) 314 (50.8%) Inbound links 131 (21.2%) 249 (40.0%) Outbound links 79 (12.8%) 166 (26.9%) The number (and percentage) of authors’ primary ACM topic labels, or authors’ primary + additional ACM topics successfully identified by each method. LDA cannot be compared without an additional step mapping word distributions to ACM topics.
  • 15. Concluding Remarks The Wiki categories often match the categories that were chosen by the authors. When they don’t match, they generally appear plausible. Among the variations of our method, the text based approach performed better than link based approaches. Among the link based approaches, inbound links performed better than outbound links.
  • 16. Next Steps Dependent topic structures Combine heuristics with generative models: Wikipedia as a prior for the topic distribution Learn from the documents observed.

Hinweis der Redaktion

  1. Blei- “Much of my research is in topic models, which are a suite of algorithms to uncover the hidden thematic structure of a collection of documents. These algorithms help us develop new ways to search, browse and summarize large archives of texts.”
  2. Here is an example of a paragraphWe assume that some number of topics exist in a document setEach document is a mixture of these corpus wide topicsEach topic is a distribution over wordsEach word is drawn from one of those topics
  3. Describing what they mean is different,
  4. Use posterior expectations / approximate posterior inference: gibbs sampling, variational inference
  5. The reason we chose this so that we can validate our results
  6. Pause… Thank you