SlideShare ist ein Scribd-Unternehmen logo
1 von 15
Downloaden Sie, um offline zu lesen
Textometry and Information Discovery : A New
 Approach to Mining Textual Data on the Web

         Erin MacMurray*, Marguerite Leenhardt **
    SYLED/CLA2T EA2290, UFR ILPGA, Université Sorbonne
                      Nouvelle Paris 3
               *erin.macmurray@gmail.com
            ** marguerite.leenhardt@gmail.com




     ICAI’11 Workshop on Intelligent Linguistic Technologies
In a nutshell
             ‱   Introduction & background
             ‱   Textometry and Web Mining: why?
             ‱   Textometry and Web Mining: how?
             ‱   Textometry and Web Mining: application?
             ‱   Conclusion




22/07/2011        E. MacMurray & M. Leenhardt   ICAI’11 Workshop on Intelligent Linguistic Technologies 2
Introduction & background
                 Structure ?                          Man versus machine ?




    Seth Grimes sees « three categories of         Neil Glassman « between those on one
    data : (i) Quantities, whether measured,       side who feel the accuracy of automated
    observed, or computed (ii) Content, which      [content analysis] is sufficient and those
    I’ll characterize as non-quantitative          on the other side who feel we can only rely
    information (iii) Metadata describing          on human analysis [
] most in the field
    quantities and content.                        concur with the idea that we need to
    Structured/unstructured is a false             define a methodology where the software
    dichotomy. »                                   and the analyst collaborate to get over the
                                                   noise and deliver accurate analysis. »
    (July 2011 – IKS Semantic Workshop, France)
                                                   (May 2011 – Sentiment Analysis Symposium
                                                   review)
22/07/2011           E. MacMurray & M. Leenhardt   ICAI’11 Workshop on Intelligent Linguistic Technologies 3
Textometry and Web Mining: why ?
‱ Improving Linguistic Models
   – Semantic complexity of simple units such as NE
   – Identifying paraphrases of NE


                             INTEL            Paris
                                                                                    Le président de la République
                           Gone with the wind
                                                                                                Sarko
              Harry Potter                   JIF Peanut Butter                                             Nicolas Sarkozy
                                                                                Sarkoland
                                    lyzozym
             The 4th of July                            20GB
                                                                                                         M. Sarkozy
                      Dulles International Airport                                    Sarkozyste

                                     Le Tour de France                                                   Mr Sarkozy
                                                                                         Sarkozysme
                           www.nytimes.com

                NE : an heterogeneous category
                Ehrmann M. (2008) les EN de la linguistique au TAL statut                   Paraphrases of a single NE
                théorique et méthodes de désambiguïsation.

22/07/2011                   E. MacMurray & M. Leenhardt                    ICAI’11 Workshop on Intelligent Linguistic Technologies 4
Textometry and Web Mining: why ?
         ‱   Text is considered having its own internal structure
         ‱   Application of statistical and probabilistic calculations directly to the textual
             units of comparable texts in a corpus




22/07/2011        E. MacMurray & M. Leenhardt             ICAI’11 Workshop on Intelligent Linguistic Technologies 5
Textometry and Web Mining: how?

                                                                                                Form         Specificness

                                                                                            b               23.43
July 4th 2011
                                                                                            b               12.68

                                                                                            b               5.57

                                                                                            b               5.66

                                              Hypergeometric Distribution

                                                                                                Form         Specificness

                                                                                            d               13.73
July 5th 2011
                                                                                            d               21.86

                                                                                            d               7.75

                                                                                            d               6.55




       22/07/2011   E. MacMurray & M. Leenhardt             ICAI’11 Workshop on Intelligent Linguistic Technologies 6
Textometry and Web Mining: how?
Two words or more that appear at the same time in a predetermined span of text- lexical
relationships around a pivot-form (William Martinez, 2003)
Result: network of associative relationships




                                                                                       A
  ---A---C---B---D.
  ---B---C---H---E.
  ---B-- C --A---E.                                                 B                   C                 E
  ---E---B---D---F.
  ---C---A---D---H.
                                                      A                  B                          C
  ---F---C---B---D.
  ---E---B---D---A.
                                                                                            E
  22/07/2011        E. MacMurray & M. Leenhardt        ICAI’11 Workshop on Intelligent Linguistic Technologies 7
Textometry and Web Mining : how?
       1/ POINT OF ENTRY                                  2/ CORPUS
                                                                                  184,761 occurrences / 13,075 forms / 5,194 hapax
NE (companies            Article                                                                    160 articles
and people)              selection
                                                                                  197,341 occurrences / 17,807 formes / 9,416 hapax
                                                                                                    103 articles

     Company NE = Xerox
     People NE = Nicolas Sarkozy               3/ TEXTOMETRIC ANALYSIS




                                                                                               4/ INTERPRETATION OF RESULTS
                                                        Hypergeometric
                                                          Disribution                          Quantitative information
                                                                                               to formulate qualitative interpretations.



                                                        Specificness
                                                        Cooccurrences




    22/07/2011            E. MacMurray & M. Leenhardt                   ICAI’11 Workshop on Intelligent Linguistic Technologies 8
Textometry and Web Mining: results?
   Observing forms and repeted segments of « Nicolas Sarkozy »
   allows identifying polarities of opinion in paraphrases,
   providing clues for determining how the NE is perceived.




contextually
dependant      {

   negative    {
22/07/2011         E. MacMurray & M. Leenhardt   ICAI’11 Workshop on Intelligent Linguistic Technologies 9
Textometry and Web Mining: results?




             Figure - Monthly variation of specificness for paraphrases for the NE « Nicolas Sarkozy ».

22/07/2011      E. MacMurray & M. Leenhardt                              ICAI’11 Workshop on Intelligent Linguistic Technologies 10
Textometry and Web Mining: results?
As a current event is discussed in the media, the lexical network produced by the co-
occurrence calculation will be greater during an event than during periods of calm
or low activity of the NE
                                   ( « buzz effect »)




22/07/2011      E. MacMurray & M. Leenhardt     ICAI’11 Workshop on Intelligent Linguistic Technologies 11
Textometry and Web Mining: results?




22/07/2011   E. MacMurray & M. Leenhardt   ICAI’11 Workshop on Intelligent Linguistic Technologies 12
Textometry and Web Mining: results?




22/07/2011   E. MacMurray & M. Leenhardt   ICAI’11 Workshop on Intelligent Linguistic Technologies 13
Conclusion
‱ Two intelligence use-cases on Le Monde and The New York Times
‱ Two complementary approaches : specificness and co-occurrence analysis

‱ Three main contributions :
   – Building corpus-driven linguistic ressources (time and cost-cutting)
   – Identifying trends with specificness calculation
   – Targeting zones of activity or events through co-occurrence networks

‱    In sum, this method :
      – Help derive knowledge from corpora without predefined information
         models
      – Provides adequate functions enabling interaction between the
         expertise of the user and processing tools
22/07/2011    E. MacMurray & M. Leenhardt   ICAI’11 Workshop on Intelligent Linguistic Technologies 14
References
Bloom K., Stein S. & Argamon S., Appraisal extraction for news opinion analysis at NTCIR-6, Proceedings of NTCIR-6, 2007, p 279-289.
Bollier, D. The Promise and Peril of Big Data. Washington, DC : The Aspen Institute, 2010.
Delanoë, A. 2010. Statistique textuelle et series chronologiques sur un corpus de presse écrite. Le cas de la mise en application du principe de précaution.
        Proceedings, JADT’2010.
Delaplace R., Leenhardt M. & Wu L-C., Méthode de conception d’une application de veille et d’Analyse Linguistique Assistée par Ordinateur, VSST
        Conference, Toulouse, France, 2010.
Fayyard, U.M, Piatesky, G., Smyth, P. & Uthurusamy, R. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.
Feldman R. & Sanger J., The Text Mining Handbook : Advanced Approaches in Analyzing Unstructured Data, Cambrigde University Press, 2006, 422 p.
 Firth, J.R. A Synopsis of Linguistic Theory 1930-1955, Linguistic Analysis Philological Society, Oxford, 1957.
Grishman, R. & Sundheim, B. Message Understanding Conference- 6 : A Brief History. Proceedings of the 16th International Conference on Computational
        Linguistics (COLING), I. Kopenhagen, 1996 p.466–471,.
Kodratoff, Y. Knowledge discovery in texts: A definition and applications, Proceedings of the International Symposium on Methodologies for Intelligent
        Systems, 1999, volume LNAI 1609, p. 16–29.
Lebart, L. & Salem, A. Statistique textuelle. Paris, Dunod, 1994.
 Lent, B., Agrawal, R., & Srikant, R. Discovering trends in text databases, Proceedings KDD’1997, AAAI Press, 14–17 p. 227–230.
MacMurray E. & Shen L., Textual Statistics and Information Discovery: Using Co-occurrences to Detect Events, VSST Conference, Toulouse, France, 2010.
Martin J.R. & White P.R.R., The language of evaluation: appraisal in English, Palgrave, London, 2005.
Martinez, W. Contribution à une méthodologie de l’analyse des cooccurrences lexicales multiples dans les corpus textuels, Thèse pour le doctorat en
        Sciences du Langage, Université de la Sorbonne nouvelle - Paris 3, 2003.
Née, E. Insécurité et élections presidentielles dans le journal Le Monde, Lexicometrica numéro thématique « Explorations Textuelles », S. Fleury, A. Salem.
        2008
Poibeau T. Extraction automatique d’information. Du texte brut au web sémantique. Paris : Hermès Sciences, 2003.
Poibeau, T. Sur le statut référentiel des entités nommées, Proceedings TALN’05. Dourdan, France, 2005.
Salem A., Introduction à la résonance textuelle, In Actes des JADT 2004 (7 èmes Journées internationales d’Analyse Statistique des Données Textuelles),
        2004, p 986-992.
Sandhaus, E. The New York Times Annotated Corpus. Philadelphia: Linguistic Data Consortium, 2008.
TuffĂ©ry, S. Data mining et statistique décisionnelle: l'intelligence des données. Paris : Editions Technip, 2007.
Wright, K. Using Open Source Common Sense Reasoning Tools in Text Mining Research, the International Journal of Applied Management and Technology,
        2006 vol 4 n°2 p.349-387.




22/07/2011                   E. MacMurray & M. Leenhardt                              ICAI’11 Workshop on Intelligent Linguistic Technologies 15

Weitere Àhnliche Inhalte

Ähnlich wie Textometry and Information Discovery : A New Approach to Mining Textual Data on the Web

Semantically-aware Networks and Services for Training and Knowledge Managemen...
Semantically-aware Networks and Services for Training and Knowledge Managemen...Semantically-aware Networks and Services for Training and Knowledge Managemen...
Semantically-aware Networks and Services for Training and Knowledge Managemen...Gilbert Paquette
 
Keynote speech at COST 292 final workshop on future of multimedia search and ...
Keynote speech at COST 292 final workshop on future of multimedia search and ...Keynote speech at COST 292 final workshop on future of multimedia search and ...
Keynote speech at COST 292 final workshop on future of multimedia search and ...Touradj Ebrahimi
 
Computing for Human Experience and Wellness
Computing for Human Experience and WellnessComputing for Human Experience and Wellness
Computing for Human Experience and WellnessAmit Sheth
 
Network of future 2011 befemto
Network of future 2011   befemtoNetwork of future 2011   befemto
Network of future 2011 befemtoThierry Lestable
 
Ontologies dynamic networks of formally represented meaning1
Ontologies dynamic networks of formally represented meaning1Ontologies dynamic networks of formally represented meaning1
Ontologies dynamic networks of formally represented meaning1STIinnsbruck
 
Uncertainty Handling in Mobile Community Information Systems
Uncertainty Handling in Mobile Community Information SystemsUncertainty Handling in Mobile Community Information Systems
Uncertainty Handling in Mobile Community Information SystemsYiwei Cao
 
Invited talk: Sensor deployment strategies for indoor robot navigation
Invited talk: Sensor deployment strategies for indoor robot navigationInvited talk: Sensor deployment strategies for indoor robot navigation
Invited talk: Sensor deployment strategies for indoor robot navigationNational Taiwan Normal University
 
Measuring the Effects of Rational 7th and 8th Order Distortion Model in the R...
Measuring the Effects of Rational 7th and 8th Order Distortion Model in the R...Measuring the Effects of Rational 7th and 8th Order Distortion Model in the R...
Measuring the Effects of Rational 7th and 8th Order Distortion Model in the R...IOSRJVSP
 
e-dialogos: An e-participation case at the local level
e-dialogos: An e-participation case at the local levele-dialogos: An e-participation case at the local level
e-dialogos: An e-participation case at the local levelVassilis Goulandris
 
Pal gov.tutorial4.session12 2.wordnets
Pal gov.tutorial4.session12 2.wordnetsPal gov.tutorial4.session12 2.wordnets
Pal gov.tutorial4.session12 2.wordnetsMustafa Jarrar
 
Analyzing and Ranking Multimedia Ontologies for their Reuse
Analyzing and Ranking Multimedia Ontologies for their ReuseAnalyzing and Ranking Multimedia Ontologies for their Reuse
Analyzing and Ranking Multimedia Ontologies for their ReuseEURECOM
 
Ontology Aware Applications @ YAPC::EU 2012
Ontology Aware Applications @ YAPC::EU 2012Ontology Aware Applications @ YAPC::EU 2012
Ontology Aware Applications @ YAPC::EU 2012Nuno Carvalho
 
Combining Data Mining and Ontology Engineering to enrich Ontologies and Linke...
Combining Data Mining and Ontology Engineering to enrich Ontologies and Linke...Combining Data Mining and Ontology Engineering to enrich Ontologies and Linke...
Combining Data Mining and Ontology Engineering to enrich Ontologies and Linke...Mathieu d'Aquin
 
VIDEO OBJECTS DESCRIPTION IN HINDI TEXT LANGUAGE
VIDEO OBJECTS DESCRIPTION IN HINDI TEXT LANGUAGE VIDEO OBJECTS DESCRIPTION IN HINDI TEXT LANGUAGE
VIDEO OBJECTS DESCRIPTION IN HINDI TEXT LANGUAGE ijmpict
 
GOSPL: A Method and Tool for Fact-Oriented Hybrid Ontology Engineering
GOSPL: A Method and Tool for Fact-Oriented Hybrid Ontology EngineeringGOSPL: A Method and Tool for Fact-Oriented Hybrid Ontology Engineering
GOSPL: A Method and Tool for Fact-Oriented Hybrid Ontology EngineeringChristophe Debruyne
 
The EM network October 2010 Newsletter
The EM network October 2010 NewsletterThe EM network October 2010 Newsletter
The EM network October 2010 Newsletterphovdenak
 
cxcxc program ssk-cug 2010 - standardized systematization of knowledge via ...
cxcxc program   ssk-cug 2010 - standardized systematization of knowledge via ...cxcxc program   ssk-cug 2010 - standardized systematization of knowledge via ...
cxcxc program ssk-cug 2010 - standardized systematization of knowledge via ...Ionel Gabriel Niculescu
 

Ähnlich wie Textometry and Information Discovery : A New Approach to Mining Textual Data on the Web (20)

Semantically-aware Networks and Services for Training and Knowledge Managemen...
Semantically-aware Networks and Services for Training and Knowledge Managemen...Semantically-aware Networks and Services for Training and Knowledge Managemen...
Semantically-aware Networks and Services for Training and Knowledge Managemen...
 
Keynote speech at COST 292 final workshop on future of multimedia search and ...
Keynote speech at COST 292 final workshop on future of multimedia search and ...Keynote speech at COST 292 final workshop on future of multimedia search and ...
Keynote speech at COST 292 final workshop on future of multimedia search and ...
 
Computing for Human Experience and Wellness
Computing for Human Experience and WellnessComputing for Human Experience and Wellness
Computing for Human Experience and Wellness
 
Network of future 2011 befemto
Network of future 2011   befemtoNetwork of future 2011   befemto
Network of future 2011 befemto
 
Reference Ontology Presentation
Reference Ontology PresentationReference Ontology Presentation
Reference Ontology Presentation
 
Ontologies dynamic networks of formally represented meaning1
Ontologies dynamic networks of formally represented meaning1Ontologies dynamic networks of formally represented meaning1
Ontologies dynamic networks of formally represented meaning1
 
Uncertainty Handling in Mobile Community Information Systems
Uncertainty Handling in Mobile Community Information SystemsUncertainty Handling in Mobile Community Information Systems
Uncertainty Handling in Mobile Community Information Systems
 
Invited talk: Sensor deployment strategies for indoor robot navigation
Invited talk: Sensor deployment strategies for indoor robot navigationInvited talk: Sensor deployment strategies for indoor robot navigation
Invited talk: Sensor deployment strategies for indoor robot navigation
 
Geolinkeddata 07042011 1
Geolinkeddata 07042011 1Geolinkeddata 07042011 1
Geolinkeddata 07042011 1
 
GeoLinkedData
GeoLinkedDataGeoLinkedData
GeoLinkedData
 
Measuring the Effects of Rational 7th and 8th Order Distortion Model in the R...
Measuring the Effects of Rational 7th and 8th Order Distortion Model in the R...Measuring the Effects of Rational 7th and 8th Order Distortion Model in the R...
Measuring the Effects of Rational 7th and 8th Order Distortion Model in the R...
 
e-dialogos: An e-participation case at the local level
e-dialogos: An e-participation case at the local levele-dialogos: An e-participation case at the local level
e-dialogos: An e-participation case at the local level
 
Pal gov.tutorial4.session12 2.wordnets
Pal gov.tutorial4.session12 2.wordnetsPal gov.tutorial4.session12 2.wordnets
Pal gov.tutorial4.session12 2.wordnets
 
Analyzing and Ranking Multimedia Ontologies for their Reuse
Analyzing and Ranking Multimedia Ontologies for their ReuseAnalyzing and Ranking Multimedia Ontologies for their Reuse
Analyzing and Ranking Multimedia Ontologies for their Reuse
 
Ontology Aware Applications @ YAPC::EU 2012
Ontology Aware Applications @ YAPC::EU 2012Ontology Aware Applications @ YAPC::EU 2012
Ontology Aware Applications @ YAPC::EU 2012
 
Combining Data Mining and Ontology Engineering to enrich Ontologies and Linke...
Combining Data Mining and Ontology Engineering to enrich Ontologies and Linke...Combining Data Mining and Ontology Engineering to enrich Ontologies and Linke...
Combining Data Mining and Ontology Engineering to enrich Ontologies and Linke...
 
VIDEO OBJECTS DESCRIPTION IN HINDI TEXT LANGUAGE
VIDEO OBJECTS DESCRIPTION IN HINDI TEXT LANGUAGE VIDEO OBJECTS DESCRIPTION IN HINDI TEXT LANGUAGE
VIDEO OBJECTS DESCRIPTION IN HINDI TEXT LANGUAGE
 
GOSPL: A Method and Tool for Fact-Oriented Hybrid Ontology Engineering
GOSPL: A Method and Tool for Fact-Oriented Hybrid Ontology EngineeringGOSPL: A Method and Tool for Fact-Oriented Hybrid Ontology Engineering
GOSPL: A Method and Tool for Fact-Oriented Hybrid Ontology Engineering
 
The EM network October 2010 Newsletter
The EM network October 2010 NewsletterThe EM network October 2010 Newsletter
The EM network October 2010 Newsletter
 
cxcxc program ssk-cug 2010 - standardized systematization of knowledge via ...
cxcxc program   ssk-cug 2010 - standardized systematization of knowledge via ...cxcxc program   ssk-cug 2010 - standardized systematization of knowledge via ...
cxcxc program ssk-cug 2010 - standardized systematization of knowledge via ...
 

KĂŒrzlich hochgeladen

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍾 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍾 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍾 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍾 8923113531 🎰 Avail...gurkirankumar98700
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel AraĂșjo
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 

KĂŒrzlich hochgeladen (20)

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍾 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍾 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍾 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍾 8923113531 🎰 Avail...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

Textometry and Information Discovery : A New Approach to Mining Textual Data on the Web

  • 1. Textometry and Information Discovery : A New Approach to Mining Textual Data on the Web Erin MacMurray*, Marguerite Leenhardt ** SYLED/CLA2T EA2290, UFR ILPGA, UniversitĂ© Sorbonne Nouvelle Paris 3 *erin.macmurray@gmail.com ** marguerite.leenhardt@gmail.com ICAI’11 Workshop on Intelligent Linguistic Technologies
  • 2. In a nutshell ‱ Introduction & background ‱ Textometry and Web Mining: why? ‱ Textometry and Web Mining: how? ‱ Textometry and Web Mining: application? ‱ Conclusion 22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 2
  • 3. Introduction & background Structure ? Man versus machine ? Seth Grimes sees « three categories of Neil Glassman « between those on one data : (i) Quantities, whether measured, side who feel the accuracy of automated observed, or computed (ii) Content, which [content analysis] is sufficient and those I’ll characterize as non-quantitative on the other side who feel we can only rely information (iii) Metadata describing on human analysis [
] most in the field quantities and content. concur with the idea that we need to Structured/unstructured is a false define a methodology where the software dichotomy. » and the analyst collaborate to get over the noise and deliver accurate analysis. » (July 2011 – IKS Semantic Workshop, France) (May 2011 – Sentiment Analysis Symposium review) 22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 3
  • 4. Textometry and Web Mining: why ? ‱ Improving Linguistic Models – Semantic complexity of simple units such as NE – Identifying paraphrases of NE INTEL Paris Le prĂ©sident de la RĂ©publique Gone with the wind Sarko Harry Potter JIF Peanut Butter Nicolas Sarkozy Sarkoland lyzozym The 4th of July 20GB M. Sarkozy Dulles International Airport Sarkozyste Le Tour de France Mr Sarkozy Sarkozysme www.nytimes.com NE : an heterogeneous category Ehrmann M. (2008) les EN de la linguistique au TAL statut Paraphrases of a single NE thĂ©orique et mĂ©thodes de dĂ©sambiguĂŻsation. 22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 4
  • 5. Textometry and Web Mining: why ? ‱ Text is considered having its own internal structure ‱ Application of statistical and probabilistic calculations directly to the textual units of comparable texts in a corpus 22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 5
  • 6. Textometry and Web Mining: how? Form Specificness b 23.43 July 4th 2011 b 12.68 b 5.57 b 5.66 Hypergeometric Distribution Form Specificness d 13.73 July 5th 2011 d 21.86 d 7.75 d 6.55 22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 6
  • 7. Textometry and Web Mining: how? Two words or more that appear at the same time in a predetermined span of text- lexical relationships around a pivot-form (William Martinez, 2003) Result: network of associative relationships A ---A---C---B---D. ---B---C---H---E. ---B-- C --A---E. B C E ---E---B---D---F. ---C---A---D---H. A B C ---F---C---B---D. ---E---B---D---A. E 22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 7
  • 8. Textometry and Web Mining : how? 1/ POINT OF ENTRY 2/ CORPUS 184,761 occurrences / 13,075 forms / 5,194 hapax NE (companies Article 160 articles and people) selection 197,341 occurrences / 17,807 formes / 9,416 hapax 103 articles Company NE = Xerox People NE = Nicolas Sarkozy 3/ TEXTOMETRIC ANALYSIS 4/ INTERPRETATION OF RESULTS Hypergeometric Disribution Quantitative information to formulate qualitative interpretations. Specificness Cooccurrences 22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 8
  • 9. Textometry and Web Mining: results? Observing forms and repeted segments of « Nicolas Sarkozy » allows identifying polarities of opinion in paraphrases, providing clues for determining how the NE is perceived. contextually dependant { negative { 22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 9
  • 10. Textometry and Web Mining: results? Figure - Monthly variation of specificness for paraphrases for the NE « Nicolas Sarkozy ». 22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 10
  • 11. Textometry and Web Mining: results? As a current event is discussed in the media, the lexical network produced by the co- occurrence calculation will be greater during an event than during periods of calm or low activity of the NE ( « buzz effect ») 22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 11
  • 12. Textometry and Web Mining: results? 22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 12
  • 13. Textometry and Web Mining: results? 22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 13
  • 14. Conclusion ‱ Two intelligence use-cases on Le Monde and The New York Times ‱ Two complementary approaches : specificness and co-occurrence analysis ‱ Three main contributions : – Building corpus-driven linguistic ressources (time and cost-cutting) – Identifying trends with specificness calculation – Targeting zones of activity or events through co-occurrence networks ‱ In sum, this method : – Help derive knowledge from corpora without predefined information models – Provides adequate functions enabling interaction between the expertise of the user and processing tools 22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 14
  • 15. References Bloom K., Stein S. & Argamon S., Appraisal extraction for news opinion analysis at NTCIR-6, Proceedings of NTCIR-6, 2007, p 279-289. Bollier, D. The Promise and Peril of Big Data. Washington, DC : The Aspen Institute, 2010. DelanoĂ«, A. 2010. Statistique textuelle et series chronologiques sur un corpus de presse Ă©crite. Le cas de la mise en application du principe de prĂ©caution. Proceedings, JADT’2010. Delaplace R., Leenhardt M. & Wu L-C., Méthode de conception d’une application de veille et d’Analyse Linguistique Assistée par Ordinateur, VSST Conference, Toulouse, France, 2010. Fayyard, U.M, Piatesky, G., Smyth, P. & Uthurusamy, R. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996. Feldman R. & Sanger J., The Text Mining Handbook : Advanced Approaches in Analyzing Unstructured Data, Cambrigde University Press, 2006, 422 p. Firth, J.R. A Synopsis of Linguistic Theory 1930-1955, Linguistic Analysis Philological Society, Oxford, 1957. Grishman, R. & Sundheim, B. Message Understanding Conference- 6 : A Brief History. Proceedings of the 16th International Conference on Computational Linguistics (COLING), I. Kopenhagen, 1996 p.466–471,. Kodratoff, Y. Knowledge discovery in texts: A definition and applications, Proceedings of the International Symposium on Methodologies for Intelligent Systems, 1999, volume LNAI 1609, p. 16–29. Lebart, L. & Salem, A. Statistique textuelle. Paris, Dunod, 1994. Lent, B., Agrawal, R., & Srikant, R. Discovering trends in text databases, Proceedings KDD’1997, AAAI Press, 14–17 p. 227–230. MacMurray E. & Shen L., Textual Statistics and Information Discovery: Using Co-occurrences to Detect Events, VSST Conference, Toulouse, France, 2010. Martin J.R. & White P.R.R., The language of evaluation: appraisal in English, Palgrave, London, 2005. Martinez, W. Contribution à une méthodologie de l’analyse des cooccurrences lexicales multiples dans les corpus textuels, Thèse pour le doctorat en Sciences du Langage, Université de la Sorbonne nouvelle - Paris 3, 2003. NĂ©e, E. InsĂ©curitĂ© et Ă©lections presidentielles dans le journal Le Monde, Lexicometrica numĂ©ro thĂ©matique « Explorations Textuelles », S. Fleury, A. Salem. 2008 Poibeau T. Extraction automatique d’information. Du texte brut au web sémantique. Paris : Hermès Sciences, 2003. Poibeau, T. Sur le statut référentiel des entités nommées, Proceedings TALN’05. Dourdan, France, 2005. Salem A., Introduction à la résonance textuelle, In Actes des JADT 2004 (7 èmes Journées internationales d’Analyse Statistique des Données Textuelles), 2004, p 986-992. Sandhaus, E. The New York Times Annotated Corpus. Philadelphia: Linguistic Data Consortium, 2008. TuffĂ©ry, S. Data mining et statistique décisionnelle: l'intelligence des données. Paris : Editions Technip, 2007. Wright, K. Using Open Source Common Sense Reasoning Tools in Text Mining Research, the International Journal of Applied Management and Technology, 2006 vol 4 n°2 p.349-387. 22/07/2011 E. MacMurray & M. Leenhardt ICAI’11 Workshop on Intelligent Linguistic Technologies 15