2. In a nutshell
âą Introduction & background
âą Textometry and Web Mining: why?
âą Textometry and Web Mining: how?
âą Textometry and Web Mining: application?
âą Conclusion
22/07/2011 E. MacMurray & M. Leenhardt ICAIâ11 Workshop on Intelligent Linguistic Technologies 2
3. Introduction & background
Structure ? Man versus machine ?
Seth Grimes sees « three categories of Neil Glassman « between those on one
data : (i) Quantities, whether measured, side who feel the accuracy of automated
observed, or computed (ii) Content, which [content analysis] is sufficient and those
Iâll characterize as non-quantitative on the other side who feel we can only rely
information (iii) Metadata describing on human analysis [âŠ] most in the field
quantities and content. concur with the idea that we need to
Structured/unstructured is a false define a methodology where the software
dichotomy. » and the analyst collaborate to get over the
noise and deliver accurate analysis. »
(July 2011 â IKS Semantic Workshop, France)
(May 2011 â Sentiment Analysis Symposium
review)
22/07/2011 E. MacMurray & M. Leenhardt ICAIâ11 Workshop on Intelligent Linguistic Technologies 3
5. Textometry and Web Mining: why ?
âą Text is considered having its own internal structure
âą Application of statistical and probabilistic calculations directly to the textual
units of comparable texts in a corpus
22/07/2011 E. MacMurray & M. Leenhardt ICAIâ11 Workshop on Intelligent Linguistic Technologies 5
6. Textometry and Web Mining: how?
Form Specificness
b 23.43
July 4th 2011
b 12.68
b 5.57
b 5.66
Hypergeometric Distribution
Form Specificness
d 13.73
July 5th 2011
d 21.86
d 7.75
d 6.55
22/07/2011 E. MacMurray & M. Leenhardt ICAIâ11 Workshop on Intelligent Linguistic Technologies 6
7. Textometry and Web Mining: how?
Two words or more that appear at the same time in a predetermined span of text- lexical
relationships around a pivot-form (William Martinez, 2003)
Result: network of associative relationships
A
---A---C---B---D.
---B---C---H---E.
---B-- C --A---E. B C E
---E---B---D---F.
---C---A---D---H.
A B C
---F---C---B---D.
---E---B---D---A.
E
22/07/2011 E. MacMurray & M. Leenhardt ICAIâ11 Workshop on Intelligent Linguistic Technologies 7
8. Textometry and Web Mining : how?
1/ POINT OF ENTRY 2/ CORPUS
184,761 occurrences / 13,075 forms / 5,194 hapax
NE (companies Article 160 articles
and people) selection
197,341 occurrences / 17,807 formes / 9,416 hapax
103 articles
Company NE = Xerox
People NE = Nicolas Sarkozy 3/ TEXTOMETRIC ANALYSIS
4/ INTERPRETATION OF RESULTS
Hypergeometric
Disribution Quantitative information
to formulate qualitative interpretations.
Specificness
Cooccurrences
22/07/2011 E. MacMurray & M. Leenhardt ICAIâ11 Workshop on Intelligent Linguistic Technologies 8
9. Textometry and Web Mining: results?
Observing forms and repeted segments of « Nicolas Sarkozy »
allows identifying polarities of opinion in paraphrases,
providing clues for determining how the NE is perceived.
contextually
dependant {
negative {
22/07/2011 E. MacMurray & M. Leenhardt ICAIâ11 Workshop on Intelligent Linguistic Technologies 9
10. Textometry and Web Mining: results?
Figure - Monthly variation of specificness for paraphrases for the NE « Nicolas Sarkozy ».
22/07/2011 E. MacMurray & M. Leenhardt ICAIâ11 Workshop on Intelligent Linguistic Technologies 10
11. Textometry and Web Mining: results?
As a current event is discussed in the media, the lexical network produced by the co-
occurrence calculation will be greater during an event than during periods of calm
or low activity of the NE
( « buzz effect »)
22/07/2011 E. MacMurray & M. Leenhardt ICAIâ11 Workshop on Intelligent Linguistic Technologies 11
12. Textometry and Web Mining: results?
22/07/2011 E. MacMurray & M. Leenhardt ICAIâ11 Workshop on Intelligent Linguistic Technologies 12
13. Textometry and Web Mining: results?
22/07/2011 E. MacMurray & M. Leenhardt ICAIâ11 Workshop on Intelligent Linguistic Technologies 13
14. Conclusion
âą Two intelligence use-cases on Le Monde and The New York Times
âą Two complementary approaches : specificness and co-occurrence analysis
âą Three main contributions :
â Building corpus-driven linguistic ressources (time and cost-cutting)
â Identifying trends with specificness calculation
â Targeting zones of activity or events through co-occurrence networks
âą In sum, this method :
â Help derive knowledge from corpora without predefined information
models
â Provides adequate functions enabling interaction between the
expertise of the user and processing tools
22/07/2011 E. MacMurray & M. Leenhardt ICAIâ11 Workshop on Intelligent Linguistic Technologies 14