SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Downloaden Sie, um offline zu lesen
Experimenting the
TextTiling Algorithm
Summary of the work done by master
students at Université Toulouse Le Mirail
Adam C., Andreani V., Bengsston J., Bouchara N., Choucavy L.,
Delpech E., El Maarouf I., Fontan L., Gotlik W.
Experimenting the Text Tiling
algorithm
Part I : What is the Text Tiling Algorithm ?
Part II : Experimentations with the Text
Tiling algorithm
Part III : Demo
Part I :
What is the TextTiling algorithm?
 « an algorithm for partitionning expository texts into
coherent multi-paragraph discourse units which reflects
the subtopic structure of the texts »

 developed by Marti Hearst (1997):
«TextTiling: Segmenting Text into Multi-Paragraph
Subtopic Passages », In Computational Linguistics, March
1997.
http://www.ischool.berkeley.edu/~hearst/tiling-about.html
Why segment a text into multi-paragraphs
unit ?
Computational tasks that use arbitrary windows might
benefit from using windows with motivated boundaries
Ease of readability for online long texts (Reading
Assistant Tools)
IR : retrieving relevant passages instead of whole
document
Summarization : extract sentences according to their
position in the subtopic structure
What is the hypothesis behind TextTiling ?

 « TextTiling assumes that a set of lexical items is in use
during the course of a given subtopic discussion, and
when that subtopic changes, a significant proportion
when that subtopic changes, a significant proportion of the
of the vocabulary changes
vocabulary changes as well »as well »
Text Tiling doesn’t detect subtopics per se but shifts in
topic by means of change in vocabulary
Operates a linear segmentation (no hierarchy)
Detection of topic shift
Raw text
Tokenisation

similarity score SS
bloc A vs bloc B S
S

Segmentation into
pseudo-sentences
(20 tokens)

a similarity score is computed every
pseudo-sentence between 2 blocks of 6
pseudo-sequences


the more vocabulary in common, the
highest the score


S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
I. Detection of topic shift
SCORE
1

 a gap means there is a

0,85

0,9

drop in vocabulary similarity

0,8

0,8

0,7

 topic shifts occur at the

0,6
0,75

deepest gaps (after
smoothing)

0,5
0,4
0,7

tiles boundaries will be
adjusted to the nearest
paragraph break

0,3
0,65
0,2

0,1
0,6
0
1 1 3 3 5 5 7 7 9 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

Pseudo-sentence
number
Evaluation by Hearst (1997)
 Evaluation on 12 magazine articles annotated by 7
judges

 Judges are asked « to mark the paragraph boudary at
which the topic changed »

 In case of disagreement among judges, a boudary is
kept if at least 3 judges agree on it

 Agreement among judges (kappa measure) :

kappa = 0.647
Evaluation by Hearst (1997)
Precision

Recall

0.43

0.42

TextTiler

0.66

0.61

Judges

0.81

0.71

Baseline
(random)

Works well on long (+1800 words) expository texts with
little structural demarcation
Part II : Experimentations with
theTextTiling algorithm
 Work done by masters students, Université Toulouse Le
Mirail

 Implementation in Perl
 Experimentations :
 cross annotation of 3 texts
 variation of :


linguistic parameters



computation parameters
Annotation of topic boundary
 No clear-cut topic shift, rather ‘regions’ of shift
Annotators felt a smaller unity (sentence) would have
been more convenient

 Our kappa : 0.56
 Heart’s judges : 0.65

 kappa should be at least > 0.67, the best is > 0.8

 A difficult (unnatural ?) task for humans
Variation of linguistic parameters
basic

trigrams

lemmatization (TreeTagger*)
0,61

0,7

0,58

0,6

0,53

0,5

0,35
0,34

0,26
0,23

PRECISION
F-MESURE

0,4

0,25

0,3
0,2

0,17

0,1
0

RECALL
* http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
Variation of computation parameters
 Computation window :


pseudo-sentence length



block length

 Smoothing :
0,7

0,7
0,7

0,6

0,6
0,6

0,5

0,5
0,5

0,4

0,4
0,4

0,3

0,3
0,3

0,2

0,2
0,2

0,1

0,1
0,1

0

0

0

1

1

15
57
71
18 17 22 2736 40 5053 65 66 78 85 92 99 105 118127 137 141 148 155 162170 183 196
1425 29 41 4349 57 64 73 81 89 92 105 113 121 129 134 145 153157169 177 185 193 197
79 97 106 113 120 131 144 161 169 176 183 190 201
9
33
Size of computation window
Pseudo-sentence length

Block length
2

4

6

8

10

12

14

16

18

20

5

++ +++ ++

++

++

++

++

++

++

++

10

++

++

++

+

+

++

+

+

+

+

15

++

+

+

+

+

+

+

-

-

-

20

+

+

+

-

-

-

-

-

-

--

25

+

+

-

-

-

-

-

--

--

--

30

+

-

-

-

-

--

--

--

--

--

35

+

-

-

-

-

--

--

--

--

--

40

--

--

--

--

--

--

--

--

--

--
Correlation
window size / smoothing
window size (number of tokens)
10

30

40

50

iteration

3

3

1

1

1

width

Smoothing

20

2

1

2

2

1

 Correlation between window size and smoothing :
The smallest your window, the more smoothing you need
to smoothe
Optimal parameters set
Nb
parag.

Nb
Words sentences tokens
smooth.
words /
/
/
iteration
parag. block
sentence

smooth.
width

Text 1

12

2000

167

6

5

3

2

Text 2

22

2400

109

6

10

1

1

Text 3

37

1750

20

8

10

1

1

 One optimal parameters set per text
 Optimal set varies according to text/paragraph
length ?
Final thoughts
 Linguistic processing :
lemmatization doesn’t significantly improve TextTiling
 what about stemming ?


 Computation parameters :
 parameters are highly dependent


optimal parameters set vary from text to text

 Proposal : an adaptative Text Tiler ?
 window size could be adapted to text intrinsic qualities
 smoothing could then be adapted to window size
Part III :

Demo
Similarity score – Hearst (1997)

Sim (b1 ,b2) =

∑t wt,b1 . wt,b2

√ ∑ w² b1 . ∑ w² b2
t

t

t

t

b1 : block 1
b2 : block 2
t : token
w : weight (frequency) of the token in the block
back
Kappa measure
http://www.musc.edu/dc/icrebm/kappa.html
Annot 1
yes

no

TOTAL

40

35

Y2=75

no

5

20

N2=25

TOTAL

Y1=45

N1=55

T=100

Annot2 yes

Kappa

Agreement
P(A) = 0.6
Expected agreement
P(E)
= (Y1.Y2 + N1.N2) / T²
= 0.475

P(A) – P(E)
=

1 – P(E)

= 0.24
back

Weitere ähnliche Inhalte

Was ist angesagt?

NoSQL Data Architecture Patterns
NoSQL Data ArchitecturePatternsNoSQL Data ArchitecturePatterns
NoSQL Data Architecture PatternsMaynooth University
 
Introduction to R Graphics with ggplot2
Introduction to R Graphics with ggplot2Introduction to R Graphics with ggplot2
Introduction to R Graphics with ggplot2izahn
 
Exploratory data analysis data visualization
Exploratory data analysis data visualizationExploratory data analysis data visualization
Exploratory data analysis data visualizationDr. Hamdan Al-Sabri
 
Sca a sine cosine algorithm for solving optimization problems
Sca a sine cosine algorithm for solving optimization problemsSca a sine cosine algorithm for solving optimization problems
Sca a sine cosine algorithm for solving optimization problemslaxmanLaxman03209
 
Exploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science ClubExploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science ClubMartin Bago
 
If Else .. Select Case in VB.NET
If Else .. Select Case in VB.NETIf Else .. Select Case in VB.NET
If Else .. Select Case in VB.NETShyam Sir
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQLkristinferrier
 
Best Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSBest Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSAmazon Web Services
 
An introduction to Bayesian Statistics using Python
An introduction to Bayesian Statistics using PythonAn introduction to Bayesian Statistics using Python
An introduction to Bayesian Statistics using Pythonfreshdatabos
 
A quick introduction to R
A quick introduction to RA quick introduction to R
A quick introduction to RAngshuman Saha
 
Elements of dynamic programming
Elements of dynamic programmingElements of dynamic programming
Elements of dynamic programmingTafhim Islam
 
Multivariate Linear Regression.ppt
Multivariate Linear Regression.pptMultivariate Linear Regression.ppt
Multivariate Linear Regression.pptTanyaWadhwani4
 
Exploratory Data Analysis
Exploratory Data AnalysisExploratory Data Analysis
Exploratory Data AnalysisUmair Shafique
 
Regular Expression in Compiler design
Regular Expression in Compiler designRegular Expression in Compiler design
Regular Expression in Compiler designRiazul Islam
 
Data Visualization and Dashboard Design
Data Visualization and Dashboard DesignData Visualization and Dashboard Design
Data Visualization and Dashboard DesignJacques Warren
 
Cientista de Dados – Dominando o Big Data com Software Livre
Cientista de Dados – Dominando o Big Data com Software Livre Cientista de Dados – Dominando o Big Data com Software Livre
Cientista de Dados – Dominando o Big Data com Software Livre Ambiente Livre
 

Was ist angesagt? (20)

NoSQL Data Architecture Patterns
NoSQL Data ArchitecturePatternsNoSQL Data ArchitecturePatterns
NoSQL Data Architecture Patterns
 
Introduction to R Graphics with ggplot2
Introduction to R Graphics with ggplot2Introduction to R Graphics with ggplot2
Introduction to R Graphics with ggplot2
 
Exploratory data analysis data visualization
Exploratory data analysis data visualizationExploratory data analysis data visualization
Exploratory data analysis data visualization
 
Sca a sine cosine algorithm for solving optimization problems
Sca a sine cosine algorithm for solving optimization problemsSca a sine cosine algorithm for solving optimization problems
Sca a sine cosine algorithm for solving optimization problems
 
Exploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science ClubExploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science Club
 
If Else .. Select Case in VB.NET
If Else .. Select Case in VB.NETIf Else .. Select Case in VB.NET
If Else .. Select Case in VB.NET
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQL
 
Best Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSBest Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWS
 
An introduction to Bayesian Statistics using Python
An introduction to Bayesian Statistics using PythonAn introduction to Bayesian Statistics using Python
An introduction to Bayesian Statistics using Python
 
A quick introduction to R
A quick introduction to RA quick introduction to R
A quick introduction to R
 
Elements of dynamic programming
Elements of dynamic programmingElements of dynamic programming
Elements of dynamic programming
 
Big data architecture
Big data architectureBig data architecture
Big data architecture
 
Multivariate Linear Regression.ppt
Multivariate Linear Regression.pptMultivariate Linear Regression.ppt
Multivariate Linear Regression.ppt
 
Valery charmes-
Valery charmes-Valery charmes-
Valery charmes-
 
Exploratory Data Analysis
Exploratory Data AnalysisExploratory Data Analysis
Exploratory Data Analysis
 
Regular Expression in Compiler design
Regular Expression in Compiler designRegular Expression in Compiler design
Regular Expression in Compiler design
 
Ai inductive bias and knowledge
Ai inductive bias and knowledgeAi inductive bias and knowledge
Ai inductive bias and knowledge
 
Data Visualization and Dashboard Design
Data Visualization and Dashboard DesignData Visualization and Dashboard Design
Data Visualization and Dashboard Design
 
Cientista de Dados – Dominando o Big Data com Software Livre
Cientista de Dados – Dominando o Big Data com Software Livre Cientista de Dados – Dominando o Big Data com Software Livre
Cientista de Dados – Dominando o Big Data com Software Livre
 
NLP_KASHK:Regular Expressions
NLP_KASHK:Regular Expressions NLP_KASHK:Regular Expressions
NLP_KASHK:Regular Expressions
 

Ähnlich wie Experimenting the TextTiling Algorithm

Transition-based Dependency Parsing with Selectional Branching
Transition-based Dependency Parsing with Selectional BranchingTransition-based Dependency Parsing with Selectional Branching
Transition-based Dependency Parsing with Selectional BranchingJinho Choi
 
Interactive Latency in Big Data Visualization
Interactive Latency in Big Data VisualizationInteractive Latency in Big Data Visualization
Interactive Latency in Big Data Visualizationbigdataviz_bay
 
CS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed CommitCS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed CommitJ Singh
 
Self-Assembling Hyper-heuristics: a proof of concept
Self-Assembling Hyper-heuristics: a proof of conceptSelf-Assembling Hyper-heuristics: a proof of concept
Self-Assembling Hyper-heuristics: a proof of conceptGerman Terrazas
 
A Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsA Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsRebecca Bilbro
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroPyData
 
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques  Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques ijsc
 
User_42751212015Module1and2pagestocompetework.pdf.docx
User_42751212015Module1and2pagestocompetework.pdf.docxUser_42751212015Module1and2pagestocompetework.pdf.docx
User_42751212015Module1and2pagestocompetework.pdf.docxdickonsondorris
 
Use of Neuronal Networks and Fuzzy Logic to Modelling the Foot Sizes
Use of Neuronal Networks and Fuzzy Logic to Modelling the Foot SizesUse of Neuronal Networks and Fuzzy Logic to Modelling the Foot Sizes
Use of Neuronal Networks and Fuzzy Logic to Modelling the Foot SizesAIRCC Publishing Corporation
 
A minimization approach for two level logic synthesis using constrained depth...
A minimization approach for two level logic synthesis using constrained depth...A minimization approach for two level logic synthesis using constrained depth...
A minimization approach for two level logic synthesis using constrained depth...IAEME Publication
 
cis97003
cis97003cis97003
cis97003perfj
 
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONSSVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONSijscmcj
 
Machine Learning meets DevOps
Machine Learning meets DevOpsMachine Learning meets DevOps
Machine Learning meets DevOpsPooyan Jamshidi
 
Mathematics Research Paper - Mathematics of Computer Networking - Final Draft
Mathematics Research Paper - Mathematics of Computer Networking - Final DraftMathematics Research Paper - Mathematics of Computer Networking - Final Draft
Mathematics Research Paper - Mathematics of Computer Networking - Final DraftAlexanderCominsky
 
A General Session Based Bit Level Block Encoding Technique Using Symmetric Ke...
A General Session Based Bit Level Block Encoding Technique Using Symmetric Ke...A General Session Based Bit Level Block Encoding Technique Using Symmetric Ke...
A General Session Based Bit Level Block Encoding Technique Using Symmetric Ke...ijcseit
 
NEW SYMMETRIC ENCRYPTION SYSTEM BASED ON EVOLUTIONARY ALGORITHM
NEW SYMMETRIC ENCRYPTION SYSTEM BASED ON EVOLUTIONARY ALGORITHMNEW SYMMETRIC ENCRYPTION SYSTEM BASED ON EVOLUTIONARY ALGORITHM
NEW SYMMETRIC ENCRYPTION SYSTEM BASED ON EVOLUTIONARY ALGORITHMijcsit
 
Methodological study of opinion mining and sentiment analysis techniques
Methodological study of opinion mining and sentiment analysis techniquesMethodological study of opinion mining and sentiment analysis techniques
Methodological study of opinion mining and sentiment analysis techniquesijsc
 

Ähnlich wie Experimenting the TextTiling Algorithm (20)

Transition-based Dependency Parsing with Selectional Branching
Transition-based Dependency Parsing with Selectional BranchingTransition-based Dependency Parsing with Selectional Branching
Transition-based Dependency Parsing with Selectional Branching
 
Interactive Latency in Big Data Visualization
Interactive Latency in Big Data VisualizationInteractive Latency in Big Data Visualization
Interactive Latency in Big Data Visualization
 
CS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed CommitCS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed Commit
 
Self-Assembling Hyper-heuristics: a proof of concept
Self-Assembling Hyper-heuristics: a proof of conceptSelf-Assembling Hyper-heuristics: a proof of concept
Self-Assembling Hyper-heuristics: a proof of concept
 
A Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsA Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and Distributions
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
Modelling and Analysis Laboratory Manual
Modelling and Analysis Laboratory ManualModelling and Analysis Laboratory Manual
Modelling and Analysis Laboratory Manual
 
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques  Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
 
User_42751212015Module1and2pagestocompetework.pdf.docx
User_42751212015Module1and2pagestocompetework.pdf.docxUser_42751212015Module1and2pagestocompetework.pdf.docx
User_42751212015Module1and2pagestocompetework.pdf.docx
 
Use of Neuronal Networks and Fuzzy Logic to Modelling the Foot Sizes
Use of Neuronal Networks and Fuzzy Logic to Modelling the Foot SizesUse of Neuronal Networks and Fuzzy Logic to Modelling the Foot Sizes
Use of Neuronal Networks and Fuzzy Logic to Modelling the Foot Sizes
 
50120140503004
5012014050300450120140503004
50120140503004
 
A minimization approach for two level logic synthesis using constrained depth...
A minimization approach for two level logic synthesis using constrained depth...A minimization approach for two level logic synthesis using constrained depth...
A minimization approach for two level logic synthesis using constrained depth...
 
cis97003
cis97003cis97003
cis97003
 
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONSSVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
 
Machine Learning meets DevOps
Machine Learning meets DevOpsMachine Learning meets DevOps
Machine Learning meets DevOps
 
Mathematics Research Paper - Mathematics of Computer Networking - Final Draft
Mathematics Research Paper - Mathematics of Computer Networking - Final DraftMathematics Research Paper - Mathematics of Computer Networking - Final Draft
Mathematics Research Paper - Mathematics of Computer Networking - Final Draft
 
A General Session Based Bit Level Block Encoding Technique Using Symmetric Ke...
A General Session Based Bit Level Block Encoding Technique Using Symmetric Ke...A General Session Based Bit Level Block Encoding Technique Using Symmetric Ke...
A General Session Based Bit Level Block Encoding Technique Using Symmetric Ke...
 
NEW SYMMETRIC ENCRYPTION SYSTEM BASED ON EVOLUTIONARY ALGORITHM
NEW SYMMETRIC ENCRYPTION SYSTEM BASED ON EVOLUTIONARY ALGORITHMNEW SYMMETRIC ENCRYPTION SYSTEM BASED ON EVOLUTIONARY ALGORITHM
NEW SYMMETRIC ENCRYPTION SYSTEM BASED ON EVOLUTIONARY ALGORITHM
 
Methodological study of opinion mining and sentiment analysis techniques
Methodological study of opinion mining and sentiment analysis techniquesMethodological study of opinion mining and sentiment analysis techniques
Methodological study of opinion mining and sentiment analysis techniques
 
50120130406023
5012013040602350120130406023
50120130406023
 

Mehr von Estelle Delpech

Génération automatique de texte
Génération automatique de texteGénération automatique de texte
Génération automatique de texteEstelle Delpech
 
Identification de compatibilités entre tages descriptifs de lieux
Identification de compatibilités entre tages descriptifs de lieuxIdentification de compatibilités entre tages descriptifs de lieux
Identification de compatibilités entre tages descriptifs de lieuxEstelle Delpech
 
Découverte du Traitement Automatique des Langues
Découverte du Traitement Automatique des LanguesDécouverte du Traitement Automatique des Langues
Découverte du Traitement Automatique des LanguesEstelle Delpech
 
Invited speaker, ATALA 2014 Ph. D. Thesis award
Invited speaker, ATALA 2014 Ph. D. Thesis awardInvited speaker, ATALA 2014 Ph. D. Thesis award
Invited speaker, ATALA 2014 Ph. D. Thesis awardEstelle Delpech
 
Corpus comparables et traduction assistée par ordinateur, contributions à la ...
Corpus comparables et traduction assistée par ordinateur, contributions à la ...Corpus comparables et traduction assistée par ordinateur, contributions à la ...
Corpus comparables et traduction assistée par ordinateur, contributions à la ...Estelle Delpech
 
Identification de compatibilites sémantiques entre descripteurs de lieux
Identification de compatibilites sémantiques entre descripteurs de lieuxIdentification de compatibilites sémantiques entre descripteurs de lieux
Identification de compatibilites sémantiques entre descripteurs de lieuxEstelle Delpech
 
Usage du TAL dans des applications industrielles : gestion des contenus multi...
Usage du TAL dans des applications industrielles : gestion des contenus multi...Usage du TAL dans des applications industrielles : gestion des contenus multi...
Usage du TAL dans des applications industrielles : gestion des contenus multi...Estelle Delpech
 
Nomao: data analysis for personalized local search
Nomao: data analysis for personalized local searchNomao: data analysis for personalized local search
Nomao: data analysis for personalized local searchEstelle Delpech
 
Nomao: carnet de bonnes adresses (entre amis)
Nomao: carnet de bonnes adresses (entre amis)Nomao: carnet de bonnes adresses (entre amis)
Nomao: carnet de bonnes adresses (entre amis)Estelle Delpech
 
Nomao: local search and recommendation engine
Nomao: local search and recommendation engineNomao: local search and recommendation engine
Nomao: local search and recommendation engineEstelle Delpech
 
Extraction of domain-specific bilingual lexicon from comparable corpora: comp...
Extraction of domain-specific bilingual lexicon from comparable corpora: comp...Extraction of domain-specific bilingual lexicon from comparable corpora: comp...
Extraction of domain-specific bilingual lexicon from comparable corpora: comp...Estelle Delpech
 
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...Estelle Delpech
 
Applicative evaluation of bilingual terminologies
Applicative evaluation of bilingual terminologiesApplicative evaluation of bilingual terminologies
Applicative evaluation of bilingual terminologiesEstelle Delpech
 
Évaluation applicative des terminologies destinées à la traduction spécialisée
Évaluation applicative des terminologies destinées à la traduction spécialiséeÉvaluation applicative des terminologies destinées à la traduction spécialisée
Évaluation applicative des terminologies destinées à la traduction spécialiséeEstelle Delpech
 
Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange
Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchangeDealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange
Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchangeEstelle Delpech
 
Bilingual terminology mining
Bilingual terminology miningBilingual terminology mining
Bilingual terminology miningEstelle Delpech
 
Robust rule-based parsing
Robust rule-based parsingRobust rule-based parsing
Robust rule-based parsingEstelle Delpech
 
Text Processing for Procedural Question Answering
Text Processing for Procedural Question AnsweringText Processing for Procedural Question Answering
Text Processing for Procedural Question AnsweringEstelle Delpech
 

Mehr von Estelle Delpech (19)

Génération automatique de texte
Génération automatique de texteGénération automatique de texte
Génération automatique de texte
 
Identification de compatibilités entre tages descriptifs de lieux
Identification de compatibilités entre tages descriptifs de lieuxIdentification de compatibilités entre tages descriptifs de lieux
Identification de compatibilités entre tages descriptifs de lieux
 
Découverte du Traitement Automatique des Langues
Découverte du Traitement Automatique des LanguesDécouverte du Traitement Automatique des Langues
Découverte du Traitement Automatique des Langues
 
Invited speaker, ATALA 2014 Ph. D. Thesis award
Invited speaker, ATALA 2014 Ph. D. Thesis awardInvited speaker, ATALA 2014 Ph. D. Thesis award
Invited speaker, ATALA 2014 Ph. D. Thesis award
 
Corpus comparables et traduction assistée par ordinateur, contributions à la ...
Corpus comparables et traduction assistée par ordinateur, contributions à la ...Corpus comparables et traduction assistée par ordinateur, contributions à la ...
Corpus comparables et traduction assistée par ordinateur, contributions à la ...
 
Identification de compatibilites sémantiques entre descripteurs de lieux
Identification de compatibilites sémantiques entre descripteurs de lieuxIdentification de compatibilites sémantiques entre descripteurs de lieux
Identification de compatibilites sémantiques entre descripteurs de lieux
 
Usage du TAL dans des applications industrielles : gestion des contenus multi...
Usage du TAL dans des applications industrielles : gestion des contenus multi...Usage du TAL dans des applications industrielles : gestion des contenus multi...
Usage du TAL dans des applications industrielles : gestion des contenus multi...
 
Nomao: data analysis for personalized local search
Nomao: data analysis for personalized local searchNomao: data analysis for personalized local search
Nomao: data analysis for personalized local search
 
Nomao: carnet de bonnes adresses (entre amis)
Nomao: carnet de bonnes adresses (entre amis)Nomao: carnet de bonnes adresses (entre amis)
Nomao: carnet de bonnes adresses (entre amis)
 
Nomao: local search and recommendation engine
Nomao: local search and recommendation engineNomao: local search and recommendation engine
Nomao: local search and recommendation engine
 
Extraction of domain-specific bilingual lexicon from comparable corpora: comp...
Extraction of domain-specific bilingual lexicon from comparable corpora: comp...Extraction of domain-specific bilingual lexicon from comparable corpora: comp...
Extraction of domain-specific bilingual lexicon from comparable corpora: comp...
 
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
 
Applicative evaluation of bilingual terminologies
Applicative evaluation of bilingual terminologiesApplicative evaluation of bilingual terminologies
Applicative evaluation of bilingual terminologies
 
Évaluation applicative des terminologies destinées à la traduction spécialisée
Évaluation applicative des terminologies destinées à la traduction spécialiséeÉvaluation applicative des terminologies destinées à la traduction spécialisée
Évaluation applicative des terminologies destinées à la traduction spécialisée
 
Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange
Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchangeDealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange
Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange
 
R&D Lingua et Machina
R&D Lingua et MachinaR&D Lingua et Machina
R&D Lingua et Machina
 
Bilingual terminology mining
Bilingual terminology miningBilingual terminology mining
Bilingual terminology mining
 
Robust rule-based parsing
Robust rule-based parsingRobust rule-based parsing
Robust rule-based parsing
 
Text Processing for Procedural Question Answering
Text Processing for Procedural Question AnsweringText Processing for Procedural Question Answering
Text Processing for Procedural Question Answering
 

Kürzlich hochgeladen

How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 

Kürzlich hochgeladen (20)

How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

Experimenting the TextTiling Algorithm

  • 1. Experimenting the TextTiling Algorithm Summary of the work done by master students at Université Toulouse Le Mirail Adam C., Andreani V., Bengsston J., Bouchara N., Choucavy L., Delpech E., El Maarouf I., Fontan L., Gotlik W.
  • 2. Experimenting the Text Tiling algorithm Part I : What is the Text Tiling Algorithm ? Part II : Experimentations with the Text Tiling algorithm Part III : Demo
  • 3. Part I : What is the TextTiling algorithm?  « an algorithm for partitionning expository texts into coherent multi-paragraph discourse units which reflects the subtopic structure of the texts »  developed by Marti Hearst (1997): «TextTiling: Segmenting Text into Multi-Paragraph Subtopic Passages », In Computational Linguistics, March 1997. http://www.ischool.berkeley.edu/~hearst/tiling-about.html
  • 4. Why segment a text into multi-paragraphs unit ? Computational tasks that use arbitrary windows might benefit from using windows with motivated boundaries Ease of readability for online long texts (Reading Assistant Tools) IR : retrieving relevant passages instead of whole document Summarization : extract sentences according to their position in the subtopic structure
  • 5. What is the hypothesis behind TextTiling ?  « TextTiling assumes that a set of lexical items is in use during the course of a given subtopic discussion, and when that subtopic changes, a significant proportion when that subtopic changes, a significant proportion of the of the vocabulary changes vocabulary changes as well »as well » Text Tiling doesn’t detect subtopics per se but shifts in topic by means of change in vocabulary Operates a linear segmentation (no hierarchy)
  • 6. Detection of topic shift Raw text Tokenisation similarity score SS bloc A vs bloc B S S Segmentation into pseudo-sentences (20 tokens) a similarity score is computed every pseudo-sentence between 2 blocks of 6 pseudo-sequences  the more vocabulary in common, the highest the score  S S S S S S S S S S S S S S S
  • 7. I. Detection of topic shift SCORE 1  a gap means there is a 0,85 0,9 drop in vocabulary similarity 0,8 0,8 0,7  topic shifts occur at the 0,6 0,75 deepest gaps (after smoothing) 0,5 0,4 0,7 tiles boundaries will be adjusted to the nearest paragraph break 0,3 0,65 0,2 0,1 0,6 0 1 1 3 3 5 5 7 7 9 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 Pseudo-sentence number
  • 8. Evaluation by Hearst (1997)  Evaluation on 12 magazine articles annotated by 7 judges  Judges are asked « to mark the paragraph boudary at which the topic changed »  In case of disagreement among judges, a boudary is kept if at least 3 judges agree on it  Agreement among judges (kappa measure) : kappa = 0.647
  • 9. Evaluation by Hearst (1997) Precision Recall 0.43 0.42 TextTiler 0.66 0.61 Judges 0.81 0.71 Baseline (random) Works well on long (+1800 words) expository texts with little structural demarcation
  • 10. Part II : Experimentations with theTextTiling algorithm  Work done by masters students, Université Toulouse Le Mirail  Implementation in Perl  Experimentations :  cross annotation of 3 texts  variation of :  linguistic parameters  computation parameters
  • 11. Annotation of topic boundary  No clear-cut topic shift, rather ‘regions’ of shift Annotators felt a smaller unity (sentence) would have been more convenient  Our kappa : 0.56  Heart’s judges : 0.65  kappa should be at least > 0.67, the best is > 0.8  A difficult (unnatural ?) task for humans
  • 12. Variation of linguistic parameters basic trigrams lemmatization (TreeTagger*) 0,61 0,7 0,58 0,6 0,53 0,5 0,35 0,34 0,26 0,23 PRECISION F-MESURE 0,4 0,25 0,3 0,2 0,17 0,1 0 RECALL * http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
  • 13. Variation of computation parameters  Computation window :  pseudo-sentence length  block length  Smoothing : 0,7 0,7 0,7 0,6 0,6 0,6 0,5 0,5 0,5 0,4 0,4 0,4 0,3 0,3 0,3 0,2 0,2 0,2 0,1 0,1 0,1 0 0 0 1 1 15 57 71 18 17 22 2736 40 5053 65 66 78 85 92 99 105 118127 137 141 148 155 162170 183 196 1425 29 41 4349 57 64 73 81 89 92 105 113 121 129 134 145 153157169 177 185 193 197 79 97 106 113 120 131 144 161 169 176 183 190 201 9 33
  • 14. Size of computation window Pseudo-sentence length Block length 2 4 6 8 10 12 14 16 18 20 5 ++ +++ ++ ++ ++ ++ ++ ++ ++ ++ 10 ++ ++ ++ + + ++ + + + + 15 ++ + + + + + + - - - 20 + + + - - - - - - -- 25 + + - - - - - -- -- -- 30 + - - - - -- -- -- -- -- 35 + - - - - -- -- -- -- -- 40 -- -- -- -- -- -- -- -- -- --
  • 15. Correlation window size / smoothing window size (number of tokens) 10 30 40 50 iteration 3 3 1 1 1 width Smoothing 20 2 1 2 2 1  Correlation between window size and smoothing : The smallest your window, the more smoothing you need to smoothe
  • 16. Optimal parameters set Nb parag. Nb Words sentences tokens smooth. words / / / iteration parag. block sentence smooth. width Text 1 12 2000 167 6 5 3 2 Text 2 22 2400 109 6 10 1 1 Text 3 37 1750 20 8 10 1 1  One optimal parameters set per text  Optimal set varies according to text/paragraph length ?
  • 17. Final thoughts  Linguistic processing : lemmatization doesn’t significantly improve TextTiling  what about stemming ?   Computation parameters :  parameters are highly dependent  optimal parameters set vary from text to text  Proposal : an adaptative Text Tiler ?  window size could be adapted to text intrinsic qualities  smoothing could then be adapted to window size
  • 19. Similarity score – Hearst (1997) Sim (b1 ,b2) = ∑t wt,b1 . wt,b2 √ ∑ w² b1 . ∑ w² b2 t t t t b1 : block 1 b2 : block 2 t : token w : weight (frequency) of the token in the block back
  • 20. Kappa measure http://www.musc.edu/dc/icrebm/kappa.html Annot 1 yes no TOTAL 40 35 Y2=75 no 5 20 N2=25 TOTAL Y1=45 N1=55 T=100 Annot2 yes Kappa Agreement P(A) = 0.6 Expected agreement P(E) = (Y1.Y2 + N1.N2) / T² = 0.475 P(A) – P(E) = 1 – P(E) = 0.24 back