SlideShare ist ein Scribd-Unternehmen logo
1 von 14
Diversity Measure for text
Summarization
Guided by: Dr.Vasudev Verma
Mentor: Litton J Kurisinkel
Submitted By:- Group No.27
Nirat Attri (201201013)
Dhruva Das (201301151)
Siddharth Saklecha (201505570)
Introduction
➢ A summary is brief but detailed outline of a
document, which conveys the essence of
the document.
➢ The goal of a text Summarizer is
condensing the source text into a shorter
version while its overall information and
meaning remains same.
➢ The Generated Summary should cover
relevant topics in the original corpus and
be diverse enough.
Motivation
➢ A short summary, which conveys the
essence of the document, helps in finding
relevant information quickly
➢ Text summarization also provides a way to
cluster similar documents and present a
summary.
➢ Text summarization has become an
important and timely tool for assisting and
interpreting text information in today’ fast-
growing information age.
Problem Statement :
➢ Derive a method to improve efficiency of the task
of Text Summarization.
➢ We design novel methods to improve efficiency of the
task of Text Summarization using a class of sub
modular functions.
➢ These functions each combine two terms, one which
encourages the summary to be representative of the
corpus (coverage), and the other which positively
rewards diversity.
➢ Our functions are monotone non-decreasing and
submodular, which means that an efficient scalable
greedy optimization scheme has a constant factor
guarantee of optimality.
Proposed Solution :
Submodular Function:
➢ Sub-modular functions are those that satisfy the
property of diminishing returns: for any A⊆B⊆ V
v, a sub-modular function F must satisfy
F(A+v) - F(A) >= F(B + v) - F(B).
That is, the incremental value of v decreases as
the context in which v is considered grows from A
to B.
➢ An equivalent definition, useful mathematically, is
that for any A,B⊆V, we must have that
F(A)+F(B) >= F(AUB)+F(AnB).
If this is satisfied everywhere with equality, then
the function F is called modular.
Summarization Tools :
➢ CLUTO
It is a software package for clustering low
and high dimensional datasets and for
analyzing the characteristics of the various
clusters.
➢ ROUGE
Recall-Oriented Understudy for Gisting
Evaluation, is a set of metrics and a
software package used for evaluating
automatic summarization and machine
translation software in natural language
processing.
Approach
➢ Two properties of a good summary are relevance and non
redundancy.
➢ Objective functions for extractive summarization usually
measure these two separately and then mix them together
trading off encouraging relevance and penalizing
Redundancy.
➢ The redundancy penalty usually violates the monotonicity
of the objective functions.
➢ In particular, we model the summary quality as
F(S) = L(S) + λ R(S)
where, L(S) measures the coverage, or fidelity, of summary
set S to the document, R(S) rewards diversity in S, and λ is
a trade-off coefficient.
➢ Coverage Measure:
L(S) can be interpreted either as a set function that
measures the similarity of summary set S to the document
to be summarized, or as a function representing some
form of coverage of V by S.
L(S) should be monotone, as coverage improves with a
larger summary.
Approach Continue…
Shannon entropy is a well-known monotone submodular
function. So, we take our coverage function as:
L(S) = Σ min { Ci(S) , α Ci(V) } and i ∈ V
Basically, Ci(S) measures how similar S is to element i, or how
much of i is covered by S and Ci (V) is just the largest value
that Ci(S) can achieve.
➢Diversity Measure :
where Pi, i = 1, ...,K is a partition of the ground set V into
separate clusters, and ri ≥ 0 indicates the singleton reward of i
(i.e., the reward of adding i into the empty set). The value ri
estimates the
importance of i to the summary.
Approach 1 : K-means Clustering
➢ The dataset was fed into CLUTO to perform K-means
Clustering to obtain clusters referring to similar data.
➢ We ran a Grid Search on the values to get the best optimal
value to maximize the sub modular function.
➢ Algorithm:
➢ Summary → ⌀
➢ allowedClusters ← allClusters
➢ while size(Summary) ≤ 665:
• pick the cluster most similar to corpus from
allowedClusters → chosenCluster
• chosenSentence ← highest ranking sentence of
chosenCluster based on coverage and diversity
measure
• Summary ← chosenSentence
Approach 2 : Agglomerative
Clustering
➢ Agglomerative clustering(also called Hierarchical clustering
analysis or HCA) is a method of cluster analysis which
seeks to build a hierarchy of cluster.
➢ It is a bottom up approach. Each observation starts in its
own cluster, and pairs of clusters are merged as one moves
up the hierarchy.
➢ O(n^3) approach.
➢ Each observation starts in its own cluster and clusters are
succesively merged together. The linkage criteria
determines the metric used for the merge strategy.
➢ In computing the clusters, we used Cosine Similarity
criteria.
Challenges Faced
➢ One of the challenges that we faced was to understand the
Output given by the Tool Cluto, which Clusters the given
sentences. Finding the summary of the folder was just a
basic implementation of the formula given in the paper.
➢ Another challenge was that the time taken to find the
summary for a single folder was quite large. This time
could be drastically reduced by not using the condition of
Sub-modular formula f(A+v) - f(A) > f(B+v) - f(B), that is, the
incremental value of v decreases as the context in which v
is considered grows from A to B. But this would be at the
cost of accuracy.
Results and Conclusion
➢ The values for λ and α for the diversity and coverage
measures giving us the submodular function were
calculated using a sweep search for the best values of
ROGUE scores.
➢ We recovered the best values as follows :
α = 15
λ = 4
➢ The clusters were summarized and their ROGUE scores
calculated to estimate the efficiency
➢ We can conclude by saying that Agglomerative clustering
using a weighted consideration for both, the diversity and
coverage, gives us the best scores in Text Summarization.
Approach ROUGE-R ROUGE-F
Agglomerative 0.3843 0.3792
K-Means 0.3724 0.3674
References
➢ Vishal Gupta and Gurpreet Singh Lehal, A Survey of Text
Summarization Extractive Techniques.
➢ Hui Lin and Jeff Bilmes, A Class of Submodular Functions
for Document Summerization.
➢ Hui Lin and Jeff Bilmes, Multi-document Summarization
via Budgeted Maximization of Submodular Function.
➢ Ying Zhao and George Karypis, Criterion Functions for
Document Clustering Experiments and Analysis.
➢ Chin-Yew LIN, A Package for Automatic Evaluation of
Summaries.
➢ DUC 2004,http://duc.nist.gov/data.html
Thank You!!

Weitere ähnliche Inhalte

Was ist angesagt?

Compact binary-tree-representation-of-logic-function-with-enhanced-throughput-
Compact binary-tree-representation-of-logic-function-with-enhanced-throughput-Compact binary-tree-representation-of-logic-function-with-enhanced-throughput-
Compact binary-tree-representation-of-logic-function-with-enhanced-throughput-
Cemal Ardil
 
Graph-Based Code Completion
Graph-Based Code CompletionGraph-Based Code Completion
Graph-Based Code Completion
Masud Rahman
 

Was ist angesagt? (20)

Harendra Singh Rawat,BCA 2nd Year
Harendra Singh Rawat,BCA 2nd YearHarendra Singh Rawat,BCA 2nd Year
Harendra Singh Rawat,BCA 2nd Year
 
Error Estimates for Multi-Penalty Regularization under General Source Condition
Error Estimates for Multi-Penalty Regularization under General Source ConditionError Estimates for Multi-Penalty Regularization under General Source Condition
Error Estimates for Multi-Penalty Regularization under General Source Condition
 
Efficient projections
Efficient projectionsEfficient projections
Efficient projections
 
Probabilistic Retrieval
Probabilistic RetrievalProbabilistic Retrieval
Probabilistic Retrieval
 
Ca notes
Ca notesCa notes
Ca notes
 
Compact binary-tree-representation-of-logic-function-with-enhanced-throughput-
Compact binary-tree-representation-of-logic-function-with-enhanced-throughput-Compact binary-tree-representation-of-logic-function-with-enhanced-throughput-
Compact binary-tree-representation-of-logic-function-with-enhanced-throughput-
 
The Impact Of Semantic Handshakes
The Impact Of Semantic HandshakesThe Impact Of Semantic Handshakes
The Impact Of Semantic Handshakes
 
Journal paper 1
Journal paper 1Journal paper 1
Journal paper 1
 
Topic model an introduction
Topic model an introductionTopic model an introduction
Topic model an introduction
 
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
 
(Icca 2014) shortest path analysis in social graphs
(Icca 2014) shortest path analysis in social graphs(Icca 2014) shortest path analysis in social graphs
(Icca 2014) shortest path analysis in social graphs
 
Mining the social web 6
Mining the social web 6Mining the social web 6
Mining the social web 6
 
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet AllocationA Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
 
Graph-Based Code Completion
Graph-Based Code CompletionGraph-Based Code Completion
Graph-Based Code Completion
 
H-MLQ
H-MLQH-MLQ
H-MLQ
 
An Introduction to Radical Minimalism: Merge & Agree
An Introduction to Radical Minimalism: Merge & AgreeAn Introduction to Radical Minimalism: Merge & Agree
An Introduction to Radical Minimalism: Merge & Agree
 
NLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationNLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic Classification
 
Bt0080 fundamentals of algorithms2
Bt0080 fundamentals of algorithms2Bt0080 fundamentals of algorithms2
Bt0080 fundamentals of algorithms2
 
Searching techniques
Searching techniquesSearching techniques
Searching techniques
 
Text classification with fast text elena_meetup_milano_27_june
Text classification with fast text elena_meetup_milano_27_juneText classification with fast text elena_meetup_milano_27_june
Text classification with fast text elena_meetup_milano_27_june
 

Andere mochten auch

PowerPoint of BioclimaticsCompany
PowerPoint of BioclimaticsCompanyPowerPoint of BioclimaticsCompany
PowerPoint of BioclimaticsCompany
Sergio Frutos C
 
[Citavi Backup] BachelorThesisEichkornJensM_final 2016-08-30 11-21-06+correct...
[Citavi Backup] BachelorThesisEichkornJensM_final 2016-08-30 11-21-06+correct...[Citavi Backup] BachelorThesisEichkornJensM_final 2016-08-30 11-21-06+correct...
[Citavi Backup] BachelorThesisEichkornJensM_final 2016-08-30 11-21-06+correct...
Jens M Eichkorn
 
High resolution two-dimensional electrophoresis as a tool to differentiate wi...
High resolution two-dimensional electrophoresis as a tool to differentiate wi...High resolution two-dimensional electrophoresis as a tool to differentiate wi...
High resolution two-dimensional electrophoresis as a tool to differentiate wi...
Egidijus Dauksas
 
Softjourn and the Entertainment industry VOD Live Video Live Events
Softjourn and the Entertainment industry VOD Live Video Live EventsSoftjourn and the Entertainment industry VOD Live Video Live Events
Softjourn and the Entertainment industry VOD Live Video Live Events
Emmy Gengler
 
Tipos de conexión de internet
Tipos de conexión de internetTipos de conexión de internet
Tipos de conexión de internet
yolipuma1990
 
ResumeComprehensivePDF
ResumeComprehensivePDFResumeComprehensivePDF
ResumeComprehensivePDF
Mark Sension
 
Supercritical CO2 extraction of the main constituents of Lovage (Levisticum o...
Supercritical CO2 extraction of the main constituents of Lovage (Levisticum o...Supercritical CO2 extraction of the main constituents of Lovage (Levisticum o...
Supercritical CO2 extraction of the main constituents of Lovage (Levisticum o...
Egidijus Dauksas
 

Andere mochten auch (12)

PowerPoint of BioclimaticsCompany
PowerPoint of BioclimaticsCompanyPowerPoint of BioclimaticsCompany
PowerPoint of BioclimaticsCompany
 
[Citavi Backup] BachelorThesisEichkornJensM_final 2016-08-30 11-21-06+correct...
[Citavi Backup] BachelorThesisEichkornJensM_final 2016-08-30 11-21-06+correct...[Citavi Backup] BachelorThesisEichkornJensM_final 2016-08-30 11-21-06+correct...
[Citavi Backup] BachelorThesisEichkornJensM_final 2016-08-30 11-21-06+correct...
 
Habilidades comunicativas-desempeno-profesional
Habilidades comunicativas-desempeno-profesionalHabilidades comunicativas-desempeno-profesional
Habilidades comunicativas-desempeno-profesional
 
User Centered Design
User Centered Design User Centered Design
User Centered Design
 
High resolution two-dimensional electrophoresis as a tool to differentiate wi...
High resolution two-dimensional electrophoresis as a tool to differentiate wi...High resolution two-dimensional electrophoresis as a tool to differentiate wi...
High resolution two-dimensional electrophoresis as a tool to differentiate wi...
 
Softjourn and the Entertainment industry VOD Live Video Live Events
Softjourn and the Entertainment industry VOD Live Video Live EventsSoftjourn and the Entertainment industry VOD Live Video Live Events
Softjourn and the Entertainment industry VOD Live Video Live Events
 
Tipos de conexiones a internet
Tipos de conexiones a internetTipos de conexiones a internet
Tipos de conexiones a internet
 
Tipos de conexión de internet
Tipos de conexión de internetTipos de conexión de internet
Tipos de conexión de internet
 
ResumeComprehensivePDF
ResumeComprehensivePDFResumeComprehensivePDF
ResumeComprehensivePDF
 
Supercritical CO2 extraction of the main constituents of Lovage (Levisticum o...
Supercritical CO2 extraction of the main constituents of Lovage (Levisticum o...Supercritical CO2 extraction of the main constituents of Lovage (Levisticum o...
Supercritical CO2 extraction of the main constituents of Lovage (Levisticum o...
 
7 Components to Medical Device Usability Testing Success
7 Components to Medical Device Usability Testing Success7 Components to Medical Device Usability Testing Success
7 Components to Medical Device Usability Testing Success
 
Game On! The New Reality of Virtual Reality at the GDC16
Game On! The New Reality of Virtual Reality at the GDC16Game On! The New Reality of Virtual Reality at the GDC16
Game On! The New Reality of Virtual Reality at the GDC16
 

Ähnlich wie Ire final

EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"
EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"
EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"
IJDKP
 
A0311010106
A0311010106A0311010106
A0311010106
theijes
 
data_mining_Projectreport
data_mining_Projectreportdata_mining_Projectreport
data_mining_Projectreport
Sampath Velaga
 
Cost versus distance_in_the_traveling_sa_79149
Cost versus distance_in_the_traveling_sa_79149Cost versus distance_in_the_traveling_sa_79149
Cost versus distance_in_the_traveling_sa_79149
olimpica
 

Ähnlich wie Ire final (20)

TEXT CLUSTERING.doc
TEXT CLUSTERING.docTEXT CLUSTERING.doc
TEXT CLUSTERING.doc
 
Efficient projections
Efficient projectionsEfficient projections
Efficient projections
 
E1062530
E1062530E1062530
E1062530
 
EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"
EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"
EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"
 
35000120030_Aritra Kundu_Operations Research.pdf
35000120030_Aritra Kundu_Operations Research.pdf35000120030_Aritra Kundu_Operations Research.pdf
35000120030_Aritra Kundu_Operations Research.pdf
 
Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
 
Duality Theory in Multi Objective Linear Programming Problems
Duality Theory in Multi Objective Linear Programming ProblemsDuality Theory in Multi Objective Linear Programming Problems
Duality Theory in Multi Objective Linear Programming Problems
 
A0311010106
A0311010106A0311010106
A0311010106
 
Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...
 
Comparative study of classification algorithm for text based categorization
Comparative study of classification algorithm for text based categorizationComparative study of classification algorithm for text based categorization
Comparative study of classification algorithm for text based categorization
 
data_mining_Projectreport
data_mining_Projectreportdata_mining_Projectreport
data_mining_Projectreport
 
An Approach to Mathematically Establish the Practical Use of Assignment Probl...
An Approach to Mathematically Establish the Practical Use of Assignment Probl...An Approach to Mathematically Establish the Practical Use of Assignment Probl...
An Approach to Mathematically Establish the Practical Use of Assignment Probl...
 
Large Scale Hierarchical Text Classification
Large Scale Hierarchical Text ClassificationLarge Scale Hierarchical Text Classification
Large Scale Hierarchical Text Classification
 
Cost versus distance_in_the_traveling_sa_79149
Cost versus distance_in_the_traveling_sa_79149Cost versus distance_in_the_traveling_sa_79149
Cost versus distance_in_the_traveling_sa_79149
 
Harnessing Deep Neural Networks with Logic Rules
Harnessing Deep Neural Networks with Logic RulesHarnessing Deep Neural Networks with Logic Rules
Harnessing Deep Neural Networks with Logic Rules
 
UNIT IV (4).pptx
UNIT IV (4).pptxUNIT IV (4).pptx
UNIT IV (4).pptx
 
Instance Learning and Genetic Algorithm by Dr.C.R.Dhivyaa Kongu Engineering C...
Instance Learning and Genetic Algorithm by Dr.C.R.Dhivyaa Kongu Engineering C...Instance Learning and Genetic Algorithm by Dr.C.R.Dhivyaa Kongu Engineering C...
Instance Learning and Genetic Algorithm by Dr.C.R.Dhivyaa Kongu Engineering C...
 
An approximate possibilistic
An approximate possibilisticAn approximate possibilistic
An approximate possibilistic
 
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
 
IRJET- A Survey of Text Document Clustering by using Clustering Techniques
IRJET- A Survey of Text Document Clustering by using Clustering TechniquesIRJET- A Survey of Text Document Clustering by using Clustering Techniques
IRJET- A Survey of Text Document Clustering by using Clustering Techniques
 

Kürzlich hochgeladen

FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Christo Ananth
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ankushspencer015
 

Kürzlich hochgeladen (20)

chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 

Ire final

  • 1. Diversity Measure for text Summarization Guided by: Dr.Vasudev Verma Mentor: Litton J Kurisinkel Submitted By:- Group No.27 Nirat Attri (201201013) Dhruva Das (201301151) Siddharth Saklecha (201505570)
  • 2. Introduction ➢ A summary is brief but detailed outline of a document, which conveys the essence of the document. ➢ The goal of a text Summarizer is condensing the source text into a shorter version while its overall information and meaning remains same. ➢ The Generated Summary should cover relevant topics in the original corpus and be diverse enough.
  • 3. Motivation ➢ A short summary, which conveys the essence of the document, helps in finding relevant information quickly ➢ Text summarization also provides a way to cluster similar documents and present a summary. ➢ Text summarization has become an important and timely tool for assisting and interpreting text information in today’ fast- growing information age.
  • 4. Problem Statement : ➢ Derive a method to improve efficiency of the task of Text Summarization. ➢ We design novel methods to improve efficiency of the task of Text Summarization using a class of sub modular functions. ➢ These functions each combine two terms, one which encourages the summary to be representative of the corpus (coverage), and the other which positively rewards diversity. ➢ Our functions are monotone non-decreasing and submodular, which means that an efficient scalable greedy optimization scheme has a constant factor guarantee of optimality. Proposed Solution :
  • 5. Submodular Function: ➢ Sub-modular functions are those that satisfy the property of diminishing returns: for any A⊆B⊆ V v, a sub-modular function F must satisfy F(A+v) - F(A) >= F(B + v) - F(B). That is, the incremental value of v decreases as the context in which v is considered grows from A to B. ➢ An equivalent definition, useful mathematically, is that for any A,B⊆V, we must have that F(A)+F(B) >= F(AUB)+F(AnB). If this is satisfied everywhere with equality, then the function F is called modular.
  • 6. Summarization Tools : ➢ CLUTO It is a software package for clustering low and high dimensional datasets and for analyzing the characteristics of the various clusters. ➢ ROUGE Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing.
  • 7. Approach ➢ Two properties of a good summary are relevance and non redundancy. ➢ Objective functions for extractive summarization usually measure these two separately and then mix them together trading off encouraging relevance and penalizing Redundancy. ➢ The redundancy penalty usually violates the monotonicity of the objective functions. ➢ In particular, we model the summary quality as F(S) = L(S) + λ R(S) where, L(S) measures the coverage, or fidelity, of summary set S to the document, R(S) rewards diversity in S, and λ is a trade-off coefficient. ➢ Coverage Measure: L(S) can be interpreted either as a set function that measures the similarity of summary set S to the document to be summarized, or as a function representing some form of coverage of V by S. L(S) should be monotone, as coverage improves with a larger summary.
  • 8. Approach Continue… Shannon entropy is a well-known monotone submodular function. So, we take our coverage function as: L(S) = Σ min { Ci(S) , α Ci(V) } and i ∈ V Basically, Ci(S) measures how similar S is to element i, or how much of i is covered by S and Ci (V) is just the largest value that Ci(S) can achieve. ➢Diversity Measure : where Pi, i = 1, ...,K is a partition of the ground set V into separate clusters, and ri ≥ 0 indicates the singleton reward of i (i.e., the reward of adding i into the empty set). The value ri estimates the importance of i to the summary.
  • 9. Approach 1 : K-means Clustering ➢ The dataset was fed into CLUTO to perform K-means Clustering to obtain clusters referring to similar data. ➢ We ran a Grid Search on the values to get the best optimal value to maximize the sub modular function. ➢ Algorithm: ➢ Summary → ⌀ ➢ allowedClusters ← allClusters ➢ while size(Summary) ≤ 665: • pick the cluster most similar to corpus from allowedClusters → chosenCluster • chosenSentence ← highest ranking sentence of chosenCluster based on coverage and diversity measure • Summary ← chosenSentence
  • 10. Approach 2 : Agglomerative Clustering ➢ Agglomerative clustering(also called Hierarchical clustering analysis or HCA) is a method of cluster analysis which seeks to build a hierarchy of cluster. ➢ It is a bottom up approach. Each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. ➢ O(n^3) approach. ➢ Each observation starts in its own cluster and clusters are succesively merged together. The linkage criteria determines the metric used for the merge strategy. ➢ In computing the clusters, we used Cosine Similarity criteria.
  • 11. Challenges Faced ➢ One of the challenges that we faced was to understand the Output given by the Tool Cluto, which Clusters the given sentences. Finding the summary of the folder was just a basic implementation of the formula given in the paper. ➢ Another challenge was that the time taken to find the summary for a single folder was quite large. This time could be drastically reduced by not using the condition of Sub-modular formula f(A+v) - f(A) > f(B+v) - f(B), that is, the incremental value of v decreases as the context in which v is considered grows from A to B. But this would be at the cost of accuracy.
  • 12. Results and Conclusion ➢ The values for λ and α for the diversity and coverage measures giving us the submodular function were calculated using a sweep search for the best values of ROGUE scores. ➢ We recovered the best values as follows : α = 15 λ = 4 ➢ The clusters were summarized and their ROGUE scores calculated to estimate the efficiency ➢ We can conclude by saying that Agglomerative clustering using a weighted consideration for both, the diversity and coverage, gives us the best scores in Text Summarization. Approach ROUGE-R ROUGE-F Agglomerative 0.3843 0.3792 K-Means 0.3724 0.3674
  • 13. References ➢ Vishal Gupta and Gurpreet Singh Lehal, A Survey of Text Summarization Extractive Techniques. ➢ Hui Lin and Jeff Bilmes, A Class of Submodular Functions for Document Summerization. ➢ Hui Lin and Jeff Bilmes, Multi-document Summarization via Budgeted Maximization of Submodular Function. ➢ Ying Zhao and George Karypis, Criterion Functions for Document Clustering Experiments and Analysis. ➢ Chin-Yew LIN, A Package for Automatic Evaluation of Summaries. ➢ DUC 2004,http://duc.nist.gov/data.html