Micropinion Generation: An Unsupervised Approach to Generating Ultra-Concise Summaries of Opinions, WWW '12
This work presents a new unsupervised approach to generating ultra-concise summaries of opinions focusing on three aspects: compactness, representativeness and readability.
By: Kavita Ganesan
4. Opinion Summary for iPod Touch
To know more: read many
redundant sentences
Structured summaries useful, but
insufficient!
5. Textual opinion summaries:
• big and clear screen
• battery lasts long without wifi
• great selection of apps
………………
Textual summaries can provide more
information…
6. Represent the major opinions
Key complaints/praise in text
Important critical information
Readable/well formed
Easily understood by readers
Compact/Concise
Can be viewed on all screen sizes
▪ e.g. phone, pda, tablets, dekstops
Maximize information conveyed
7. Generate a set of non-redundant phrases:
Summarizing key opinions in text
Short (2-5 words) Micropinions
Readable
Micropinion summary for a restaurant:
“Good service”
“Delicious soup dishes”
Ultra-concise nature of phrases to allow flexible
adjustment of summaries according to display
constraints
8.
9. Has been widely studied
[Radev et al.2000; Erkan & Radev, 2004; Mihalcea & Tarau, 2004…]
Fairly easy to implement
Problem: Not suitable for concise summaries
Bias – with limit on summary size
▪ selected sentence/phrase may have missed critical info
Verbose - may contain irrelevant information
▪ not suitable for smaller devices
10. Understand the original text and “re-tell” story in a
fewer words Hard to achieve!
Some methods require manual effort
[DeJong 1982; Radev & McKeown 1998; Finley & Harabagiu 2002]
Need to define templates
Later filled with info using IE techniques
Some methods rely on deep NL understanding
[Saggion & Lapalme 2002; Jing & McKeown 2000]
Domain dependent
Impractical – high computational costs
11. Goal is to extract important phrases from text
Traditionally used to characterize documents
Can be potentially used to select key opinion phrases
Closest to our goal!
Problem:
Only topic phrases may be selected
▪ E.g. battery life, screen size candidate phrases
▪ Enough to characterize documents, but we need more info!
Readability aspect is not much of a concern
▪ E.g. “Battery short life” == “Short battery life”
Most methods are supervised – need training data
[Tomokiyo & Hurst 2003; Witten et al. 1999; Medelyan & Witten 2008]
12. Unsupervised, lightweight & general approach to
generating ultra-concise summaries of opinions
Idea is to use existing words in original text to
compose meaningful summaries
Emphasis on 3 aspects:
Compactness
▪ summaries should use as few words as possible
Representativeness
▪ summaries should reflect major opinions in text
Readability
▪ summaries should be fairly well formed
13. k
M arg max mi...mk Srep (mi) Sread (mi)
i 1
subject to
k
mi ss
i 1
Srep (mi) rep
Sread (mi) read
sim( mi , mj ) sim i, j 1, k
14. k
M arg max mi...mk Srep (mi) Sread (mi)
i 1
Micropinion Summary, M subject to
2.3 very clean rooms
2.1 friendly service k
1.8 dirty lobby and pool mi ss
1.3 nice and polite staff i 1
Srep (mi) rep
Sread (mi) read
sim( mi , mj ) sim i, j 1, k
15. k
M arg max mi...mk Srep (mi) Sread (mi)
i 1
Micropinion Summary, M subject to
m1 2.3 very clean rooms
m2 2.1 friendly service k
m3 1.8 dirty lobby and pool mi ss
mk 1.3 nice and polite staff
i 1
Srep (mi) rep
Sread (mi) read
sim( mi , mj ) sim i, j 1, k
16. Representativeness score of mi
k
M arg max mi...mk Srep (mi) Sread (mi)
i 1
subject to Readability score of mi
k
mi ss
i 1
Srep (mi) rep
Sread (mi) read
sim( mi , mj ) sim i, j 1, k
17. Objective function
k
M arg max mi...mk Srep (mi) Sread (mi)
i 1
subject to
k
mi ss
i 1
Srep (mi) rep
Sread (mi) read
sim( mi , mj ) sim i, j 1, k
18. k
M arg max mi...mk Srep (mi) Sread (mi)
i 1
constraints
subject to
k
mi ss
i 1
Srep (mi) rep
Sread (mi) read
sim( mi , mj ) sim i, j 1, k
19. k
M arg max mi...mk Srep (mi) Sread (mi)
i 1
Objective function: Optimize
subject to m
1 2.3 very clean rooms
representativeness & k m2 2.1 friendly service
readability scores mi m
ss 3 1.8 dirty lobby and pool
Ensure: summaries reflect key opinions & mk 1.3 nice and polite staff
i 1
reasonably well formed
S (m ) rep i rep Objective function value
Sread (mi) read
sim( mi , mj ) sim i, j 1, k
20. k
M arg max mi...mk Srep (mi) Sread (mi)
Constraint 1: Maximum
i 1
length of summary.
subject to •User adjustable
k •Captures compactness.
mi ss
i 1
Srep (mi) rep
Sread (mi) read
sim(mi , mj ) sim i, j 1, k
21. k
M arg max mi...mk Srep (mi) Sread (mi)
i 1Constraint 2 & 3: Min
representativeness &
subject to
readability.
k •Helps improve efficiency
mi •Does not affect performance
ss
i 1
Srep (mi) rep
Sread (mi) read
sim(mi , mj ) sim i, j 1, k
22. k
M arg max mi...mk Srep (mi) Sread (mi)
i 1
subject to
k
mi Constraint 4: Max
ss
i 1 similarity between phrases
Srep (mi) • User adjustable
rep
• Captures compactness by
Sread (mi) minimizing redundancies
read
sim(mi , mj ) sim i, j 1, k
23. Ensures representativeness &
readability k
M arg max mi...mk Srep (mi) Sread (mi)
i 1
subject to Capture compactness
k
mi ss
i 1
Srep (mi) rep
Sread (mi) Capture compactness
read
sim( mi , mj ) sim i, j 1, k
24. Now, we need to know how to compute:
Similarity scores: sim(mi, mj) ?
Readability scores: Sread(mi, mj) ?
Representativeness scores: Srep(mi, mj) ?
25. Score similarity between 2 phrases
User adjustable parameter
Why important?
Allows user to control redundancy
E.g. With small devices users may desire summaries with
good coverage of information less redundancies !
Measure used:
Standard Jaccard Similarity Measure
Can use other measures (e.g. cosine) – not our focus
26. Purpose: Measure how well a phrase
represents opinions from the original text?
2 properties of a highly representative phrase:
1. Words should be strongly associated in text
2. Words should be sufficiently frequent in text
Captured by a modified pointwise mutual
information (PMI) function
27. Captures property 1:
Measures strength of
association between
Original PMI function, pmi(wi,wj)
words
p( wi, wj )
pmi ( wi , wj ) log 2
p( wi ) p( wj )
28. Original PMI function, pmi(wi,wj)
p( wi, wj )
pmi ( wi , wj ) log 2 Captures property 2:
p( wi ) p( wj ) Rewards well
associated words with
Modified PMI function, high co-occurrences
pmi’(wi,wj)
Add frequency of
p( wi, wj ) c( wi, wj ) occurrence within
pmi ' ( wi , wj ) log2
p( wi ) p( wj ) a window
29. To compute representativeness of a phrase:
n Take average strength of
1
Srep (w1.. wn) association (pmi’) of each
pmilocal ( wi )
n i 1 word, wi in phrase with C
neighboring words
i C
1
pmilocal ( wi ) [ pmi ' ( wi, wj )]
2C j i C
30. Gives a good estimate of
how strongly associated
To compute representativeness of
the words are in a phrase a phrase:
n
1
Srep (w1.. wn) pmilocal ( wi )
n i 1
i C
1
pmilocal ( wi ) [ pmi ' ( wi, wj )]
2C j i C
31. Purpose: Measure well-formedness of phrases
Phrases are constructed from seed words
▪ can have new phrases not in original text
No guarantee phrases would be well-formed
chain rule to compute
joint probability in terms of
Our readability scoring: conditional probabilities
(averaged)
Based on Microsoft's Web N-gram model (publicly available)
N-gram model used to obtain conditional probabilities of phrases
n
1
S read ( wk ... wn ) log 2 p ( wk |wk q 1... wk )
1
K k q
Intuition: A phrase is more readable if it occurs more frequently on
the web
32. Example: Readability scores of phrases using tri-gram LM
Ungrammatical Grammatical
“sucks life battery” -4.51 “battery life sucks” -2.93
“life battery is poor” -3.66 “battery life is poor” -2.37
33. Scoring each candidate phrase is not practical!
Why?
Phrases composed using words from original text
Potential solution space would be Huge!
Our solution: greedy summarization algorithm
Explore solution space with heuristic pruning
Touch only the most promising candidates
34. Input Unigrams Seed Bigrams
….
very very + nice
nice very + clean
very + dirty
place
clean + place Srep > σ
clean clean + room rep
Text to be problem
dirty
dirty + place …
summarized
room …
Step 1: Shortlist Step 2: Form seed
high freq unigrams bigrams by pairing
(count > median) unigrams. Shortlist
by Srep. (Srep > σrep)
35. Higher order n-grams Summary
Candidates + Seed Bi-grams = New Candidates
clean rooms = very clean rooms 0.9 very clean rooms
very clean + clean bed very clean bed 0.8 friendly service
dirty room very dirty room 0.7 dirty lobby and pool
very dirty + = 0.5 nice and polite staff
dirty pool very dirty pool
…..
very nice + nice place = very nice place …..
nice room very nice room
Sorted Candidates
Srep<σrep ; Sread<σread
Step 3: Generate higher order n-grams. Step 4: Final summary.
• Concatenate existing candidates + seed bigrams Sort by objective
• Prune non-promising candidates (Srep & Sread) function value. Add
• Eliminate redundancies (sim(mi,mj)) phrases until |M|< σss
• Repeat process on shortlisted candidates
(until no possbility of expansion)
36.
37. Product reviews from CNET (330 products)
Each product has minimum of 5 reviews
CNET REVIEW
xxxxxx
PROS
Content Summarized
CONS
FULL REVIEW
38. Given
Textual reviews about a product
Constraints
Micropinion Summary, M
m1 2.3 easy to use summary size, σss
Task m2 2.1 lense is not clear redundancy, σsim
m3 1.8 too big for pocket
mk 1.3 expensive batteries representativeness, σrep
readability , σread
argmax Srep(mi) + Sread(mi)
39. Human composed summaries
Two human summarizers
▪ Each summarize 165 product reviews (total 330)
Top 10 phrases from pros & cons provided as hints
▪ Hints help with topic coverage reduce bias
Summarizers asked to compose a set of short
phrases (2-5 words) summarizing key opinions on
the product
40. 3 representative baselines:
Tf-Idf – unsupervised method commonly used for key phrase
extraction tasks
▪ Selected only adjective containing n-grams (performance reasons)
▪ Redundancy removal used to generate non-redundant phrases
KEA – state of the art supervised key phrase extraction
model [witten et al. 1999]
▪ Uses a Naive Bayes model
▪ Trained using 100 review documents withheld from dataset
Opinosis – unsupervised abstractive summarizer designed to
generate textual opinion summaries [ganesan et al. 2010]
▪ Shown to be effective in generating concise opinion summaries
▪ Designed for highly redundant text
41. ROUGE - to determine quality of summary
(ROUGE-1 & ROUGE-2)
Standard measure for summarization tasks
Measures precision & recall of overlapping units
between computer generated summary & human
summaries
42. Questionnaire – to assess potential utility of generated
summaries to users
Answered by 2 assessors
Original reviews provided as reference
Questionnaire
[1] Very Poor - [5] Very Good
Grammaticality Non-redundancy Informativeness
[DUC 2005] [DUC 2005] [Filippova 2010]
Do the phrases
Are the phrases Are the phrases in convey important
readable? the summary information about
unique? the product?
44. WebNgram: Performs
0.09 the best for this task
0.08
0.07 Opinosis: much better
ROUGE-2 RECALL
0.06 than KEA & tfidf
0.05 KEA: slightly KEA
0.04 better than tfidf Worst
Tfidf: Tfidf
0.03 performance
Opinosis
0.02 WebNGram
0.01
0.00
5 10 15 20 25 30
Summary Size (max words)
45. Intuition: Well-formed phrases tend to be
longer in general
very clear screen vs. very clear
good battery life vs. good battery
screen is bright vs. screen is
A few longer phrases is more desirable than
many fragmented (i.e. short) phrases
46. KEA: Generates most # of
12 phrases (i.e. favors short
11.75
phrases)
11 WebNGram: Generates
fewest phrases on average.
10
Average # of Phrases
9.86 (each phrase is longer)
9 WebNgram
8.74
8 Tfidf
7.36 Opinosis
7
Kea
6 6.04
5
4.53
4 WebNGram phrases are generally more
15 well-formed 30
25
Summary Size (max words)
47. In our algorithm: n-grams are generated from
seed words
potential of forming new phrases not in original text
Why not use existing n-grams?
With redundant opinions, using seen n-grams may
be sufficient
Performed a run by forcing only seen n-grams to
appear as candidate phrases.
49. Example:
Unseen N-Gram (Acer AL2216 Monitor)
“wide screen lcd monitor is bright”
readability : -1.88
representativeness: 4.25
“…plus the monitor is very bright…” Related
“…it is a wide screen, great color, great quality…” snippets in
“…this lcd monitor is quite bright and clear…” original text
50. Purpose of σrep & σread:
Control minimum representativeness & readability
Helps prune non-promising candidates improves
efficiency of algorithm
Without σrep & σread - we would still arrive at a
solution, however:
Time to convergence would be much longer
Results could be skewed
These parameters need to be set correctly!
51. ROUGE-2 scores at ROUGE-2 scores at
various σread settings various σrep settings
0.06 0.040
0.04
ROUGE-2
ROUGE-2
0.035
0.02
0.00 0.030
-4 -3 -2
σread -1 1 2 3
σrep 4 5
Performance is stable except in extreme
conditions - thresholds are too restrictive
52. ROUGE-2 scores at ROUGE-2 scores at
various σread settings various σrep settings
0.06 0.040
0.04
ROUGE-2
ROUGE-2
0.035
0.02
0.00 0.030
-4 -3 -2
σread -1 1 2 3
σrep 4 5
Ideal setting:
σread : between -2 and -4
σrep : between 1 and 4
54. Questionnaire [Score 1-5]
Score < 3 poor
Score > 3 good
Grammaticality Non-redundancy Informativeness
[DUC 2005] [DUC 2005] [Filippova 2010]
WebNgram 4.2/5 WebNgram 3.9/5 WebNgram 3.2/5
TfIDF 2.0/5 TfIDF 2.3/5 TfIDF 1.7/5
Human 4.7/5 Human 4.5/5 Human 3.6/5
TFIDF: Poor scores on all 3 aspects
(all below 3.0)
55. Questionnaire [Score 1-5]
Score < 3 poor
Score > 3 good
Grammaticality Non-redundancy Informativeness
[DUC 2005] [DUC 2005] [Filippova 2010]
WebNgram 4.2/5 WebNgram 3.9/5 WebNgram 3.2/5
TfIDF 2.0/5 TfIDF 2.3/5 TfIDF 1.7/5
Human 4.7/5 Human 4.5/5 Human 3.6/5
WebNGram: All scores above 3.0
Close to human scores
56. Questionnaire [Score 1-5]
Score < 3 poor
Score > 3 good
Grammaticality Non-redundancy Informativeness
[DUC 2005] [DUC 2005] [Filippova 2010]
WebNgram 4.2/5 WebNgram 3.9/5 WebNgram 3.2/5
TfIDF 2.0/5 TfIDF 2.3/5 TfIDF 1.7/5
Human 4.7/5 Human 4.5/5 Human 3.6/5
Subjective aspect. Human
summaries scored slightly
better than WebNgram
57. Semantic redundancies
“Good sound quality” ≠ “Excellent audio” semantically similar
Due to directly using surface words
Solution: opinion/phrase normalization before
summarizing
Some phrases are not opinion phrases
“I bought this for Christmas” grammatical & representative
Due to our candidate selection strategy
▪ Plus Side: Very general approach, can summarize any text
▪ Down Side: Summaries may be too general
Solution: stricter selection of phrases
▪ E.g. Select only opinion containing phrases
58. Canon Powershot SX120 IS
Easy to use
Good picture quality
Crisp and clear
Good video quality
Useful for pushing opinions
to devices where the screen
is small
E-reader/ Smart Cell Phones
Tablet Phones
59. Optimization framework to generate ultra-concise
summaries of opinions
Emphasis: representativeness, readability & compactness
Evaluation shows our summaries are:
Well-formed and convey essential information
More effective than other competing methods
Our approach is unsupervised, lightweight &
general
Can summarize any other textual content
(e.g. news articles, tweets, user comments, etc.)
Hinweis der Redaktion
Most current work in opinion summarization focus on predictingthe aspect based ratings for an entity.For example, for an ipod, you may predict that the appearance is 5 stars, ease of use is three stars..etc.
Most current work in opinion summarization focus on predictingthe aspect based ratings for an entity.For example, for an ipod, you may predict that the appearance is 5 stars, ease of use is three stars..etc.
Most current work in opinion summarization focus on predictingthe aspect based ratings for an entity.For example, for an ipod, you may predict that the appearance is 5 stars, ease of use is three stars..etc.
The problem is that non-informative phrases may be selected as candidate phrases. Phrases such as battery life and screen size may become candidate phrases since traditionally it is used only to characterize or tag documents.This is however non informative for an opinion summary.
We try to capture these three criteria using the following optimzation framework.
M represents a micropinion summary consisting of a set of short phrases
Mi to mk represents each short phrase in the summary
Srep(mi) is the representativeness score of miSread(mi) is the readability score of mi
This equation is the objective function
These are the constraints of the optimization framework
Objective function ensure that the summaries reflect opinions from the original text and is reasonably well formed. This is an example of the objective function values
-The first threshold constraint that controls the maximum length of the summary.-This is a user adjustable parameter and helps captures compactness.
This constraint controls the similarity between phrases which is a user adjustable parameters.Italso captures compactness by minimizing the amount redundancies.
To score similarity between phrases, we simply use the jaccard similarity measure
Then the next is representativeness…Then in scoring representativeness, we have defined two properties of highly representative phrases:The words in each phrase should be strongly associated within a narrow window in the original textThe words in each phrase should be sufficiently frequent in the original textThese two properties is capture by a modified pointwise mutual information functionThen to scoreThe intuition is that if a generated phrase occurs frequently onthe web, then this phrase is readable.
This is the original PMI formula. It captures property 1 by measuring the strength of association between two words.
This is the modified PMI formula. We add the frequency of occurrence of words within a narrow window.This would capture property 2 by rewarding highly co-occuring word pairs. This also addresses the problem ofsparseness, where low frequency words tend to yield in high PMI scores.
So, to actually score the representativeness of a phrase, we take the average strength of association which is pmi’ of a given word wi with C of its neighboring words. C is a window size that is empirically set. We will later show influence of C on the generated summaries.
So, to actually score the representativeness of a phrase, we take the average strength of association which is pmi’ of a given word wi with C of its neighboring words. C is a window size that is empirically set. We will later show influence of C on the generated summaries.
Readability scoring is to determine how well formed the constructed phrases are.Since the phrases are constructed from seed words, we can have new phrases that may not have occurredin the original text.
Now lets look at some examples of readability scores, using the tri-gram lm. Let us say we want to summarize some facts about the battery life….
Now we have all the scoring components that we need for the optimization frameworkBut we cannot score each candidate phrase because we are composing phrases using words from the original text….the potential solution space can be huge. Our solution is to use a greedy algorithm where we explore the solution with some heuristics pruning so that we only touch the most promising candidates.
From the input text to be summarized, which is basically..some review texts,We shortlist a set of high freq unigrams where the count is larger than the median count (not counting 1’s). This acts as an initial reduction in search space. Then from these unigrams, we generate seed bigrams by pairing one unigram wiith every other unigram. We compute representativeness of the seed bigrams and keep only those that meet the minimum rep requirement. Pairunigram with every other unigram. Compute the representativeness scores and keep only those where the Srep >σrepThe next step is trying to grow the seed bigrams into higher order promising n-grams. This is done by concatenating existing candidates with bigrams that share an overlapping word. At any point, if the representativeness or readability scores are below the corresponding thresholds, these n-grams are pruned.At this point wel aslo check redundancy of the generated candidates, and if two phrases have a similarity higher than the sim threshold, we would discard the one with a lower combined scoreof Srep + Sreadnal step is
The third step is where we generate higher order n-grams that act as candidate summaries. This is done by concatenating existing candidates with seed bigrams that share an overlapping word. This gives you the new candidates. Then non-promising candidates are pruned based on their representativeness and readability scores. This further reduces the overall candidate search space. Then we also check for redundancy. If two candidates have high similarity scores, we would discard one these candidates.And this process repeats for the shortlisted candidates until there is no other expansion possibilitiesThen to get the final summary, we sort the candidate n-grams based on their objective function values (i.e., sum of Srep and Sread) and we gradually add phrases until the max summary length is reached
Moving into the evaluation part….
To evaluate this summarization task we use product reviews from CNET for 330 diff products for products that have a minimum of 5 reviews. The CNET review, has 3 parts to it. It has pros, cons and the full review. We ignored the pros and cons and only focused on summarizing the full reviews.
As our first baseline, we use TF-IDF, an unsupervised language-independent method commonly used for keyphrase extraction tasks. To make the TF-IDF baseline competitive, welimit the n-grams in consideration to those that contain at least one adjective (i.e. favoring opinion containing n-grams). The performance is much worse without this selection step. Then, we used the same redundancy removal technique used in our approach to allow a set ofnon-redundant phrases to be generated.For our second baseline, we use KEA a state-of-the-art keyphrase extraction method. KEA buildsa Naive Bayes model with known keyphrases, and then uses the model to find keyphrases in new documents. The KEA model was trained using 100 review documents withheldfrom our dataset . With KEA as a baseline, we would gain insights into the applicabilityof existing keyphrase extraction approaches for the task of micropinion generation. The comparison is strictly speaking an unfair comparison.As our final representative baseline, we use Opinosis, an abstractive summarizer designed to generate textual summary of opinions.
3 key questions rated on a scale from 1 (Very Poor) to 5 (Very Good). Grammaticality: Are the phrases in the summary readable with no obvious errors? Non-redundancy: Are the phrases in the summary unique with unnecessary repetitions such as repeated facts or repeated use of noun phrases? Informativeness: Do the phrases convey important information regarding the product? This can be positive/negative opinions about the product or some critical facts about the product. Note that the first two aspects, grammaticality and non-redundancy are linguistic questions used at the 2005 Document Understanding Conference (DUC) [4]. The last aspect, informativeness, has been used in other summarization tasks and is key in measuring how much users would learn fromthe summaries.
-This graph shows the rouge scores of thedifferent summarization methods for different summary lengths.-We have KEA is a supervised keyphrase extraction model-We have a simple tfidf based keyphrase extraction method-Then we have Opinosis-TF-IDF performs the worst. This is likely due to insucient redundancies of meaningful n-grams.-KEA does only slightly better than TF-IDF even though KEA is a supervised approach. While KEA is suitable for characterizing documents, such a supervised approach proves to be insufcient for the task of generating micropinions. The model may actually need a more sophisticated set of features for it to generate more meaningful summaries.-Opinosis performs better than both KEA and TFIDF-For this task, WebNgram performs the best
Phrases that are well-formed tend to be in general longer…Thus, a few readable phrases is more desirable than many fragmented (or very short) phrases. So for example if we have two phrases very clear screen and very clear. Here it is obvious that the more readable phrase is the first one as it is not fragmented and is well formed. This well-formed phrase is also longer than the short fragmented one.So now we will look into the average number of phrases generated by each summarization method.
-This graph shows the average number of phrases generated for different summary lengths using different summarization methods-Here we can see that KEA generates the most # of phrases, so it tends to favor very short phrases. On the other hand While the phrases may represent characteristics of the document, the phrases are not well formed enough to convey essential opinions.
In our candidate generation approach, we form n-grams from seed words from the original text. So there is a potential of forming new phrases not in the text. It is reasonable to think that with sufficiently redundant opinions, searching the restricted space of all seen n-grams may be enough for generating reasonable summaries. To test this hypothesis, we performed a run by forcing onnly seen n-grams to appear as candidate summaries.
-This graph shows the ROUGE-2 recall scores using seen n-grams vs. system generated n-grams for different summary lengths. -From these results it is clear that our search algorithm actually helps discover useful new phrases
-Now we will look into the stability of the non-user dependent parameters, sigma rep and sigma read.-the purpose of these two parameters is to control the min rep and readability scores. -This actually helps us prune non-promising candidates which eventually helps improve the efficiency of the algorithm.-Without these two parameters we would still arrive at a reasonable solution but the time to convergence would be much longer and the results could be skewed…e.g. very representative but not readable…so these parameters need to be correctly set.
-This first graph shows the ROUGE-2 recall scores at different sigma read settings-This second graph shows the ROUGE-2 recall scores at different sigma rep settings-So here we can see that the performance is actually stable at different settings except when the conditions are extreme – when the sigma read or sigma rep scores are too restrictive.
The ideal setting is thus…
-To assess potential utility of the generated summaries to real users, a subset of the summaries were manually evaluated.-As mentioned earlier, we look at 3 aspects: grammaticality, non-redundancy and informativeness scores -Each scored on a scale from 1 to 5-A score below 3 is considered `poor' and a score above 3 is considered `good'.-These are the results for the 3 aspects-The results on human summaries serves as an upper bound for comparison.
-Earlier we showed that TF-IDF had lowest ROUGE scores amongst all the approaches, indicating that the summaries may not be very useful. -Human assessors agree with this conclusion as TF-IDF summaries received poor scores (all below 3) on all three dimensions compared to WebNgram and human summaries.
In summary, in this work we proposed an optimi….with special emphasis on…of the generated summaries.Our evaluation shows that the …We give special emphasis on….