App stores allow users to submit feedback for downloaded apps in form of star ratings and text reviews. Recent studies analyzed this feedback and found that it includes information useful for app developers, such as user requirements, ideas for improvements, user sentiments about specific features, and descriptions of experiences with these features. However, for many apps, the amount of reviews is too large to be processed manually and their quality varies largely. The star ratings are given to the whole app and developers do not have a mean to analyze the feedback for the single features. In this paper we propose an automated approach that helps developers filter, aggregate, and analyze user reviews. We use natural language processing techniques to identify fine-grained app features in the reviews. We then extract the user sentiments about the identified features and give them a general score across all reviews. Finally, we use topic modeling techniques to group fine- grained features into more meaningful high-level features. We evaluated our approach with 7 apps from the Apple App Store and Google Play Store and compared its results with a manually, peer-conducted analysis of the reviews. On average, our approach has a precision of 0.59 and a recall of 0.51. The extracted features were coherent and relevant to requirements evolution tasks. Our approach can help app developers to systematically analyze user opinions about single features and filter irrelevant reviews.
5. Reviews
Include
Useful
Informa,on
for
Development
Teams
Bug
Reports
Improvement
Idea
Usage
scenario
Feature
request
Feature
Feedback
About
one
third
of
the
reviews
contain
informaEon
related
to
requirements
5
[Pagano
and
Maalej,
RE
‘13]
[Galvis
and
Winbladh,
ICSE’13]
6. Users
Submit
Many
Reviews,
Regularly!
6
• iOS
users
submit
on
average
22
reviews
per
day
per
app
• Facebook
on
iOS
receives
more
than
4000
per
day
[Pagano
&
Maalej
-‐
RE’13]
Reviews
Reviews
Reviews
Reviews
7. The
Quality
of
Reviews
Varies
Freaking
awesome!
Worst
mistake
of
my
life
aZer
daEng
my
ex
was
downloading
this
app…
Synching
files
takes
forever
aZer
the
new
release,
help!
I
love
the
that
it
lets
me
share
files
with
my
family
easily
7
8. Star
Ra,ng:
Limited
Usefulness
• RaEng
for
the
whole
app,
not
the
features
• The
review
text
menEons
the
senEment
about
the
single
features
Sharing
files
is
great,
but
the
white
background
is
horrible.
Please
put
the
background
from
before!
8
9. upload photo
open file
file name
view file
pdf view
delete photo
take photo
move file
want upload
update time
0 50 100 150
Appeareance frequency
Positive sentiment
Negative sentiment
DropboxFeatures
Our
Goal:
Extract
Features
and
Analyze
their
Sen,ment
9
J
L
-‐3.9
-‐2.5
-‐2.5
-‐4.2
+4.5
+4.0
+4.2
+3.7
+2.8
+3.6
10. Outline
of
the
Talk
10
Summary
Approach
Evalua,on
Mo,va,on
2
1
3
4
11. Feature
Extrac,on
and
Sen,ment
Analysis
11
[Thelwall
et
al.
JASIST
2010]
User reviews
Sentiment scores
for each review
Nouns, verbs
and adjectives
Fine-grained
features
Feature-Sentiment
scores
High-level features
with sentiment score
POST, removal of stopwords
and sentiment words, lemmatization
Sentiment Analysis Feature Extraction
Collocation finding (NLTK),
synonyms (Wordnet)
Feature-Sentiment
estimation
Topic Modeling (LDA) and weighted average
Lexical sentiment analysis
(SentiStrength)
12. Feature
Extrac,on
and
Sen,ment
Analysis
12
[Thelwall
et
al.
JASIST
2010]
User reviews
Sentiment scores
for each review
Nouns, verbs
and adjectives
Fine-grained
features
Feature-Sentiment
scores
High-level features
with sentiment score
POST, removal of stopwords
and sentiment words, lemmatization
Sentiment Analysis Feature Extraction
Collocation finding (NLTK),
synonyms (Wordnet)
Feature-Sentiment
estimation
Topic Modeling (LDA) and weighted average
Lexical sentiment analysis
(SentiStrength)
Sentence:
had
fun
using
it
before
but
now
it’s
really
horrible
:(
help!
Word
scores:
had
fun[2]
using
it
before
but
now
its
really
horrible[-‐4]
[-‐1
booster
word]
:(
[-‐1
emo,con]
help!![-‐1
punctua,on
emphasis]
Sentence
score:
{2,
-‐5}
13. Feature
Extrac,on
and
Sen,ment
Analysis
13
[Thelwall
et
al.
JASIST
2010]
User reviews
Sentiment scores
for each review
Nouns, verbs
and adjectives
Fine-grained
features
Feature-Sentiment
scores
High-level features
with sentiment score
POST, removal of stopwords
and sentiment words, lemmatization
Sentiment Analysis Feature Extraction
Collocation finding (NLTK),
synonyms (Wordnet)
Feature-Sentiment
estimation
Topic Modeling (LDA) and weighted average
Lexical sentiment analysis
(SentiStrength)
The
pdf
viewer
is
great,
but
the
white
background
is
horrible.
Please
put
the
background
from
before!
14. Feature
Extrac,on
and
Sen,ment
Analysis
14
[Thelwall
et
al.
JASIST
2010]
User reviews
Sentiment scores
for each review
Nouns, verbs
and adjectives
Fine-grained
features
Feature-Sentiment
scores
High-level features
with sentiment score
POST, removal of stopwords
and sentiment words, lemmatization
Sentiment Analysis Feature Extraction
Collocation finding (NLTK),
synonyms (Wordnet)
Feature-Sentiment
estimation
Topic Modeling (LDA) and weighted average
Lexical sentiment analysis
(SentiStrength)
<pdf
view>
=
<view
pdf>
<picture
view>
=
<photo
view>
=
<photo
see>
15. Feature
Extrac,on
and
Sen,ment
Analysis
15
[Thelwall
et
al.
JASIST
2010]
User reviews
Sentiment scores
for each review
Nouns, verbs
and adjectives
Fine-grained
features
Feature-Sentiment
scores
High-level features
with sentiment score
POST, removal of stopwords
and sentiment words, lemmatization
Sentiment Analysis Feature Extraction
Collocation finding (NLTK),
synonyms (Wordnet)
Feature-Sentiment
estimation
Topic Modeling (LDA) and weighted average
Lexical sentiment analysis
(SentiStrength)
The
pdf
viewer
is
great
{+3,-‐1}
pdf
viewer:
3
16. Feature-‐Sen,ment
Scores
use easy
find thing
pin board
pin thing
idea recipe
force close
update search
look pin
search something
show pin
Pinterest (Android)
0 20 40 60 80 100 120 140
upload photo
open file
file name
view file
pdf view
delete photo
take photo
move file
want upload
update time
Dropbox (iOS)
0 50 100 150
Appeareance frequencyAppeareance frequency
AppFeatures
Positive sentimentNegative sentiment
AppFeatures
16
17. User reviews
Sentiment scores
for each review
Nouns, verbs
and adjectives
Fine-grained
features
Feature-Sentiment
scores
High-level features
with sentiment score
POST, removal of stopwords
and sentiment words, lemmatization
Sentiment Analysis Feature Extraction
Collocation finding (NLTK),
synonyms (Wordnet)
Feature-Sentiment
estimation
Topic Modeling (LDA) and weighted average
Lexical sentiment analysis
(SentiStrength)
Feature
Extrac,on
and
Sen,ment
Analysis
17
[Thelwall
et
al.
JASIST
2010]
19. Outline
of
the
Talk
19
Summary
Approach
Evalua,on
Mo,va,on
2
1
3
4
20. Evalua,on
Ques,ons
20
• Feature
Extrac,on:
Does
the
extracted
text
represent
app
features?
• Sen,ment:
Is
the
automated
senEment
esEmaEon
similar
to
human
assessment?
• Are
the
extracted
and
grouped
features
coherent
and
relevant
for
app
developers
and
analysts?
Accuracy
Relevance
21. Evalua,on
Method
21
Run
the
approach
Build
a
truth
set
Compare
results
upload photo
open file
file name
view file
pdf view
delete photo
take photo
move file
want upload
update time
0 50 100 150
Appeareance frequency
Positive sentiment
Negative sentiment
DropboxFeatures
Collect
data
22. App
Store
Category
#Reviews
∅
Length
App
Store
Games
1538
132
App
Store
ProducEvity
2009
172
App
Store
ProducEvity
8878
200
App
Store
Travel
3165
142
Google
Play
Photography
4438
51
Google
Play
Social
4486
81
Google
Play
CommunicaEon
7696
38
Evalua,on
Data
22
23. Truth
Set
Crea,on
Stra,fied
sampling
of
2800
reviews
Manual
content
analysis
by
9
coders
Peer
coding
of
each
review
Coding
guide
to
reduce
disagreement
A
dedicated
coding
tool
(CADO)
23
25. Truth
Set
Crea,on
Stra,fied
sampling
of
2800
reviews
Manual
content
analysis
by
9
coders
Peer
coding
of
each
review
Coding
guide
to
reduce
disagreement
A
dedicated
coding
tool
(CADO)
25
27. Truth
Set
Crea,on
Stra,fied
sampling
of
2800
reviews
Manual
content
analysis
by
9
coders
Peer
coding
of
each
review
Coding
guide
to
reduce
disagreement
A
dedicated
coding
tool
(CADO)
27
30. Accuracy
of
Sen,ment
Analysis
Sentence-‐based
senEment
analysis
CorrelaEon
factor
0.445
p-‐value
<
2.2e-‐16
30
Review-‐based
senEment
analysis
CorrelaEon
factor
0.592
p-‐value
<
2.2e-‐16
This
is
the
first
sentence.
This
is
the
second
sentence
This
is
the
third
sentence.
And
so
on.
[-‐2]
This
is
the
first
sentence
[1]
This
is
the
second
sentence
[-‐2]
This
is
the
third
sentence.
[-‐4]
This
is
the
fourth
sentence.
[1]
And
so
on.
31. Relevance
to
RE
And
coherence
App
Coherence
Requirements
Relevance
Angrybirds
Good
Good
Dropbox
Good
Very
Good
Evernote
Good
Good
Tripadvisor
Good
Very
Good
Picsart
Neutral
Good
Pinterest
Good
Good
Whatsapp
Bad
Good
31
32. Outline
of
the
Talk
32
Summary
Approach
Evalua,on
Mo,va,on
2
1
3
4
33. Tools
to
Filter,
Analyze,
and
Aggregate
Feedback:
Review
Analy,cs
33
• Different
levels
of
granularity
• Understanding
(specific)
user
needs
[Maalej,
SEIF
Award
2014]
[Guzman
et
al.
VISSOFT
‘14]
• Supports
release
planning
and
work
prioriEzaEon
• DetecEon
of
bug
reports
and
feature
requests
34. 34
Reviews
Reviews
Reviews
Reviews
Reviews
include
important
informa,on
about
features
but
a
lot
of
noise
Tools
for
reviews
analy,cs
and
classifica,on
We
can
extract
features
and
their
sen,ments
in
the
reviews
upload photo
open file
file name
view file
pdf view
delete photo
take photo
move file
want upload
update time
0 50 100 150
Appeareance frequency
Positive sentiment
Negative sentiment
DropboxFeatures
-‐3.9
-‐2.5
-‐2.5
-‐4.2
+4.5
+4.0
+4.2
+3.7
+2.8
+3.6
Sa,sfactory
accuracy,
high
difference
between
apps,
good
RE
relevance
upload photo
open file
file name
view file
pdf view
delete photo
take photo
move file
want upload
update time
0 50 100 150
Appeareance frequency
Positive sentiment
Negative sentiment
DropboxFeatures
35. 35
maalejw
TU
München,
Germany
Emitzá
Guzmán
emitza.guzman@mytum.de
Uni
Hamburg,
Germany
Prof.
Dr.
Walid
Maalej
maalej@informaEk.uni-‐hamburg.de