SlideShare ist ein Scribd-Unternehmen logo
1 von 75
Downloaden Sie, um offline zu lesen
Detecting Product Aspect Categories
using SVMs
Author:
Andrew E. Hagens
Supervisor:
Dr. Flavius Frasincar
Co-reader:
Kim Schouten
A thesis submitted in fulfilment of the requirements for the degree of
Master of Econometrics and Management Science
at the
Department of Econometrics
Erasmus School of Economics
Erasmus University Rotterdam
July 2015
Abstract
Department of Econometrics
Erasmus School of Economics
Erasmus University Rotterdam
Detecting Product Aspect Categories using SVMs
by Andrew E. Hagens
Consumer reviews are becoming increasingly important to potential buyers of a cer-
tain product. To determine what is important in a review, we must find the discussed
product features. With the rise of the World Wide Web, it has become a valuable
source of product reviews for consumers when they are deciding on the purchase of a
product. In this work we present two methods that detect product aspect categories
in a review sentence. We propose to tackle the problem using advanced machine learn-
ings algorithms, support vector machines in our case. Just as some methods in the
SemEval-2014 competition, we propose 2 methods that use linguistic patterns such as
word n-grams to find possible aspect categories. In this thesis we want to gain insight
into the effects of different patterns.The results from the proposed methods show that
we can extract a large amount of information using relatively simple machine learning
methods to extract the information.
Acknowledgements
First I would like to thank my supervisors Dr. Flavius Frasincar and Kim Schouten
for the inspiring experience of writing this thesis. They showed a level of knowledge in the
area of data mining was both impressive and plentiful. Their support inspiring. I would
also like to thank my parents Eric and Karina and brother Emill for the unconditional
support they showed in all my endeavors. This has given me the freedom to learn and
grow and to follow my interests. I would like to thank my girlfriend Birgitt for her never
ending encouragement. This has helped me to stay focused and relaxed through the
whole experience of doing scientific research. An experience I will carry with me in all
my future endeavors.
v
Contents
Abstract iii
Acknowledgements v
Contents vi
List of Figures ix
List of Tables xi
1 Introduction 1
1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Research Goal 7
2.1 Research Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Related Work 9
3.1 Implicit Aspect Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Aspect Category Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Method Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Methodology 15
4.1 Method Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 Feature-Space Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2.1 Word Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Lexicon and Lemmatization . . . . . . . . . . . . . . . . . . 17
N-Grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Part-of-speech tagging . . . . . . . . . . . . . . . . . . . . . 18
Chunk Parsing . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.4 Aspect Category Detection Methods . . . . . . . . . . . . . . . . . . . . . 21
4.4.1 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . 21
4.4.2 Multi-Class Support Vector Machines . . . . . . . . . . . . . . . . 22
vii
Contents viii
4.4.3 Strict One-Vs-All Support Vector Machines Method . . . . . . . . 23
4.4.4 Two-Stage Classification Scheme Support Vector Machines Method 28
5 Evaluation 33
5.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Spell checker . . . . . . . . . . . . . . . . . . . . . . . . . . 34
POS Tagger . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Word Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Chunker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . 36
SVM Classification Algorithm . . . . . . . . . . . . . . . . . 36
5.2 Restaurant Review Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.4 Parameter Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.4.1 Part Of Speech Filter . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.4.2 N-Grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.4.3 Threshold vs. No Threshold . . . . . . . . . . . . . . . . . . . . . . 43
5.4.4 OVA Scheme based vs. Two-Stage Scheme based . . . . . . . . . . 45
5.4.5 Parameter Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.5 Algorithm Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.5.1 Dominant Aspect Category Tagger . . . . . . . . . . . . . . . . . . 47
5.5.2 Random Aspect Category Detector . . . . . . . . . . . . . . . . . . 47
5.5.3 Algorithm Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 48
6 Conclusion and Future Work 53
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
A Part Of Speech Filter Annotation 57
Bibliography 59
List of Figures
1.1 Review summary for Apple MacBook (2015) with scores for aspect cate-
gories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
4.1 General framework for aspect category detection using machine learning . 16
4.2 Flowchart showing an example of the OVA scheme based method . . . . . 27
4.3 Flowchart showing an example of the two-stage classification scheme based
method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.1 A general overview of training and prediction processes implemented in
this thesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2 Overview of the process of converting a sentence into a set of instances . . 35
5.3 Distribution of the number of aspect categories in a sentence . . . . . . . . 37
5.4 Distribution of the number of aspect categories in a sentence . . . . . . . 37
5.5 Distribution of the number of aspect categories in a sentence . . . . . . . . 38
5.6 Results for The Strict OVA Aspect Category Detection method with 1-,2-
and 3-grams without a trained threshold for each individual aspect category 41
5.7 Results for The Strict OVA Aspect Category Detection method with 1-,2-
and 3-gram with a trained threshold for each individual aspect category . 41
5.8 Results for the Two-Stage Classification Scheme method with 1-,2- and
3-grams with a trained threshold for each individual aspect category . . . 42
5.9 Results for Two-Stage Classification Scheme method with 1-,2- and 3-
grams without a trained threshold for each individual aspect category . . 42
5.10 Arithmetic difference of F1 scores for the OVA based method and the
two-stage method with and without threshold . . . . . . . . . . . . . . . . 43
5.11 Arithmetic difference of Precision scores for the OVA based method and
the two-stage method with and without threshold . . . . . . . . . . . . . . 44
5.12 Arithmetic difference of Recall scores for the OVA based method and the
two-stage method with and without threshold . . . . . . . . . . . . . . . . 44
5.13 Arithmetic difference of F1 scores for the OVA based method and the
two-stage method with and without threshold . . . . . . . . . . . . . . . . 45
5.14 Arithmetic difference of precision scores for Ithe OVA based method and
the two-stage method with and without threshold . . . . . . . . . . . . . . 46
5.15 Arithmetic difference of recall scores for the OVA based method and the
two-stage method with and without threshold . . . . . . . . . . . . . . . . 46
ix
List of Tables
3.1 An overview of the results of the related work that is discussed. . . . . . . 13
5.1 Confusion table for classification problems . . . . . . . . . . . . . . . . . . 38
5.2 Final parameters for the OVA scheme based method and the method
based on a Two-Stage classification scheme . . . . . . . . . . . . . . . . . 47
5.3 F1, Recall and Precision scores for different method when evaluation is
done on the test set provided by SemEval-2014 . . . . . . . . . . . . . . . 50
A.1 All part-of-speech filters applied to the parameter tuning in this thesis . . 57
xi
Chapter 1
Introduction
In this chapter we will introduce the subject of this thesis. Next we introduce some
terminology that is used in this thesis to gain a better understanding of the subject at
hand. Finally we present the structure of this thesis.
1.1 Problem Definition
When someone forms an opinion, a key part in this process is the influence from the
opinions of others (Liu, 2012). Not long ago people relied on the opinions of families and
friends to form an opinion about a product. Another way of forming an opinion on a
product is to read specialized magazines or books. With the rise of the World Wide Web,
the importance of online shopping has increased. An important part of online commerce
is the ability for a consumer to write a review for a product. When a consumer decides
to buy a product online, he/she most likely will read through the reviews by other
consumers for that product to get an idea what the overall sentiment is towards the
particular product (Bickart and Schindler, 2001; Feldman, 2013). Reading through all
reviews for one product can be a hassle. For this reason, it would be beneficial to find
an efficient way of giving the consumer an overview of the overall sentiment based on
product reviews.
The task of creating a relevant overview of the opinions expressed in a review, can
be divided into four subtasks (Popescu and Etzioni, 2007):
1. Identify product aspects;
2. Identify opinions regarding product aspects;
3. Determine the polarity of opinions;
1
Introduction 2
4. Rank opinions based on their strength;
A product aspect is called features. The number of aspects in a collection of reviews
can become large. To create an overview of the sentiment of the product, we can
define aspect categories. Aspect categories are a way of summarizing aspects that are
closely related. An example of a curated summary of a laptop review, with some aspect
categories ,can be seen in Figure 1.1.
Figure 1.1: Review summary for Apple MacBook (2015) with scores for aspect cate-
gories.
The research presented in this thesis does not tackle all 4 subtasks presented in
Popescu and Etzioni (2007), but tackles a variation on subtask 1. In this thesis we will
address the task of finding aspect categories in consumer review sentences. Obviously
the review summary presented in Figure 1.1 is composed by a human that scores certain
categories using the full review as a reference. To understand how an overview of the
aspect categories can be constructed using the reviews from consumers, we can look at
Sentence 1 which is extracted from a restaurant review:
(1) “Best of all is the warm vibe, the owner is super friendly and service is fast.”
In Sentence 1 we find that there are 3 aspects:vibe, owner and service. We can further
reduce this list to 2 aspect categories, namely ambiance and service. This classification
Introduction 3
gives a coarser overall view of the products aspects and enables us to better score these
aspects. The methods presented in this research were specifically developed to determine
aspect categories in a sentence.
Given that aspect categories are assigned to already existing aspects, it may be of
interest to explore the methods concerned with the subtask of finding product aspects.
Some early methods that were developed to extract product aspects were already pre-
sented in (Ding et al., 2009; Hu and Liu, 2004; Kobayashi et al., 2005; Mei et al., 2007;
Popescu and Etzioni, 2007). The methods proposed in (Ding et al., 2009; Hu and Liu,
2004; Kobayashi et al., 2005; Mei et al., 2007; Popescu and Etzioni, 2007) use relatively
few linguistic attributes (e.g., lexical and semantic features) to find product aspects.
The feature that these methods have in common is that they are based on some form of
word/phrase co-occurrences.
Methods based on co-occurrences have been shown to be adequate for modeling
specific word/phrase relations. To find aspect categories, the context in which word-
s/phrases are used is an important source of information to determine the right aspect
category. To better understand the context of a word/phrase, surrounding word/phrase
patterns are used to determine the correct aspect categories in a sentence. The pat-
terns that arise when decomposing a sentence can be used to determine which aspect
categories are addressed in a sentence.
The number of words/phrases in a set of reviews can be quite large. Also, there are
many more combinations possible for a sequences of words/phrases neighboring a specific
word/phrase. Solving such large scale classification problems is crucial in areas such as
text classification. The method proposed in this thesis can be described as a method
for solving a text-classification problem. This problem can be characterized by having
large sparse data with a huge number of instances and features (Fan et al., 2008). An
efficient and promising method to solve the classification problem for large datasets with
high-dimensionality is support vector machines (SVM). For this reason we propose to use
SVM as the classification algorithm of choice for the aspect category detection methods
presented in this thesis. This thesis will also concentrate on the numerical representation
of a word given the context in which the word appears. An example of a method to give a
numerical representation is the method developed in Mikolov et al. (2013a,b,c), namely
word2vec. In this thesis we propose a similar numerical representation where words
are converted to a vector form. The vector dimensions are determined by important
sentence context features.
Introduction 4
1.2 Terminology
In Section 1.1 we introduced some terminology but provided little explanation. In the
following section we introduce the terms most commonly used in this thesis.
Aspect An aspect is a word or a collection of words that describes a specific feature of
the subject being discussed in a sentence. Sentences can contain ≥ 0 aspects. Sentence 2
is an example of the type of sentence commonly found in a review.
(2) “I can barely use any usb devices because they will not stay connected properly.”
The term ‘usb devices’ in Sentence 2 is tagged as an aspect term. This is because we
know that all opinions in the sentence are related to ‘usb devices’.
In this research we distinguish between 2 types of aspects, namely explicit and im-
plicit aspects. The reason for this distinction is that explicit aspects are relatively easy
to find while the implicit aspects are relatively hard to determine. In Sentence 2 we
found the aspect ‘usb drives’ because, the aspect was explicitly mentioned. In Liu et al.
(2005) the authors argued that some aspects are not explicitly mentioned but rather,
some aspects can be inferred from the sentence. This is an aspect that is implicitly
mentioned (Liu et al., 2005). Sentence 3 gives an example of a sentence with an implied
aspect.
(3) “When we went to use it again , there was sound but no picture .”
The aspect tagged for Sentence 3 is “camera”. Notice here that the aspect was never
explicitly mentioned. The words “sound” and “picture” together imply that we are
reading about a camera. Although most aspects appear explicitly in sentences, the
number of implicit aspects can reach up to 30% of the total number of aspects (Wang
et al., 2013).
Aspect Category In Section 1.1 we introduced the concept of aspect categories. The
categories and their respective aspects are determined in advance. In this thesis we
use “aspect category” and “category” interchangeably. At its basis the aspect category
serves the same purpose as aspects, that is to describe a product or entity. The number
of unique aspects in a set of reviews is generally quite large. Aspect categories are a
convenient way of labeling aspects that are closely related. This enables a consumer to
see an overview of what the overall opinion is on a certain group of aspects. An example
of such an overview can be seen in Figure 1.1.
Introduction 5
We previously mentioned that aspects can appear explicitly of implicitly. Categories
can also either be mentioned explicitly or implicitly. But because a category represents
a group of aspects with a single label, we assume that categories do not always ap-
pear explicitly but appear implicitly with the mention of an aspect. To illustrate this
property and to give a general insight into aspect categories, we can look at Sentence 4
and Sentence 5 as example sentences from review about restaurants. The predefined
categories are ‘food’ and ‘price’
(4) “Great food at REASONABLE prices, makes for an evening that can’t be beat!”
(5) “He has visited Thailand and is quite expert on the cuisine.”
In Sentence 4 ‘price’ and ‘food’ are tagged as aspects. These tags have the same label
as the categories thus the categories appear explicitly. The tagged aspect in Sentence 5
is ‘cuisine’. The corresponding tagged category is ‘food’. Here the category ‘food’ was
never mentioned but, can be inferred by the fact that the aspect ‘cuisine’ describes some
aspect that belongs to the category ‘food’. Sentence 5 is a great example of how aspects
are related to their predefined category.
Feature-space In Section 1.1 we proposed to use SVM to detect aspect categories
in a review sentence. At it’s basis, SVM is a classification technique that determines a
decision boundary for a binary classification problem. Text classification is notorious for
having high-dimensional data, thus the trained SVM for the defined problem has a high-
dimensional problem space. In our case each dimension represents a feature (attribute)
of the sentence. In this thesis we will refer to the problem space as a ’feature-space’ and
each dimension as a feature. This enables us to represent a word/phrase numerically
as a vector. The number of features (feature-space dimensions) is determined by the
vocabulary. As mentioned before, a similar vector representation of words is presented
in (Mikolov et al., 2013a,b,c). This advanced method is designed for large clusters of
computers to process a large amount of data with a neural-network as learning algorithm.
1.3 Thesis Structure
In Chapter 2 we will formally present the goal and scope of the research presented in
this thesis. In Chapter 3 we discuss previous work that is related to aspect and category
detection. Chapter 4 we first introduce a general framework for category detection.
From this framework we present two methods to perform the task of category detection.
For both methods we will present the pseudo-code and an example. In Chapter 5
Introduction 6
we introduce the dataset we use to evaluate our methods. After the data has been
introduced we present the evaluation metrics we will use to measure the performance of
the methods proposed in this thesis. The performance of out methods will be compared
to some baseline methods and a couple of methods from literature. Finally the methods
are compared to the baseline and existing algorithms. Chapter 6 presents the conclusions
we arrived at after evaluation. Lastly we suggest some work that can be done in the
future.
Chapter 2
Research Goal
Although the concept of detecting explicit entity (product) aspects is not new, there
has been relatively little research in the area of extracting aspects that are implied.
The research presented in Su et al. (2008) was one of the first to attempt to tackle the
problem of detecting implicit entity aspects. The authors base their model on inter- an
intra-word relations. These relations are used for clustering and mutual reinforcement to
create a set of association rules. The association rules depict the mapping of an opinion
word to the associated feature word. Although this seems to be a reasonable method, it
fails to capture important information with regard to the context of the opinion word.
Another problem encountered in methods that are based on association rules and/or
co-occurrences is that if a particular opinion word was never associated with a feature
then it will not be discovered, possibly negatively affecting the performance. This is
often due to the sparseness of co-occurrences.
The goal of this thesis is not to detect the entity aspects, but rather their category.
Because aspects are direct children of their categories, it is of interest to us to look at
previous attempts at detecting aspects that appear either explicitly or implicitly. An
interesting question arises when we look at most of the current research in detecting
implied aspects. The question goes as follows:
“How can you leverage the available information in a corpus to learn patterns that lead
to an accurate model for extracting aspect categories? ”
In order to provide an answer to the previous question, the following questions need
to be answered as well:
• What lexical features in a sentence are important for determining aspect categories?
7
Research Goal 8
• How important are the patterns of words and/or lexical features for determining
aspect categories?
• What algorithm suits pattern recognition for aspect category detection?
• How can we compare the performance of a proposed method to already existing
methods?
2.1 Research Scope
The focus of this research is primarily in extracting aspect categories with the help of
classification algorithms. More specifically this research concerns itself with all steps
of building a classification system for detecting aspect categories in consumer reviews.
The preprocessing steps implemented here will make use of several off-the-shelf tools
from the Natural Language Processing field. This research will not actively improve on
these existing methods, but mostly leverage these methods to improve performance of
the overall system. The focal point of this research is to present a method to extract
contextual information from sentences and discovering patterns in this information to
find aspect categories.
2.2 Methodology
The first part of this research is a literature survey on methods previously devised for ex-
tracting features (aspects) from a corpus of text. Next a general framework is presented
to detect aspect categories. In this research we assume that aspects are just specializa-
tions fo their categories. Therefore we assume that the methods for aspect detection
and category detection are relatively similar. After the framework is introduced, two
methods for detecting aspect categories are presented. Both methods are based on the
framework presented in this thesis. These methods will use a sentence as their input and
output a list of predicted aspect categories. Last some baseline algorithms are presented
to form a reference to which we can compare the methods presented in this research.
We will also compare out methods to existing methods for category detection.
Chapter 3
Related Work
Aspect detection is a relatively new area of research in the Natural Language Processing
domain. It is related to the fields of Opinion mining and Sentiment Analysis (Liu,
2012). Aspect detection is used in these areas to extract opinion/sentiment about a
certain aspect of a product. This chapter discusses the current approaches that directly
or indirectly tackle aspect detection. As mentioned before, aspects and categories can
appear explicitly or implicitly. In Section 3.1 we discuss the research done on finding
aspects that are implied. Section 3.2 will present the research done on the task of finding
determining aspect categories.
3.1 Implicit Aspect Detection
Finding aspects can be a challenge in itself. Most methods to the find aspects concentrate
on aspects that are explicitly mentioned in a document or sentence (Ding et al., 2009;
Hu and Liu, 2004; Kobayashi et al., 2005; Mei et al., 2007; Popescu and Etzioni, 2007).
OPINE is a review-mining system introduced in Popescu and Etzioni (2007) to find
semantic orientation of words in the context of given product features and sentences.
The research goal of Popescu and Etzioni (2007) comes really close to the research
question proposed in this research. The authors present a thorough system for opinion
mining. Specifically they present methods for detecting aspects that take into account
implicitly and explicitly mentioned aspects. The explicit aspect detector is discussed in
more detail then the implicit aspect detector. They use opinion words and patterns to
extract implicit features. More specifically they use neighborhood features of a word
to determine if an aspect appears in a sentence. The authors also developed a method
to find patterns of the semantic orientation of an opinion word in the context of an
9
Related Work 10
associated aspect and the input sentence. The experiment the authors constructed was
geared towards opinion mining. The research presents interesting ideas but, the the
authors do not present results on implicit aspect detection.
One of the earlier attempts at extracting implicit aspects is done in Su et al. (2006).
At it’s basis the authors propose to use a method that analyzes semantic associations,
based on Point-wise Mutual Information (PMI), to determine if a word represents an
aspect . It is easy to understand the logic that the semantic association of an opinion
word, with a corresponding aspect, will help us determine the correct aspect implied in
the sentence that contains the opinion word. However, the results can not be verified as
the authors did not include any tangible results in the presentation of their method.
In field of opinion mining, the authors in Su et al. (2008) proposed a method that
clusters words with a high semantic similarity, to detect implicit aspects. The words
used for clustering are words that have been tagged as aspects and opinion words. The
thinking behind this is that words that appear together often have a high similarity.
By this reasoning we can estimate the aspect by looking at the given opinion word
in the context of the sentence it appears in. To model the complicated relationships
between product aspects and opinion words, the authors consider two sets of words: a
set of product aspect words and a set of opinion words. After the definition of the sets,
the clusters and the inter- and intra-relationships of the aspect and opinion words are
iteratively determined. To calculate similarity between 2 words the authors propose to
combine a traditional approach for calculating similarity with a similarity metric based
on the retrospective relationships between certain words. The limitation in this research
is that the authors only consider adjectives as opinion words. In practice, adjectives do
not cover the wide range of opinions that are expressed. The authors did not provide any
numerical results which means that we can not verify the performance of this method.
In Hai et al. (2011), the authors also tackled the problem of identifying implicit entity
aspects. They proposed to identify the implicitly mentioned aspects using co-occurrence
association rule mining. The method is based on the co-occurrence count of an explicitly
mentioned aspect and an opinion word. Explicitly mentioned aspects can be extracted
using existing methods. In Hai et al. (2011), the explicitly mentioned aspects can be
detected using using dependency relations. Opinion words are extracted using part-of-
speech tags. After building the aspect and opinion words sets, the co-occurrence matrix
between these two word sets is generated. The association rules are mined based on the
co-occurrence matrix. Based on this mapping one can predict an implied aspect. The
authors report that the method yields an F1-measure of 74% on a dataset of Chinese
review of mobile phones. The performance of this method is heavily dependent on word
co-occurring often which result in bad results for sparse datasets.
Related Work 11
Wang et al. (2013) uses the same basic idea as Hai et al. (2011). The proposed method
goes beyond the idea of mining for rules by simply mapping opinion words and explicit
features. There are 3 important extensions used in this research. First the authors add
substring rules to a basic set of rules. This means that they build new rules from the
substrings of an existing rule. Secondly, they use the syntactic dependency between
lexical units to mine for potential rules. Lastly, they use a constrained topic model
to expand the word co-occurrence. The results presented in Wang et al. (2013) seem
to improve on those given in Hai et al. (2011). The authors reported a F1-measure of
75, 51% which is a slight improvement on the F1-measure reported in method presented
in Hai et al. (2011).
InZhang and Zhu (2013) the authors proposed a method that uses co-occurrences
similarly to the method in Hai et al. (2011). They also use a concept called double
propagation. Even though double propagation is a method employed for explicit aspect
detection, we will briefly discuss the method. We will then discuss the full method used
in Zhang and Zhu (2013).
To understand double propagation we look at the research of Qiu et al. (2009). The
researchers in Qiu et al. (2009) set out to find an efficient way of doing sentiment
analysis on text within a certain domain. Opinion expressions can vary wildly from
one domain to another. The proposed method exploits the relation between sentiment
words and the product features that modify the sentiment. The relations are used for
the propagation of information through both the sentiment and feature words. This is
called double propagation. The method proposed by the authors in Qiu et al. (2009)
performed favorably when the compared their method to several other methods (e.g.,
conditional random fields). It performed especially well when a relatively small training
corpus was used for training. The reason for this performance boost can be attributed
to the fact that the method in Qiu et al. (2009) finds implicit opinion words that modify
the aspect words (modifiers). An example of a modifier word is ‘small’ in Sentence 6.
This word describes an implied aspect, namely the aspect ‘size’, of the entity ‘mackerel’.
(6) ‘Lee caught a small mackerel.’
The researchers in Zhang and Zhu (2013) bring together the ideas from the methods
proposed by Hai et al. (2011) and Qiu et al. (2009). All previous work on detecting
implied entity aspects have 2 things in common. First they extract opinion words
and explicit aspect words to create a mapping between the two. Secondly, they use
the co-occurrence between opinion words and the extracted aspect words to create the
mapping. The method presented in Zhang and Zhu (2013) uses co-occurrences and the
idea of double propagation to calculate the average correlations between an aspect word
Related Work 12
and the notional words in a sentence. The feature with the highest average correlation
is selected as the implicit feature. The authors reported an F1-measure score of 80% on
a dataset set of Chinese phone reviews.
3.2 Aspect Category Detection
The task of detecting aspect categories was introduced International Workshop on Se-
mantic Evaluation (SemEval-2014) as a subtask of the general task of ‘Aspect Based
Sentiment Analysis’.
As part of the SemEval-2014 task, the method developed in Schouten and Frasincar
(2014) presents a method that computes a score for the likelihood that a certain word
is a description for an aspect and/or it’s category. To train the method the authors use
a training-set that contains sentences that have been manually annotated by humans,
namely with aspects and aspect categories for each sentence. A co-occurrence matrix
is then constructed with the frequency that words co-occur with a predefined aspect or
category in a sentence. After the co-occurrence matrix is defined, the authors propose to
train a threshold for all aspect categories to to decide when to choose which category is
most likely. To detect a category in a sentence, the score is calculated for each category
in the given sentence. If the score exceeds the threshold the category is chosen. The
authors presented a F1-measure of 59% with on a dataset containing sentences from
restaurant reviews from the SemEval-2014 competition.
The method presented in Brychcın et al. (2014) uses a binary Maximum Entropy
classifier with term frequency–inverse document frequencies (td-idf) and bag-of-words
as the feature-space. The authors reported a F1-measure score of 81.0% which makes
this method the best performing constrained method in the SemEval-2014 workshop.
Another method that is based on a machine learning algorithm is proposed in Kir-
itchenko et al. (2014). The proposed method uses a (one-vs-all) SVM scheme for n pre-
defined aspect categories for classification. The feature-space for the SVM’s is defined
by various n-grams and information learned from a lexicon learned from an unlabeled
dataset with restaurant review from YELP. The sentences that have not been assigned
an aspect category ar passed through a post-processing step that calculates a posterior
probability P(c|d) for category c given sentence d. The category with the highest prob-
ability is chosen as the most likely category for the sentence. Only if the probability
of the preliminary category exceeds a certain trained threshold is the sentence labeled
as referring to the category. With a F1-measure score of 88, 6%. The method that was
submitted to the SemEval-2014 workshop did not use YELP to learn the lexicon. The
Related Work 13
constrained method they submitted had an F1-score of 82.2% This method was not sub-
mitted to the SemEval-2014 workshop, but it did out-perform all other methods from
the SemEval-2014 workshop that participated in the Task of ‘Aspect Based Sentiment
Analysis’.
3.3 Method Overview
In this section we present an overview of the methods we introduced in this chapter.
Table 3.1 presents an overview of the methods that were most relevant to the method
used in this thesis.
Method Type Method Task Result
Machine
Learning-based
Kiritchenko et al.
(2014)
Detect Aspect Categories F1-score: 81%
Brychcın et al.
(2014)
Detect Aspect Categories F1-score: 82%
Kiritchenko et al.
(2014)*
Detect Aspect Categories F1-score: 89%
Frequency-and
Rule-based
Wang et al. (2013) Detect Implicit Product
Aspects
F1-score: 75.51%
Frequency-
based
Hai et al. (2011) Detect Implicit Product
Aspects
F1-score: 74%
Wang et al. (2013) Detect Implicit Product
Aspects
F1-score: 75.51%
Zhang and Zhu
(2013)
Detect Implicit Product
Aspects
F1-score: 80%
Schouten and Fras-
incar (2014)
Detect Aspect Categories F1-score: 59%
Table 3.1: An overview of the results of the related work that is discussed.
*
indicates a constrained method where the algorithm is trained using only the training set as a
resource.
Chapter 4
Methodology
In this chapter we introduce two methods for detecting product aspect categories in
review sentences. First, we introduce a general framework which forms the basis for
both methods we are going introduced. Next we present a method to define a multi-
dimensional feature-space to convert a word into a vector representation of the word
given a sentence. The last part of this thesis will be dedicated to discussing two methods
for aspect category detection using the framework and a proposed method for defining
the feature-space. Both methods are based on some form of the one-versus-all classifi-
cation scheme for multi-class SVM classification.
4.1 Method Framework
The two method presented in this thesis are based on machine learning algorithms. The
choice for using machine learning as the foundation is based on the intuition that con-
textual patterns exist around words that describe an aspect. in a sentence To illustrate
this we can look at categories service and food. The word ‘horrible’ can be associated
with either ‘horrible service’ or ‘horrible food’. Methods based on association rule min-
ing (Hai et al., 2011; Wang et al., 2013) propose to tackle this choosing the category
with the highest association probability. The statistical methods presented in (Zhang
and Zhu, 2013; Schouten and Frasincar, 2014), propose to solve this problem by using
a co-occurrence matrix to calculate the probability of choosing a category related to a
word..
The advantage of using association rule mining is that the algorithms are fast and
the rules are relatively easy for humans to understand. They give us insight in some
prevalent patterns. The disadvantage of this approach is that the trained algorithms
15
Methodology 16
can miss some less obvious patterns that may appear in a sentence. Furthermore, when
a word appears that was not previously seen in the training set, we have to navigate a
non-intuitive set of steps to be able to generate a prediction.
In this thesis we choose to develop a method that enables us to convert any chosen
word in a sentence into vector form given the sentence it appears in. This enables us to
predict the category related to a word (represented as a point in the feature-space) by
looking at the neighboring words. The features in the multidimensional problem space
are defined by a selection of lexical and semantic attributes that have been selected by
off-the-shelve feature selection algorithms.
The spacial representation of a word enables us to use some advanced machine learn-
ing algorithms as the classification step in our framework. The important insight here
is that we aim to learn how the context of a word impacts it’s meaning. In this thesis
we assume that two similar words, used in the same context, are close to each other in
the problem space and that the two words are far apart when mentioned in different
contexts Mikolov et al. (2013a). Based on this assumption, we can now choose classifi-
cation algorithms that are based on the spacial representation of data points. Figure 4.1
presents a general overview of the general framework. This framework serves as the
structure of the methods presented in this thesis.
Training set
Feature-space
definition
Classification Algorithm
Training
Determine Categories
Figure 4.1: General framework for aspect category detection using machine learning
4.2 Feature-Space Definition
In this section we will present a supervised learning algorithm to define the dimensions of
the feature-space. The reason for using a spacial approach is that in enables us convert
a word into a vector representation of the word. First we will present a method for
defining the context in which a word appears. Next we define the features that are
included in the feature-space. Last, we will discuss a method to reduce the number of
dimensions in the feature-space.
Methodology 17
4.2.1 Word Context
In this thesis we assume that the input data is in the form of individual review sentences.
The choice of using sentences as the input form stems from the fact that we want to
capture the information contained in the words that are around a word Ws
i for a given
sentence s. An alternative to a sentence as input string we can also use a collection
of sequential sentences,e.g. a paragraph, as input. The disadvantage of using multiple
sentences is that many categories are mentioned in a paragraph which leads to over
generalization of the input context. In this thesis the context of a word is defined as the
parts of a sentence s that precede or follow a specific word wi at the ith word-index in
sentence s. The context of a word influences its meaning or effect.
4.2.2 Features
The researchers Flekova et al. (2014) use machine learning algorithms to determine what
makes a good biography. The authors present a list of 9 classes of numerical features to
construct a feature-space that is well suited for text-classification problems. The feature-
space constructed in this thesis is based on three of the nine classes from Flekova et al.
(2014). We chose to use only three classes because most other classes in Flekova et al.
(2014) are geared more toward quality analysis. The three classes are discussed below.
Lexicon and Lemmatization The first step in developing the feature-space is to
construct set of the words used in a training corpus. The set of words can grow to become
quite large because of the many grammatical forms a word can appears as. In the set
of words, there are related words with similar meanings but differ in grammatical form.
An example of such words are democracy, democratic and democratization. To reduce
the size of the set of unique words, we propose to represent a word in it’s most basal
form possible. A process of finding the root word of an input word is called stemming.
This method is usually very crude in that it can just cuts the end of a word off and
hopes for the best. The most common and effective algorithm for stemming is presented
in Porter (1980).
A related, and more advanced, method for finding the base form of a word is called
‘Lemmatization’. The advantage of lemmatization is that it stems words based on vo-
cabulary and an analysis of the morphological properties of words. For an example
of the process of stemming/lemmatizing the words in a sentence we can look at the
words in Sentence 7 for the original sentence and Sentence 8 for the same sentence but
lemmatized.
Methodology 18
(7) “It took half an hour to get our check, which was perfect since we could sit, have
drinks and talk!”
(8) “It take half an hour to get our check , which be perfect since we could sit , have
drink and talk !”
Here we see that a verb like ‘took’ have been transformed to its base word ‘take’. This
small example show us the potential for word-set size reduction. If you now encounter
a word such as ‘taken’ in another sentence in the training set, the word-set size will not
grow but will contain the stemmed form of ‘taken’, namely ‘take’
In this research lemmatization is done with the Stanford CoreNLPManning et al.
(2014) java implementation
N-Grams In Section 4.2.1 the context of a word as ‘the parts in a sentence that
precede or follow a word’. To capture the context of a word we propose to first construct
the set of contiguous sequences of n words from a sentence s. In this thesis we will
define one such sequence as an n-gram. To illustrate we can define a simple sentences
s = {w1, w2, w3, w4} where wi is the word at position i in sentence s. n denotes the
number of words in the sentence s. Say we want to use a 2-gram model to get the context
of a word wi The set of 1-grams C1 of sentences s is defined as C1 = {w1, w2, w3, w4}.
The set of 2-grams C2 extracted from sentence s is C2 = {w1w2, w2w3, w3w4}. In this
example the contexts of word w2 would be {w2, w1w2, w2w3}. In this research we will
define one feature as one element from the set of n-grams. In this example the features
set will become F = C1 ∪C2. For a more practical examples of the n-grams set building,
we can look Sentence 8. Below, the 1-gram set and the 2-gram set are constructed from
the words in the (lemmatized) first part of Sentence 8.
1-grams = {It,take,half,an,hour,to,get,our, check}
2-grams = {It take,take half,half an ,an hour ,hour to,to get,get our,our check}
The set of n-grams for all sentences, in the training set, are initially added to the
feature set F. Although the number of features can grow quite large, further in this
thesis we will alleviate this problem by doing feature selection (Section 4.3).
Part-of-speech tagging In the field of linguistics words can be labeled(tag) such
that the label corresponds to a so-called part-of-speech (POS). This process is called
POS-tagging. The basic POS-tags are familiar ones such as noun and verb. The process
of tagging a POS to a word often involves advanced learning algorithms to detect hidden
relations between words in sentences/paragraphs to assign the correct tag given al these
properties. One such example is the POS-tagger presented in Toutanova and Manning
Methodology 19
(2000), which is based on a maximum-entropy model. In most cases the supervised
tagging algorithms are trained on a annotated text corpus. (e.g., the Penn Treebank
and the British National Corpus (Marcus et al., 1993; Leech et al., 1994)).To illustrate
part-of-speech tagging, we again use Sentence 7 to show how tagging works :
(9) “It/PRP took/VBD half/NN an/DT hour/NN to/TO get/VB our/PRP check-
/NN which/WDT was/VBD perfect/JJ since/IN we/PRP could/MD sit/VB
have/VB drinks/NNS and/CC talk/VB”
In this thesis the parts-of-speech in a sentence are used to construct sets of n-grams
for the word tags in a sentence s. This will help in finding important linguistic patterns.
A simple example of this is the fact that an adjective follows a noun. This tells us that
a noun is being modified by an adjective and that the noun is potentially referring to
an aspect(category) of the product.
Chunk Parsing Although n-grams can detect certain linguistic patterns, they do
not capture lexical patterns that appear with words outside the scope of n. Meaning
they do not detect relations that may exist with words that are n + 1 from the related
words. According to the research presented in Gee and Grosjean (1983) a sentence can
be parsed into so called performance structures. The parsing method presented (Abney,
1992) defines performance structures as structures of word clustering that emerge from
a variety of types of experimental data, such as pause durations in reading, and naive
sentence diagramming . Although the presentation of performance structures makes
some general assumptions about the syntax rules, Abney (1992) use the performance
structures to form a basis for their method that builds syntactic subgraphs of a sentence.
To capture the disjoint lexical patterns we will employ a method for shallow parsing
introduced in Abney (1992). According to the authors, a sentence can be read in chunks.
Again we use Sentence 7 as an example. In Sentence 10 the chunks represent a possible
set of chunks when we Sentence 7. The chunks in this sentence serves as a fictional
example.
(10) “[It took] [half an hour] [ to get] [ our check] , [ which was perfect] [since we
could sit] , [ have drinks], [ and talk] !”
The author Abney (1992) uses such an example to construct a method to parse a
sentence based on chunks. The author called this shallow parsing. Shallow parsing will
split a sentence in so called phrases or chunks. These small phrases can give us further
insight into what information is contained in which part of a sentence. This can be
Methodology 20
seen as a more advanced variable length n-gram generator. Building on the previous
example, the chunks created for Sentence 10with the method presented in Abney (1992)
can be seen in 11.
(11) “[NP It] [VP took] [NP half an hour] [VP to get] [NP our check] , [NP which]
[VP was] [ADJP perfect] [SBAR since] [NP we] [VP could sit] , [VP have] [NP
drinks and talk] !”
In this example we get that “drinks and talk” is parsed as a Noun Phrase (NP). For
a full list of chunk tags we refer you to the tagging guidelines presented in Santorini
(1990).
In this research we will use the chunks in the same way we use the n-grams. We will
use this as a way of getting context from a word using lexical and semantic relations
that exist in the sentence. We will include the chunks of the corresponding POS-tags.
4.3 Feature Selection
Large datasets are more and more common in many areas of research. Both the number
of instances and features are growing with the increased ability to measure data points
with a large amount of features. In the case of the field of natural language processing
the datasets contain a large number of instances and a large number of features. Most
statistical methods have a hard time handling this high dimensionality. For this reason
we choose to incorporate a feature selection step. The benefits of this step are two-
fold. First we get a significant reduction in the feature size. This lowers the overall
computation time of the algorithm. It also makes the method more robust, as without
feature selection we run the risk of not being able to generalize. The other benefit of this
step is that it can give us an insight into what the most important features are and can
give us an insight into what patterns matter most for what category in what context.
In this research we extract specific information related to a word, and the context
in which a word is mentioned. We extract the context information on a sentence level.
If we maintain the original feature set we run the risk of creating word vectors that are
too sparse, which can hurt performance of the method. To prune the feature space we
propose to use the Information Gain approach presented in Kullback (2012).
The Information Gain method is based on measures that give numerical value to
the state uniformity in a set of multidimensional points. To illustrate the idea let us
imagine we have a dataset with points that can be labeled as being either of class a or
class b.The goal is to find features that best split the data in such a way that we get the
Methodology 21
best split between class a and b. The measure for this best split is called information. In
terms of features we can say that information is a measure to show how many features
are needed to correctly classify an instance as being class a or class b.
To choose if a feature is included in the set or not we must determine the information
gained by including a feature f in the feature-set F where f /∈ F. To measure this
influence, the Information Gain method uses a measure for information entropy that is
presented in Schneider (1995) and shown in Equation 4.1.
H =
i∈F
Pilog2Pi (4.1)
Where H denotes the information entropy, F denotes the set of feature to analyze and
Pi is the probability that a successful classification is made given the set of features.
Now that we have a measure to determine the importance of a feature for classification,
Equation 4.2 follows naturally to determine the Information gained from adding a feature
f to the feature-set F. We will denote this new set as ¯F where ¯F = F ∪ f. The gained
information IG(F, f) for feature f is determined by subtracting the information entropy
H( ¯F) of the set of features, with feature f included, from the information entropy H(F)
for the feature-set F without feature f.
IG(F, f) = H(F) − H( ¯F) (4.2)
The final feature-set selecting those features that maximize the information gained
by adding them to the set.
4.4 Aspect Category Detection Methods
In this section we will first discuss the SVM classification algorithm. Then we introduce
a method for solving multi-class classification problem with SVM algorithms. Then we
will introduce two methods for aspect category detection based on the general frame-
work presented in Section 4.1. Both methods use SVM algorithms as the classification
algorithm.
4.4.1 Support Vector Machines
The Support Vector Machine (SVM) algorithm was proposed in Cortes and Vapnik
(1995). The root of SVM is in statistical learning theory. SVM classification have been
used on real world problems with good results. At the basic level SVM is a method
Methodology 22
to find a hyperplane in a feature-space such that the hyperplane forms a separation
between instances that are labeled either −1 or +1 Tan et al. (2006).
The method for determining a hyperplane depends on a so called kernel function.
Kernel functions can be classified as linear functions or non-linear functions. The authors
in Fan et al. (2008) argue that for dataset with a large feature-space (e.g. 4464 unique
features in one of our cases) and a large number of sparse instances, the benefits of non-
linear kernel functions are minimal while that time-complexity is very high. In the case
that the number of features is very large and the data sparse, the authors in Fan et al.
(2008) propose to use a SVM algorithm with alinear kernel function, as opposed to a
more complex non-linear kernel function. For an overview of the performance difference
between linear and non-linear kernel functions we refer to the research in Fan et al.
(2008).
Now that we have chosen the kernel function type we can proceed with discussing
the SVM algorithm with a leaner kernel function. Assume we know that the sentences
∀s in the training data are labeled as having either category a or category b. Suppose
we want to detect the category of a sentence s by extracting word wi classify the word
as related to either category a or category b. First we convert wi in sentence s into
a vector where the length of the vector is the number of features n = |F|. This word
vector is called a classification instance xj. We do this for all words in all sentences to
construct the input dataset. Given training instances xj ∈ Rn, i = 1, . . . , n and binary
class vector y ∈ Rl such that yi = 1, −1, we can now train the SVM algorithm by solving
the optimization problem in Equation 4.3
min
w
1
2
wT
w + C
l
i=1
(max(0, 1 − yiwT
xi))2
(4.3)
where this problem will give us an optimal weight vector w. This weigth vector can be
seen as the separation hyperplane for the problem. Given the trained weight vector w
we can classify classify a vector according to the following classifier in Equation 4.4:
˜y = sign(wT
x) (4.4)
4.4.2 Multi-Class Support Vector Machines
The task of extracting an aspect category boils down to the problem of assigning a
label to an extracted word wi in a sentence s. An extracted word can be labeled by
its corresponding feature, if present. We leverage the information within a sentence to
Methodology 23
determine whether a word might imply the presence of categories c or not. This means
that the number of categories can be expressed as |c| ≥ 1.
We already mentioned that SVM is actually a binary-classification algorithm. Aspect
category detection can be seen as a multi-class classification problem. The methods in
Freund and Schapire (1997) and Schapire and Singer (1999) are examples of multi-class
algorithms that tackle multi-class classification problems with SVMs.These methods are
mostly based on a boosting scheme to train multiple binary classifiers and use some
classification scheme to process an instance.
One of the simplest schemes for multi-class classification is to build N classifiers with
N denoting the number of categories in the category set C. Each classifier distinguishing
between one category and the rest Rifkin and Klautau (2004). This scheme is known as
the “one-vs-all” (OVA) scheme. Another quite simple scheme is to build a classifier that
distinguishes between all pairs of classes. In this scheme we build N
2 classifiers Rifkin
and Klautau (2004). This scheme is also known as the “all-vs-all” (AVA) scheme.
There have been several attempts at developing a true multi-class SVM algorithm
(Crammer and Singer, 2002a; Weston et al., 1999; Vapnik, 1998). In general an OVA
scheme does not offer a theoretical advantage over other multi-class classification schemes.
From a practical point OVA performs just as well as other schemes Rifkin and Klautau
(2004). Because of the relative simplicity of the OVA scheme it is the desired scheme to
use.
In this thesis we chose to use an OVA scheme implementation that incorporates a
method presented in (Crammer and Singer, 2002b,a). The details for these methods are
presented in Keerthi et al. (2008).
4.4.3 Strict One-Vs-All Support Vector Machines Method
The first method we present is a method that is based on the framework we presented in
Section 4.1. In this method we choose to use a simple multi-class SVM algorithm with
a OVA classification scheme. We previously stated that in order to use SVM algorithms
we must convert the words in all sentences into a set of instances I for before we can
train the SVM algorithm. The pseudo-code for the process is presented in Algorithm 1.
The first thing to note about an instance produced by Algorithm 1 is that they are
very sparse. This is shown by the fact that xi << n where xi ∈ 0, 1 and n denotes
the number of attributes. For this reason scaling of an instance is very important. One
advantage of this is to reduce the complexity of the SVM calculations Hsu et al. (2003).
Methodology 24
Algorithm 1 Instance builder
Require: X: list of attributes obtained with Information Gain
Require: P: set of POS tags given by POS filter
Require: G: integer for number of grams to extract
Ensure: I: a set of instances
1: procedure instanceBuilder(Si)
2: L ← array of lemmas for all words in Si
3: A ← array of aspect terms in Si an array of pre-labeled aspect terms
4: Initialize set of instances I
5: for all lemma l ∈ L do
6: initialize set of n-grams N
7: p ← pos tag for lemma l
8: c ← chunk tag for lemma l
9: if p ∈ P then check on POS against the list defined by the filter
10: for j = 1 to G do
11: Nj ← buildNGrams(l,p,c,L)
12: add I = Nj to N
13: for all aspects a ∈ A do
14: Na ←buildNGrams(a,p,c,L)
15: add Na to N
16: end for
17: end for
18: Il ← define instance Il(j) = 1 if N(j) = X(j)∀j = 1, . . . , k
19: if Il = empty then
20: Scale Il and add to I
21: end if
22: end if
23: end for
24: return I
25: end procedure
To get a scaled instance we use its unit vector ex calculated as followed:
ex =
x
|x|
To further keep the number of attributes at a minimum, we apply a so-called part-
of-speech filter. Only when a word is tagged with a POS tag defined in the filter will it
be considered for instance building. This will limit the set of the number of words used
for instance building. The part-of-speech filters that are considered in this research are
presented in Appendix A. The final part-of-speech tag is presented in 5.4.
We know that SVM classification is done for each lemma l in the sentence s. This
gives us an array of predicted aspect categories ˜c with |˜c| = |l|. The final set of predictions
for sentence s is defined as ¯c = {˜c|˜c ∈ C}. We also use n¯c to denote the number of words
we predicted that imply sentence s contains category c. Here we see that the possibility
Methodology 25
exists to predict an aspect category in a sentence, based only on one word. This will
lead to a higher number of false positive. To avoid this behavior, we will introduce a
threshold to limit the number of predictions of category c relative to the length of the
sentence. Equation 4.5 shows how the threshold tc for class c is evaluated:
tc ≤
n¯c
|l|
(4.5)
Equation 4.5 tells us that a sentence is labeled as mentioning category c only if the
relative number of lemmas that were classified as related to category c is larger then some
threshold. To train this threshold, we apply a simple linear algorithm that incrementally
raises the value of tc for category c ∈ C and chooses the value for tc that maximizes an
evaluation metric for the classification process over all sentences in the review training
set. In this thesis we use the F1-measure as the performance measure to maximize. The
definition of the F1-measure is presented in Section 5.3.
Now that we know how to build instances for training the SVM classifier, we present
the algorithm to train the SVM classifiers using the OVA scheme for multi-class classi-
fication in Algorithm 2.
Algorithm 2 Strict OVA SVM classifier training algorithm
Require: S: set of annotated sentences
Require: X: list of Attributes obtained with Information Gain
Require: P: set of POS tags given by POS filter
Require: G: integer for number of grams to extract
Ensure: M: the trained SVM classifier
Ensure: thresholds: vector with threshold values for the relative number of times
category c ∈ C was classified in sentence s
1: procedure SVM classifier training(S)
2: Initialize training dataset D
3: for all sentence s ∈ S do
4: Y ← list of unique aspect categories for sentence s
5: for all iaspect category y ∈ Y do
6: I ← instanceBuilder(s)
7: add [y, I] to D
8: end for
9: end for
10: M ← trainClassifier(D)
11: thresholds ← trainThreshold(S,M) Simple linear search algorithm
12: end procedure
Methodology 26
Algorithm 3 introduces the prediction process for the method employing the strict
OVA scheme.
Algorithm 3 Strict OVA SVM classifier prediction algorithm
Require: S: set of training sentences
Require: X: list of Attributes obtained with Information Gain
Require: P: set of POS tags given by POS filter
Require: G: integer for number of grams to extract
Require: F: set of predefined aspect categories
1: procedure process OVA Classification scheme on test set(S)
2: for all sentence s ∈ S do
3: initialize fy = 0 with y ∈ F
4: I ← instanceBuilder(s)
5: for all instance i ∈ I do
6: y ← C(i)
7: fy = fy + 1
8: end for
9: for all y ∈ F do
10: if f˜y/|s| ≥ threshold then here |s| denotes the number of word in
sentence s
11: Annotate y as an aspect category for sentence s
12: end if
13: end for
14: end for
15: end procedure
The output of Algorithm 3 is a classification vector y with |y| ≥ 1 for each sentence.
Methodology 27
Example Figure 4.2 shows an flowchart of an example of the OVA method. The figure
for the SVM Classifier is not representative of a true SVM hyperplane for the example
problem.
Figure 4.2: Flowchart showing an example of the OVA scheme based method
Methodology 28
4.4.4 Two-Stage Classification Scheme Support Vector Machines Method
The second method we present in this thesis is an extension in the first method presented
in Section 4.4.3. The disadvantage of the method based on OVA is that it uses th same
classifier scheme with the same features to predict all predefined aspect categories which
can lead to a higher probability of wrongly classifying a sentence with the most common
category.
The extension presented next will include a binary classifier with the sole job of
predicting if a sentence s may contain aspect categories c or not. if the classifier predicts
that the sentence s contains ≥ 1 categories, we proceed by applying the OVA scheme
introduced in Algorithm 3 on the sentences that are predicted to contain an aspect
category. The proposed extension on the method presented in Section 4.4.3 is presented
in Algorithm 6.
Algorithm 4 Instance builder two-stage classification step1
Require: X: list of attributes obtained with Information Gain
Require: P: set of POS tags given by POS filter
Require: G: integer for number of grams to extract
Ensure: I: a set of instances
1: procedure instanceBuilder(L)
2: L ← array of lemmas for all words in Si
3: A ← array of aspect terms in Si an array of pre-labeled aspect terms
4: Initialize set of instances I
5: initialize set of n-grams N
6: p ← pos tag for lemma l
7: c ← chunk tag for lemma l
8: if p ∈ P then check on POS against the list defined by the filter
9: for j = 1 to G do
10: Nj ← buildNGrams(l,p,c,L)
11: add I = Nj to N
12: end for
13: Il ← define instance Il(j) = 1 if N(j) = X(j)∀j = 1, . . . , k
14: if Il = empty then
15: Scale Il and add to I
16: end if
17: end if
18: return I
19: end procedure
The instances created to train the first classifier (C0) are just a rough collection of all
n-grams formed by all words in a given sentence s. With this additional step, the hope
is that Two-Stage approach will further reduce the false positive predictions when
Algorithm 7 is used to process a sentence.
Methodology 29
Algorithm 5 Instance builder two-stage classification step 2
Require: X: list of attributes obtained with Information Gain
Require: P: set of POS tags given by POS filter
Require: G: integer for number of grams to extract
Ensure: I: a set of instances
1: procedure instanceBuilder(Si)
2: L ← array of lemmas for all words in Si
3: A ← array of aspect terms in Si an array of pre-labeled aspect terms
4: Initialize set of instances I
5: for all lemma l ∈ L do
6: initialize set of n-grams N
7: p ← pos tag for lemma l
8: c ← chunk tag for lemma l
9: if p ∈ P then check on POS against the list defined by the filter
10: for j = 1 to G do
11: Nj ← buildNGrams(l,p,c,L)
12: add I = Nj to N
13: end for
14: Il ← define instance Il(j) = 1 if N(j) = X(j)∀j = 1, . . . , k
15: if Il = empty then
16: Scale Il and add to I
17: end if
18: end if
19: end for
20: return I
21: end procedure
The classifier C0 in Algorithm 7 functions as a filter to only classify sentences that
possibly contain aspect categories. Just as before The output this algorithm is a classi-
fication vector y with |y| ≥ 1 for each sentence.
Methodology 30
Algorithm 6 Two-Stage Classification Scheme training algorithm
Require: S: set of annotated sentences
Require: X: list of attributes obtained with Information Gain
Require: P: set of POS tags given by POS filter
Require: G: integer for number of grams to extract
Ensure: C0: the trained SVM classifier for the first stage
Ensure: C1: the trained SVM classifier for the second stage
Ensure: thresholds: vector with threshold values for the relative number of times
category c ∈ C was classified in sentence s
1: procedure Training Two-Stage Classification Scheme method on anno-
tated sentences(S)
2: Initialize training dataset D0
3: Initialize training dataset D1
4: for all sentence s ∈ S do
5: Y s ← list of unique aspect categories for sentence s
6: if Y = “miscellaneous” then
7: I0 ← instanceBuilder(L) Here we build an instance based on the set
of all n-grams the sentence
8: add [“OTHER”, I0] to D0
9: for all aspect category y ∈ Y do
10: I1 ← instanceBuilder(s)
11: add [y, I1] to D1
12: end for
13: else
14: add [”“miscellaneous”, I0] to D0
15: end if
16: end for
17: C0 ← trainClassifier(D0)
18: C1 ← trainClassifier(D1)
19: thresholds ← trainThreshold(S,C0,C1) Simple linear search algorithm
discussed earlier
20: end procedure
Methodology 31
Algorithm 7 Two-Stage Classification Scheme method prediction algorithm
Require: S: set of training sentences
Require: X: list of Attributes obtained with Information Gain
Require: P: set of POS tags given by POS filter
Require: G: integer for number of grams to extract
Require: F: set of predefined aspect categories
1: procedure process Two-Satge Calassification scheme on test set(S)
2: for all sentence s ∈ S do
3: y0 ← C0(i)
4: if y0 = “miscellaneous” then
5: initialize fy = 0 with y ∈ F
6: I ← instanceBuilder(s)
7: for all instance i ∈ I do
8: y1 ← C1(i)
9: fy1
= fy1
+ 1
10: end for
11: for all y ∈ F do
12: if f˜y/|s| ≥ threshold then
13: Annotate ˜y as an aspect category for sentence s
14: end if
15: end for
16: else
17: Annotate “miscellaneous” as an aspect category for sentence s
18: end if
19: end for
20: end procedure
Methodology 32
Example Figure 4.2 shows an flowchart of an example of the Two-Stage Classification
scheme method. The figure for the SVM Classifier is not representative of a true SVM
hyperplane for the example problem.
Figure 4.3: Flowchart showing an example of the two-stage classification scheme
based method
Chapter 5
Evaluation
In this chapter we give an overview of the experiment setup. First, we will present
the system architecture of the experiment in Section 5.1.Then in Section 5.2 we present
the consumer review data presented at SemEval 2014 (Pontiki et al., 2014). The data
consists of a corpus of consumer reviews for restaurants from Citysearch New York Ganu
et al. (2009). To validate the results from the experiment we use a training and test
set as provided by SemEval 2014 (Pontiki et al., 2014). The two method we proposed
require some form of parameter selection and tuning. In Section 5.4 we give an overview
of the parameters that need to be tuned . The performance of the proposed methods
will be compared to some baselines. The three baselines are Dominant Aspect Category
Tagger and a Random Aspect Category Tagger . The baselines are formally introduced
in Sections 5.5.1, 5.5.2. We will also compare the performance of our methods with some
methods from the literature. These methods are those that have been developed for the
SemEval 2014 (Pontiki et al., 2014) competition.
5.1 System Architecture
In this section we will give a visual overview of the implementation of the methods
presented in the previous section. In this thesis we used the Java programming lan-
guage to implement the methods we proposed. To discuss the Java libraries used in the
implementation of the methods, we will use the visual representation as a reference.
A summarized overview of the proposed methods is presented in Figure 5.1. The
process will draw the data used for evaluation and split the data into a set of training
sentences and a set of test sentences. The first process in Figure 5.1 is the process of
converting the words in a sentence into instances.The output is a dataset where the
33
Chapter 5. Evaluation 34
Figure 5.1: A general overview of training and prediction processes implemented in
this thesis.
(targeted) words represent a point in the feature-space, as defined in Section 4.2, with
their corresponding labeled category. Figure 5.2 shows the process of converting the
words in a given sentence into instances.
Spell checker The process in Figure 5.2 perform spell checking. The method used
in this thesis is presented in (Naber, 2003; Milkowski, 2010). The Java implementation
of the method presented in (Naber, 2003; Milkowski, 2010) is called JLanguageTool1.
The advantage of using the JLanguageTool is in that it not only checks for the best
word match given a dictionary of correctly spelled words, but the JLanguageTool also
uses a corpus of grammatical pattern rules to determine the correct word to replace a
misspelled word with.
POS Tagger One of the core methods to employ in natural language processing is
part-of-speech tagging. There are many part-of-speech taggers available that are ready
to use. Most taggers are trained on a annotated text corpus. (e.g., the Penn Treebank
and the British National Corpus Marcus et al. (1993); Leech et al. (1994)). The POS
1
The JLanguageTool API can be found at http://wiki.languagetool.org/java-api
Chapter 5. Evaluation 35
Figure 5.2: Overview of the process of converting a sentence into a set of instances
tagger used in this thesis is the tagger used in the Stanford CoreNLP(Manning et al.,
2014) Java API2.
Word Lemma To bring a word down to it’s lemma form we use the lemmatizer
available in the Stanford CoreNLP(Manning et al., 2014) Java API.
2
The Standford CoreNLP Java API can be found at http://nlp.stanford.edu/software/corenlp.
shtml
Chapter 5. Evaluation 36
Chunker In this thesis we perform chunking with the chunker available in the
OpenNLP project Baldridge (2005) Java API. In this thesis we use the default model
for the chunker in the OpenNLP Java API. The model is trained on the data presented
in Tjong Kim Sang and Buchholz (2000).
Feature Selection In this thesis we apply the Information Gain method to do feature
selection. The Java implementation of the method we used is the Information Gain
Feature selection method in the Weka Machine Learning library (Hall et al., 2009).
SVM Classification Algorithm In Section4.3 we determined that the number of
features can be quite large. For this reason the authors inFan et al. (2008) propose to
use a Linear SVM. In Fan et al. (2008) they also show the difference in time between
linear and non-linear SVM kernel functions for problems with a large number of features
and instances. For this reason a SVM with a linear kernel is used in both methods. In
this thesis we use a Java version of the C++ API presented in Fan et al. (2008). For
multi-class classification we use the default OVA scheme used in Fan et al. (2008), which
is an implementation of the OVA method discussed in Keerthi et al. (2008).
5.2 Restaurant Review Corpus
The restaurant review dataset (Ganu et al., 2009; Pontiki et al., 2014) consists of col-
lection of reviews for restaurants in New York. In this thesis we will use a training
set of approximately 3000 review sentences and a test set of approximately 800 review
sentences. The sentences are manually annotated with aspect terms. Each sentence is
also annotated with aspect categories.
The training dataset from (Ganu et al., 2009; Pontiki et al., 2014) has 5 predefined
aspect categories : ‘service’, ‘ambiance’ , ‘food’, ‘price’ and ‘anecdotes/miscellaneous’.
The distribution of the number of each category in the set of review sentences in the
dataset is presented in Figure 5.3. The category with the highest frequency is ‘food’.
The frequency of ‘food’ and ’anecdotes/miscellaneous’ are about two time the frequency
of the other categories. This will have an effect on the number of false positives when
we run experiments on the methods presented in the previous chapter.
A sentence can also have more than one labeled aspect categories. This complicates
matters more because the proposed system must be able to predict at most the number
of predefined categories. As mentioned before this problem is solved by using multi-
class SVM classification schemes and that all words are processed to construct a set of
Chapter 5. Evaluation 37
Figure 5.3: Distribution of the number of aspect categories in a sentence
predicted categories . Figure 5.4 gives an overview of the distribution of the number of
aspect categories in a sentence.
Figure 5.4: Distribution of the number of aspect categories in a sentence
The distribution of the aspect categories in the test dataset provided by (Ganu et al.,
2009; Pontiki et al., 2014) is presented in Figure 5.5.
Here it is obvious that the portion aspects tagged with ‘anecdotes/miscellaneous’
label is much lower when compared to the test set. The impact of this shift is that the
Chapter 5. Evaluation 38
Figure 5.5: Distribution of the number of aspect categories in a sentence
methods might over classify the ‘anecdotes/miscellaneous’ category to the words in a
sentence.
5.3 Evaluation Metrics
To evaluate the output of the presented methods and the comparative algorithms, some
evaluation metrics are defined. Table 5.1 introduces the the 4 prediction vs. actual
outcomes. In this research the outcomes are defined as follows:
predicted class
true false
actual class
true TP FN (Type I error)
false FP (Type II error) TN
Table 5.1: Confusion table for classification problems
• True Positive (TP): the algorithm has correctly predicted a category in a sentence.
• False Negative (FN): The algorithm has not predicted a category but a different
category is present in the annotated sentence (Type I error).
Chapter 5. Evaluation 39
• False Positive (FP): The algorithm has predicted an aspect category but an aspect
category is not present in the sentence (Type II error).
A TP is given only when the algorithm predicts the same category as the annotated
category in a sentence. This means that wrongly predicting an aspect category is not
only affecting FP but also FN. The previous formulation of FP and FN dictates that
we must annotate the prediction as FP and FN. FP because the predicted category
is not in the sentence, and FN because the aspect of the annotated sentence was not
predicted. For this reason the performance measures such as precision and recall will
be affected. Precision and recall are presented in equations (5.1) and (5.2)
precision =
TP
TP + FP
(5.1)
recall =
TP
TP + FN
(5.2)
When looking at equations (5.1) and (5.2) we can see the original definitions of FP
and FN can lead to lower values for these performance measures. This is because one
misclassification of the algorithm can increases the FP and FN counts thus lowering
both precision and recall.
In this research we want to maximize both these performance metrics’. A very high
precision score may result from an algorithm that is too conservative in its predictions.
This means that if the algorithms does not classify a category there will be no effect on
the precision. This results in low values for recall. Vice versa, if we have a high recall
score the algorithm is can be too liberal in its predictions. To maximize both measures
we will look at the harmonic mean of recall and precision known as the F1-measure (Tan
et al., 2006).
F1 =
2TP
2TP + FP + FN
(5.3)
5.4 Parameter Selection
This section discusses the parameters that are pre-selected or tuned in both methods
presented in this thesis. First a part-of-speech filter is applied to words extracted from a
sentence. An example of a part-of-speech filter in that only the nouns will be extracted
from a given a tagged sentence. In this research the previous filter rule is noted as ”NN”.
Appendix A lists all part-of-speech filters considered.
The two proposed method extract information by constructing a set of 1-,2- and 3-
grams of the neighboring words of the word wi being processed. The neighboring words
Chapter 5. Evaluation 40
selected to construct the n-gram sets of word wi are not subject to filtering based their
part-of-speech tags. Section 4.2.2 introduced the n-grams to the attribute space. In this
section the optimal value for n in n-grams is determined by comparing the F1 values for
for both methods using 1-,2 and 3-grams as input parameters. Further we will determine
if thresholds must be set for each pre-defined category as is described in Algorithms 3
for the method based on a strict OVA classification scheme and 7 for the method based
on a two-stage classification scheme.
To determine the optimal parameter setup for the proposed methods, we will run
the trained models for the two methods on the test set. The results are presented in
Figure 5.6 and 5.9 for the Strict OVA method and the Two-Stage approach respectively,
without threshold training. When no threshold is trained we use a default value of 0 for
all threshold values.In Figure 5.7 and 5.8 for the Strict OVA method and the Two-Stage
approach respectively, with threshold training.
5.4.1 Part Of Speech Filter
The results in Figures 5.7 - 5.9 show that any POS-filter that allows nouns to be extracted
seem to result in higher performance for the Strict OVA Aspect Category Detection
method. This seems to reinforce the research presented in Nakagawa and Mori (2002).
Next to nouns the most important word type seems to be the adjective. This makes
sense in that the role of an adjective is defined as ”a describing word, the main syntactic
role of which is to qualify a noun or noun phrase”. This tells us that extracting an
adjective results in some indication that an aspect category is being discussed. The word
type that contains the least amount of information about aspect category is the adverb.
An adverb is generally used as a modifier for verbs, adjectives, nouns, and noun phrases.
The adverb is principally used with verbs. This knowledge combined with the results
from Nakagawa and Mori (2002) explains the result that adverbs perform poorly for
detecting aspects that appear implicitly.
5.4.2 N-Grams
The influence of the size of n-grams extracted can be seen in Figures 5.6 - 5.9.
For the Strict OVA Aspect Category Detection method the n-grams seem to follow
the reasoning developed in Section 4.2.2. Only when the ‘only NN JJ’ filter is applied
,shown in Figure 5.6, the 1-gram seems to perform better then other n-grams. This can
be explained by the fact that nouns are often used for describing aspects that appear
implicitly Nakagawa and Mori (2002), and that adjectives describe nouns.
Chapter 5. Evaluation 41
Figure 5.6: Results for The Strict OVA Aspect Category Detection method with 1-,2-
and 3-grams without a trained threshold for each individual aspect category
Figure 5.7: Results for The Strict OVA Aspect Category Detection method with 1-,2-
and 3-gram with a trained threshold for each individual aspect category
Chapter 5. Evaluation 42
Figure 5.8: Results for the Two-Stage Classification Scheme method with 1-,2- and
3-grams with a trained threshold for each individual aspect category
Figure 5.9: Results for Two-Stage Classification Scheme method with 1-,2- and 3-
grams without a trained threshold for each individual aspect category
Chapter 5. Evaluation 43
The results for the Two-Stage Classification Scheme method paint a different picture
for the importance of the length of an extracted n-gram. When nouns are filtered out
of the extracted lemmas the result behaves more or less according to the reasoning
presented in Section 4.2.2 and observed for the basic OVA based method. This could
be due to the fact that the first classification stage discriminates better between sentences
with and without aspect categories and thus the need for information extraction from
the neighboring words is less effective. For the Two-Stage Classification Scheme based
method, the best results seem to be obtained with unigrams.
5.4.3 Threshold vs. No Threshold
To test the effect of training a threshold on the number of predictions for a certain
category in a sentence we look at the arithmetic difference of the F1-score of the methods
with and without threshold. Figures 5.10a and 5.10b present the arithmetic difference of
F1, recall and precision scores for the two methods with and without threshold trained.
(a) OVA scheme based (b) Two-Stage scheme based
Figure 5.10: Arithmetic difference of F1 scores for the OVA based method and the
two-stage method with and without threshold
When it comes to the threshold training the threshold step seems to increase the
performance when nouns are included in the filter. This can be attributed to the fact
that nouns are important in aspect category detection. In the case of the method based
on the OVA scheme, this would mean that there would be many predictions for the
dominant feature when nouns are extracted. Limiting the number of FP hits with the
threshold improves accuracy and in turn improves the F1 measure. Figures 5.11 shows
the difference in precision when threshold training is applied and not. The addition
of the trained threshold generally improves the precision score. The reasoning for the
threshold is really reflected in these two figures.
Chapter 5. Evaluation 44
(a) OVA scheme based (b) Two-Stage scheme based
Figure 5.11: Arithmetic difference of Precision scores for the OVA based method and
the two-stage method with and without threshold
To test how restrictive the threshold would be for category detection the difference
between recall scores, for the algorithms with and without threshold, are presented in
Figure 5.12. From the results in Figure 5.12 we can see that adding a trained threshold
(a) OVA scheme based (b) Two-Stage scheme based
Figure 5.12: Arithmetic difference of Recall scores for the OVA based method and
the two-stage method with and without threshold
filters out some TP hits to compensate for the number FP hits. This means that a
decrease in the recall-scores. This is obvious given the fact that the threshold is trained
by maximizing the F1 score. From the results in Section 5.4 and Figures 5.11 and 5.12
we can conclude that overall training a threshold to reduce the FP count improves
the performance of the proposed methods even though a higher FN is expected and
observed, the performance increase with respect to the F1-measure is mostly due to the
reduction of FP. We can also see that not for all part-of-speech filters there is an increase
in performance. These decreases happen when nouns are omitted from the sentence by
the filter. We will not consider these results in our explanation of the results
Chapter 5. Evaluation 45
5.4.4 OVA Scheme based vs. Two-Stage Scheme based
Section 4.4.3 proposed two methods for aspect category detection. The algorithm pro-
posed in Section 4.4.3 is one where a single classification step is performed. This step
uses the same instances to classify if a sentence is labeled as having an aspect category
labeled or not. The second algorithm proposed in Section 4.4.4 proposed an extra classi-
fication step to first predict if a sentence has ≥ 1 aspect categories or none. Figure 5.13
shows the results of Ft
1 − Fo
1 , where Ft
1 denotes F1 score for the method based on a
two-stage classifications scheme, and Fo
1 for the OVA scheme based method, to see the
impact of training a separate classifier to find sentences with or without labeled aspect
categories.
(a) no trained threshold (b) trained threshold
Figure 5.13: Arithmetic difference of F1 scores for the OVA based method and the
two-stage method with and without threshold
The results in Figure 5.13 show that adding an additional classifier to specifically
predict if a sentence contains an implicit feature results in an overall improvement in
performance for the F1 score. Especially when only unigrams (1-gram) are extracted as
attributes, the performance seems to improve the best. This could be due to the fact
that because of the binary classifier for detecting category aspects discards sentences
that may not refer to an aspect category. This in turn will lower the FP count and thus
increasing the F1-score. The effect on the number of FP predictions Figure 5.14 shows
the arithmetic difference between precision-scores precisionD
− precisionI
.
The results in Figure 5.14 suggest that adding the extra classifier has a big impact on
FP’s. This result confirms the reasoning given for adding an extra classifier presented
in Section 4.4.4.
When we look at the arithmetic difference between recall-scores recallD
− recallI
in Figure 5.15, we can see the effect on the number of FN’s when including an extra
classifier to filter out sentences that may not contain references to aspect categories .
Chapter 5. Evaluation 46
(a) no trained threshold (b) trained threshold
Figure 5.14: Arithmetic difference of precision scores for Ithe OVA based method
and the two-stage method with and without threshold
(a) no trained threshold (b) trained threshold
Figure 5.15: Arithmetic difference of recall scores for the OVA based method and
the two-stage method with and without threshold
Figure 5.15a shows how adding the second classifier increases the number of FN
predictions and in turn lower the recall-score. This is due to the fact that the Two-
Stage method might more easily classify a sentence as having no references to aspect
categories thus increasing the probability that sentences with labeled aspect categories
might never be processed by the second classifier scheme in this method.
On the face of it the results in Figure 5.15b seem to go against the reasoning previ-
ously given. But on closer inspection, the threshold trained in the OVA scheme based
method can be more restrictive on whether or not to annotate a sentence as having an
aspect category. This in turn will give high FN counts, thus a low recall score. In the
method based on the Two-Staged scheme the threshold only has effect on sentences that
are classified as having aspect categories.
Chapter 5. Evaluation 47
5.4.5 Parameter Tuning
For comparison proposes Table 5.2 shows the parameter settings for OVA based method
and the two-stage method that will be used when comparing the performance of the
methods presented in this research and some comparative algorithms. The parameters
have been selected by using the parameter settings that results in the highest value of
F1.
OVA based Two-Stage based
Parameters
pos-filter NN VB JJ NN VB JJ
n-gram 3 1
threshold true true
Results
F1 0.665 0.772
precision 0.618 0.765
recall 0.718 0.779
Table 5.2: Final parameters for the OVA scheme based method and the method based
on a Two-Stage classification scheme
5.5 Algorithm Evaluation
In this section we will evaluate how our methods performs. We will compare our methods
with some baseline category detection methods and three aspect category detection
methods from the literature. In Sections 5.5.1 and 5.5.2 we introduce Dominant Aspect
Category Tagging method and a Random Aspect Tagger respectively.
5.5.1 Dominant Aspect Category Tagger
The lazy feature extractor is simply an algorithm trained by determining the most
frequent aspect category in the training dataset. Then annotating a test sentence it
simply annotates the aspect category determined in the training stage. The pseudo-code
for training and processing this algorithm is given in Algorithms 8 and 9, respectively.
5.5.2 Random Aspect Category Detector
Another baseline algorithm we are interested in is one that randomly assigns aspect
categories to sentences. This algorithm is trained by determining the probability of an
Chapter 5. Evaluation 48
Algorithm 8 Dominant Aspect Category Tagger training algorithm
1: Input: S: set of annotated sentences
2: procedure Training Dominant Aspect Category Tagger on annotated
sentences(S)
3: Initialize best feature F
4: Initialize feature count vector f = 0
5: for all sentence s ∈ S do
6: Y ← list of unique aspect categories for sentence s
7: for all implicit feature y ∈ Y do
8: fy + +
9: end for
10: end for
11: F = maxy∈Y fy
12: end procedure
Algorithm 9 Dominant Aspect Category Tagge prediction algorithm
1: Input: S: set of training sentences
2: F: most common feature from training stage
3: procedure process Lazy Feature Extractor on test set(S)
4: for all sentence s ∈ S do
5: Annotate F as an aspect category for sentence s
6: end for
7: end procedure
aspect category by:
Py = s∈S fy,s
n
(5.4)
where Py is the probability of category y, fy,s = 1 if category y is in sentence s and
n is the total number of aspect categories plus the number of sentences with no aspect
categories in training set S. The training and processing algorithms for the Weighted
Random Aspect Category Detector are presented in Algorithms 10 and 11 respectively.
5.5.3 Algorithm Comparison
To compare the baseline algorithms presented in the previous section, we compare the
results from our application of the methods on the test data provided by the SemEval
2014 (Pontiki et al., 2014) of the restaurant dataset. The results in Table 5.3 show the
performance of all previously mentioned baseline methods and methods from literature,
compared to the the methods proposed in this thesis. The settings selected for the two
methods are presented in Section 5.4.5. The best performing algorithm is used as the
benchmark.
Chapter 5. Evaluation 49
Algorithm 10 Random Aspect Category Detector training algorithm
1: Input: S: set of annotated sentences
2: procedure Weighted Random Aspect Category Detector on annotated
sentences(S)
3: Initialize best feature F
4: for all sentence s ∈ S do
5: Y ← list of unique aspect categories for sentence s
6: for all aspect category y ∈ Y do
7: fy + +
8: end for
9: end for
10: for all aspect category y ∈ Y do
11: Py = 1
|Y |
12: end for
13: end procedure
Algorithm 11 Weighted Random Aspect Category Detector prediction algorithm
1: Input: S: set of training sentences
2: F: best Feature from training
3: procedure process Weighted Random Aspect Category Detector on
test set(S)
4: for all sentence s ∈ S do
5: Annotate F as an aspect category for sentence s with probabilityPy
6: end for
7: end procedure
The first thing to conclude from Table 5.3 is that Two-Staged method outperforms
the OVA based method by 11% on the F1-score. Both the precision- and recall-score are
increased by adding a second stage to the method. Following the definition of recall we
can conclude that, adding a SVM classifier that detects sentences with or without aspect
categories, decreases the false positives count. The Random Aspect Category Tagger
introduced in 5.5.2 performs the worst on all measures. This could be down to the fact
that sentences can contain multiple aspect categories (Figure 5.4). The Random Aspect
Category Tagger only classifies a sentence as having 1 or no aspect category. This means
that the number of FN’s is naturally higher and thus a high recall. The advantage of
Two-Stage classifier scheme is that it does take into account that there can be more
than one aspect categories per sentence.
The results for the Dominant Aspect Category Tagger are to be expected when you
take into account that the frequency of the aspect category ”food” is almost twice as
much as other categories. This result reveals that the features used to construct the
feature space for the Two-Stage method reveals some information that the SVM’s are
able to learn from.
Chapter 5. Evaluation 50
performance measures
Method F1 recall precision
Random Aspect Category baseline 0.306 0.305 0.308
Dominant Aspect Category baseline 0.483 0.637 0.388
Schouten and Frasincar (2014) 0.593 0.558 0.633
SemEval baseline 0.639 - -
OVA Scheme Based 0.665 0.718 0.618
Two stage Classification Scheme
Based
0.772 0.779 0.765
Brychcın et al. (2014) * 0.810 0.774 0.851
Kiritchenko et al. (2014) 0.822 0.783 0.865
Brychcın et al. (2014) 0.886 0.862 0.910
Table 5.3: F1, Recall and Precision scores for different method when evalua-
tion is done on the test set provided by SemEval-2014
*
indicates a constrained method where the algorithm is trained using only the
training set as a resource.
In this thesis we do not outperform the methods presented in (Brychcın et al., 2014;
Kiritchenko et al., 2014) when we compare our methods with the restaurant test dataset
from the SemEval2014 competition. The research presented in this thesis shows that
we can extract increasing amount of information using simple contextual information.
This contextual information enables us to build a feature space that tries to numeri-
cally represent a word given the context of the word. Our two-staged method shows
that training a classifier to filter out sentences that are labeled as “anecdotes/miscel-
laneous” benefits the performance of the classifier(s) that are specialized in detecting
more specific aspect categories. Nouns also seem to be very important although, this
was already proposed by Nakagawa and Mori (2002). The importance of contextual
information seems to decrease when we make a separate classifier for “anecdotes/mis-
cellaneous”. We can see this as detecting sentences where the contextual information
was determined to tag a sentence worthy of more scrutiny or one that can be discarded
as being “anecdotes/miscellaneous”.
The constrained method proposed in Kiritchenko et al. (2014) is the method that
resembles the methods in this thesis the most. We can see that our method achieves
recall-scores that are similar to the recall-score reported in Kiritchenko et al. (2014). The
precision score presented in Kiritchenko et al. (2014) is 5% higher than the precision
score achieved by our best performing method. This could be due to the fact that the
Chapter 5. Evaluation 51
authors in Kiritchenko et al. (2014) us a more sophisticated method for words where the
aspect category is not immediately apparent.
Chapter 6
Conclusion and Future Work
On-line consumer reviews are increasingly becoming the norm when evaluating the qual-
ity or desirability of a product. These reviews can contain a lot of information that is
relevant to other consumers. A review can be about a certain aspect of a product or
service. A set of reviews can contain many unique aspects. To further summarize the
aspects we assign the aspects to an aspect category. In this thesis we present two ma-
chine learning methods to detect the aspect categories in a given sentence. The first
method we propose is a method based on a general scheme for multi-class classification.
The other method we presented is based on a revised scheme of the general scheme. An
overview of the findings is presented in Section 6.1. Based on these findings the future
direction is presented in Section 6.2.
6.1 Conclusion
This thesis first introduced the problem of finding aspect categories in customer reviews.
A sentence can explicitly mention that “the food was great”. Here we know that it was
about the aspect category ‘food’. Now imagine the sentence reads like this “the scallops
had a great taste to them.”. Although food was never mentioned we know that we
are discussing the aspect category ‘food’ by relating the aspect ‘scallop’ to the category
‘food’. This is an example aspect category detection.
In this thesis, two machine learning methods were introduced to tackle the problem
of detecting aspect categories. First we presented a basic framework for aspect category
detection using classification algorithms. Some preprocessing steps were proposed to
transform a sentence into a set of instances that can then be used in the to train or
process the classification algorithms. The first step in preprocessing is to perform a
53
Aspect_Category_Detection_Using_SVM
Aspect_Category_Detection_Using_SVM
Aspect_Category_Detection_Using_SVM
Aspect_Category_Detection_Using_SVM
Aspect_Category_Detection_Using_SVM
Aspect_Category_Detection_Using_SVM
Aspect_Category_Detection_Using_SVM
Aspect_Category_Detection_Using_SVM
Aspect_Category_Detection_Using_SVM
Aspect_Category_Detection_Using_SVM

Weitere ähnliche Inhalte

Was ist angesagt?

From sound to grammar: theory, representations and a computational model
From sound to grammar: theory, representations and a computational modelFrom sound to grammar: theory, representations and a computational model
From sound to grammar: theory, representations and a computational modelMarco Piccolino
 
dissertation_ulrich_staudinger_commenting_enabled
dissertation_ulrich_staudinger_commenting_enableddissertation_ulrich_staudinger_commenting_enabled
dissertation_ulrich_staudinger_commenting_enabledUlrich Staudinger
 
2012-02-17_Vojtech-Seman_Rigorous_Thesis
2012-02-17_Vojtech-Seman_Rigorous_Thesis2012-02-17_Vojtech-Seman_Rigorous_Thesis
2012-02-17_Vojtech-Seman_Rigorous_ThesisVojtech Seman
 
Pragmatic+unit+testing+in+c%23+with+n unit%2 c+second+edition
Pragmatic+unit+testing+in+c%23+with+n unit%2 c+second+editionPragmatic+unit+testing+in+c%23+with+n unit%2 c+second+edition
Pragmatic+unit+testing+in+c%23+with+n unit%2 c+second+editioncuipengfei
 
ExamsGamesAndKnapsacks_RobMooreOxfordThesis
ExamsGamesAndKnapsacks_RobMooreOxfordThesisExamsGamesAndKnapsacks_RobMooreOxfordThesis
ExamsGamesAndKnapsacks_RobMooreOxfordThesisRob Moore
 
Error correcting codes and cryptology
Error correcting codes and cryptologyError correcting codes and cryptology
Error correcting codes and cryptologyRosemberth Rodriguez
 
PhD thesis "On the intelligent Management of Sepsis"
PhD thesis "On the intelligent Management of Sepsis"PhD thesis "On the intelligent Management of Sepsis"
PhD thesis "On the intelligent Management of Sepsis"Vicente RIBAS-RIPOLL
 
Bookpart
BookpartBookpart
Bookparthasan11
 
Incorporating Learning Strategies in Training of Deep Neural Networks for Au...
Incorporating Learning Strategies in Training of Deep Neural  Networks for Au...Incorporating Learning Strategies in Training of Deep Neural  Networks for Au...
Incorporating Learning Strategies in Training of Deep Neural Networks for Au...Artur Filipowicz
 
Backtesting Value at Risk and Expected Shortfall with Underlying Fat Tails an...
Backtesting Value at Risk and Expected Shortfall with Underlying Fat Tails an...Backtesting Value at Risk and Expected Shortfall with Underlying Fat Tails an...
Backtesting Value at Risk and Expected Shortfall with Underlying Fat Tails an...Stefano Bochicchio
 
Fundamentals of computational_fluid_dynamics_-_h._lomax__t._pulliam__d._zingg
Fundamentals of computational_fluid_dynamics_-_h._lomax__t._pulliam__d._zinggFundamentals of computational_fluid_dynamics_-_h._lomax__t._pulliam__d._zingg
Fundamentals of computational_fluid_dynamics_-_h._lomax__t._pulliam__d._zinggRohit Bapat
 
Business Mathematics Code 1429
Business Mathematics Code 1429Business Mathematics Code 1429
Business Mathematics Code 1429eniacnetpoint
 

Was ist angesagt? (19)

From sound to grammar: theory, representations and a computational model
From sound to grammar: theory, representations and a computational modelFrom sound to grammar: theory, representations and a computational model
From sound to grammar: theory, representations and a computational model
 
dissertation_ulrich_staudinger_commenting_enabled
dissertation_ulrich_staudinger_commenting_enableddissertation_ulrich_staudinger_commenting_enabled
dissertation_ulrich_staudinger_commenting_enabled
 
thesis
thesisthesis
thesis
 
Thesis_Nazarova_Final(1)
Thesis_Nazarova_Final(1)Thesis_Nazarova_Final(1)
Thesis_Nazarova_Final(1)
 
2012-02-17_Vojtech-Seman_Rigorous_Thesis
2012-02-17_Vojtech-Seman_Rigorous_Thesis2012-02-17_Vojtech-Seman_Rigorous_Thesis
2012-02-17_Vojtech-Seman_Rigorous_Thesis
 
Pragmatic+unit+testing+in+c%23+with+n unit%2 c+second+edition
Pragmatic+unit+testing+in+c%23+with+n unit%2 c+second+editionPragmatic+unit+testing+in+c%23+with+n unit%2 c+second+edition
Pragmatic+unit+testing+in+c%23+with+n unit%2 c+second+edition
 
thesis
thesisthesis
thesis
 
ExamsGamesAndKnapsacks_RobMooreOxfordThesis
ExamsGamesAndKnapsacks_RobMooreOxfordThesisExamsGamesAndKnapsacks_RobMooreOxfordThesis
ExamsGamesAndKnapsacks_RobMooreOxfordThesis
 
Thesis_Prakash
Thesis_PrakashThesis_Prakash
Thesis_Prakash
 
Error correcting codes and cryptology
Error correcting codes and cryptologyError correcting codes and cryptology
Error correcting codes and cryptology
 
PhD thesis "On the intelligent Management of Sepsis"
PhD thesis "On the intelligent Management of Sepsis"PhD thesis "On the intelligent Management of Sepsis"
PhD thesis "On the intelligent Management of Sepsis"
 
Bookpart
BookpartBookpart
Bookpart
 
Oop c++ tutorial
Oop c++ tutorialOop c++ tutorial
Oop c++ tutorial
 
Bogstad 2015
Bogstad 2015Bogstad 2015
Bogstad 2015
 
Incorporating Learning Strategies in Training of Deep Neural Networks for Au...
Incorporating Learning Strategies in Training of Deep Neural  Networks for Au...Incorporating Learning Strategies in Training of Deep Neural  Networks for Au...
Incorporating Learning Strategies in Training of Deep Neural Networks for Au...
 
Backtesting Value at Risk and Expected Shortfall with Underlying Fat Tails an...
Backtesting Value at Risk and Expected Shortfall with Underlying Fat Tails an...Backtesting Value at Risk and Expected Shortfall with Underlying Fat Tails an...
Backtesting Value at Risk and Expected Shortfall with Underlying Fat Tails an...
 
Machine learning-cheat-sheet
Machine learning-cheat-sheetMachine learning-cheat-sheet
Machine learning-cheat-sheet
 
Fundamentals of computational_fluid_dynamics_-_h._lomax__t._pulliam__d._zingg
Fundamentals of computational_fluid_dynamics_-_h._lomax__t._pulliam__d._zinggFundamentals of computational_fluid_dynamics_-_h._lomax__t._pulliam__d._zingg
Fundamentals of computational_fluid_dynamics_-_h._lomax__t._pulliam__d._zingg
 
Business Mathematics Code 1429
Business Mathematics Code 1429Business Mathematics Code 1429
Business Mathematics Code 1429
 

Andere mochten auch

Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machinesnextlib
 
Ms word thesis_082
Ms word thesis_082Ms word thesis_082
Ms word thesis_082hegazoh
 
Project report - Bengali digit recongnition using SVM
Project report - Bengali digit recongnition using SVMProject report - Bengali digit recongnition using SVM
Project report - Bengali digit recongnition using SVMMohammad Saiful Islam
 
Svm light at E-commerce Website
Svm light at E-commerce WebsiteSvm light at E-commerce Website
Svm light at E-commerce WebsiteZhang Peng
 
Malware Detection Using Machine Learning Techniques
Malware Detection Using Machine Learning TechniquesMalware Detection Using Machine Learning Techniques
Malware Detection Using Machine Learning TechniquesArshadRaja786
 
Svm implementation for Health Data
Svm implementation for Health DataSvm implementation for Health Data
Svm implementation for Health DataAbhishek Agrawal
 
report.doc
report.docreport.doc
report.docbutest
 
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...osify
 
Tweets Classification using Naive Bayes and SVM
Tweets Classification using Naive Bayes and SVMTweets Classification using Naive Bayes and SVM
Tweets Classification using Naive Bayes and SVMTrilok Sharma
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for ClassificationPrakash Pimpale
 

Andere mochten auch (14)

Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
Ms word thesis_082
Ms word thesis_082Ms word thesis_082
Ms word thesis_082
 
Thesis
ThesisThesis
Thesis
 
Master thesis
Master thesisMaster thesis
Master thesis
 
Project report - Bengali digit recongnition using SVM
Project report - Bengali digit recongnition using SVMProject report - Bengali digit recongnition using SVM
Project report - Bengali digit recongnition using SVM
 
Svm light at E-commerce Website
Svm light at E-commerce WebsiteSvm light at E-commerce Website
Svm light at E-commerce Website
 
svm_AD
svm_ADsvm_AD
svm_AD
 
Malware Detection Using Machine Learning Techniques
Malware Detection Using Machine Learning TechniquesMalware Detection Using Machine Learning Techniques
Malware Detection Using Machine Learning Techniques
 
Svm implementation for Health Data
Svm implementation for Health DataSvm implementation for Health Data
Svm implementation for Health Data
 
report.doc
report.docreport.doc
report.doc
 
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
 
Tweets Classification using Naive Bayes and SVM
Tweets Classification using Naive Bayes and SVMTweets Classification using Naive Bayes and SVM
Tweets Classification using Naive Bayes and SVM
 
Lecture12 - SVM
Lecture12 - SVMLecture12 - SVM
Lecture12 - SVM
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for Classification
 

Ähnlich wie Aspect_Category_Detection_Using_SVM

Automatic Detection of Performance Design and Deployment Antipatterns in Comp...
Automatic Detection of Performance Design and Deployment Antipatterns in Comp...Automatic Detection of Performance Design and Deployment Antipatterns in Comp...
Automatic Detection of Performance Design and Deployment Antipatterns in Comp...Trevor Parsons
 
A Probabilistic Pointer Analysis For Speculative Optimizations
A Probabilistic Pointer Analysis For Speculative OptimizationsA Probabilistic Pointer Analysis For Speculative Optimizations
A Probabilistic Pointer Analysis For Speculative OptimizationsJeff Brooks
 
UCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_finalUCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_finalGustavo Pabon
 
UCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_finalUCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_finalGustavo Pabon
 
High Performance Traffic Sign Detection
High Performance Traffic Sign DetectionHigh Performance Traffic Sign Detection
High Performance Traffic Sign DetectionCraig Ferguson
 
Interactive Filtering Algorithm - George Jenkins 2014
Interactive Filtering Algorithm - George Jenkins 2014Interactive Filtering Algorithm - George Jenkins 2014
Interactive Filtering Algorithm - George Jenkins 2014George Jenkins
 
Distributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsDistributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsArinto Murdopo
 
Dragos Datcu_PhD_Thesis
Dragos Datcu_PhD_ThesisDragos Datcu_PhD_Thesis
Dragos Datcu_PhD_Thesisdragos80
 
An Optical Character Recognition Engine For Graphical Processing Units
An Optical Character Recognition Engine For Graphical Processing UnitsAn Optical Character Recognition Engine For Graphical Processing Units
An Optical Character Recognition Engine For Graphical Processing UnitsKelly Lipiec
 
Performance Evaluation of Path Planning Techniques for Unmanned Aerial Vehicles
Performance Evaluation of Path Planning Techniques for Unmanned Aerial VehiclesPerformance Evaluation of Path Planning Techniques for Unmanned Aerial Vehicles
Performance Evaluation of Path Planning Techniques for Unmanned Aerial VehiclesApuroop Paleti
 
Integrating IoT Sensory Inputs For Cloud Manufacturing Based Paradigm
Integrating IoT Sensory Inputs For Cloud Manufacturing Based ParadigmIntegrating IoT Sensory Inputs For Cloud Manufacturing Based Paradigm
Integrating IoT Sensory Inputs For Cloud Manufacturing Based ParadigmKavita Pillai
 

Ähnlich wie Aspect_Category_Detection_Using_SVM (20)

Automatic Detection of Performance Design and Deployment Antipatterns in Comp...
Automatic Detection of Performance Design and Deployment Antipatterns in Comp...Automatic Detection of Performance Design and Deployment Antipatterns in Comp...
Automatic Detection of Performance Design and Deployment Antipatterns in Comp...
 
A Probabilistic Pointer Analysis For Speculative Optimizations
A Probabilistic Pointer Analysis For Speculative OptimizationsA Probabilistic Pointer Analysis For Speculative Optimizations
A Probabilistic Pointer Analysis For Speculative Optimizations
 
UCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_finalUCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_final
 
UCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_finalUCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_final
 
Thesis
ThesisThesis
Thesis
 
Thesis Abstract
Thesis AbstractThesis Abstract
Thesis Abstract
 
High Performance Traffic Sign Detection
High Performance Traffic Sign DetectionHigh Performance Traffic Sign Detection
High Performance Traffic Sign Detection
 
Thesis_Report
Thesis_ReportThesis_Report
Thesis_Report
 
Jmetal4.5.user manual
Jmetal4.5.user manualJmetal4.5.user manual
Jmetal4.5.user manual
 
Interactive Filtering Algorithm - George Jenkins 2014
Interactive Filtering Algorithm - George Jenkins 2014Interactive Filtering Algorithm - George Jenkins 2014
Interactive Filtering Algorithm - George Jenkins 2014
 
Distributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsDistributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data Streams
 
Dragos Datcu_PhD_Thesis
Dragos Datcu_PhD_ThesisDragos Datcu_PhD_Thesis
Dragos Datcu_PhD_Thesis
 
Diplomarbeit
DiplomarbeitDiplomarbeit
Diplomarbeit
 
Technical report
Technical reportTechnical report
Technical report
 
An Optical Character Recognition Engine For Graphical Processing Units
An Optical Character Recognition Engine For Graphical Processing UnitsAn Optical Character Recognition Engine For Graphical Processing Units
An Optical Character Recognition Engine For Graphical Processing Units
 
Performance Evaluation of Path Planning Techniques for Unmanned Aerial Vehicles
Performance Evaluation of Path Planning Techniques for Unmanned Aerial VehiclesPerformance Evaluation of Path Planning Techniques for Unmanned Aerial Vehicles
Performance Evaluation of Path Planning Techniques for Unmanned Aerial Vehicles
 
energia
energiaenergia
energia
 
Integrating IoT Sensory Inputs For Cloud Manufacturing Based Paradigm
Integrating IoT Sensory Inputs For Cloud Manufacturing Based ParadigmIntegrating IoT Sensory Inputs For Cloud Manufacturing Based Paradigm
Integrating IoT Sensory Inputs For Cloud Manufacturing Based Paradigm
 
Report
ReportReport
Report
 
Mak ms
Mak msMak ms
Mak ms
 

Aspect_Category_Detection_Using_SVM

  • 1. Detecting Product Aspect Categories using SVMs Author: Andrew E. Hagens Supervisor: Dr. Flavius Frasincar Co-reader: Kim Schouten A thesis submitted in fulfilment of the requirements for the degree of Master of Econometrics and Management Science at the Department of Econometrics Erasmus School of Economics Erasmus University Rotterdam July 2015
  • 2.
  • 3. Abstract Department of Econometrics Erasmus School of Economics Erasmus University Rotterdam Detecting Product Aspect Categories using SVMs by Andrew E. Hagens Consumer reviews are becoming increasingly important to potential buyers of a cer- tain product. To determine what is important in a review, we must find the discussed product features. With the rise of the World Wide Web, it has become a valuable source of product reviews for consumers when they are deciding on the purchase of a product. In this work we present two methods that detect product aspect categories in a review sentence. We propose to tackle the problem using advanced machine learn- ings algorithms, support vector machines in our case. Just as some methods in the SemEval-2014 competition, we propose 2 methods that use linguistic patterns such as word n-grams to find possible aspect categories. In this thesis we want to gain insight into the effects of different patterns.The results from the proposed methods show that we can extract a large amount of information using relatively simple machine learning methods to extract the information.
  • 4.
  • 5. Acknowledgements First I would like to thank my supervisors Dr. Flavius Frasincar and Kim Schouten for the inspiring experience of writing this thesis. They showed a level of knowledge in the area of data mining was both impressive and plentiful. Their support inspiring. I would also like to thank my parents Eric and Karina and brother Emill for the unconditional support they showed in all my endeavors. This has given me the freedom to learn and grow and to follow my interests. I would like to thank my girlfriend Birgitt for her never ending encouragement. This has helped me to stay focused and relaxed through the whole experience of doing scientific research. An experience I will carry with me in all my future endeavors. v
  • 6.
  • 7. Contents Abstract iii Acknowledgements v Contents vi List of Figures ix List of Tables xi 1 Introduction 1 1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Research Goal 7 2.1 Research Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3 Related Work 9 3.1 Implicit Aspect Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 Aspect Category Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3 Method Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4 Methodology 15 4.1 Method Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.2 Feature-Space Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.2.1 Word Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.2.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Lexicon and Lemmatization . . . . . . . . . . . . . . . . . . 17 N-Grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Part-of-speech tagging . . . . . . . . . . . . . . . . . . . . . 18 Chunk Parsing . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.3 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.4 Aspect Category Detection Methods . . . . . . . . . . . . . . . . . . . . . 21 4.4.1 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . 21 4.4.2 Multi-Class Support Vector Machines . . . . . . . . . . . . . . . . 22 vii
  • 8. Contents viii 4.4.3 Strict One-Vs-All Support Vector Machines Method . . . . . . . . 23 4.4.4 Two-Stage Classification Scheme Support Vector Machines Method 28 5 Evaluation 33 5.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Spell checker . . . . . . . . . . . . . . . . . . . . . . . . . . 34 POS Tagger . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Word Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Chunker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . 36 SVM Classification Algorithm . . . . . . . . . . . . . . . . . 36 5.2 Restaurant Review Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.4 Parameter Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.4.1 Part Of Speech Filter . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.4.2 N-Grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.4.3 Threshold vs. No Threshold . . . . . . . . . . . . . . . . . . . . . . 43 5.4.4 OVA Scheme based vs. Two-Stage Scheme based . . . . . . . . . . 45 5.4.5 Parameter Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.5 Algorithm Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.5.1 Dominant Aspect Category Tagger . . . . . . . . . . . . . . . . . . 47 5.5.2 Random Aspect Category Detector . . . . . . . . . . . . . . . . . . 47 5.5.3 Algorithm Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 48 6 Conclusion and Future Work 53 6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 A Part Of Speech Filter Annotation 57 Bibliography 59
  • 9. List of Figures 1.1 Review summary for Apple MacBook (2015) with scores for aspect cate- gories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 4.1 General framework for aspect category detection using machine learning . 16 4.2 Flowchart showing an example of the OVA scheme based method . . . . . 27 4.3 Flowchart showing an example of the two-stage classification scheme based method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.1 A general overview of training and prediction processes implemented in this thesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.2 Overview of the process of converting a sentence into a set of instances . . 35 5.3 Distribution of the number of aspect categories in a sentence . . . . . . . . 37 5.4 Distribution of the number of aspect categories in a sentence . . . . . . . 37 5.5 Distribution of the number of aspect categories in a sentence . . . . . . . . 38 5.6 Results for The Strict OVA Aspect Category Detection method with 1-,2- and 3-grams without a trained threshold for each individual aspect category 41 5.7 Results for The Strict OVA Aspect Category Detection method with 1-,2- and 3-gram with a trained threshold for each individual aspect category . 41 5.8 Results for the Two-Stage Classification Scheme method with 1-,2- and 3-grams with a trained threshold for each individual aspect category . . . 42 5.9 Results for Two-Stage Classification Scheme method with 1-,2- and 3- grams without a trained threshold for each individual aspect category . . 42 5.10 Arithmetic difference of F1 scores for the OVA based method and the two-stage method with and without threshold . . . . . . . . . . . . . . . . 43 5.11 Arithmetic difference of Precision scores for the OVA based method and the two-stage method with and without threshold . . . . . . . . . . . . . . 44 5.12 Arithmetic difference of Recall scores for the OVA based method and the two-stage method with and without threshold . . . . . . . . . . . . . . . . 44 5.13 Arithmetic difference of F1 scores for the OVA based method and the two-stage method with and without threshold . . . . . . . . . . . . . . . . 45 5.14 Arithmetic difference of precision scores for Ithe OVA based method and the two-stage method with and without threshold . . . . . . . . . . . . . . 46 5.15 Arithmetic difference of recall scores for the OVA based method and the two-stage method with and without threshold . . . . . . . . . . . . . . . . 46 ix
  • 10.
  • 11. List of Tables 3.1 An overview of the results of the related work that is discussed. . . . . . . 13 5.1 Confusion table for classification problems . . . . . . . . . . . . . . . . . . 38 5.2 Final parameters for the OVA scheme based method and the method based on a Two-Stage classification scheme . . . . . . . . . . . . . . . . . 47 5.3 F1, Recall and Precision scores for different method when evaluation is done on the test set provided by SemEval-2014 . . . . . . . . . . . . . . . 50 A.1 All part-of-speech filters applied to the parameter tuning in this thesis . . 57 xi
  • 12.
  • 13. Chapter 1 Introduction In this chapter we will introduce the subject of this thesis. Next we introduce some terminology that is used in this thesis to gain a better understanding of the subject at hand. Finally we present the structure of this thesis. 1.1 Problem Definition When someone forms an opinion, a key part in this process is the influence from the opinions of others (Liu, 2012). Not long ago people relied on the opinions of families and friends to form an opinion about a product. Another way of forming an opinion on a product is to read specialized magazines or books. With the rise of the World Wide Web, the importance of online shopping has increased. An important part of online commerce is the ability for a consumer to write a review for a product. When a consumer decides to buy a product online, he/she most likely will read through the reviews by other consumers for that product to get an idea what the overall sentiment is towards the particular product (Bickart and Schindler, 2001; Feldman, 2013). Reading through all reviews for one product can be a hassle. For this reason, it would be beneficial to find an efficient way of giving the consumer an overview of the overall sentiment based on product reviews. The task of creating a relevant overview of the opinions expressed in a review, can be divided into four subtasks (Popescu and Etzioni, 2007): 1. Identify product aspects; 2. Identify opinions regarding product aspects; 3. Determine the polarity of opinions; 1
  • 14. Introduction 2 4. Rank opinions based on their strength; A product aspect is called features. The number of aspects in a collection of reviews can become large. To create an overview of the sentiment of the product, we can define aspect categories. Aspect categories are a way of summarizing aspects that are closely related. An example of a curated summary of a laptop review, with some aspect categories ,can be seen in Figure 1.1. Figure 1.1: Review summary for Apple MacBook (2015) with scores for aspect cate- gories. The research presented in this thesis does not tackle all 4 subtasks presented in Popescu and Etzioni (2007), but tackles a variation on subtask 1. In this thesis we will address the task of finding aspect categories in consumer review sentences. Obviously the review summary presented in Figure 1.1 is composed by a human that scores certain categories using the full review as a reference. To understand how an overview of the aspect categories can be constructed using the reviews from consumers, we can look at Sentence 1 which is extracted from a restaurant review: (1) “Best of all is the warm vibe, the owner is super friendly and service is fast.” In Sentence 1 we find that there are 3 aspects:vibe, owner and service. We can further reduce this list to 2 aspect categories, namely ambiance and service. This classification
  • 15. Introduction 3 gives a coarser overall view of the products aspects and enables us to better score these aspects. The methods presented in this research were specifically developed to determine aspect categories in a sentence. Given that aspect categories are assigned to already existing aspects, it may be of interest to explore the methods concerned with the subtask of finding product aspects. Some early methods that were developed to extract product aspects were already pre- sented in (Ding et al., 2009; Hu and Liu, 2004; Kobayashi et al., 2005; Mei et al., 2007; Popescu and Etzioni, 2007). The methods proposed in (Ding et al., 2009; Hu and Liu, 2004; Kobayashi et al., 2005; Mei et al., 2007; Popescu and Etzioni, 2007) use relatively few linguistic attributes (e.g., lexical and semantic features) to find product aspects. The feature that these methods have in common is that they are based on some form of word/phrase co-occurrences. Methods based on co-occurrences have been shown to be adequate for modeling specific word/phrase relations. To find aspect categories, the context in which word- s/phrases are used is an important source of information to determine the right aspect category. To better understand the context of a word/phrase, surrounding word/phrase patterns are used to determine the correct aspect categories in a sentence. The pat- terns that arise when decomposing a sentence can be used to determine which aspect categories are addressed in a sentence. The number of words/phrases in a set of reviews can be quite large. Also, there are many more combinations possible for a sequences of words/phrases neighboring a specific word/phrase. Solving such large scale classification problems is crucial in areas such as text classification. The method proposed in this thesis can be described as a method for solving a text-classification problem. This problem can be characterized by having large sparse data with a huge number of instances and features (Fan et al., 2008). An efficient and promising method to solve the classification problem for large datasets with high-dimensionality is support vector machines (SVM). For this reason we propose to use SVM as the classification algorithm of choice for the aspect category detection methods presented in this thesis. This thesis will also concentrate on the numerical representation of a word given the context in which the word appears. An example of a method to give a numerical representation is the method developed in Mikolov et al. (2013a,b,c), namely word2vec. In this thesis we propose a similar numerical representation where words are converted to a vector form. The vector dimensions are determined by important sentence context features.
  • 16. Introduction 4 1.2 Terminology In Section 1.1 we introduced some terminology but provided little explanation. In the following section we introduce the terms most commonly used in this thesis. Aspect An aspect is a word or a collection of words that describes a specific feature of the subject being discussed in a sentence. Sentences can contain ≥ 0 aspects. Sentence 2 is an example of the type of sentence commonly found in a review. (2) “I can barely use any usb devices because they will not stay connected properly.” The term ‘usb devices’ in Sentence 2 is tagged as an aspect term. This is because we know that all opinions in the sentence are related to ‘usb devices’. In this research we distinguish between 2 types of aspects, namely explicit and im- plicit aspects. The reason for this distinction is that explicit aspects are relatively easy to find while the implicit aspects are relatively hard to determine. In Sentence 2 we found the aspect ‘usb drives’ because, the aspect was explicitly mentioned. In Liu et al. (2005) the authors argued that some aspects are not explicitly mentioned but rather, some aspects can be inferred from the sentence. This is an aspect that is implicitly mentioned (Liu et al., 2005). Sentence 3 gives an example of a sentence with an implied aspect. (3) “When we went to use it again , there was sound but no picture .” The aspect tagged for Sentence 3 is “camera”. Notice here that the aspect was never explicitly mentioned. The words “sound” and “picture” together imply that we are reading about a camera. Although most aspects appear explicitly in sentences, the number of implicit aspects can reach up to 30% of the total number of aspects (Wang et al., 2013). Aspect Category In Section 1.1 we introduced the concept of aspect categories. The categories and their respective aspects are determined in advance. In this thesis we use “aspect category” and “category” interchangeably. At its basis the aspect category serves the same purpose as aspects, that is to describe a product or entity. The number of unique aspects in a set of reviews is generally quite large. Aspect categories are a convenient way of labeling aspects that are closely related. This enables a consumer to see an overview of what the overall opinion is on a certain group of aspects. An example of such an overview can be seen in Figure 1.1.
  • 17. Introduction 5 We previously mentioned that aspects can appear explicitly of implicitly. Categories can also either be mentioned explicitly or implicitly. But because a category represents a group of aspects with a single label, we assume that categories do not always ap- pear explicitly but appear implicitly with the mention of an aspect. To illustrate this property and to give a general insight into aspect categories, we can look at Sentence 4 and Sentence 5 as example sentences from review about restaurants. The predefined categories are ‘food’ and ‘price’ (4) “Great food at REASONABLE prices, makes for an evening that can’t be beat!” (5) “He has visited Thailand and is quite expert on the cuisine.” In Sentence 4 ‘price’ and ‘food’ are tagged as aspects. These tags have the same label as the categories thus the categories appear explicitly. The tagged aspect in Sentence 5 is ‘cuisine’. The corresponding tagged category is ‘food’. Here the category ‘food’ was never mentioned but, can be inferred by the fact that the aspect ‘cuisine’ describes some aspect that belongs to the category ‘food’. Sentence 5 is a great example of how aspects are related to their predefined category. Feature-space In Section 1.1 we proposed to use SVM to detect aspect categories in a review sentence. At it’s basis, SVM is a classification technique that determines a decision boundary for a binary classification problem. Text classification is notorious for having high-dimensional data, thus the trained SVM for the defined problem has a high- dimensional problem space. In our case each dimension represents a feature (attribute) of the sentence. In this thesis we will refer to the problem space as a ’feature-space’ and each dimension as a feature. This enables us to represent a word/phrase numerically as a vector. The number of features (feature-space dimensions) is determined by the vocabulary. As mentioned before, a similar vector representation of words is presented in (Mikolov et al., 2013a,b,c). This advanced method is designed for large clusters of computers to process a large amount of data with a neural-network as learning algorithm. 1.3 Thesis Structure In Chapter 2 we will formally present the goal and scope of the research presented in this thesis. In Chapter 3 we discuss previous work that is related to aspect and category detection. Chapter 4 we first introduce a general framework for category detection. From this framework we present two methods to perform the task of category detection. For both methods we will present the pseudo-code and an example. In Chapter 5
  • 18. Introduction 6 we introduce the dataset we use to evaluate our methods. After the data has been introduced we present the evaluation metrics we will use to measure the performance of the methods proposed in this thesis. The performance of out methods will be compared to some baseline methods and a couple of methods from literature. Finally the methods are compared to the baseline and existing algorithms. Chapter 6 presents the conclusions we arrived at after evaluation. Lastly we suggest some work that can be done in the future.
  • 19. Chapter 2 Research Goal Although the concept of detecting explicit entity (product) aspects is not new, there has been relatively little research in the area of extracting aspects that are implied. The research presented in Su et al. (2008) was one of the first to attempt to tackle the problem of detecting implicit entity aspects. The authors base their model on inter- an intra-word relations. These relations are used for clustering and mutual reinforcement to create a set of association rules. The association rules depict the mapping of an opinion word to the associated feature word. Although this seems to be a reasonable method, it fails to capture important information with regard to the context of the opinion word. Another problem encountered in methods that are based on association rules and/or co-occurrences is that if a particular opinion word was never associated with a feature then it will not be discovered, possibly negatively affecting the performance. This is often due to the sparseness of co-occurrences. The goal of this thesis is not to detect the entity aspects, but rather their category. Because aspects are direct children of their categories, it is of interest to us to look at previous attempts at detecting aspects that appear either explicitly or implicitly. An interesting question arises when we look at most of the current research in detecting implied aspects. The question goes as follows: “How can you leverage the available information in a corpus to learn patterns that lead to an accurate model for extracting aspect categories? ” In order to provide an answer to the previous question, the following questions need to be answered as well: • What lexical features in a sentence are important for determining aspect categories? 7
  • 20. Research Goal 8 • How important are the patterns of words and/or lexical features for determining aspect categories? • What algorithm suits pattern recognition for aspect category detection? • How can we compare the performance of a proposed method to already existing methods? 2.1 Research Scope The focus of this research is primarily in extracting aspect categories with the help of classification algorithms. More specifically this research concerns itself with all steps of building a classification system for detecting aspect categories in consumer reviews. The preprocessing steps implemented here will make use of several off-the-shelf tools from the Natural Language Processing field. This research will not actively improve on these existing methods, but mostly leverage these methods to improve performance of the overall system. The focal point of this research is to present a method to extract contextual information from sentences and discovering patterns in this information to find aspect categories. 2.2 Methodology The first part of this research is a literature survey on methods previously devised for ex- tracting features (aspects) from a corpus of text. Next a general framework is presented to detect aspect categories. In this research we assume that aspects are just specializa- tions fo their categories. Therefore we assume that the methods for aspect detection and category detection are relatively similar. After the framework is introduced, two methods for detecting aspect categories are presented. Both methods are based on the framework presented in this thesis. These methods will use a sentence as their input and output a list of predicted aspect categories. Last some baseline algorithms are presented to form a reference to which we can compare the methods presented in this research. We will also compare out methods to existing methods for category detection.
  • 21. Chapter 3 Related Work Aspect detection is a relatively new area of research in the Natural Language Processing domain. It is related to the fields of Opinion mining and Sentiment Analysis (Liu, 2012). Aspect detection is used in these areas to extract opinion/sentiment about a certain aspect of a product. This chapter discusses the current approaches that directly or indirectly tackle aspect detection. As mentioned before, aspects and categories can appear explicitly or implicitly. In Section 3.1 we discuss the research done on finding aspects that are implied. Section 3.2 will present the research done on the task of finding determining aspect categories. 3.1 Implicit Aspect Detection Finding aspects can be a challenge in itself. Most methods to the find aspects concentrate on aspects that are explicitly mentioned in a document or sentence (Ding et al., 2009; Hu and Liu, 2004; Kobayashi et al., 2005; Mei et al., 2007; Popescu and Etzioni, 2007). OPINE is a review-mining system introduced in Popescu and Etzioni (2007) to find semantic orientation of words in the context of given product features and sentences. The research goal of Popescu and Etzioni (2007) comes really close to the research question proposed in this research. The authors present a thorough system for opinion mining. Specifically they present methods for detecting aspects that take into account implicitly and explicitly mentioned aspects. The explicit aspect detector is discussed in more detail then the implicit aspect detector. They use opinion words and patterns to extract implicit features. More specifically they use neighborhood features of a word to determine if an aspect appears in a sentence. The authors also developed a method to find patterns of the semantic orientation of an opinion word in the context of an 9
  • 22. Related Work 10 associated aspect and the input sentence. The experiment the authors constructed was geared towards opinion mining. The research presents interesting ideas but, the the authors do not present results on implicit aspect detection. One of the earlier attempts at extracting implicit aspects is done in Su et al. (2006). At it’s basis the authors propose to use a method that analyzes semantic associations, based on Point-wise Mutual Information (PMI), to determine if a word represents an aspect . It is easy to understand the logic that the semantic association of an opinion word, with a corresponding aspect, will help us determine the correct aspect implied in the sentence that contains the opinion word. However, the results can not be verified as the authors did not include any tangible results in the presentation of their method. In field of opinion mining, the authors in Su et al. (2008) proposed a method that clusters words with a high semantic similarity, to detect implicit aspects. The words used for clustering are words that have been tagged as aspects and opinion words. The thinking behind this is that words that appear together often have a high similarity. By this reasoning we can estimate the aspect by looking at the given opinion word in the context of the sentence it appears in. To model the complicated relationships between product aspects and opinion words, the authors consider two sets of words: a set of product aspect words and a set of opinion words. After the definition of the sets, the clusters and the inter- and intra-relationships of the aspect and opinion words are iteratively determined. To calculate similarity between 2 words the authors propose to combine a traditional approach for calculating similarity with a similarity metric based on the retrospective relationships between certain words. The limitation in this research is that the authors only consider adjectives as opinion words. In practice, adjectives do not cover the wide range of opinions that are expressed. The authors did not provide any numerical results which means that we can not verify the performance of this method. In Hai et al. (2011), the authors also tackled the problem of identifying implicit entity aspects. They proposed to identify the implicitly mentioned aspects using co-occurrence association rule mining. The method is based on the co-occurrence count of an explicitly mentioned aspect and an opinion word. Explicitly mentioned aspects can be extracted using existing methods. In Hai et al. (2011), the explicitly mentioned aspects can be detected using using dependency relations. Opinion words are extracted using part-of- speech tags. After building the aspect and opinion words sets, the co-occurrence matrix between these two word sets is generated. The association rules are mined based on the co-occurrence matrix. Based on this mapping one can predict an implied aspect. The authors report that the method yields an F1-measure of 74% on a dataset of Chinese review of mobile phones. The performance of this method is heavily dependent on word co-occurring often which result in bad results for sparse datasets.
  • 23. Related Work 11 Wang et al. (2013) uses the same basic idea as Hai et al. (2011). The proposed method goes beyond the idea of mining for rules by simply mapping opinion words and explicit features. There are 3 important extensions used in this research. First the authors add substring rules to a basic set of rules. This means that they build new rules from the substrings of an existing rule. Secondly, they use the syntactic dependency between lexical units to mine for potential rules. Lastly, they use a constrained topic model to expand the word co-occurrence. The results presented in Wang et al. (2013) seem to improve on those given in Hai et al. (2011). The authors reported a F1-measure of 75, 51% which is a slight improvement on the F1-measure reported in method presented in Hai et al. (2011). InZhang and Zhu (2013) the authors proposed a method that uses co-occurrences similarly to the method in Hai et al. (2011). They also use a concept called double propagation. Even though double propagation is a method employed for explicit aspect detection, we will briefly discuss the method. We will then discuss the full method used in Zhang and Zhu (2013). To understand double propagation we look at the research of Qiu et al. (2009). The researchers in Qiu et al. (2009) set out to find an efficient way of doing sentiment analysis on text within a certain domain. Opinion expressions can vary wildly from one domain to another. The proposed method exploits the relation between sentiment words and the product features that modify the sentiment. The relations are used for the propagation of information through both the sentiment and feature words. This is called double propagation. The method proposed by the authors in Qiu et al. (2009) performed favorably when the compared their method to several other methods (e.g., conditional random fields). It performed especially well when a relatively small training corpus was used for training. The reason for this performance boost can be attributed to the fact that the method in Qiu et al. (2009) finds implicit opinion words that modify the aspect words (modifiers). An example of a modifier word is ‘small’ in Sentence 6. This word describes an implied aspect, namely the aspect ‘size’, of the entity ‘mackerel’. (6) ‘Lee caught a small mackerel.’ The researchers in Zhang and Zhu (2013) bring together the ideas from the methods proposed by Hai et al. (2011) and Qiu et al. (2009). All previous work on detecting implied entity aspects have 2 things in common. First they extract opinion words and explicit aspect words to create a mapping between the two. Secondly, they use the co-occurrence between opinion words and the extracted aspect words to create the mapping. The method presented in Zhang and Zhu (2013) uses co-occurrences and the idea of double propagation to calculate the average correlations between an aspect word
  • 24. Related Work 12 and the notional words in a sentence. The feature with the highest average correlation is selected as the implicit feature. The authors reported an F1-measure score of 80% on a dataset set of Chinese phone reviews. 3.2 Aspect Category Detection The task of detecting aspect categories was introduced International Workshop on Se- mantic Evaluation (SemEval-2014) as a subtask of the general task of ‘Aspect Based Sentiment Analysis’. As part of the SemEval-2014 task, the method developed in Schouten and Frasincar (2014) presents a method that computes a score for the likelihood that a certain word is a description for an aspect and/or it’s category. To train the method the authors use a training-set that contains sentences that have been manually annotated by humans, namely with aspects and aspect categories for each sentence. A co-occurrence matrix is then constructed with the frequency that words co-occur with a predefined aspect or category in a sentence. After the co-occurrence matrix is defined, the authors propose to train a threshold for all aspect categories to to decide when to choose which category is most likely. To detect a category in a sentence, the score is calculated for each category in the given sentence. If the score exceeds the threshold the category is chosen. The authors presented a F1-measure of 59% with on a dataset containing sentences from restaurant reviews from the SemEval-2014 competition. The method presented in Brychcın et al. (2014) uses a binary Maximum Entropy classifier with term frequency–inverse document frequencies (td-idf) and bag-of-words as the feature-space. The authors reported a F1-measure score of 81.0% which makes this method the best performing constrained method in the SemEval-2014 workshop. Another method that is based on a machine learning algorithm is proposed in Kir- itchenko et al. (2014). The proposed method uses a (one-vs-all) SVM scheme for n pre- defined aspect categories for classification. The feature-space for the SVM’s is defined by various n-grams and information learned from a lexicon learned from an unlabeled dataset with restaurant review from YELP. The sentences that have not been assigned an aspect category ar passed through a post-processing step that calculates a posterior probability P(c|d) for category c given sentence d. The category with the highest prob- ability is chosen as the most likely category for the sentence. Only if the probability of the preliminary category exceeds a certain trained threshold is the sentence labeled as referring to the category. With a F1-measure score of 88, 6%. The method that was submitted to the SemEval-2014 workshop did not use YELP to learn the lexicon. The
  • 25. Related Work 13 constrained method they submitted had an F1-score of 82.2% This method was not sub- mitted to the SemEval-2014 workshop, but it did out-perform all other methods from the SemEval-2014 workshop that participated in the Task of ‘Aspect Based Sentiment Analysis’. 3.3 Method Overview In this section we present an overview of the methods we introduced in this chapter. Table 3.1 presents an overview of the methods that were most relevant to the method used in this thesis. Method Type Method Task Result Machine Learning-based Kiritchenko et al. (2014) Detect Aspect Categories F1-score: 81% Brychcın et al. (2014) Detect Aspect Categories F1-score: 82% Kiritchenko et al. (2014)* Detect Aspect Categories F1-score: 89% Frequency-and Rule-based Wang et al. (2013) Detect Implicit Product Aspects F1-score: 75.51% Frequency- based Hai et al. (2011) Detect Implicit Product Aspects F1-score: 74% Wang et al. (2013) Detect Implicit Product Aspects F1-score: 75.51% Zhang and Zhu (2013) Detect Implicit Product Aspects F1-score: 80% Schouten and Fras- incar (2014) Detect Aspect Categories F1-score: 59% Table 3.1: An overview of the results of the related work that is discussed. * indicates a constrained method where the algorithm is trained using only the training set as a resource.
  • 26.
  • 27. Chapter 4 Methodology In this chapter we introduce two methods for detecting product aspect categories in review sentences. First, we introduce a general framework which forms the basis for both methods we are going introduced. Next we present a method to define a multi- dimensional feature-space to convert a word into a vector representation of the word given a sentence. The last part of this thesis will be dedicated to discussing two methods for aspect category detection using the framework and a proposed method for defining the feature-space. Both methods are based on some form of the one-versus-all classifi- cation scheme for multi-class SVM classification. 4.1 Method Framework The two method presented in this thesis are based on machine learning algorithms. The choice for using machine learning as the foundation is based on the intuition that con- textual patterns exist around words that describe an aspect. in a sentence To illustrate this we can look at categories service and food. The word ‘horrible’ can be associated with either ‘horrible service’ or ‘horrible food’. Methods based on association rule min- ing (Hai et al., 2011; Wang et al., 2013) propose to tackle this choosing the category with the highest association probability. The statistical methods presented in (Zhang and Zhu, 2013; Schouten and Frasincar, 2014), propose to solve this problem by using a co-occurrence matrix to calculate the probability of choosing a category related to a word.. The advantage of using association rule mining is that the algorithms are fast and the rules are relatively easy for humans to understand. They give us insight in some prevalent patterns. The disadvantage of this approach is that the trained algorithms 15
  • 28. Methodology 16 can miss some less obvious patterns that may appear in a sentence. Furthermore, when a word appears that was not previously seen in the training set, we have to navigate a non-intuitive set of steps to be able to generate a prediction. In this thesis we choose to develop a method that enables us to convert any chosen word in a sentence into vector form given the sentence it appears in. This enables us to predict the category related to a word (represented as a point in the feature-space) by looking at the neighboring words. The features in the multidimensional problem space are defined by a selection of lexical and semantic attributes that have been selected by off-the-shelve feature selection algorithms. The spacial representation of a word enables us to use some advanced machine learn- ing algorithms as the classification step in our framework. The important insight here is that we aim to learn how the context of a word impacts it’s meaning. In this thesis we assume that two similar words, used in the same context, are close to each other in the problem space and that the two words are far apart when mentioned in different contexts Mikolov et al. (2013a). Based on this assumption, we can now choose classifi- cation algorithms that are based on the spacial representation of data points. Figure 4.1 presents a general overview of the general framework. This framework serves as the structure of the methods presented in this thesis. Training set Feature-space definition Classification Algorithm Training Determine Categories Figure 4.1: General framework for aspect category detection using machine learning 4.2 Feature-Space Definition In this section we will present a supervised learning algorithm to define the dimensions of the feature-space. The reason for using a spacial approach is that in enables us convert a word into a vector representation of the word. First we will present a method for defining the context in which a word appears. Next we define the features that are included in the feature-space. Last, we will discuss a method to reduce the number of dimensions in the feature-space.
  • 29. Methodology 17 4.2.1 Word Context In this thesis we assume that the input data is in the form of individual review sentences. The choice of using sentences as the input form stems from the fact that we want to capture the information contained in the words that are around a word Ws i for a given sentence s. An alternative to a sentence as input string we can also use a collection of sequential sentences,e.g. a paragraph, as input. The disadvantage of using multiple sentences is that many categories are mentioned in a paragraph which leads to over generalization of the input context. In this thesis the context of a word is defined as the parts of a sentence s that precede or follow a specific word wi at the ith word-index in sentence s. The context of a word influences its meaning or effect. 4.2.2 Features The researchers Flekova et al. (2014) use machine learning algorithms to determine what makes a good biography. The authors present a list of 9 classes of numerical features to construct a feature-space that is well suited for text-classification problems. The feature- space constructed in this thesis is based on three of the nine classes from Flekova et al. (2014). We chose to use only three classes because most other classes in Flekova et al. (2014) are geared more toward quality analysis. The three classes are discussed below. Lexicon and Lemmatization The first step in developing the feature-space is to construct set of the words used in a training corpus. The set of words can grow to become quite large because of the many grammatical forms a word can appears as. In the set of words, there are related words with similar meanings but differ in grammatical form. An example of such words are democracy, democratic and democratization. To reduce the size of the set of unique words, we propose to represent a word in it’s most basal form possible. A process of finding the root word of an input word is called stemming. This method is usually very crude in that it can just cuts the end of a word off and hopes for the best. The most common and effective algorithm for stemming is presented in Porter (1980). A related, and more advanced, method for finding the base form of a word is called ‘Lemmatization’. The advantage of lemmatization is that it stems words based on vo- cabulary and an analysis of the morphological properties of words. For an example of the process of stemming/lemmatizing the words in a sentence we can look at the words in Sentence 7 for the original sentence and Sentence 8 for the same sentence but lemmatized.
  • 30. Methodology 18 (7) “It took half an hour to get our check, which was perfect since we could sit, have drinks and talk!” (8) “It take half an hour to get our check , which be perfect since we could sit , have drink and talk !” Here we see that a verb like ‘took’ have been transformed to its base word ‘take’. This small example show us the potential for word-set size reduction. If you now encounter a word such as ‘taken’ in another sentence in the training set, the word-set size will not grow but will contain the stemmed form of ‘taken’, namely ‘take’ In this research lemmatization is done with the Stanford CoreNLPManning et al. (2014) java implementation N-Grams In Section 4.2.1 the context of a word as ‘the parts in a sentence that precede or follow a word’. To capture the context of a word we propose to first construct the set of contiguous sequences of n words from a sentence s. In this thesis we will define one such sequence as an n-gram. To illustrate we can define a simple sentences s = {w1, w2, w3, w4} where wi is the word at position i in sentence s. n denotes the number of words in the sentence s. Say we want to use a 2-gram model to get the context of a word wi The set of 1-grams C1 of sentences s is defined as C1 = {w1, w2, w3, w4}. The set of 2-grams C2 extracted from sentence s is C2 = {w1w2, w2w3, w3w4}. In this example the contexts of word w2 would be {w2, w1w2, w2w3}. In this research we will define one feature as one element from the set of n-grams. In this example the features set will become F = C1 ∪C2. For a more practical examples of the n-grams set building, we can look Sentence 8. Below, the 1-gram set and the 2-gram set are constructed from the words in the (lemmatized) first part of Sentence 8. 1-grams = {It,take,half,an,hour,to,get,our, check} 2-grams = {It take,take half,half an ,an hour ,hour to,to get,get our,our check} The set of n-grams for all sentences, in the training set, are initially added to the feature set F. Although the number of features can grow quite large, further in this thesis we will alleviate this problem by doing feature selection (Section 4.3). Part-of-speech tagging In the field of linguistics words can be labeled(tag) such that the label corresponds to a so-called part-of-speech (POS). This process is called POS-tagging. The basic POS-tags are familiar ones such as noun and verb. The process of tagging a POS to a word often involves advanced learning algorithms to detect hidden relations between words in sentences/paragraphs to assign the correct tag given al these properties. One such example is the POS-tagger presented in Toutanova and Manning
  • 31. Methodology 19 (2000), which is based on a maximum-entropy model. In most cases the supervised tagging algorithms are trained on a annotated text corpus. (e.g., the Penn Treebank and the British National Corpus (Marcus et al., 1993; Leech et al., 1994)).To illustrate part-of-speech tagging, we again use Sentence 7 to show how tagging works : (9) “It/PRP took/VBD half/NN an/DT hour/NN to/TO get/VB our/PRP check- /NN which/WDT was/VBD perfect/JJ since/IN we/PRP could/MD sit/VB have/VB drinks/NNS and/CC talk/VB” In this thesis the parts-of-speech in a sentence are used to construct sets of n-grams for the word tags in a sentence s. This will help in finding important linguistic patterns. A simple example of this is the fact that an adjective follows a noun. This tells us that a noun is being modified by an adjective and that the noun is potentially referring to an aspect(category) of the product. Chunk Parsing Although n-grams can detect certain linguistic patterns, they do not capture lexical patterns that appear with words outside the scope of n. Meaning they do not detect relations that may exist with words that are n + 1 from the related words. According to the research presented in Gee and Grosjean (1983) a sentence can be parsed into so called performance structures. The parsing method presented (Abney, 1992) defines performance structures as structures of word clustering that emerge from a variety of types of experimental data, such as pause durations in reading, and naive sentence diagramming . Although the presentation of performance structures makes some general assumptions about the syntax rules, Abney (1992) use the performance structures to form a basis for their method that builds syntactic subgraphs of a sentence. To capture the disjoint lexical patterns we will employ a method for shallow parsing introduced in Abney (1992). According to the authors, a sentence can be read in chunks. Again we use Sentence 7 as an example. In Sentence 10 the chunks represent a possible set of chunks when we Sentence 7. The chunks in this sentence serves as a fictional example. (10) “[It took] [half an hour] [ to get] [ our check] , [ which was perfect] [since we could sit] , [ have drinks], [ and talk] !” The author Abney (1992) uses such an example to construct a method to parse a sentence based on chunks. The author called this shallow parsing. Shallow parsing will split a sentence in so called phrases or chunks. These small phrases can give us further insight into what information is contained in which part of a sentence. This can be
  • 32. Methodology 20 seen as a more advanced variable length n-gram generator. Building on the previous example, the chunks created for Sentence 10with the method presented in Abney (1992) can be seen in 11. (11) “[NP It] [VP took] [NP half an hour] [VP to get] [NP our check] , [NP which] [VP was] [ADJP perfect] [SBAR since] [NP we] [VP could sit] , [VP have] [NP drinks and talk] !” In this example we get that “drinks and talk” is parsed as a Noun Phrase (NP). For a full list of chunk tags we refer you to the tagging guidelines presented in Santorini (1990). In this research we will use the chunks in the same way we use the n-grams. We will use this as a way of getting context from a word using lexical and semantic relations that exist in the sentence. We will include the chunks of the corresponding POS-tags. 4.3 Feature Selection Large datasets are more and more common in many areas of research. Both the number of instances and features are growing with the increased ability to measure data points with a large amount of features. In the case of the field of natural language processing the datasets contain a large number of instances and a large number of features. Most statistical methods have a hard time handling this high dimensionality. For this reason we choose to incorporate a feature selection step. The benefits of this step are two- fold. First we get a significant reduction in the feature size. This lowers the overall computation time of the algorithm. It also makes the method more robust, as without feature selection we run the risk of not being able to generalize. The other benefit of this step is that it can give us an insight into what the most important features are and can give us an insight into what patterns matter most for what category in what context. In this research we extract specific information related to a word, and the context in which a word is mentioned. We extract the context information on a sentence level. If we maintain the original feature set we run the risk of creating word vectors that are too sparse, which can hurt performance of the method. To prune the feature space we propose to use the Information Gain approach presented in Kullback (2012). The Information Gain method is based on measures that give numerical value to the state uniformity in a set of multidimensional points. To illustrate the idea let us imagine we have a dataset with points that can be labeled as being either of class a or class b.The goal is to find features that best split the data in such a way that we get the
  • 33. Methodology 21 best split between class a and b. The measure for this best split is called information. In terms of features we can say that information is a measure to show how many features are needed to correctly classify an instance as being class a or class b. To choose if a feature is included in the set or not we must determine the information gained by including a feature f in the feature-set F where f /∈ F. To measure this influence, the Information Gain method uses a measure for information entropy that is presented in Schneider (1995) and shown in Equation 4.1. H = i∈F Pilog2Pi (4.1) Where H denotes the information entropy, F denotes the set of feature to analyze and Pi is the probability that a successful classification is made given the set of features. Now that we have a measure to determine the importance of a feature for classification, Equation 4.2 follows naturally to determine the Information gained from adding a feature f to the feature-set F. We will denote this new set as ¯F where ¯F = F ∪ f. The gained information IG(F, f) for feature f is determined by subtracting the information entropy H( ¯F) of the set of features, with feature f included, from the information entropy H(F) for the feature-set F without feature f. IG(F, f) = H(F) − H( ¯F) (4.2) The final feature-set selecting those features that maximize the information gained by adding them to the set. 4.4 Aspect Category Detection Methods In this section we will first discuss the SVM classification algorithm. Then we introduce a method for solving multi-class classification problem with SVM algorithms. Then we will introduce two methods for aspect category detection based on the general frame- work presented in Section 4.1. Both methods use SVM algorithms as the classification algorithm. 4.4.1 Support Vector Machines The Support Vector Machine (SVM) algorithm was proposed in Cortes and Vapnik (1995). The root of SVM is in statistical learning theory. SVM classification have been used on real world problems with good results. At the basic level SVM is a method
  • 34. Methodology 22 to find a hyperplane in a feature-space such that the hyperplane forms a separation between instances that are labeled either −1 or +1 Tan et al. (2006). The method for determining a hyperplane depends on a so called kernel function. Kernel functions can be classified as linear functions or non-linear functions. The authors in Fan et al. (2008) argue that for dataset with a large feature-space (e.g. 4464 unique features in one of our cases) and a large number of sparse instances, the benefits of non- linear kernel functions are minimal while that time-complexity is very high. In the case that the number of features is very large and the data sparse, the authors in Fan et al. (2008) propose to use a SVM algorithm with alinear kernel function, as opposed to a more complex non-linear kernel function. For an overview of the performance difference between linear and non-linear kernel functions we refer to the research in Fan et al. (2008). Now that we have chosen the kernel function type we can proceed with discussing the SVM algorithm with a leaner kernel function. Assume we know that the sentences ∀s in the training data are labeled as having either category a or category b. Suppose we want to detect the category of a sentence s by extracting word wi classify the word as related to either category a or category b. First we convert wi in sentence s into a vector where the length of the vector is the number of features n = |F|. This word vector is called a classification instance xj. We do this for all words in all sentences to construct the input dataset. Given training instances xj ∈ Rn, i = 1, . . . , n and binary class vector y ∈ Rl such that yi = 1, −1, we can now train the SVM algorithm by solving the optimization problem in Equation 4.3 min w 1 2 wT w + C l i=1 (max(0, 1 − yiwT xi))2 (4.3) where this problem will give us an optimal weight vector w. This weigth vector can be seen as the separation hyperplane for the problem. Given the trained weight vector w we can classify classify a vector according to the following classifier in Equation 4.4: ˜y = sign(wT x) (4.4) 4.4.2 Multi-Class Support Vector Machines The task of extracting an aspect category boils down to the problem of assigning a label to an extracted word wi in a sentence s. An extracted word can be labeled by its corresponding feature, if present. We leverage the information within a sentence to
  • 35. Methodology 23 determine whether a word might imply the presence of categories c or not. This means that the number of categories can be expressed as |c| ≥ 1. We already mentioned that SVM is actually a binary-classification algorithm. Aspect category detection can be seen as a multi-class classification problem. The methods in Freund and Schapire (1997) and Schapire and Singer (1999) are examples of multi-class algorithms that tackle multi-class classification problems with SVMs.These methods are mostly based on a boosting scheme to train multiple binary classifiers and use some classification scheme to process an instance. One of the simplest schemes for multi-class classification is to build N classifiers with N denoting the number of categories in the category set C. Each classifier distinguishing between one category and the rest Rifkin and Klautau (2004). This scheme is known as the “one-vs-all” (OVA) scheme. Another quite simple scheme is to build a classifier that distinguishes between all pairs of classes. In this scheme we build N 2 classifiers Rifkin and Klautau (2004). This scheme is also known as the “all-vs-all” (AVA) scheme. There have been several attempts at developing a true multi-class SVM algorithm (Crammer and Singer, 2002a; Weston et al., 1999; Vapnik, 1998). In general an OVA scheme does not offer a theoretical advantage over other multi-class classification schemes. From a practical point OVA performs just as well as other schemes Rifkin and Klautau (2004). Because of the relative simplicity of the OVA scheme it is the desired scheme to use. In this thesis we chose to use an OVA scheme implementation that incorporates a method presented in (Crammer and Singer, 2002b,a). The details for these methods are presented in Keerthi et al. (2008). 4.4.3 Strict One-Vs-All Support Vector Machines Method The first method we present is a method that is based on the framework we presented in Section 4.1. In this method we choose to use a simple multi-class SVM algorithm with a OVA classification scheme. We previously stated that in order to use SVM algorithms we must convert the words in all sentences into a set of instances I for before we can train the SVM algorithm. The pseudo-code for the process is presented in Algorithm 1. The first thing to note about an instance produced by Algorithm 1 is that they are very sparse. This is shown by the fact that xi << n where xi ∈ 0, 1 and n denotes the number of attributes. For this reason scaling of an instance is very important. One advantage of this is to reduce the complexity of the SVM calculations Hsu et al. (2003).
  • 36. Methodology 24 Algorithm 1 Instance builder Require: X: list of attributes obtained with Information Gain Require: P: set of POS tags given by POS filter Require: G: integer for number of grams to extract Ensure: I: a set of instances 1: procedure instanceBuilder(Si) 2: L ← array of lemmas for all words in Si 3: A ← array of aspect terms in Si an array of pre-labeled aspect terms 4: Initialize set of instances I 5: for all lemma l ∈ L do 6: initialize set of n-grams N 7: p ← pos tag for lemma l 8: c ← chunk tag for lemma l 9: if p ∈ P then check on POS against the list defined by the filter 10: for j = 1 to G do 11: Nj ← buildNGrams(l,p,c,L) 12: add I = Nj to N 13: for all aspects a ∈ A do 14: Na ←buildNGrams(a,p,c,L) 15: add Na to N 16: end for 17: end for 18: Il ← define instance Il(j) = 1 if N(j) = X(j)∀j = 1, . . . , k 19: if Il = empty then 20: Scale Il and add to I 21: end if 22: end if 23: end for 24: return I 25: end procedure To get a scaled instance we use its unit vector ex calculated as followed: ex = x |x| To further keep the number of attributes at a minimum, we apply a so-called part- of-speech filter. Only when a word is tagged with a POS tag defined in the filter will it be considered for instance building. This will limit the set of the number of words used for instance building. The part-of-speech filters that are considered in this research are presented in Appendix A. The final part-of-speech tag is presented in 5.4. We know that SVM classification is done for each lemma l in the sentence s. This gives us an array of predicted aspect categories ˜c with |˜c| = |l|. The final set of predictions for sentence s is defined as ¯c = {˜c|˜c ∈ C}. We also use n¯c to denote the number of words we predicted that imply sentence s contains category c. Here we see that the possibility
  • 37. Methodology 25 exists to predict an aspect category in a sentence, based only on one word. This will lead to a higher number of false positive. To avoid this behavior, we will introduce a threshold to limit the number of predictions of category c relative to the length of the sentence. Equation 4.5 shows how the threshold tc for class c is evaluated: tc ≤ n¯c |l| (4.5) Equation 4.5 tells us that a sentence is labeled as mentioning category c only if the relative number of lemmas that were classified as related to category c is larger then some threshold. To train this threshold, we apply a simple linear algorithm that incrementally raises the value of tc for category c ∈ C and chooses the value for tc that maximizes an evaluation metric for the classification process over all sentences in the review training set. In this thesis we use the F1-measure as the performance measure to maximize. The definition of the F1-measure is presented in Section 5.3. Now that we know how to build instances for training the SVM classifier, we present the algorithm to train the SVM classifiers using the OVA scheme for multi-class classi- fication in Algorithm 2. Algorithm 2 Strict OVA SVM classifier training algorithm Require: S: set of annotated sentences Require: X: list of Attributes obtained with Information Gain Require: P: set of POS tags given by POS filter Require: G: integer for number of grams to extract Ensure: M: the trained SVM classifier Ensure: thresholds: vector with threshold values for the relative number of times category c ∈ C was classified in sentence s 1: procedure SVM classifier training(S) 2: Initialize training dataset D 3: for all sentence s ∈ S do 4: Y ← list of unique aspect categories for sentence s 5: for all iaspect category y ∈ Y do 6: I ← instanceBuilder(s) 7: add [y, I] to D 8: end for 9: end for 10: M ← trainClassifier(D) 11: thresholds ← trainThreshold(S,M) Simple linear search algorithm 12: end procedure
  • 38. Methodology 26 Algorithm 3 introduces the prediction process for the method employing the strict OVA scheme. Algorithm 3 Strict OVA SVM classifier prediction algorithm Require: S: set of training sentences Require: X: list of Attributes obtained with Information Gain Require: P: set of POS tags given by POS filter Require: G: integer for number of grams to extract Require: F: set of predefined aspect categories 1: procedure process OVA Classification scheme on test set(S) 2: for all sentence s ∈ S do 3: initialize fy = 0 with y ∈ F 4: I ← instanceBuilder(s) 5: for all instance i ∈ I do 6: y ← C(i) 7: fy = fy + 1 8: end for 9: for all y ∈ F do 10: if f˜y/|s| ≥ threshold then here |s| denotes the number of word in sentence s 11: Annotate y as an aspect category for sentence s 12: end if 13: end for 14: end for 15: end procedure The output of Algorithm 3 is a classification vector y with |y| ≥ 1 for each sentence.
  • 39. Methodology 27 Example Figure 4.2 shows an flowchart of an example of the OVA method. The figure for the SVM Classifier is not representative of a true SVM hyperplane for the example problem. Figure 4.2: Flowchart showing an example of the OVA scheme based method
  • 40. Methodology 28 4.4.4 Two-Stage Classification Scheme Support Vector Machines Method The second method we present in this thesis is an extension in the first method presented in Section 4.4.3. The disadvantage of the method based on OVA is that it uses th same classifier scheme with the same features to predict all predefined aspect categories which can lead to a higher probability of wrongly classifying a sentence with the most common category. The extension presented next will include a binary classifier with the sole job of predicting if a sentence s may contain aspect categories c or not. if the classifier predicts that the sentence s contains ≥ 1 categories, we proceed by applying the OVA scheme introduced in Algorithm 3 on the sentences that are predicted to contain an aspect category. The proposed extension on the method presented in Section 4.4.3 is presented in Algorithm 6. Algorithm 4 Instance builder two-stage classification step1 Require: X: list of attributes obtained with Information Gain Require: P: set of POS tags given by POS filter Require: G: integer for number of grams to extract Ensure: I: a set of instances 1: procedure instanceBuilder(L) 2: L ← array of lemmas for all words in Si 3: A ← array of aspect terms in Si an array of pre-labeled aspect terms 4: Initialize set of instances I 5: initialize set of n-grams N 6: p ← pos tag for lemma l 7: c ← chunk tag for lemma l 8: if p ∈ P then check on POS against the list defined by the filter 9: for j = 1 to G do 10: Nj ← buildNGrams(l,p,c,L) 11: add I = Nj to N 12: end for 13: Il ← define instance Il(j) = 1 if N(j) = X(j)∀j = 1, . . . , k 14: if Il = empty then 15: Scale Il and add to I 16: end if 17: end if 18: return I 19: end procedure The instances created to train the first classifier (C0) are just a rough collection of all n-grams formed by all words in a given sentence s. With this additional step, the hope is that Two-Stage approach will further reduce the false positive predictions when Algorithm 7 is used to process a sentence.
  • 41. Methodology 29 Algorithm 5 Instance builder two-stage classification step 2 Require: X: list of attributes obtained with Information Gain Require: P: set of POS tags given by POS filter Require: G: integer for number of grams to extract Ensure: I: a set of instances 1: procedure instanceBuilder(Si) 2: L ← array of lemmas for all words in Si 3: A ← array of aspect terms in Si an array of pre-labeled aspect terms 4: Initialize set of instances I 5: for all lemma l ∈ L do 6: initialize set of n-grams N 7: p ← pos tag for lemma l 8: c ← chunk tag for lemma l 9: if p ∈ P then check on POS against the list defined by the filter 10: for j = 1 to G do 11: Nj ← buildNGrams(l,p,c,L) 12: add I = Nj to N 13: end for 14: Il ← define instance Il(j) = 1 if N(j) = X(j)∀j = 1, . . . , k 15: if Il = empty then 16: Scale Il and add to I 17: end if 18: end if 19: end for 20: return I 21: end procedure The classifier C0 in Algorithm 7 functions as a filter to only classify sentences that possibly contain aspect categories. Just as before The output this algorithm is a classi- fication vector y with |y| ≥ 1 for each sentence.
  • 42. Methodology 30 Algorithm 6 Two-Stage Classification Scheme training algorithm Require: S: set of annotated sentences Require: X: list of attributes obtained with Information Gain Require: P: set of POS tags given by POS filter Require: G: integer for number of grams to extract Ensure: C0: the trained SVM classifier for the first stage Ensure: C1: the trained SVM classifier for the second stage Ensure: thresholds: vector with threshold values for the relative number of times category c ∈ C was classified in sentence s 1: procedure Training Two-Stage Classification Scheme method on anno- tated sentences(S) 2: Initialize training dataset D0 3: Initialize training dataset D1 4: for all sentence s ∈ S do 5: Y s ← list of unique aspect categories for sentence s 6: if Y = “miscellaneous” then 7: I0 ← instanceBuilder(L) Here we build an instance based on the set of all n-grams the sentence 8: add [“OTHER”, I0] to D0 9: for all aspect category y ∈ Y do 10: I1 ← instanceBuilder(s) 11: add [y, I1] to D1 12: end for 13: else 14: add [”“miscellaneous”, I0] to D0 15: end if 16: end for 17: C0 ← trainClassifier(D0) 18: C1 ← trainClassifier(D1) 19: thresholds ← trainThreshold(S,C0,C1) Simple linear search algorithm discussed earlier 20: end procedure
  • 43. Methodology 31 Algorithm 7 Two-Stage Classification Scheme method prediction algorithm Require: S: set of training sentences Require: X: list of Attributes obtained with Information Gain Require: P: set of POS tags given by POS filter Require: G: integer for number of grams to extract Require: F: set of predefined aspect categories 1: procedure process Two-Satge Calassification scheme on test set(S) 2: for all sentence s ∈ S do 3: y0 ← C0(i) 4: if y0 = “miscellaneous” then 5: initialize fy = 0 with y ∈ F 6: I ← instanceBuilder(s) 7: for all instance i ∈ I do 8: y1 ← C1(i) 9: fy1 = fy1 + 1 10: end for 11: for all y ∈ F do 12: if f˜y/|s| ≥ threshold then 13: Annotate ˜y as an aspect category for sentence s 14: end if 15: end for 16: else 17: Annotate “miscellaneous” as an aspect category for sentence s 18: end if 19: end for 20: end procedure
  • 44. Methodology 32 Example Figure 4.2 shows an flowchart of an example of the Two-Stage Classification scheme method. The figure for the SVM Classifier is not representative of a true SVM hyperplane for the example problem. Figure 4.3: Flowchart showing an example of the two-stage classification scheme based method
  • 45. Chapter 5 Evaluation In this chapter we give an overview of the experiment setup. First, we will present the system architecture of the experiment in Section 5.1.Then in Section 5.2 we present the consumer review data presented at SemEval 2014 (Pontiki et al., 2014). The data consists of a corpus of consumer reviews for restaurants from Citysearch New York Ganu et al. (2009). To validate the results from the experiment we use a training and test set as provided by SemEval 2014 (Pontiki et al., 2014). The two method we proposed require some form of parameter selection and tuning. In Section 5.4 we give an overview of the parameters that need to be tuned . The performance of the proposed methods will be compared to some baselines. The three baselines are Dominant Aspect Category Tagger and a Random Aspect Category Tagger . The baselines are formally introduced in Sections 5.5.1, 5.5.2. We will also compare the performance of our methods with some methods from the literature. These methods are those that have been developed for the SemEval 2014 (Pontiki et al., 2014) competition. 5.1 System Architecture In this section we will give a visual overview of the implementation of the methods presented in the previous section. In this thesis we used the Java programming lan- guage to implement the methods we proposed. To discuss the Java libraries used in the implementation of the methods, we will use the visual representation as a reference. A summarized overview of the proposed methods is presented in Figure 5.1. The process will draw the data used for evaluation and split the data into a set of training sentences and a set of test sentences. The first process in Figure 5.1 is the process of converting the words in a sentence into instances.The output is a dataset where the 33
  • 46. Chapter 5. Evaluation 34 Figure 5.1: A general overview of training and prediction processes implemented in this thesis. (targeted) words represent a point in the feature-space, as defined in Section 4.2, with their corresponding labeled category. Figure 5.2 shows the process of converting the words in a given sentence into instances. Spell checker The process in Figure 5.2 perform spell checking. The method used in this thesis is presented in (Naber, 2003; Milkowski, 2010). The Java implementation of the method presented in (Naber, 2003; Milkowski, 2010) is called JLanguageTool1. The advantage of using the JLanguageTool is in that it not only checks for the best word match given a dictionary of correctly spelled words, but the JLanguageTool also uses a corpus of grammatical pattern rules to determine the correct word to replace a misspelled word with. POS Tagger One of the core methods to employ in natural language processing is part-of-speech tagging. There are many part-of-speech taggers available that are ready to use. Most taggers are trained on a annotated text corpus. (e.g., the Penn Treebank and the British National Corpus Marcus et al. (1993); Leech et al. (1994)). The POS 1 The JLanguageTool API can be found at http://wiki.languagetool.org/java-api
  • 47. Chapter 5. Evaluation 35 Figure 5.2: Overview of the process of converting a sentence into a set of instances tagger used in this thesis is the tagger used in the Stanford CoreNLP(Manning et al., 2014) Java API2. Word Lemma To bring a word down to it’s lemma form we use the lemmatizer available in the Stanford CoreNLP(Manning et al., 2014) Java API. 2 The Standford CoreNLP Java API can be found at http://nlp.stanford.edu/software/corenlp. shtml
  • 48. Chapter 5. Evaluation 36 Chunker In this thesis we perform chunking with the chunker available in the OpenNLP project Baldridge (2005) Java API. In this thesis we use the default model for the chunker in the OpenNLP Java API. The model is trained on the data presented in Tjong Kim Sang and Buchholz (2000). Feature Selection In this thesis we apply the Information Gain method to do feature selection. The Java implementation of the method we used is the Information Gain Feature selection method in the Weka Machine Learning library (Hall et al., 2009). SVM Classification Algorithm In Section4.3 we determined that the number of features can be quite large. For this reason the authors inFan et al. (2008) propose to use a Linear SVM. In Fan et al. (2008) they also show the difference in time between linear and non-linear SVM kernel functions for problems with a large number of features and instances. For this reason a SVM with a linear kernel is used in both methods. In this thesis we use a Java version of the C++ API presented in Fan et al. (2008). For multi-class classification we use the default OVA scheme used in Fan et al. (2008), which is an implementation of the OVA method discussed in Keerthi et al. (2008). 5.2 Restaurant Review Corpus The restaurant review dataset (Ganu et al., 2009; Pontiki et al., 2014) consists of col- lection of reviews for restaurants in New York. In this thesis we will use a training set of approximately 3000 review sentences and a test set of approximately 800 review sentences. The sentences are manually annotated with aspect terms. Each sentence is also annotated with aspect categories. The training dataset from (Ganu et al., 2009; Pontiki et al., 2014) has 5 predefined aspect categories : ‘service’, ‘ambiance’ , ‘food’, ‘price’ and ‘anecdotes/miscellaneous’. The distribution of the number of each category in the set of review sentences in the dataset is presented in Figure 5.3. The category with the highest frequency is ‘food’. The frequency of ‘food’ and ’anecdotes/miscellaneous’ are about two time the frequency of the other categories. This will have an effect on the number of false positives when we run experiments on the methods presented in the previous chapter. A sentence can also have more than one labeled aspect categories. This complicates matters more because the proposed system must be able to predict at most the number of predefined categories. As mentioned before this problem is solved by using multi- class SVM classification schemes and that all words are processed to construct a set of
  • 49. Chapter 5. Evaluation 37 Figure 5.3: Distribution of the number of aspect categories in a sentence predicted categories . Figure 5.4 gives an overview of the distribution of the number of aspect categories in a sentence. Figure 5.4: Distribution of the number of aspect categories in a sentence The distribution of the aspect categories in the test dataset provided by (Ganu et al., 2009; Pontiki et al., 2014) is presented in Figure 5.5. Here it is obvious that the portion aspects tagged with ‘anecdotes/miscellaneous’ label is much lower when compared to the test set. The impact of this shift is that the
  • 50. Chapter 5. Evaluation 38 Figure 5.5: Distribution of the number of aspect categories in a sentence methods might over classify the ‘anecdotes/miscellaneous’ category to the words in a sentence. 5.3 Evaluation Metrics To evaluate the output of the presented methods and the comparative algorithms, some evaluation metrics are defined. Table 5.1 introduces the the 4 prediction vs. actual outcomes. In this research the outcomes are defined as follows: predicted class true false actual class true TP FN (Type I error) false FP (Type II error) TN Table 5.1: Confusion table for classification problems • True Positive (TP): the algorithm has correctly predicted a category in a sentence. • False Negative (FN): The algorithm has not predicted a category but a different category is present in the annotated sentence (Type I error).
  • 51. Chapter 5. Evaluation 39 • False Positive (FP): The algorithm has predicted an aspect category but an aspect category is not present in the sentence (Type II error). A TP is given only when the algorithm predicts the same category as the annotated category in a sentence. This means that wrongly predicting an aspect category is not only affecting FP but also FN. The previous formulation of FP and FN dictates that we must annotate the prediction as FP and FN. FP because the predicted category is not in the sentence, and FN because the aspect of the annotated sentence was not predicted. For this reason the performance measures such as precision and recall will be affected. Precision and recall are presented in equations (5.1) and (5.2) precision = TP TP + FP (5.1) recall = TP TP + FN (5.2) When looking at equations (5.1) and (5.2) we can see the original definitions of FP and FN can lead to lower values for these performance measures. This is because one misclassification of the algorithm can increases the FP and FN counts thus lowering both precision and recall. In this research we want to maximize both these performance metrics’. A very high precision score may result from an algorithm that is too conservative in its predictions. This means that if the algorithms does not classify a category there will be no effect on the precision. This results in low values for recall. Vice versa, if we have a high recall score the algorithm is can be too liberal in its predictions. To maximize both measures we will look at the harmonic mean of recall and precision known as the F1-measure (Tan et al., 2006). F1 = 2TP 2TP + FP + FN (5.3) 5.4 Parameter Selection This section discusses the parameters that are pre-selected or tuned in both methods presented in this thesis. First a part-of-speech filter is applied to words extracted from a sentence. An example of a part-of-speech filter in that only the nouns will be extracted from a given a tagged sentence. In this research the previous filter rule is noted as ”NN”. Appendix A lists all part-of-speech filters considered. The two proposed method extract information by constructing a set of 1-,2- and 3- grams of the neighboring words of the word wi being processed. The neighboring words
  • 52. Chapter 5. Evaluation 40 selected to construct the n-gram sets of word wi are not subject to filtering based their part-of-speech tags. Section 4.2.2 introduced the n-grams to the attribute space. In this section the optimal value for n in n-grams is determined by comparing the F1 values for for both methods using 1-,2 and 3-grams as input parameters. Further we will determine if thresholds must be set for each pre-defined category as is described in Algorithms 3 for the method based on a strict OVA classification scheme and 7 for the method based on a two-stage classification scheme. To determine the optimal parameter setup for the proposed methods, we will run the trained models for the two methods on the test set. The results are presented in Figure 5.6 and 5.9 for the Strict OVA method and the Two-Stage approach respectively, without threshold training. When no threshold is trained we use a default value of 0 for all threshold values.In Figure 5.7 and 5.8 for the Strict OVA method and the Two-Stage approach respectively, with threshold training. 5.4.1 Part Of Speech Filter The results in Figures 5.7 - 5.9 show that any POS-filter that allows nouns to be extracted seem to result in higher performance for the Strict OVA Aspect Category Detection method. This seems to reinforce the research presented in Nakagawa and Mori (2002). Next to nouns the most important word type seems to be the adjective. This makes sense in that the role of an adjective is defined as ”a describing word, the main syntactic role of which is to qualify a noun or noun phrase”. This tells us that extracting an adjective results in some indication that an aspect category is being discussed. The word type that contains the least amount of information about aspect category is the adverb. An adverb is generally used as a modifier for verbs, adjectives, nouns, and noun phrases. The adverb is principally used with verbs. This knowledge combined with the results from Nakagawa and Mori (2002) explains the result that adverbs perform poorly for detecting aspects that appear implicitly. 5.4.2 N-Grams The influence of the size of n-grams extracted can be seen in Figures 5.6 - 5.9. For the Strict OVA Aspect Category Detection method the n-grams seem to follow the reasoning developed in Section 4.2.2. Only when the ‘only NN JJ’ filter is applied ,shown in Figure 5.6, the 1-gram seems to perform better then other n-grams. This can be explained by the fact that nouns are often used for describing aspects that appear implicitly Nakagawa and Mori (2002), and that adjectives describe nouns.
  • 53. Chapter 5. Evaluation 41 Figure 5.6: Results for The Strict OVA Aspect Category Detection method with 1-,2- and 3-grams without a trained threshold for each individual aspect category Figure 5.7: Results for The Strict OVA Aspect Category Detection method with 1-,2- and 3-gram with a trained threshold for each individual aspect category
  • 54. Chapter 5. Evaluation 42 Figure 5.8: Results for the Two-Stage Classification Scheme method with 1-,2- and 3-grams with a trained threshold for each individual aspect category Figure 5.9: Results for Two-Stage Classification Scheme method with 1-,2- and 3- grams without a trained threshold for each individual aspect category
  • 55. Chapter 5. Evaluation 43 The results for the Two-Stage Classification Scheme method paint a different picture for the importance of the length of an extracted n-gram. When nouns are filtered out of the extracted lemmas the result behaves more or less according to the reasoning presented in Section 4.2.2 and observed for the basic OVA based method. This could be due to the fact that the first classification stage discriminates better between sentences with and without aspect categories and thus the need for information extraction from the neighboring words is less effective. For the Two-Stage Classification Scheme based method, the best results seem to be obtained with unigrams. 5.4.3 Threshold vs. No Threshold To test the effect of training a threshold on the number of predictions for a certain category in a sentence we look at the arithmetic difference of the F1-score of the methods with and without threshold. Figures 5.10a and 5.10b present the arithmetic difference of F1, recall and precision scores for the two methods with and without threshold trained. (a) OVA scheme based (b) Two-Stage scheme based Figure 5.10: Arithmetic difference of F1 scores for the OVA based method and the two-stage method with and without threshold When it comes to the threshold training the threshold step seems to increase the performance when nouns are included in the filter. This can be attributed to the fact that nouns are important in aspect category detection. In the case of the method based on the OVA scheme, this would mean that there would be many predictions for the dominant feature when nouns are extracted. Limiting the number of FP hits with the threshold improves accuracy and in turn improves the F1 measure. Figures 5.11 shows the difference in precision when threshold training is applied and not. The addition of the trained threshold generally improves the precision score. The reasoning for the threshold is really reflected in these two figures.
  • 56. Chapter 5. Evaluation 44 (a) OVA scheme based (b) Two-Stage scheme based Figure 5.11: Arithmetic difference of Precision scores for the OVA based method and the two-stage method with and without threshold To test how restrictive the threshold would be for category detection the difference between recall scores, for the algorithms with and without threshold, are presented in Figure 5.12. From the results in Figure 5.12 we can see that adding a trained threshold (a) OVA scheme based (b) Two-Stage scheme based Figure 5.12: Arithmetic difference of Recall scores for the OVA based method and the two-stage method with and without threshold filters out some TP hits to compensate for the number FP hits. This means that a decrease in the recall-scores. This is obvious given the fact that the threshold is trained by maximizing the F1 score. From the results in Section 5.4 and Figures 5.11 and 5.12 we can conclude that overall training a threshold to reduce the FP count improves the performance of the proposed methods even though a higher FN is expected and observed, the performance increase with respect to the F1-measure is mostly due to the reduction of FP. We can also see that not for all part-of-speech filters there is an increase in performance. These decreases happen when nouns are omitted from the sentence by the filter. We will not consider these results in our explanation of the results
  • 57. Chapter 5. Evaluation 45 5.4.4 OVA Scheme based vs. Two-Stage Scheme based Section 4.4.3 proposed two methods for aspect category detection. The algorithm pro- posed in Section 4.4.3 is one where a single classification step is performed. This step uses the same instances to classify if a sentence is labeled as having an aspect category labeled or not. The second algorithm proposed in Section 4.4.4 proposed an extra classi- fication step to first predict if a sentence has ≥ 1 aspect categories or none. Figure 5.13 shows the results of Ft 1 − Fo 1 , where Ft 1 denotes F1 score for the method based on a two-stage classifications scheme, and Fo 1 for the OVA scheme based method, to see the impact of training a separate classifier to find sentences with or without labeled aspect categories. (a) no trained threshold (b) trained threshold Figure 5.13: Arithmetic difference of F1 scores for the OVA based method and the two-stage method with and without threshold The results in Figure 5.13 show that adding an additional classifier to specifically predict if a sentence contains an implicit feature results in an overall improvement in performance for the F1 score. Especially when only unigrams (1-gram) are extracted as attributes, the performance seems to improve the best. This could be due to the fact that because of the binary classifier for detecting category aspects discards sentences that may not refer to an aspect category. This in turn will lower the FP count and thus increasing the F1-score. The effect on the number of FP predictions Figure 5.14 shows the arithmetic difference between precision-scores precisionD − precisionI . The results in Figure 5.14 suggest that adding the extra classifier has a big impact on FP’s. This result confirms the reasoning given for adding an extra classifier presented in Section 4.4.4. When we look at the arithmetic difference between recall-scores recallD − recallI in Figure 5.15, we can see the effect on the number of FN’s when including an extra classifier to filter out sentences that may not contain references to aspect categories .
  • 58. Chapter 5. Evaluation 46 (a) no trained threshold (b) trained threshold Figure 5.14: Arithmetic difference of precision scores for Ithe OVA based method and the two-stage method with and without threshold (a) no trained threshold (b) trained threshold Figure 5.15: Arithmetic difference of recall scores for the OVA based method and the two-stage method with and without threshold Figure 5.15a shows how adding the second classifier increases the number of FN predictions and in turn lower the recall-score. This is due to the fact that the Two- Stage method might more easily classify a sentence as having no references to aspect categories thus increasing the probability that sentences with labeled aspect categories might never be processed by the second classifier scheme in this method. On the face of it the results in Figure 5.15b seem to go against the reasoning previ- ously given. But on closer inspection, the threshold trained in the OVA scheme based method can be more restrictive on whether or not to annotate a sentence as having an aspect category. This in turn will give high FN counts, thus a low recall score. In the method based on the Two-Staged scheme the threshold only has effect on sentences that are classified as having aspect categories.
  • 59. Chapter 5. Evaluation 47 5.4.5 Parameter Tuning For comparison proposes Table 5.2 shows the parameter settings for OVA based method and the two-stage method that will be used when comparing the performance of the methods presented in this research and some comparative algorithms. The parameters have been selected by using the parameter settings that results in the highest value of F1. OVA based Two-Stage based Parameters pos-filter NN VB JJ NN VB JJ n-gram 3 1 threshold true true Results F1 0.665 0.772 precision 0.618 0.765 recall 0.718 0.779 Table 5.2: Final parameters for the OVA scheme based method and the method based on a Two-Stage classification scheme 5.5 Algorithm Evaluation In this section we will evaluate how our methods performs. We will compare our methods with some baseline category detection methods and three aspect category detection methods from the literature. In Sections 5.5.1 and 5.5.2 we introduce Dominant Aspect Category Tagging method and a Random Aspect Tagger respectively. 5.5.1 Dominant Aspect Category Tagger The lazy feature extractor is simply an algorithm trained by determining the most frequent aspect category in the training dataset. Then annotating a test sentence it simply annotates the aspect category determined in the training stage. The pseudo-code for training and processing this algorithm is given in Algorithms 8 and 9, respectively. 5.5.2 Random Aspect Category Detector Another baseline algorithm we are interested in is one that randomly assigns aspect categories to sentences. This algorithm is trained by determining the probability of an
  • 60. Chapter 5. Evaluation 48 Algorithm 8 Dominant Aspect Category Tagger training algorithm 1: Input: S: set of annotated sentences 2: procedure Training Dominant Aspect Category Tagger on annotated sentences(S) 3: Initialize best feature F 4: Initialize feature count vector f = 0 5: for all sentence s ∈ S do 6: Y ← list of unique aspect categories for sentence s 7: for all implicit feature y ∈ Y do 8: fy + + 9: end for 10: end for 11: F = maxy∈Y fy 12: end procedure Algorithm 9 Dominant Aspect Category Tagge prediction algorithm 1: Input: S: set of training sentences 2: F: most common feature from training stage 3: procedure process Lazy Feature Extractor on test set(S) 4: for all sentence s ∈ S do 5: Annotate F as an aspect category for sentence s 6: end for 7: end procedure aspect category by: Py = s∈S fy,s n (5.4) where Py is the probability of category y, fy,s = 1 if category y is in sentence s and n is the total number of aspect categories plus the number of sentences with no aspect categories in training set S. The training and processing algorithms for the Weighted Random Aspect Category Detector are presented in Algorithms 10 and 11 respectively. 5.5.3 Algorithm Comparison To compare the baseline algorithms presented in the previous section, we compare the results from our application of the methods on the test data provided by the SemEval 2014 (Pontiki et al., 2014) of the restaurant dataset. The results in Table 5.3 show the performance of all previously mentioned baseline methods and methods from literature, compared to the the methods proposed in this thesis. The settings selected for the two methods are presented in Section 5.4.5. The best performing algorithm is used as the benchmark.
  • 61. Chapter 5. Evaluation 49 Algorithm 10 Random Aspect Category Detector training algorithm 1: Input: S: set of annotated sentences 2: procedure Weighted Random Aspect Category Detector on annotated sentences(S) 3: Initialize best feature F 4: for all sentence s ∈ S do 5: Y ← list of unique aspect categories for sentence s 6: for all aspect category y ∈ Y do 7: fy + + 8: end for 9: end for 10: for all aspect category y ∈ Y do 11: Py = 1 |Y | 12: end for 13: end procedure Algorithm 11 Weighted Random Aspect Category Detector prediction algorithm 1: Input: S: set of training sentences 2: F: best Feature from training 3: procedure process Weighted Random Aspect Category Detector on test set(S) 4: for all sentence s ∈ S do 5: Annotate F as an aspect category for sentence s with probabilityPy 6: end for 7: end procedure The first thing to conclude from Table 5.3 is that Two-Staged method outperforms the OVA based method by 11% on the F1-score. Both the precision- and recall-score are increased by adding a second stage to the method. Following the definition of recall we can conclude that, adding a SVM classifier that detects sentences with or without aspect categories, decreases the false positives count. The Random Aspect Category Tagger introduced in 5.5.2 performs the worst on all measures. This could be down to the fact that sentences can contain multiple aspect categories (Figure 5.4). The Random Aspect Category Tagger only classifies a sentence as having 1 or no aspect category. This means that the number of FN’s is naturally higher and thus a high recall. The advantage of Two-Stage classifier scheme is that it does take into account that there can be more than one aspect categories per sentence. The results for the Dominant Aspect Category Tagger are to be expected when you take into account that the frequency of the aspect category ”food” is almost twice as much as other categories. This result reveals that the features used to construct the feature space for the Two-Stage method reveals some information that the SVM’s are able to learn from.
  • 62. Chapter 5. Evaluation 50 performance measures Method F1 recall precision Random Aspect Category baseline 0.306 0.305 0.308 Dominant Aspect Category baseline 0.483 0.637 0.388 Schouten and Frasincar (2014) 0.593 0.558 0.633 SemEval baseline 0.639 - - OVA Scheme Based 0.665 0.718 0.618 Two stage Classification Scheme Based 0.772 0.779 0.765 Brychcın et al. (2014) * 0.810 0.774 0.851 Kiritchenko et al. (2014) 0.822 0.783 0.865 Brychcın et al. (2014) 0.886 0.862 0.910 Table 5.3: F1, Recall and Precision scores for different method when evalua- tion is done on the test set provided by SemEval-2014 * indicates a constrained method where the algorithm is trained using only the training set as a resource. In this thesis we do not outperform the methods presented in (Brychcın et al., 2014; Kiritchenko et al., 2014) when we compare our methods with the restaurant test dataset from the SemEval2014 competition. The research presented in this thesis shows that we can extract increasing amount of information using simple contextual information. This contextual information enables us to build a feature space that tries to numeri- cally represent a word given the context of the word. Our two-staged method shows that training a classifier to filter out sentences that are labeled as “anecdotes/miscel- laneous” benefits the performance of the classifier(s) that are specialized in detecting more specific aspect categories. Nouns also seem to be very important although, this was already proposed by Nakagawa and Mori (2002). The importance of contextual information seems to decrease when we make a separate classifier for “anecdotes/mis- cellaneous”. We can see this as detecting sentences where the contextual information was determined to tag a sentence worthy of more scrutiny or one that can be discarded as being “anecdotes/miscellaneous”. The constrained method proposed in Kiritchenko et al. (2014) is the method that resembles the methods in this thesis the most. We can see that our method achieves recall-scores that are similar to the recall-score reported in Kiritchenko et al. (2014). The precision score presented in Kiritchenko et al. (2014) is 5% higher than the precision score achieved by our best performing method. This could be due to the fact that the
  • 63. Chapter 5. Evaluation 51 authors in Kiritchenko et al. (2014) us a more sophisticated method for words where the aspect category is not immediately apparent.
  • 64.
  • 65. Chapter 6 Conclusion and Future Work On-line consumer reviews are increasingly becoming the norm when evaluating the qual- ity or desirability of a product. These reviews can contain a lot of information that is relevant to other consumers. A review can be about a certain aspect of a product or service. A set of reviews can contain many unique aspects. To further summarize the aspects we assign the aspects to an aspect category. In this thesis we present two ma- chine learning methods to detect the aspect categories in a given sentence. The first method we propose is a method based on a general scheme for multi-class classification. The other method we presented is based on a revised scheme of the general scheme. An overview of the findings is presented in Section 6.1. Based on these findings the future direction is presented in Section 6.2. 6.1 Conclusion This thesis first introduced the problem of finding aspect categories in customer reviews. A sentence can explicitly mention that “the food was great”. Here we know that it was about the aspect category ‘food’. Now imagine the sentence reads like this “the scallops had a great taste to them.”. Although food was never mentioned we know that we are discussing the aspect category ‘food’ by relating the aspect ‘scallop’ to the category ‘food’. This is an example aspect category detection. In this thesis, two machine learning methods were introduced to tackle the problem of detecting aspect categories. First we presented a basic framework for aspect category detection using classification algorithms. Some preprocessing steps were proposed to transform a sentence into a set of instances that can then be used in the to train or process the classification algorithms. The first step in preprocessing is to perform a 53