Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Web Opinion Mining

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Web Opinion Mining

     Dupré Marc-Antoine, Alexander Patronas, Erhard Dinhobl, Ksenija Ivekovic,
                       ...
review sites, forums, blogs, black boards and so on. This type of information is also
called “Word-of-mouth”. To mine opin...
cases like forums or something like that this is not true therefore the document must
be separated. In this level the opin...
Anzeige
Anzeige
Anzeige
Nächste SlideShare
Opinion mining
Opinion mining
Wird geladen in …3
×

Hier ansehen

1 von 18 Anzeige
Anzeige

Weitere Verwandte Inhalte

Anzeige

Ähnlich wie Web Opinion Mining (20)

Aktuellste (20)

Anzeige

Web Opinion Mining

  1. 1. Web Opinion Mining Dupré Marc-Antoine, Alexander Patronas, Erhard Dinhobl, Ksenija Ivekovic, Martin Trenkwalder TU Wien, Wintersemester 2009/10 marcantoine.dupre@gmail.com, e0425487@student.tuwien.ac.at, e0525938@student.tuwien.ac.at, xenia.ivekovic@gmail.com, trenkwaldermartin@gmail.com Abstract. This paper covers an overview about the topic Web Opinion Mining, which includes the structure of an opinion, several different approaches, opinion spam and analysis and existing tools using sentiment analysis techniques to gather the opinions from different sources. Web 2.0 has dramatically changed the way in which people communicate with each other. People are writing their point of view about every topic you can imagine of on the web. For example there are opinions about people, a product, website or a specific service. The need for good opinion mining is increasing in a very fast way. Market analysis or companies capitalize on those techniques. A very interesting aspect for those companies is the knowledge what people, respectively the market, is thinking currently about a new product they just released. Of course for individuals gathering opinions from several product reviews is also very useful. Keywords: Data Mining, Opinion Mining, Sentiment Analysis, Opinion Mining Tools, Sentiment Analysis Tools Introduction Think about everything is posted on blogs, facebook-feeds, twitter and so on. Users express there what they think, their opinions and also maybe their political, religious point of view. There are also websites like wikipedia, research-information-sites or something like that which describe facts. So we can distinguish between opinions and facts on the web [14]. Data what is read and declared as a fact must be assumed that is it is true. Currently search engines search and index facts. They can be associated with keywords, tags and can be grouped by topics [14]. But opinions underlie a more complex situation. Usually they are out of a question like, what do people think of Motorola Cell phones, or, what do people in America think about Barack Obama [14]. Todays search algorithms are not designed to receive opinions. In most cases it is also very difficult to determine such data and also user opinion data is mostly part of the deep web [14] (Bing Liu defines this as user generated content, but this is exactly what it is, namely the deep web mostly, but there is also other content). It is not part of the global scope of the web but more on one’s circle of friends. Most data lies in
  2. 2. review sites, forums, blogs, black boards and so on. This type of information is also called “Word-of-mouth”. To mine opinions expressed in such content needs some kind of artificial intelligence algorithm [14]. This is not easy. But practically it would be very useful, for example in market intelligence for organisations and companies to serve better product and service advertising. Maybe persons are interested in other opinions when purchasing products or discussion about political topics. It is also interesting for overall search functions like “Opinions: Motorola cell phones” or “BMW vs. Porsche”. Due that data types there crystallize out two types of opinions namely direct opinions and comparisons. The former is some kind of expression on an object like products, events, persons and so on. The latter describes a relation between objects, usually an ordering of them like “product x is more expensive than y” [14]. These relations can be objective like prices but also subjective. Opinion mining concept To get a realizable way to opinion mining the process must be formalized. The basic components of an opinion are [14]: • Opinion holder: the person or organization which has written an opinion on the web • Object: object on which the opinion holder expressed the opinion • Opinion: the content on the object from the opinion holder Model An object is an entity like a product, event or so and represents a hierarch of components and each component is associated with attributes [14]. O is the root node. There also exist sub-events or sub-topics. The represent the whole component tree we describe this as “feature”. Therefore expressing an opinion on a feature makes it easier not to determine between components attributes. In that sense the object is also a feature. So the object O is defined by a finite set of feature F= {f1, f2, f3, …, fn}. Every feature fi F defines a set of words or phrases Wi as synonyms. Wi W. W = {W1, W2, W3, …., Wn}. Now the opinion holder is j and comments on a subset of features Sj of F of O. Now feature fk Sj is commented by j by a word or phrase from Wk to determine the feature and a positive, negative or neutral opinion on fk. Task The opinion mining task seen as the sentiment classification is done on three levels. [14] First it is done on document level. There is one assumption namely that one document focuses only on a single opinion from a single opinion holder. In many
  3. 3. cases like forums or something like that this is not true therefore the document must be separated. In this level the opinion is given the class it belongs to like positive, negative or neutral. This level is too coarse grained for most applications. The second one is mining at sentence level. In this level are two tasks. First one is to determine the sentence type like objective or subjective. The second task is to determine the sentence class to which it belongs to like positive, negative or neutral. The assumption is that sentence contains only one opinion but this is not very easy to match. Therefore clauses or phrases may be useful and focuses on identifying subjective sentences. The last and third level of the mining task is done at the feature level. In overall task focus is on sentiment words like great, excellent, horrible, bad, worst and so. In topic-based classification topic words are important. Summary-List 1. document level - class determining (1 opinion from 1 opinion holder) 2. sentence level (one opinion) a. sentence type determining (objective or subjective) b. sentence class determining (neutral, positive, negative) 3. feature level – determining words and phrases Words and Phrases The basic question is how to determine the sentiment classification on document and sentence level [14]. Negative sentiment doesn’t mean that the opinion holder dislikes the feature of the product or the whole product and a positive one that he/she likes everything. There is more! Sentiment words are often context dependent, for example long. Long runtime of a benchmark on a graphic card would be very bad but long runtime of a battery would be very nice. To get such word and phrase lists there are three approaches: 1. manual approach: manual creation of the list, one time effort 2. corpus-based approach text is analyzed by co-occurrence patterns and is domain dependent 3. dictionary-based approach Using constraints on connectives of words to identify opinion words, for example “This camera is beautiful AND spacious” where and gives the same orientation. This constraint using can also be applied to OR, BUT, EITHER- OR and NEITHER-OR. For this learning approach there exists a database which contained 21 million words in 1987. There is a good online resource called “WordNet”. Document-level sentiment analysis In order to analyse the general opinion of documents most of the research studies use classifiers. A classifier is an algorithm or the program based on it. Given a set of
  4. 4. documents, a sentiment classifier classifies each document in two classes : positive or negative (the class neutral is seldom used). A document classified in the positive class expresses a general opinion which is positive. And a document classified as negative expresses a general negative opinion. Such a classifier is unable to determine who are the holders of the opinions or what are the objects targeted by the opinions. Thus the set of documents has to be chosen wisely, the topic of all the documents could be a single object for example. It is assumed that a single document only expresses the opinion of a single holder. Several approaches exist to perform sentiment classification at a document level, we describe three of them below [14, 30]. Classification based on sentiment phrases This approach is a research field of Tuney [28]. It can be divided into three steps. First the document is tagged using the Part-of-speech (POS) method [30]. It basically replaced each word by a linguistic category according to its syntactic or morphological behavior. For instances, JJ means adjective and VBN means verb in past participle. It has be proven [29] that, for sentiment classification purposes, the adjectives are the most relevant words. Nevertheless an adjective may have several semantic orientation depending of the context. "unpredictable" might be negative in a automotive review but be positive in a movie review [29]. That is why, thanks to the POS tagging, pairs of words are extracted depending on precise patterns in order to determine precisely the semantic orientation of the adjectives. The following table contains some of the patterns used for extracting two-words phrases. First word Second word Third word (not extracted) JJ NN anything RB JJ not NN JJ JJ not NN NN JJ not NN RB VB anything The table above presents a simple version of the extraction patterns. NN are nouns, RB adverbs, VB vers and JJ adjectives. For example, in the sentence "This camera produces beautiful pictures", "beautiful pictures" will be extracted (first pattern : NN + JJ). The second step is based on a measure called the pointwise mutual information (PMI). The concept is to search if a given phrase is more likely to co-occur with the word "excellent" or with the word "poor" on the web.
  5. 5. Pr(term1 ^ term2) is the probability that term1 and term2 co-occur. Pr(term1)Pr(term2) is the probability that term1 and term2 co-occur if they are statistically independant. Thus the ratio gives an information about the statistical dependence of those two terms. Tuney proposes to compute a value of the semantic orientation of a phrase by the following way : Then by using the number of hits on a search engine it is possible to estimate the probabilities and the SO equation becomes The last step of Turney's algorithm is, given a review, to compute the average SO of all phrases in the review. If it is greater than null then the review expresses a positive opinion. Otherwise it expresses a negative opinion. Final classification results on reviews from various domains are from 84% for automobile review to 66% for movie reviews [29, 30]. Classification using text classification methods Sentiment classification can be tackled as a topic-based text classification problem. All the usual text classification algorithms can be used, e.g., naïve Bayes, SVM, kNN, etc. This approach was experimented by Pang et al. [31]. They have classified 1400 movies reviews from IMDb.com with a random-choice baseline of 50%. They used the three following algorithms, SVM, naïve Bayes and Maximum Entropy. Each of those algorithm usually produces good results on text classification problems. With various pre-processing options and a 3-fold cross-validation, the results spread from 72.8% to 82.9%. The best result is achieved by SVM algorithm on unigrams data. All the results are above the random-choice baseline and the human bag-of-words experiences (58% and 64%). They are superior to the PMI-IR algorithm from Turney on movies review (66%).
  6. 6. Still the three used algorithms are expected to get results around 90% on topic- based text classification problems. Thus sentiment classification is a more difficult task because of the various semantic values and uses of sentiment phrases. Classification using a score function Another approach by Dave et al. [32] is by using a score function. The first step is to score each term of the learning set with the following score function the score number is between -1 and 1, it indicates toward which class, C or C', the term is more likely to belong to. A learning set is a set of reviews which have been labeled manually. So it is possible to compute statistics such as Pr(t|C): probability that the term t appears in a review belonging to class C. Then a document is classified according to the sum of the scores of all its terms. On a large set of reviews from the web (more than 13000) and by working with bigrams and trigrams, the classification rate is between 84.6% and 88.3%. Sentence-level sentiment analysis The sentiment classification at the document-level is the most important field of web opinion mining. However, for most applications, the document-level is too coarse. Therefore it is possible to perform finer analysis at the sentence-level. The research studies in this field mostly focus on a classification of the sentences wether they hold a objective or a subjective speech, the aim is to recognise subjective sentences in news articles and not to extract them. The sentiment classification as it has been described in the document-level part still exists at the sentence-level, the same approaches as the Turney's algorithm are used, based on likelihood ratios. Because this approach has already been described in this paper, this part focuses on the objective/subjective sentences classification and presents two methods to tackle this issue. The first method is based on a bootstrapping approach using learned patterns. It means that this method is self-improving and is based on phrases patterns which are learned automatically. This method comes from the study of Wiebe & Riloff [33], the following schema helps to understand the bootstrapping
  7. 7. process. The input of this method is known subjective vocabulary and a collection of unannotated texts. • The high-precision classifiers find wether the sentences are objective or subjective based on the input vocabulary. High-precision means their behaviours are stable and reproductible. They are not able to classify all the sentences but they make almost no errors. • Then the phrase patterns which are supposed to represent a subjective sentence are extracted and used on the sentences the HP classifiers have let unlabeled. • The system is self-improving as the new subjective sentences or patterns are used in a loop on the unlabeled data. This algorithm was able to recognise 40% of the subjective sentences in a test set of 2197 sentences (59% are subjective) with a 90% precision. In order to compare, the HP subjective classifier alone recognises 33% of the subjective sentences with a 91% precision. Along this original method, more classical data mining algorithm are used such as the naïve bayes classifier in the research studies of Yu & Hatzivassiloglou [34]. The naïves bayes is a supervised learning method which is simple and efficient, especially for text classification problems (i.e. when the number of attributes is huge). To cope with an important and unavoidable approximation about their training data to avoid human labelisation on enormous data set, they use a multiple naïve bayes classifiers method. The general concept is to split each sentence in features -- such as presence of words, presence of n-grams, heuristics from other studies in the field -- and to use
  8. 8. the statistics of the training data set about those features to classify new sentences. Their results show that the more features, the better. They achieved at best a 80-90% recall and precision classification for subjective/opinions sentences and a 50% recall and precision classification for objective/facts sentences. The sentence-level sentiment classification methods are improving, this results from research studies in 2003 show that they were already quite efficient then and that the task is possible. Feature-based opinion mining Main objective of feature-based opinion mining is to find what reviewers (opinion holders) like and dislike about observed object. This process consists of following tasks: 1. extract object features that have been commented on in each review 2. determine whether the opinions on the features are positive, negative or neutral 3. group feature synonyms 4. produce a feature-based opinion summary There are three main review formats on the Web which may need different techniques to perform the above tasks: 1. Format 1 – Pros and Cons: The reviewer is asked to describe Pros and Cons separately. Example: C|net.com 2. Format 2 – Pros, Cons and detailed review: The reviewer is asked to describe Pros and Cons separately and also write a detailed review. Example: Epinions.com 3. Format 3 – free format: The reviewer can write freely, there is no separation of Pros and Cons. Example: Amazon.com Analysing reviews of formats 1 and 3: The summarization is performed in three main steps: 1) mining product features that have been commented on by customers: • part-of-speech tagging: Product features are usually nouns or noun phrases in review sentences. Each review text is segmented into sentences and part- of-speech tag is produced for each word. Each sentence is saved in the review database along with the POS tag information of each word in the sentence. Example of sentence with POS tags:
  9. 9. <S> <NG><W C='PRP' L='SS' T='w' S='Y'> I </W> </NG> <VG> <W C='VBP'> am </W><W C='RB'> absolutely </W></VG> <W C='IN'> in </W> <NG> <W C='NN'> awe </W> </NG> <W C='IN'> of </W> <NG> <W C='DT'> this </W> <W C='NN'> camera </W></NG><W C='.'> . </W></S> • frequent feature identification: Frequent features are those features that are talked about by many customers. To identify them, association mining is used. However, not all candidate frequent features generated by association mining are genuine features. Two types of pruning are used to remove those unlikely features. Compactness pruning checks features that contain at least two words, which we call feature phrases, and remove those that are likely to be meaningless. In redundancy pruning, redundant features that contain single words are removed. Redundant features are described with concept of p-support (pure support). p-support of feature ftr is the number of sentences that ftr appears in as a noun or noun phrase, and these sentences must contain no feature phrase that is a superset of ftr. Minimum p-support value is used to prune those redundant features. • infrequent feature generation: For generating infrequent features following algorithm is applied: for each sentence in the review database if (it contains no frequent feature but one or more opinion words) { find the nearest noun/noun phrase around the opinion word. The noun/noun phrase is stored in the feature set as an infrequent feature. } 2) identify orientation of an opinion sentence To determine the orientation of the sentence, dominant orientation of the opinion words (e.g. adjectives) in the sentence is used. If positive opinion prevails, the opinion sentence is regarded as a positive and opposite. 3) Summarizing the results The following picture shows an example summary for the feature picture of a digital camera. Feature: picture Positive: 12 • Overall this is a good camera with a really good picture clarity. • The pictures are absolutely amazing - the camera
  10. 10. captures the minutest of details. • After nearly 800 pictures I have found that this camera takes incredible pictures. … Negative: 2 • The pictures come out hazy if your hands shake even for a moment during the entire process of taking a picture. • Focusing on a display rack about 20 feet away in a brightly lit room during day time, pictures produced by this camera were blurry and in a shade of orange. Analysing reviews of format 2: Features extracted based on the principle that each sentence segment contains at most one product feature. Sentence segments are separated by ‘,’, ‘.’, ‘and’, and ‘but'. For extracting product features, suprevised rule discovery is used. First, training dataset has to be prepaired. The steps are following: • perform part-of-speech tagging e.g. <N> Battery <N> usage <V> included <N> MB <V>is <Adj> stingy • replace actual feature words in a sentence with [feature] e.g. <N> [feature] <N> usage <V> included <N> [feature] <V> is <Adj> stingy • use n-gram to produce shorter segments from long ones e.g. <V> included <N> [feature] <V> is <N> [feature] <V> is <Adj> stingy After these steps, rule generation can be performed – definition of extraction patterns. e.g. of extraction pattern: <JJ> <NN> [feature] easy to <VB> [feature] The resulting patterns are used to match and identify features from new reviews. Sometimes mistakes made during extraction have to be corrected. E.g. when there are
  11. 11. two or more candidate features in one sentence segment or there is a feature in the sentence segment but not extracted by any pattern. First problem can be solved by implementing an iterative algorithm to deal with the problem by remembering occurrence counts. Orientation (positive or negative) of extracted features is easily to define as we know if the feature is from Pros or Cons of a review. These features are usually used to make comparison of consumer’s opinions of different products. Opinion Spam and Analysis The web has dramatically changed the way that people express themselves and interact with others. They are now able to post reviews of products at merchant sites and interact with others via blogs and forums. Reviews contain rich user opinions on products and services. They are used to by potential customers to find opinions of existing users before deciding to purchase a product and they are also helpful for product manufacturers to identify product problems and to find marketing intelligence information about their competitors. Due to the fact that there is no quality control, anyone can write anything on the Web. This results in many low quality reviews and review spam.
  12. 12. It is now very common for people to read opinions on the Web for many reasons. For example, if someone wants to buy a product and sees that the reviews of the product are mostly positive, one is very likely to buy the product. If the reviews are mostly negative, one is very likely to choose another product. There are generally three types of spam reviews: 1. Untruthful opinions: Those that the reviewer is giving an unjustly positive review to a product or an object in order to promote the object (hyper spam) or when the reviewer is giving some wrongly negative comment to some object in order to damage that product (defaming spam) 2. Reviews on brands only: Those are comments given by a reviewer only for the brands, the seller or the manufactures but not for the specific product or object. In some cases it is useful, but it is considered as spam because it focuses not to the specific product. 3. Non-Reviews: Those are comments that are not related to the product, for example advertisements, questions, answers and random texts. In general, spam detection can be regarded as a classification problem with two classes, spam and non-spam. However, due to the specific nature of different types of spam, we have to deal with them differently. For spam reviews of type 2 and type 3, we can detect them based on traditional classification learning using manually labeled spam and non-spam reviews because these two types of spam reviews are recognizable manually. Quite a lot of reviews of this two types are duplicates and easy to detect. To detect the remaining spam reviews it is necessary to create a model containing the following model: • The content of the review: i.e. number of helpful feedbacks, length of the review title, length of the review body, position of the review, textual features etc. • The previewer who wrote the review: i.e. number of reviews of the reviewer, average rating given by the reviewer, standard deviation in rating • The product being reviewed: i.e. price of the product, average rating, standard deviation in ratings Using the just discussed model with logistic regression, that produces a probability estimate of each review being a spam. It was evaluated on 470 spam reviews searched on amazon.com and it got the following result: Spam Type Num AUC AUC – text AUC – w/o reviews features only feedbacks Types 2 & 3 470 98,7 % 90 % 98% Types 2 only 221 98,5 % 88 % 98 % Types 3 only 249 99,00 % 92 % 98 %
  13. 13. To perform the logistic regression it was used the statistical package R (http://www.r- project.org/). The AUC (Area under ROC curve) is a standard measure used in machine learning for assessing the model quality. Without using feedback features it is reached quite the same result as include them in the evaluation. This is important because feedbacks can be spammed too. However for the first type of spam, manual labeling by simply reading the reviews is quite impossible. The point is to distinguish the untruth review of a spammer from a innocent review. The only way is to create a logistic regression model using duplicates as positive training examples and the rest of the reviews as negative training examples. The model was evaluated on totally 223.002 reviews, of them there was 4488 duplicate spam reviews and 218514 other reviews. Features used AUC All features 78 % Only review features 75 % Only reviewer features 72,5 % Without feedback features 77 % Only text features 63 % The table shows that review centric features are most helpful. Using only text features gives only 63 % AUC, which demonstrates that it is very difficult to identify spam reviews using text content alone. Combining all the features gives the best result. Opinion mining Tools Following there will be a categorized list of several different tools, which can be used for opinion mining. A short review is enclosed for each mentioned tool. APIs Evri [15] Evri is a semantic search engine. It automatically reads web content in a similar way humans do. It performs a deep linguistic analysis of many millions documents which is then build up to a large set of semantic relationships expressing grammatical Subject-Verb-Object style clause level relationships. Evri offers a complex API for developers, with it, it is easy to automatically, cost effectively and in a fully scalable manner, analyze text, get recommendations, discover relationships, mine facts and get popularity data.
  14. 14. Further it is possible to get Widgets with different usage, one of those is using the sentiment aspect. An example of the sentiment widget displays the positive and negative aspects in a percentage bar of the opinion on the new mobile operating system running on the Linux Kernel, called “Android”. [16] OpenDover [17] OpenDover is a Java based webservice that allows to easily integrate sementic features within your blog, content management system, website or application. Basically how it works is that your content is sent through a webservice to their servers (OpenDover), which process and analyze the content by using the implemented linguistic processing technologies. After processing the content is sent back, emotion tagged along with an indicating value how positive or negative the content is. Without any effort it is possible to test this service at a live-demo site on their website [17]. As an example i chose an arbitrary review on a camera from amazon.com: „...the L20 is unisex and it's absolutely right in line with the jeweled quality of Nikon. I was able to use the camera right out of the box without having to read the instruction manual, it's that easy to use.... The camera feels good in my hands and the controls are easy to find without having to take your eyes off your subject... The Nikon L20 comes with a one year manufactures warranty - "Not that you would need a warranty for a Nikon camera" - Impressive warranty details, I was amazed that any camera manufacturer would offer a one year on a point and shoot but Nikon has such a good reputation and so I doubt very much that you would even need to use it. In a nutshell, I love this camera so much that I would recommend this Nikon L20 to my friends, family and anyone else looking to buy. It's a real beauty!“ The first BaseTag was set to “camera”, the second to “Nikon L20”, which the product review was about. The Mode was set to “Accurate” and the selected subject domain was “camera”. The output is then the emotion tagged text. It recognizes positive, negative words and the object. The result of their algorithm is good, for example positive words like “easy to use”, “good”, “impressive” and “love” are marked with green colour. Twitter/Blogsphere RankSpeed [18] RankSpeed is a sentimant search tool for the blogosphere / twittersphere. It finds the best websites, the most useful web apps, the most secure web services and other topics with the help of sentiment analysis. It is possible to search for any website category using tags and rank them by any
  15. 15. desired criteria. Criterias like good, useful, easy and secure. A statistical analysis computes the percentage of bloggers/users who correspond to the desired criteria. The given result, which is a list of links from the source, is then sorted in an descending order by the given percentage. Twittratr [19] Twittratr is a simple search tool for answers to questions like "Are Tweets about Obama generally positve or negative?". The functionality is kept simple. It is based on a list of positive and negative keywords. Twitter is searched for these keywords and the results are crossreferenced against their adjective lists, then displayed accordingly. TwitterSentiment [20] "Twitter Sentiment is a graduate school project from Stanford University. It started as a Natural Language Processing class project in Spring 2009 and will continue as a Natural Language Understanding CS224U project in Winter 2010." Twitter Sentiment was created by three Computer Science graduate students at Stanford University: Alec Go, Richa Bhayani, Lei Huang. It is an academic project. As it is obvious the are doing sentiment analysis on Tweets from Twitter. [27] The approach they are working with is different from other sentiment analysis sites due to following reasons: • Use of classifiers built from machine learning algorithms. Other sites tend to use a keyword-based approach which is much simpler, it may have higher precision, although lower recall. • Transparent in how classification on individual tweets is done. Other sites often do not display the classification of individual tweets. There are only showing aggregated numbers, which makes it almost impossible to assess how accurate the classifiers are. WE Twendz Pro [21] Waggener Edstorm twendz pro service is a Twitter monitoring and analytics web application. It enables the user to easily measure the impact of a specific message within the key audiences. It uses a keyword-based approach to determine general emotion. Meaningful words are compared against a dictionary of thousands of words which are associated with positive or negative emotion. Each word has a specific score, combined with the other scored words it results in an educated guess at the overall emotion.
  16. 16. Newspaper Newssift [22] Newssift is a sentiment search tool on Newspapers and a product from Financial Times. It indexes content from major news and business sources. The query, for example brands, legal risks and environmental impact, is matched in regards to the business topics. This gives you information about changing issues across time for a company or product. Applications LingPipe [23] "LingPipe is a state-of-the-art suite of natural language processing tools written in Java that performs tokenization, sentence detection, named entity detection, coreference resolution, classification, clustering, part-of-speech tagging, general chunking, fuzzy dictionary matching. These general tools support a range of applications." The idea on how sentiment analysis is done using LingPipe's language classification framework is to make two classification tasks: • separating subjective from objective sentences • separating positive from negative reviews A tutorial is online at their website [23] which describes how to use LingPipe for sentiment analysis. Radian6 [24] Radian6 is a commercial social media monitoring application. It has much functionality, like working with dashboards, widgets. Radian6 gathers from blogs, comments, multimedia and forums and communities like Twitter the discussions and opinions and gives businesses the ability to analyze, manage, track and report on their social media engagement and monitoring efforts. RapidMiner [25] RapidMiner is an open-source system, at least the Community Editon, for data mining and machine learning. It is available as a stand-alone application for data analysis and as a data-mining engine for the integration into own products. Sentiment Analysis is also supported. It is used for both real-world data mining and in the research area.
  17. 17. Web Opinion Mining 17 References / Further Readings 1. Liu, Bing, Mining Opinion Features in Customer Reviews, Department of Computer Science, University of Illinois at Chicago 2. Liu, Bing, Mining and Summarizing Opinions on the Web, Department of Computer Science, University of Illinois at Chicago 3. Liu, Bing, From Web Content Mining to Natural Language Processing, Department of Computer Science, University of Illinois Chicago 4. Liu, Bing, Mining and Searching Opinions in User-Generated Contents, Department of Computer Science, University of Illinois Chicago 5. Hu, Minquing, Liu, Bing, Mining and Summarizing Customer Reviews, Department of Computer Science, University of Illinois Chicago 6. Ding, Xiaowen, Liu, Bing, Zhang, Lei, Entity Discovery and Assignment for Opinion Mining Applicatinos, Department of Computer Science, University of Illinois Chicago 7. Liu, Bing, Opinion Mining, Department of Computer Science, University of Illinois Chicago 8. Liu, Bing, Opinion Mining and Search, Department of Computer Science, University of Illinois Chicago 9. Ding, Xiaowen, Liu, Bing, Yu, Philip S., A Holistic Lexicon-Based Approach to Opinion Mining, Department of Computer Science, University of Illinois Chicago 10. Liu, Bing, Opinion Mining & Summerazation – Sentment Analysis, Department of Computer Science, University of Illinois Chicago 11. Jindal, Nitin, Liu, Bing, Opinion Spam and Analysis, Department of Computer Science, University of Illinois Chicago 12. Liu, Bing, Web Content Mining, Department of Computer Science, University of Illinois Chicago 13. Liu, Bing, Hu, Minqing, Cheng, Junsheng, Opinion Observer: Analyzing and Comparing Opinions on the Web, Department of Computer Science, University of Illinois Chicago 14. Liu, Bing Web Data Mining – Exploring Hyperlinks, Contents and Usage Data – Lecture Sides, Springer, Dec. 2006 15. Evri, Semantic Web Search Engine; [cited 2010 Jan 19]. <http://www.evri.com/>. 16. Evri, Widget Sentiment Analysis Example on “Android”; [cited 2010 Jan 19]. <http://www.evri.com/widget_gallery/single_subject?widget=sentiment&ent ity_uri=/product/android-0xf14fe&entity_name=Android>. 17. OpenDover, Sentiment Analysis Webservice; [cited 2010 Jan 19]. <http://www.opendover.nl/>. 18. RankSpeed, Sentiment Analysis on Blogosphere and Twittersphere; [cited 2010 Jan 19]. <http://www.rankspeed.com/>.
  18. 18. 18 Dupré Marc-Antoine, Alexander Patronas, Erhard Dinhobl, Ksenija Ivekovic, Martin Trenkwalder 19. Twittratr; [cited 2010 Jan 19]. <http://twitrratr.com/>. 20. Twitter Sentiment, a sentiment analysis tool; [cited 2010 Jan 19]. <http://twittersentiment.appspot.com/>. 21. WE twendz pro service, influence analytics for twitter; [cited 2010 Jan 19]. <https://wexview.waggeneredstrom.com/twendzpro/default.aspx>. 22. Newssift, sentiment analysis based on Newspapers; [cited 2010 Jan 19]. <http://www.newssift.com/>. 23. LingPipe, Java libraries for the linguistic analysis of human language; [cited 2010 Jan 19]. <http://alias-i.com/lingpipe/index.html>. 24. Radian6, social media monitoring and engagement; [cited 2010 Jan 19]. <http://www.radian6.com/>. 25. Sysomos, Business Intellegence for Social Media; [cited 2010 Jan 19]. <http://sysomos.com/>. 26. RapidMiner, environment for machine learning and data mining experiments; [cited 2010 Jan 19]. <http://rapid-i.com/>. 27. Go, Alec, Bhayani, Richa, Huang, Lei, Twitter Sentiment Classifciation using Distand Supervision, Stanford University; [cited 2010 Jan 19]. Available From: <http://www.stanford.edu/~alecmgo/papers/TwitterDistantSupervision09.pdf >. 28. Turney, P. Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews. In Proc. of the 15th Intl. Conf. on World Wide Web (WWW'06), 2006. 29. Liu, Bing Web Data Mining – Exploring Hyperlinks, Contents and Usage Data, Springer, 2007. 30. Santorini, B. Part-of-speech Tagging Guidelines for the Penn Treebank Project. Technical Report MS-CIS-90-47, Department of Computer and Information Science, University of Pennsylvania, 1990. 31. Pang, B., Lee, L., Vaithyanathan, S. Thumbs Up? Sentiment Classification Using Machine Learning Techniques. In Proc. of the EMNLP'02, 2002. 32. Dave, K., Lawrence, S., Pennock, D. Mining the Peanut Gallery : Opinion Extraction and Semantic Classification of Product Reviews. In WWW'03, 2003. 33. Wiebe, J., Riloff, E. Learning Extraction Patterns for Subjective Expressions. 34. Yu, H., Hatzivassiloglou, V. Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying the Polarity of Opinion Sentences. Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, pp. 129-136, 2003.

×