Anzeige
NE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSIS
Anzeige
NE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSIS
Anzeige
NE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSIS
Anzeige
NE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSIS
Nächste SlideShare
NE7012- SOCIAL NETWORK ANALYSISNE7012- SOCIAL NETWORK ANALYSIS
Wird geladen in ... 3
1 von 17
Anzeige

Más contenido relacionado

Anzeige

Similar a NE7012- SOCIAL NETWORK ANALYSIS(20)

Anzeige

NE7012- SOCIAL NETWORK ANALYSIS

  1. NE7012 SOCIAL NETWORK ANATYSIS PREPARED BY: A.RATHNADEVI A.V.C COLLEGE OF ENGINEERING UNIT 5-TEXT AND OPINION MINING
  2. UNIT V TEXT AND OPINION MINING Text Mining in Social Networks -Opinion extraction – Sentiment classification and clustering - Temporal sentiment analysis - Irony detection in opinion mining - Wish analysis – Product review mining – Review Classification – Tracking sentiments towards topics over time 5.1 Text Mining in Social Networks 5.1.1 Text mining definition  The objective of Text Mining is to exploit information contained in textual documents in various ways, including discovery of patterns and trends in data, associations among entities, predictive rules, etc  The results can be important both for:  the analysis of the collection, and  providing intelligent navigation and browsing methods 5.1.2 Text mining pipeline 5.1.3 Motivation for Text Mining  Approximately 90% of the world’s data is held in unstructured formats (source: Oracle Corporation)  Information intensive business processes demand that we transcend from simple document retrieval to “knowledge” discovery.  The justification for the interest in text mining is the same as for the interest in knowledge retrieval (search and categorization).
  3.  The shear amount of unstructured data (mostly textual) out there calls for more than just document retrieval. Tools and techniques exist to mine this data and realize value in the same way that data mining taps structured data for business intelligence and knowledge discovery. 5.1.4 Text mining process  Text preprocessing - Syntactic/Semantic text analysis  Features Generation - Bag of words  Features Selection
  4. - Simple counting - Statistics  Text/Data Mining - Classification- Supervised learning - Clustering- Unsupervised learning  Analyzing results - Mapping/Visualization - Result interpretation 5.1.5 Challenges in text mining  Data collection is “free text”, is not well-organized (Semi-structured or unstructured)  No uniform access over all sources, each source has separate storage and algebra, examples: email, databases, applications, web  A quintuple heterogeneity: semantic, linguistic, structure, format, size of unit information  Learning techniques for processing text typically need annotated training  XML as the common model, it allows: o Manipulation data with standards o Mining becomes more data mining o RDF emerging as a complementary model  The more structure you can explore the better you can do mining 5.1.6 Text mining actors
  5. 5.1.7 Text mining tasks 5.1.8 Applications of Text Mining  Keyword Search  Classification  Clustering
  6.  Linkage-based Cross Domain Learning 5.1.8.1 Keyword Search  simple but user-friendly interface for information retrieval on the Web.  Proves to be an effective method for accessing structured data.  The challenges lie in three aspects: o Query semantics o Ranking strategy o Query efficiency Keyword Search Algorithms  Query Semantics and Answer Ranking  Keyword search over XML and relational data  Keyword search over graph data 5.1.8.2 Classification Algorithms  Content-based text classification o Naive Bayes classifier, TFIDF classifier and Probabilistic Indexing classifier  Challenges in the context of text classification: o Social networks contain a much larger and non-standard vocabulary o The labels in social networks may often be quite sparse o use of content can greatly improve the effectiveness of the link-based classification process 5.1.8.3 Clustering Algorithms  Related to the traditional problem of graph partitioning  The problem of graph partitioning is NP-hard and often does not scale very well to large networks.  Methods:
  7. o The Kerninghan-Lin algorithm o link-based clustering o clustering graph streams  uses only the structure of the network for the clustering process.  Improve the quality of clustering by using the text content in the nodes of the social network.  use a number of variants of traditional clustering algorithms for multi-dimensional data.  Most of these methods are variants of the k-means method o start off with a set of k seeds and build the clusters iteratively around these seeds. o The seeds and cluster membership are iteratively defined with respect to each other, until we converge to an effective solution.  Perform the clustering with the use of both content and structure information.  constructs a new graph which takes into account both the structure and attribute information.  Such a graph has two kinds of edges:  structure edges from the original graph, and  attribute edges, which are based on the nature of the attributes in the different nodes.  A random walk approach is used over this graph in order to define the underlying clusters.  Each edge is associated with a weight, which is used in order to control the probability of the random walk across the different nodes.  These weights are updated during an iterative process, and the clusters and the weights are successively used in order to refine each other.  weights and the clusters will naturally converge, as the clustering process progresses 5.2 Sentiment analysis 5.2.1 Introduction
  8.  Sentiment analysis (opinion mining): Computational and automatic study of people’s opinions expressed in written language or text.  Two types of information are in text data:  Objective information: facts.  Subjective information: opinions.  The focus of sentiment analysis:  subjective part of text à identify opinionated information rather than mining and retrieval of factual information.  Sentiment analysis brings together various fields of research: text mining, Natural Language Processing, Data mining. 5.2.2 APPLICATIONS  Review summarizations. - Review-oriented search engines. - Search for people’s opinions: How do people think about iPhone 5s?  Recommendation systems. - If you can do sentiment analysis, then the recommendation system can recommend items with positive feedback and not recommend items with negative feedback.  Information extraction systems. - These systems focus on objective parts to extract factual information. - They can discard subjective sentences.  Question-answering systems. - Different types of questions: definitional and opinion oriented questions. - Both individuals and organizations can take advantage of sentiment analysis. 5.2.3 Levels Of Sentiment Analysis  Document level - Identify the opinion orientation of the whole document.  Sentence level - Identify whether the sentence is subjective or objective. - Identify the opinion orientation of subjective sentences.  Aspect level - Identify the aspects that the users are commenting on. - Identify the opinion orientation about each aspect. 5.2.4 System process
  9. 5.2.5 ASPECT IDENTIFICATION  Using clustering to find similar sentences.  It is likely that similar sentences are about similar aspects.  For sentence clustering the method that we use for representing each sentence is important.  The major reason that regular clustering algorithms did not work (Gamon et al [2005]) is the lack of proper method to represent each sentence.  Sentences representation  BOW representation: considers all terms in the sentence.  BON representation: considers only nouns of the sentence. 5.2.6 Sentiment Identification  Machine learning approach sees the sentiment identification problem as a classification problem. Make use of manually labeled training data.  Two major tasks in designing a classifier  Feature extraction: come up with a set of features that represents your problem properly.  Classifier selection: choose a classifier among KNN, Naïve Bayes, SVM, Maximum Entropy.  Our approaches are related to feature extraction steps.  Support Vector Machines are widely used in text classification. We use SVM as well.
  10. 5.2.7 Sentiment classification  Classify sentences/documents (e.g. reviews)/features based on the overall sentiments expressed by authors o positive, negative and (possibly) neutral  Similar to topic-based text classification o Topic-based classification: topic words are important o Sentiment classification: sentiment words are more important (e.g: great, excellent, horrible, bad, worst)  In summary, approaches used in sentiment classification o Unsupervised – eg: NLP pattern @ NLP patterns with lexicon o Supervised – eg: SVM, Naive Bayes..etc (with varying features like POS tags, word phrases) o Semi Supervised – eg: lexicon+classifier 1) Supervised Learning  Supervised learning (or called classification) is one of the major tasks in the research areas such as machine learning, artificial intelligence, data mining, and so forth.  A supervised learning algorithm commonly first trains a classifier (or inferred function) by analyzing the given training data and then classify (or give class label to) those test data.  One typical example for supervised learning in web mining is that if we are given many already known web pages with labels (i.e., topics in Yahoo!), how to automatically set labels to the new web pages.  In this section, we briefly introduce some most commonly used techniques for supervised learning. More kinds of strategies and algorithms can be found.  Nearest Neighbor Classifiers  Decision Tree  Bayesian Classifiers  Neural Networks Classifier . 2) Unsupervised Learning  In this section, we will introduce major techniques of unsupervised learning (or clustering).  Among a large amount of approaches that have been proposed, there are three representative unsupervised learning strategies, i.e., k-means, hierarchical clustering and density based clustering.
  11. 3) Semi-supervised Learning  In the previous two sections, we have introduced the learning issues on the labeled data (supervised learning or classification), and the unlabeled data (unsupervised learning or clustering).  In this chapter, we will present the basic learning techniques when both of the two kind of data are given.  The intuition is that large amount of unlabeled data is easier to obtain (e.g., pages crawled by Google) yet only a small part of them could be labeled due to resource limitation.  The research is so-called semi-supervised learning (or semi-supervised classi f ication), which aims to address the problem by using large amount of unlabeled data, together with the labeled data, to build better classifiers.  There are many approaches proposed for semi-supervised classification, in which the representatives are self-training, co-training, generative models, graph-based methods. 5.3 Temporal sentiment analysis 5.3.1 Overview  The method produces topic graph and sentiment graph by using sentiment phrases which are patterns of sentiment expression such as “happy” or “delighted at”.  We extracted 383 sentiment phrases from Japanese news articles manually, and classified them into eight categories: anxiety, sorrow, anger, happiness, suffering, fatigue, complaint, and shock.
  12. 5.3.2 Procedure for Making a Topic Graph Following is the procedure for making a topic graph. Given: one of sentiment category S which is specified by a user period of time: D=(d1, d2, …, dl) Step 1: For each day di in D, retrieve articles containing sentiment phrases of sentiment s. Step 2: Extract keywords from retrieved articles by using a keyword extraction system called GENSEN-Web3 that can extract compound nouns as a keyword. Step 3: For each extracted keywords wj(j=1,2,…,N), calculate an average correlation c between wj and sentiment phrases contained in S. We use the Dice coefficient for calculating correlation. Step 4: Extract top n keywords according to the score defined by the products of (1) number of days in which keywords appears, (2) inverse frequency of number of days, and (3) scores provided by GENSEN-Web. Step 4’(optional): Put keywords into clusters based on correlation coefficient over timeline and the Dice coefficient in an article. Step 5: Generate a temporal graph for each n keywords (or clusters). For viewability of the graph, we apply moving average. 5.3.4 Procedure for Making a Sentiment Graph Following is the procedure for making a sentiment graph. Given: a keyword w which is specified by a user period of time: D=(d1, d2, …, dl) Step 1: Retrieve articles containing keyword w for each day di(i=1,2,…,l). Step 2: For each articles, calculate the sum of frequency of sentiment phrases for all sentiment categories. Step 3: Generate a temporal graph of frequency of sentiment phrases for each sentiment category. Then, moving average is applied to the graph.
  13. 5.4 Irony detection in opinion mining  In video/spoken discourse, especially in a conversational context, we are usually able to detect a variety of external clues (e.g. facial expression, intonation, pause duration) that enable the perception of irony. In written text, a set of more or less explicit linguistic strategies is also used to express irony. In the next subsections, we describe eight linguistic patterns that we have previously identified to be related to the expression of
  14. irony (Table 1). Some are specific to Portuguese (e.g. morphological patterns) while others seem to be language independent (e.g. emoticons). 1. P𝑑𝑖𝑚: Diminutive Forms Diminutives are commonly used in Portuguese, often with the purpose of expressing positive sentiments, like affect, tenderness and intimacy. However, they can also be sarcastically and ironically used for expressing an insult or depreciation towards the entity they represent. This is especially so when diminutives are found in NE mentioning well-known personalities, such as political entities (e.g. “Socratezinho” for the current Portuguese prime-minister, Jos´e S´ocrates). 2. P𝑑𝑒𝑚: Demonstrative Determiners In Portuguese, the occurrence of any demonstrative form – namely, “este” (this), “esse” and “aquele” (that) – before an human NE usually indicates that such entity is being negatively or pejoratively mentioned. In some cases, demonstratives (DEM ) are the unique explicit clue that signals the presence of irony (e.g. “Este S´ocrates ´e muito amigo do Sr. Jack” / “This S´ocrates is a very good friend of Mr. Jack”). 3. P𝑖𝑡𝑗 : Interjections Interjections abound in subjective texts, particularly in UGC, carrying on valuable information concerning authors’ emotions, feelings and attitudes. We believe that some interjections can be used as potential clues for irony detection, when they appear in specific contexts, such as the ones represented in the Pattern P𝑖 . Since we are especially interested in recognizing irony in prior positive text, we confined our analysis to a small set of interjections that are commonly used to express positive sentiments, namely: “bravo”, “for¸ca”, “muito obrigado/a”, “obrigado/a”, “obrigadinho/a”, “parab´ens”, “muitos parab´ens” and “viva”. 4. P𝑣𝑒𝑟𝑏: Verb Morphology The type of pronoun used for addressing people can also be an important clue for irony detection in UGC, especially in languages like Portuguese, where the choice of a specific
  15. pronoun or way of expression (e.g. “tu” vs. “vocˆe”, both translatable by “you”) may depend on the degree of proximity/familiarity between the speaker and the NE it refers to. The pronoun “tu” is used in a familiar context (e.g. with friends and family). In our experiments, we analyze to what extent the use of the pronoun “tu” for addressing a wellknow named entity can be used as a clue for irony detection in UGC. As represented in P𝑣𝑒𝑟𝑏, the pronoun can be either explicitly referred in the text or it can be embedded in the morphology of the verb (which is in the second-person singular). We confined the analysis to the verb “ser” (to be). 5. P𝑐𝑟𝑜𝑠𝑠: Cross-constructions In Portuguese, evaluative adjectives with a prior positive or neutral polarity usually take a negative or ironic interpretation whenever they appear in cross-constructions, where adjectives relate to the noun they modify through the preposition “de” (e.g. “O comunista do ministro” / “The communist of the minister”) [2]. Pattern P𝑐𝑟𝑜𝑠𝑠 recognizes cross- constructions headed by a positive or neutral adjective (ADJ𝑝𝑜𝑠 or ADJ𝑛𝑒𝑢𝑡, respectively), which modify a human NE. Adjectives are preceded by a demonstrative (DEM ) or an article (ART) determiner. 6. P𝑝𝑢𝑛𝑐𝑡: Heavy Punctuation In UGC, punctuation is frequently used both for verbalizing user immediate emotions and feelings and for intentionally signaling humoristic or ironic text. We assume that the presence in a sentence of a sequence composed of more than one exclamation point and/or question mark can be used as a clue for irony detection. 7. P𝑞𝑢𝑜𝑡𝑒: Quotation Marks Quotation marks are also frequently used to express and emphasize an ironic content, especially if the content has a prior positive polarity (e.g. positive adjective qualifying an entity). In our experiments, we tried to find possible ironic sentences by searching quoted sequences composed of one or two words, corresponding, at least one of them, to a positive adjective or noun. 8. P𝑙𝑎𝑢𝑔ℎ: Laughter Expressions Internet slang contains a variety of widespread expressions and symbols that typically represent a sensory expression, suggesting different attitudes or emotions. In our experiments, we considered (i) the acronyms “lol” and corresponding variations (LOL), (ii) onomatopoeic expressions such as “ah”, “eh” and “hi” (AH) and (iii) the prior positive emoticons “:)”“;-)” and “:P” (EMO+). In this particular case, we did not constraint the polarity of elements contained in the sentence. We assume that laugh expressions are intrinsically positive or ironic 5.5 Product review mining 5.5.1 Motivation  A rapid expansion of e-commerce, where more and more products are sold via online portals (Amazon, eBay … )
  16.  Online product reviews thus become an important resource: o Customers to share and find opinions about products easily o Producers to get certain degrees of feedback 5.5.2 Related works  Single-document summarization o Extractive-based approach  Sentence score + ranking  Machine learning technique o Abstractive-based approach  Template  Concept hierarchy  Multi-document summarization o Extractive-based approach  Sentence score + ranking + MMR + Ordering o Abstractive-based approach  Template  Concept hierarchy  Sentence fusion with paraphrasing rules  Sentiment analysis o Reviews polarity classification o PROS/ CONS identification o Mining review opinions  Identify product facets  Identify opinion orientation on the facet 5.5.3 Process
  17. 5.5.4 Product facets identification o Association rule mining  Each transaction consists of nouns/noun phrases from single sentence  The frequent itemsets are the candidate product facets o Redundancy pruning  Removing redundant facets that contain only single words. (e.g. life -> battery life) o Compactness pruning  Removing meaningless facets that contain multiple words
Anzeige