SlideShare ist ein Scribd-Unternehmen logo
1 von 72
Tutorial: Text Data Mining and Analytics: Part 2 HICSS 44 – January 2011 Dave King
Text Mining: Payoff from Simple Approaches Many of the applications of data mining to text “have proved remarkably successful without understanding specific properties of text such as the concepts of grammar or the meaning of words. Strictly low-level frequency information is used, such as the number of times a word appears in a document, and then well-known methods of machine learning are applied.” Source: S. Weiss, et. al. Text Mining: Predictive Methods for Analyzing Unstructured Information, 2005
Text Mining:Here’s a fun job! News Articles ??
Text Mining:Here’s a fun job! Google News is a computer-generated news site that aggregates headlines from news sources worldwide, groups similar stories together and displays them according to each reader's personalized interests…Google News has no human editors …
Text Mining:Text Categorization (Classification) Probably the most frequently used TM technique.  Often employed in applications where there is a flow of dynamic information (emails, news articles, blogs, scientific articles, patents, medical claims, legal data …),  requiring automated handling and routing. ? Category News Articles
Text Mining:Text Categorization (Classification) Inductive, supervised machine learning process the classifies or categorizes a given document instance (of unknown classification) into one of a set of predetermined categories. Docs w/ known classification – training corpa Documents w/ unknown classification Validate Test Train Feature Extraction/Learning Feature Extraction Classification Algorithm Predetermined Categories 1 2 3 n
Text Mining:Classification Algorithm  Naïve Bayes Decision Trees Nearest Neighbor (k-NN) Support Vector Machine Neural Nets (e.g. SOM)
Text Categorization:An Example Who is Gary Thuerk?
Text Categorization:An Example “We invite you to come see the 2020 and hear about the DECSystem-20 family.’’  Gary Thuerk,  DEC Marketing,  1978 DECSYSTEM-2020: a bit-slice processor with up to 512 kilowords of solid state RAM  Source: http://www.newyorker.com/reporting/2007/08/06/070806fa_fact_specter#ixzz16zE3E2zO
Text Categorization:An Example Answer: He’s the father of Spam – not the Hormel type but the Email type
Spam Detection:Size of the Problem 90 trillion – The number of emails sent on the Internet in 2009. 247 billion – Average number of email messages per day. 1.4 billion – The number of email users worldwide. 100 million – New email users since the year before. 81% – The percentage of emails that were spam. 92% – Peak spam levels late in the year. 24% – Increase in spam since last year. 200 billion – The number of spam emails per day (assuming 81% are spam).
Spam Detection:Size of the Problem Estimated Annual Costs of Spam in the US (in $billions) Source: blog.epostmarks.com/team-blog/2009/3/21/the-true-corporate-and-consumer-costs-of-spam.html
Spam Detection:Size of the Problem (Yale Univ.) Measured in millions http://www.yale.edu/its/metrics/email/index.html
Spam Detection:General Approaches 1 2 SPAM Detection./Filter #
Spam Detection:General Approaches Rules Is this email from someone@spam.com? Blacklists & Whitelists Check the subject and body of the message for particular words or phrases Problem: Need new rules to handle dynamic data Ways to alter the data (add spaces at random, non-alpha characters, misspellings, composite words, …)
Spam Detection:Problem with Rules
Beginning Example:Yale University Spam Management Blocks messages from known spammers using a service called SpamHaus, a real-time database of IP addresses of verified spam sources. Content-based, central spam detection using  SpamAssassin. Messages scored as spam are moved away from a user’s inbox to the Tagged-Spam folder on the server.  Rules used for tagging spam are conservative.  For that reason some spam gets through the first two levels of filtering. End users should train email clients to recognize and manage spam.  Mail clients like Eudora or Outlook have built-in spam filters that you can train to filter messages you consider spam.
Spam Detection: Yale University Spam Management A set of Perl programs that uses the combined score from multiple types of checks to determine if a given message is spam including Bayesian filtering. Microsoft Outlook utilizes its SmartScreen Technology which is based on a machine-learning Bayesian technology that employs a probability-based algorithm, to determine whether email is legitimate or spam.
Spam Detection:Genesis of Content-Based Control  “I think it’s possible to stop spam, and that content-based filters are the way to do it. The Achilles’ heel of the spammers is their message. They can circumvent any other barrier you set up. But they have to deliver their message, whatever it is. There is no way they can get around that… I think we will be able to solve the problem with fairly simple algorithms. In fact, I've found that you can filter present-day spam acceptably well using nothing more than a Bayesian combination of the spam probabilities of individual words. Using a slightly tweaked (as described below) Bayesian filter, we now miss less than 5 per 1000 spams, with 0 false positives.  Paul Graham, A Plan for Spam, 2002
Spam Detection:The Goal Confusion Matrix Precision = TP / (TP + FP) Recall = TP / (TP + FN) Accuracy = (TP + TN)/N Error = (FP + FN)/N F1 = 2*Recall*Precision/(Recall + Precision) Where N = TP+FP+FN+TN Goal:  Minimize false positives FPR = FP/(FP + TN)
Spam Detection:Naïve Bayesian Classifier P(H/D) = P(D/H) * P(H)/P(D) H is the hypothesis and D is the data P(H) is the prior probability of H: the probability that H is correct before the data D are seen .  P(D/H) is the conditional probability of seeing the data D given that the hypothesis H is true. This conditional probability is called the likelihood.  P(D) is the marginal probability of D. P(H/D) is the posterior probability: the probability that the hypothesis is true, given the data and the previous state of belief about the hypothesis.  Thomas Bayes
Spam Detection:Naïve Bayesian Classifier ? P(Spam | Message) compared to  P(Not Spam | Message) Training Set P(Spam | Word)  = P(S)  *     P(W1/S) /   P(M) P(Spam | quick) = P(Spam) * P(quick/Spam)  P(Spam | quick) = ..4 * .5 = .2 P(Not Spam | Word) = P(NS)  *  P(W1/NS) /  P(M) P(Not Spam | quick) = P(Not Spam) * P(quick/Not Spam)  P(Not Spam | quick) = .6 * .67  ~ .4
Spam Detection:Naïve Bayesian Classifier ? P(Spam | Message) compared to  P(Not Spam | Message) Training Set P(Spam | Words)  = P(S)  *    P(W1/S)  *   P(W2/S) * ... P(Spam | quick & money ) = P(Spam) * P(quick/Spam)  * P(money/Spam) P(Spam | quick & money ) = ..4 * .5  * .5 = .1 P(Not Spam | Words) = P(NS)  *  P(W1/NS)  * P(W2/NS) * ... P(Not Spam | quick & money) = P(Not Spam) * P(quick/Not *Spam)  * P(money/Not Spam) P(Not Spam | quick & money) = .6 * .67 * 0 = 0
Sentiment Analysis:The Issues and Payoffs Every hour of every day they share their opinions, issues, thoughts and sentiments about products, brands, services and companies.
Sentiment Analysis:Some Survey Data Activity 81% of Internet users (or 60% of Americans) have done online research on a product at least once 20% (15% of all Americans) do so on a typical day 32% have provided a rating on a product, service, or person via an online ratings system, and 30% (including 18% of online senior citizens) have posted an online comment or review regarding a product or service.2 Impact Among readers of online reviews of restaurants, hotels,andvarious services (e.g., travel agencies or doctors), between 73% and 87% report that reviews had a significant influence on their purchase Consumers report being willing to pay from 20% to 99% more for a 5-star-rated item than a 4-star-rated item (the variance stems from what type of item or service is considered) Pew Internet & American Life Project Report, 2008.
Sentiment Analysis:The Issues and Payoff This evaluative text data is extremely valuable to customer-facing organizations Marketing -- Inform targeted marketing and help determine which marketing messages resonate with customers Service -- Provide more rapid response to perceived customer issues and determine the steps to take to satisfy customers Products -- Quickly determine whether there are emerging product issues, how to position products and where development dollars should be focused. It is also very voluminous – beyond addressing with armies of staff manually sifting through the data
Sentiment Analysis:What is it? Also called opinion mining or voice of the customer (VOC) Involves using text mining to classifying subjective opinions in text into categories like "positive" or "negative” extracting various forms of attitudinal information: sentiment, opinion, mood, and emotion. Text analytics techniques are helpful in analyzing sentiment at the entity, concept, or topic level and in distinguishing opinion holder and opinion object.
Sentiment Analysis: How do you know if the review is “-” or “+” plot : two teen couples go to a church party , drink and then drive .  they get into an accident .  one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares .  what's the deal ?  watch the movie and " sorta " find out . . .  critique : a mind-xxx movie for the teen generation that touches on a very cool idea , but presents it in a very bad package .  which is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn't snag this one correctly .  they seem to have taken this pretty neat concept , but executed it terribly .  so what are the problems with the movie ?  well , its main problem is that it's simply too jumbled .  having not seen , " who framed roger rabbit " in over 10 years , and not remembering much besides that i liked it then , i decided to rent it recently .  watching it iwas struck by just how brilliant a film it is .  aside from the fact that it's a milestone in animation in movies ( it's the first film to combine real actors and cartoon characters , have them interact , and make it convincingly real ) and a great entertainment it's also quite an effective comedy/mystery .  while the plot may be somewhat familiar the characters are original , especially baby herman , and watching them together is a lot of fun .  … `who framed roger rabbit' is a rare film .  one that not only presented a great challenge to the filmmakers but one that can be enjoyed by the whole family ( although some very young viewers may be a little scared by judge doom ) .  do yourself a favor and rent it , `p-p-p-p-please . "
Sentiment Analysis:Underlying Assumption There are opinion words (aka polar words, opinion-bearing words, and sentiment words) used to express state.  Positive opinion words are used to express desired states (e.g. beautiful, wonderful, good, and amazing) Negative opinion words are used to express undesired states (bad, poor, and terrible) There are also opinion phrases and idioms ( e.g. cost someone an arm and a leg) Collectively, they are called the Opinion Lexicon.
Sentiment Analysis:Types Sentiment Classification – document level, classified as positive or negative Feature-based opinion – sentence level, determines which aspects of an object people like or dislike Comparative sentence and relationship mining – sentence level comparisons of one object against another (to determine which is better than the other)
Sentiment Analysis:Which type is best? From one type to the next (classification, features, comparisons), it becomes more complex to extract the information needed to perform the analysis. However, once extracted, standard text mining techniques can be used to classify and compare the opinions expressed in the documents, statements, sentences, and phrases. Simple techniques (like naïve Bayesian) often produce excellent results (e.g. 80+% accuracy)
Text Mining and Analytics:Applications JetBlue Airways Uses Attensity to analyze the large volume of e-mail messages it receives from customers.  By matching specific comments and comment patterns with structured data, airline personnel can solve problems rapidly, before they jeopardize the carrier's satisfaction rating.  Rosetta Stone Uses IBM SPSS text analytics software to analyze answers to open-ended questions in surveys of current and potential customers. Combines text analysis along with other identification information (e.g. products purchased, demographics) to drive decisions on advertising, marketing and product development as well as strategic planning. Gaylord Hotels Uses Clarabridge software to make sense of thousands of customer satisfaction surveys gathered each day Spots positive and negative comments that helps track trends in customer satisfaction and spot problems -- as well as best practices -- tied to particular properties, departments or employees.
Text Mining:Clustering (Setting the Stage) A common problem:  Establishing categories or topic structures for Free-form survey data Customer complaints/comments, incident reports and warranty claims Blogs and discussion forums Search results Common answer: Clustering
Text Mining:Clustering (Defined) The unsupervised, automated grouping of records, observations, or cases into classes of similar objects called clusters. Document  Collection Similarities stronger within clusters than between (i.e. distances shorter) C1 Freq W1 Clustering Algorithm C3 Clusters C2 1 2 3 n Freq W2
Text Mining:Clustering (Measuring Distance) In a term-doc matrix treat the docs as vectors and the topics as variables and measure the distance/similarity between them. 3 Euclidean Distance: SQRT(Sum(Xi-Yi)^2)) 2 D1 T1 D2 1 D3 1 2 3 0 T2
Text Mining:Clustering (Measuring Distance) Squared Euclidean: Sum of squared differences City Block or Manhattan: Sum of absolute differences Minkowski: hth root of the sum of absolute differences raised to the hth power Matching Distance: For binary – number of (mis)matches divided by number of comparisons (like Jaccard Similarity) Correlation: 1 – 2r where r is corr. coeff. Cosine: angle between the vectors
Text Mining:Clustering Methods Hierarchical: Produces a Tree-Like Structure of Clusters (Divisive and Agglomerative) Partitioning: Organizes objects into k partitions (k<=n) where each partition is a cluster
Text Mining:Clustering Methods Hierarchical Partitioning Start Start 1 2 3 K … Divisive Agglomerative Clusters
Text Mining:Clustering (Simple Example) T1 - The Neatest Guide to Stock Market Investing T2 - Investing For Dummies, 4th Edition  T3 - The Book of Common Sense Investing: The Only Way to Guarantee Your Fair Share of StockMarket Returns  T4 – The Book of ValueInvesting  T5 - ValueInvesting: From Graham to Buffett and Beyond  T6 - RichDad'sGuide to Investing: What the Rich Invest in, That the Poor and the Middle Class Do Not!  T7 - Investing in RealEstate, 5th Edition  T8 - StockInvesting For Dummies"  T9 - RichDad's Advisors: The ABC's of RealEstateInvesting: The Secrets of Finding Hidden Profits Most Investors Miss  Focused on (exact) indexed words – appears in at least 2 titles and is not a stop word
Text Mining:Clustering Method - Hierarchical Calculate distances between docs Select 2 closest docs and put them into a cluster Now determine closest doc among the remaining individual docs and existing clusters [utilizing either single (nearest), complete (farthest) or average linkage] Repeat process until a single cluster is formed Level Plot
Text Mining:Clustering Method - Hierarchical 41
Text Mining:Clustering Method – K-Means Determine the number of clusters “k”<=n Randomly assign k docs to be the initial cluster center locations (centroids) Repeat until termination For each doc calculate the (Euclidean) distance from the center locations and assign them to the cluster with the nearest center. For every cluster, recompute the centroid based on current members Check for termination – minimal or no changes in doc assigments Return the list of clusters
Text Mining:Clustering (K-Means Example) Cluster 1: T1, T3 T1 - The Neatest Guide to Stock Market Investing T3 - The Book of Common Sense Investing: The Only Way to Guarantee Your Fair Share of StockMarket Returns Cluster 2: T6, T7, T9  T6 - RichDad'sGuide to Investing: What the Rich Invest in, That the Poor and the Middle Class Do Not!  T7 - Investing in RealEstate, 5th Edition  T9 - RichDad's Advisors: The ABC's of RealEstateInvesting: The Secrets of Finding Hidden Profits Most Investors Miss  Cluster 3: T2, T4, T5, T8 T2 - Investing For Dummies, 4th Edition T4 – The Book of ValueInvesting  T5 - ValueInvesting: From Graham to Buffett and Beyond  T8 - StockInvesting For Dummies"
Text Mining:Clustering (RSS Feeds Example)
Text Mining:Clustering (RSS Feeds Example)
Text Mining:Clustering (RSS Feeds Example)
Text Mining:Clustering (RSS Feeds Example) http://feeds.reuters.com/reuters/entertainment http://feeds.reuters.com/reuters/technologyNews http://feeds.foxnews.com/foxnews/scitech http://feeds.foxnews.com/foxnews/entertainment http://rss.cnn.com/rss/cnn_showbiz.rss http://rss.cnn.com/rss/cnn_tech.rss
Text Mining:Clustering Example (RSS Newsfeeds) RSS Feed-Stem Matrix
Text Mining:RSS Newsfeeds Dendrogram
Text Mining:RSS Newsfeeds K-Means Clusters
Text Mining:Clustering Process Many people imagine that it will produce neatly separated clusters like those that (appear in relatively simple examples), but it almost never does.  Such ideal clusters are rarely encountered in real data, so we often need to modify our objective from “find the natural clusters in the data” to “organize the cases into groups that are similar in some way.” Cook and Swayne, Interactive and Dynamic Graphics for Data Analysis
Text Mining:Real World Clustering Example “Text Mining Warranty and Call Center Data: Early Warning for Product Quality Awareness” (Wallace & Cermack, SUGI29, 2004) Goal: Develop a system that would enable an early warning, alerting system for product quality problems (for American Honda Motors) Problem – most of the information is in text documents Warranty: when dealers complete warranty service claims, a comment field is available to further describe the problem.  Customer Relations: the call center logs parts of conversations and written communications with customers. Techline: calls from dealer service technicians to specialized mechanics create more text data.
Text Mining:Real World Clustering Example
Text Mining:Real World Clustering Example Changes in cluster size Appearance of new words Changes in Shape Alerts
Text Mining:Real World Clustering Example Integrated warranty business rules. Emerging issues. Drill-to from emerging issues. Drill on multiple points. Analyze by alert. Ad hoc analysis. Advanced warranty analysis. SAS Warranty Analysis 4.2
Text Mining:Another Clustering Example
Text Mining:Another Clustering Example
Text Mining:Another Clustering Example
Text Mining:Information Extraction (Goals) Type of IR Goal is to automatically extract structured information (e.g. entities, concepts and topics) from unstructured text from contextually and semantically  well-defined data usually from well-defined domain (sometimes called content analysis) Named-Entity Recognition  Subtask of IE, also known as entity identification and entity extraction Seeks to locate and classify atomic elements in text into predefined categories (e.g. names of persons, organizations, locations, dates, quantities, monetary values, percentages and so on) The end goal is usually to fill in templates codifying the extracted information (e.g. entity relationship structures <entity><rel><entity>)
Information Extraction:Common Uses Competitive Intelligence Counter-Terrorism & Criminal Intelligence Resume Harvesting Patent Search Scientific Literature Search (biology & medicine) Email Scanning
Information Extraction:Named Entity Recognition 61
Text Mining:Information Extraction (Process) Linguistic Processing 1 2 Information Extraction 62
Information Extraction:Process (Part-of-Speech Tagging) Part-of-speech tagging is the process of converting a sentence, in the form of a list of words, into a list of tuples, where each tuple is of the form (word, tag).  The tag is a part-of-speech tag and signifies whether the word is a noun, adjective, verb, and so on.  Variety of tagging strategies, most of which are “trainable.”
Information Extraction:Process (Part-of-Speech Tagging) The pilot had to bank the plane because it was headed right for the downtown branch bank which was located next to the river bank. Taggers (examples) Training for N-Gram Taggers (sequences of N words): Trigram, Bigram, Unigram Employs training and test sets like other classification systems Utilizes various classification algorithms for training then actual classification
Information Extraction:Process (Part-of-Speech Tagging) Sample sentence: CVS Caremark Corporation agreed to buy the Medicare Part D unit of Universal American Financial Corporation for about $1.25 billion. Tagged sentence: [('CVS', 'NNP'), ('Caremark', 'NNP'), ('Corporation', 'NNP'), ('agreed', 'VBD'), ('to', 'TO'), ('buy', 'VB'), ('the', 'DT'), ('Medicare', 'NNP'), ('Part', 'NNP'), ('D', 'NNP'), ('unit', 'NN'), ('of', 'IN'), ('Universal', 'NNP'), ('American', 'NNP'), ('Financial', 'NNP'), ('Corporation', 'NNP'), ('for', 'IN'), ('about', 'IN'), ('$', '$'), ('1.25', 'CD'), ('billion', 'CD')]
Information Extraction:Process (Entity Recognition) Chunking Basic technique which segments and labels multi-token sequences Sequences are non-overlapping  Usually employs a combination of a “templated” grammar couched as regular expressions along with tagger & classification processes to do the segmenting Simple Example – NP Chunker grammar = "NP:{<DT>?<JJ.*>*<NN.*>+}"
Information Extraction:Process (Entity Recognition) (S   (NP CVS/NNP Caremark/NNP Corporation/NNP)   agreed/VBD   to/TO   buy/VB (NP the/DT Medicare/NNP Part/NNP D/NNP unit/NN)   of/IN   (NP Universal/NNP American/NNP Financial/NNP Corporation/NNP)   for/IN   about/IN   $/$   1.25/CD   billion/CD)
Information Extraction:Process (Entity Recognition) Named Entity Recognition – Identify all textual mentions of the named entities Hard to rely on precompiled lists of names, locations, … especially in dynamically changing domains A starting point is provided by the “named” entity chunkersfound in toolkits like NLTK
Information Extraction:Process (Entity Recognition) Example of Entity Recognition Tree('S', [Tree('ORGANIZATION', [('CVS', 'NNP')]), Tree('PERSON', [('Caremark', 'NNP'), ('Corporation', 'NNP')]), ('agreed', 'VBD'), ('to', 'TO'), ('buy', 'VB'), ('the', 'DT'), Tree('ORGANIZATION', [('Medicare', 'NNP'), ('Part', 'NNP')]), ('D', 'NNP'), ('unit', 'NN'), ('of', 'IN'), Tree('ORGANIZATION', [('Universal', 'NNP'), ('American', 'NNP')]), ('Financial', 'NNP'), ('Corporation', 'NNP'), ('for', 'IN'), ('about', 'IN'), ('$', '$'), ('1.25', 'CD'), ('billion', 'CD')])
Information Extraction:Sample System (Xanalys)
Text Mining & Analysis:Tools kdnuggets.com/software/text.html digitalresearchtools.pbworks.com/
Text Mining and Analysis:Lessons Learned There are practical applications in business, scientific and government arenas with substantial payback Text can be analyzed with many of the same analytical (data mining) techniques applied to structured data, although the text must first be transformed into structured data for this to occur. Many practical applications of text analysis and mining rest on treating documents as “bag of words” and on utilizing simpler versus more complex mining techniques.  This techniques often have the same payoffs as more complex techniques

Weitere ähnliche Inhalte

Was ist angesagt?

Socail Influence & Homophilly
Socail Influence & HomophillySocail Influence & Homophilly
Socail Influence & HomophillyNitish Upreti
 
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Editor IJCATR
 
Discovering emerging topics in social streams via link anomaly detection
Discovering emerging topics in social streams via link anomaly detectionDiscovering emerging topics in social streams via link anomaly detection
Discovering emerging topics in social streams via link anomaly detectionFinalyear Projects
 
Recommenders Systems
Recommenders SystemsRecommenders Systems
Recommenders SystemsTariq Hassan
 
Identifying Prominent Life Events on Twitter - K-Cap 2015
Identifying Prominent Life Events on Twitter - K-Cap 2015Identifying Prominent Life Events on Twitter - K-Cap 2015
Identifying Prominent Life Events on Twitter - K-Cap 2015Tom Dickinson
 
The journal of statistical software
The journal of statistical softwareThe journal of statistical software
The journal of statistical softwareAjay Ohri
 
A Multi-Criteria Recommender System Exploiting Aspect-based Sentiment Analysi...
A Multi-Criteria Recommender System Exploiting Aspect-based Sentiment Analysi...A Multi-Criteria Recommender System Exploiting Aspect-based Sentiment Analysi...
A Multi-Criteria Recommender System Exploiting Aspect-based Sentiment Analysi...Cataldo Musto
 
IEEE 2014 JAVA DATA MINING PROJECTS Discovering emerging topics in social str...
IEEE 2014 JAVA DATA MINING PROJECTS Discovering emerging topics in social str...IEEE 2014 JAVA DATA MINING PROJECTS Discovering emerging topics in social str...
IEEE 2014 JAVA DATA MINING PROJECTS Discovering emerging topics in social str...IEEEFINALYEARSTUDENTPROJECTS
 
Semantics-aware Content-based Recommender Systems
Semantics-aware Content-based Recommender SystemsSemantics-aware Content-based Recommender Systems
Semantics-aware Content-based Recommender SystemsPasquale Lops
 
NE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSISNE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSISrathnaarul
 
09 Respondent Driven Sampling and Network Sampling with Memory
09 Respondent Driven Sampling and Network Sampling with Memory09 Respondent Driven Sampling and Network Sampling with Memory
09 Respondent Driven Sampling and Network Sampling with Memorydnac
 
Social Media Mining - Chapter 9 (Recommendation in Social Media)
Social Media Mining - Chapter 9 (Recommendation in Social Media)Social Media Mining - Chapter 9 (Recommendation in Social Media)
Social Media Mining - Chapter 9 (Recommendation in Social Media)SocialMediaMining
 
Collaborative Filtering
Collaborative FilteringCollaborative Filtering
Collaborative FilteringTayfun Sen
 
Practical Opinion Mining for Social Media
Practical Opinion Mining for Social MediaPractical Opinion Mining for Social Media
Practical Opinion Mining for Social MediaDiana Maynard
 

Was ist angesagt? (16)

Socail Influence & Homophilly
Socail Influence & HomophillySocail Influence & Homophilly
Socail Influence & Homophilly
 
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
 
Discovering emerging topics in social streams via link anomaly detection
Discovering emerging topics in social streams via link anomaly detectionDiscovering emerging topics in social streams via link anomaly detection
Discovering emerging topics in social streams via link anomaly detection
 
Recommenders Systems
Recommenders SystemsRecommenders Systems
Recommenders Systems
 
Identifying Prominent Life Events on Twitter - K-Cap 2015
Identifying Prominent Life Events on Twitter - K-Cap 2015Identifying Prominent Life Events on Twitter - K-Cap 2015
Identifying Prominent Life Events on Twitter - K-Cap 2015
 
The journal of statistical software
The journal of statistical softwareThe journal of statistical software
The journal of statistical software
 
A Multi-Criteria Recommender System Exploiting Aspect-based Sentiment Analysi...
A Multi-Criteria Recommender System Exploiting Aspect-based Sentiment Analysi...A Multi-Criteria Recommender System Exploiting Aspect-based Sentiment Analysi...
A Multi-Criteria Recommender System Exploiting Aspect-based Sentiment Analysi...
 
IEEE 2014 JAVA DATA MINING PROJECTS Discovering emerging topics in social str...
IEEE 2014 JAVA DATA MINING PROJECTS Discovering emerging topics in social str...IEEE 2014 JAVA DATA MINING PROJECTS Discovering emerging topics in social str...
IEEE 2014 JAVA DATA MINING PROJECTS Discovering emerging topics in social str...
 
Semantics-aware Content-based Recommender Systems
Semantics-aware Content-based Recommender SystemsSemantics-aware Content-based Recommender Systems
Semantics-aware Content-based Recommender Systems
 
NE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSISNE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSIS
 
Link prediction
Link predictionLink prediction
Link prediction
 
Twitter Analytics
Twitter AnalyticsTwitter Analytics
Twitter Analytics
 
09 Respondent Driven Sampling and Network Sampling with Memory
09 Respondent Driven Sampling and Network Sampling with Memory09 Respondent Driven Sampling and Network Sampling with Memory
09 Respondent Driven Sampling and Network Sampling with Memory
 
Social Media Mining - Chapter 9 (Recommendation in Social Media)
Social Media Mining - Chapter 9 (Recommendation in Social Media)Social Media Mining - Chapter 9 (Recommendation in Social Media)
Social Media Mining - Chapter 9 (Recommendation in Social Media)
 
Collaborative Filtering
Collaborative FilteringCollaborative Filtering
Collaborative Filtering
 
Practical Opinion Mining for Social Media
Practical Opinion Mining for Social MediaPractical Opinion Mining for Social Media
Practical Opinion Mining for Social Media
 

Andere mochten auch

Panelinstr
PanelinstrPanelinstr
Panelinstrskatelal
 
Hannah! Emily! Jodie!
Hannah! Emily! Jodie!Hannah! Emily! Jodie!
Hannah! Emily! Jodie!guestcea81f
 
Panelinstrph
PanelinstrphPanelinstrph
Panelinstrphskatelal
 
Enallaktiki Protasi Dimokratia Anaptyksi
Enallaktiki Protasi Dimokratia AnaptyksiEnallaktiki Protasi Dimokratia Anaptyksi
Enallaktiki Protasi Dimokratia AnaptyksiDimitris Tsingos
 
VANZARE Apartament 3 camere Crangasi
VANZARE Apartament 3 camere CrangasiVANZARE Apartament 3 camere Crangasi
VANZARE Apartament 3 camere Crangasiemadoyle
 
Fraud cases presentation
Fraud cases presentationFraud cases presentation
Fraud cases presentationGhassan Kabbara
 
Open data day open data and govt
Open data day   open data and govtOpen data day   open data and govt
Open data day open data and govtLori Bush
 
The Employee Point of View: The Economic Downturn
The Employee Point of View: The Economic DownturnThe Employee Point of View: The Economic Downturn
The Employee Point of View: The Economic DownturnCitrix Online
 
李燕聪107081013 企鹅-工程学基础
李燕聪107081013 企鹅-工程学基础李燕聪107081013 企鹅-工程学基础
李燕聪107081013 企鹅-工程学基础zust
 
Optical illusions!
Optical illusions!Optical illusions!
Optical illusions!Alka Rao
 
Bathed in Modernity: Spatial Relegation of Houseless Individuals and Liberato...
Bathed in Modernity: Spatial Relegation of Houseless Individuals and Liberato...Bathed in Modernity: Spatial Relegation of Houseless Individuals and Liberato...
Bathed in Modernity: Spatial Relegation of Houseless Individuals and Liberato...Abigail Brown
 
หลักสูตรอบรมพัฒนาทักษะการรู้สารสนเทศ มหาวิทยาลัยขอนแก่น
หลักสูตรอบรมพัฒนาทักษะการรู้สารสนเทศ มหาวิทยาลัยขอนแก่นหลักสูตรอบรมพัฒนาทักษะการรู้สารสนเทศ มหาวิทยาลัยขอนแก่น
หลักสูตรอบรมพัฒนาทักษะการรู้สารสนเทศ มหาวิทยาลัยขอนแก่นGritiga Soothorn
 
徐金梅乌龟按摩器
徐金梅乌龟按摩器徐金梅乌龟按摩器
徐金梅乌龟按摩器zust
 
Rm 01-last
Rm 01-lastRm 01-last
Rm 01-lasttomkacy
 
Caruso Inq Project
Caruso Inq ProjectCaruso Inq Project
Caruso Inq Projectkalmanidisn1
 

Andere mochten auch (20)

Panelinstr
PanelinstrPanelinstr
Panelinstr
 
Hannah! Emily! Jodie!
Hannah! Emily! Jodie!Hannah! Emily! Jodie!
Hannah! Emily! Jodie!
 
Panelinstrph
PanelinstrphPanelinstrph
Panelinstrph
 
Enallaktiki Protasi Dimokratia Anaptyksi
Enallaktiki Protasi Dimokratia AnaptyksiEnallaktiki Protasi Dimokratia Anaptyksi
Enallaktiki Protasi Dimokratia Anaptyksi
 
VANZARE Apartament 3 camere Crangasi
VANZARE Apartament 3 camere CrangasiVANZARE Apartament 3 camere Crangasi
VANZARE Apartament 3 camere Crangasi
 
Fraud cases presentation
Fraud cases presentationFraud cases presentation
Fraud cases presentation
 
Open data day open data and govt
Open data day   open data and govtOpen data day   open data and govt
Open data day open data and govt
 
The Employee Point of View: The Economic Downturn
The Employee Point of View: The Economic DownturnThe Employee Point of View: The Economic Downturn
The Employee Point of View: The Economic Downturn
 
李燕聪107081013 企鹅-工程学基础
李燕聪107081013 企鹅-工程学基础李燕聪107081013 企鹅-工程学基础
李燕聪107081013 企鹅-工程学基础
 
Optical illusions!
Optical illusions!Optical illusions!
Optical illusions!
 
Elvis Presley Vol 05
Elvis Presley Vol 05Elvis Presley Vol 05
Elvis Presley Vol 05
 
Bathed in Modernity: Spatial Relegation of Houseless Individuals and Liberato...
Bathed in Modernity: Spatial Relegation of Houseless Individuals and Liberato...Bathed in Modernity: Spatial Relegation of Houseless Individuals and Liberato...
Bathed in Modernity: Spatial Relegation of Houseless Individuals and Liberato...
 
หลักสูตรอบรมพัฒนาทักษะการรู้สารสนเทศ มหาวิทยาลัยขอนแก่น
หลักสูตรอบรมพัฒนาทักษะการรู้สารสนเทศ มหาวิทยาลัยขอนแก่นหลักสูตรอบรมพัฒนาทักษะการรู้สารสนเทศ มหาวิทยาลัยขอนแก่น
หลักสูตรอบรมพัฒนาทักษะการรู้สารสนเทศ มหาวิทยาลัยขอนแก่น
 
ProtoThema-G20
ProtoThema-G20ProtoThema-G20
ProtoThema-G20
 
Freedom Slidshow
Freedom SlidshowFreedom Slidshow
Freedom Slidshow
 
徐金梅乌龟按摩器
徐金梅乌龟按摩器徐金梅乌龟按摩器
徐金梅乌龟按摩器
 
Rm 01-last
Rm 01-lastRm 01-last
Rm 01-last
 
Beauty newsletter
Beauty newsletterBeauty newsletter
Beauty newsletter
 
O W Overview
O W OverviewO W Overview
O W Overview
 
Caruso Inq Project
Caruso Inq ProjectCaruso Inq Project
Caruso Inq Project
 

Ähnlich wie Text mining and analytics v6 - p2

Opinion Mining
Opinion MiningOpinion Mining
Opinion MiningAli Habeeb
 
opinionmining-131221011849-phpapp02-converted.ppt
opinionmining-131221011849-phpapp02-converted.pptopinionmining-131221011849-phpapp02-converted.ppt
opinionmining-131221011849-phpapp02-converted.pptssuser059331
 
Slides from Growthcon 2014 Lean Analytics masterclass
Slides from Growthcon 2014 Lean Analytics masterclassSlides from Growthcon 2014 Lean Analytics masterclass
Slides from Growthcon 2014 Lean Analytics masterclassLean Analytics
 
Electoral College Votes Explained What Are The Pros And Cons
Electoral College Votes Explained What Are The Pros And ConsElectoral College Votes Explained What Are The Pros And Cons
Electoral College Votes Explained What Are The Pros And ConsLori Mathers
 
Floral Stationery Set Purple Floral Statione
Floral Stationery Set Purple Floral StationeFloral Stationery Set Purple Floral Statione
Floral Stationery Set Purple Floral StationeTiffany Love
 
Lean Analytics: Using Data to Build a Better Business Faster
Lean Analytics: Using Data to Build a Better Business FasterLean Analytics: Using Data to Build a Better Business Faster
Lean Analytics: Using Data to Build a Better Business FasterLean Startup Co.
 
(PDF) How To Write A Seminar Paper Seminar
(PDF) How To Write A Seminar Paper Seminar(PDF) How To Write A Seminar Paper Seminar
(PDF) How To Write A Seminar Paper SeminarAndrea Lee
 
Sentiment of Sentence in Tweets: A Review
Sentiment of Sentence in Tweets: A ReviewSentiment of Sentence in Tweets: A Review
Sentiment of Sentence in Tweets: A Reviewiosrjce
 
A data mining tool for the detection of suicide in social networks
A data mining tool for the detection of suicide in social networksA data mining tool for the detection of suicide in social networks
A data mining tool for the detection of suicide in social networksYassine Bensaoucha
 
Twitter sentimentanalysis report
Twitter sentimentanalysis reportTwitter sentimentanalysis report
Twitter sentimentanalysis reportSavio Aberneithie
 
3 Ways To Write An Essay On Sociology - WikiHow
3 Ways To Write An Essay On Sociology - WikiHow3 Ways To Write An Essay On Sociology - WikiHow
3 Ways To Write An Essay On Sociology - WikiHowCarol
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273Abutest
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273Abutest
 
Sentiment Analysis Using Hybrid Approach: A Survey
Sentiment Analysis Using Hybrid Approach: A SurveySentiment Analysis Using Hybrid Approach: A Survey
Sentiment Analysis Using Hybrid Approach: A SurveyIJERA Editor
 
9Th Grade Science Fair Research Paper - Science
9Th Grade Science Fair Research Paper - Science9Th Grade Science Fair Research Paper - Science
9Th Grade Science Fair Research Paper - ScienceLiz Graham
 
The Rebel An Essay On Man In Revolt
The Rebel An Essay On Man In RevoltThe Rebel An Essay On Man In Revolt
The Rebel An Essay On Man In RevoltMolly Wood
 

Ähnlich wie Text mining and analytics v6 - p2 (20)

Opinion Mining
Opinion MiningOpinion Mining
Opinion Mining
 
Sentiment analysis
Sentiment analysisSentiment analysis
Sentiment analysis
 
opinionmining-131221011849-phpapp02-converted.ppt
opinionmining-131221011849-phpapp02-converted.pptopinionmining-131221011849-phpapp02-converted.ppt
opinionmining-131221011849-phpapp02-converted.ppt
 
Slides from Growthcon 2014 Lean Analytics masterclass
Slides from Growthcon 2014 Lean Analytics masterclassSlides from Growthcon 2014 Lean Analytics masterclass
Slides from Growthcon 2014 Lean Analytics masterclass
 
Electoral College Votes Explained What Are The Pros And Cons
Electoral College Votes Explained What Are The Pros And ConsElectoral College Votes Explained What Are The Pros And Cons
Electoral College Votes Explained What Are The Pros And Cons
 
Floral Stationery Set Purple Floral Statione
Floral Stationery Set Purple Floral StationeFloral Stationery Set Purple Floral Statione
Floral Stationery Set Purple Floral Statione
 
Lean Analytics: Using Data to Build a Better Business Faster
Lean Analytics: Using Data to Build a Better Business FasterLean Analytics: Using Data to Build a Better Business Faster
Lean Analytics: Using Data to Build a Better Business Faster
 
Changed Value Systems
Changed Value SystemsChanged Value Systems
Changed Value Systems
 
(PDF) How To Write A Seminar Paper Seminar
(PDF) How To Write A Seminar Paper Seminar(PDF) How To Write A Seminar Paper Seminar
(PDF) How To Write A Seminar Paper Seminar
 
Sentiment of Sentence in Tweets: A Review
Sentiment of Sentence in Tweets: A ReviewSentiment of Sentence in Tweets: A Review
Sentiment of Sentence in Tweets: A Review
 
W01761157162
W01761157162W01761157162
W01761157162
 
A data mining tool for the detection of suicide in social networks
A data mining tool for the detection of suicide in social networksA data mining tool for the detection of suicide in social networks
A data mining tool for the detection of suicide in social networks
 
Twitter sentimentanalysis report
Twitter sentimentanalysis reportTwitter sentimentanalysis report
Twitter sentimentanalysis report
 
3 Ways To Write An Essay On Sociology - WikiHow
3 Ways To Write An Essay On Sociology - WikiHow3 Ways To Write An Essay On Sociology - WikiHow
3 Ways To Write An Essay On Sociology - WikiHow
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
Sentiment Analysis Using Hybrid Approach: A Survey
Sentiment Analysis Using Hybrid Approach: A SurveySentiment Analysis Using Hybrid Approach: A Survey
Sentiment Analysis Using Hybrid Approach: A Survey
 
9Th Grade Science Fair Research Paper - Science
9Th Grade Science Fair Research Paper - Science9Th Grade Science Fair Research Paper - Science
9Th Grade Science Fair Research Paper - Science
 
The Rebel An Essay On Man In Revolt
The Rebel An Essay On Man In RevoltThe Rebel An Essay On Man In Revolt
The Rebel An Essay On Man In Revolt
 
Abstract
AbstractAbstract
Abstract
 

Mehr von Dave King

Mining and analyzing social media part 2 - hicss47 tutorial - dave king
Mining and analyzing social media   part 2 - hicss47 tutorial - dave kingMining and analyzing social media   part 2 - hicss47 tutorial - dave king
Mining and analyzing social media part 2 - hicss47 tutorial - dave kingDave King
 
Mining and analyzing social media part 1 - hicss47 tutorial - dave king
Mining and analyzing social media   part 1 - hicss47 tutorial - dave kingMining and analyzing social media   part 1 - hicss47 tutorial - dave king
Mining and analyzing social media part 1 - hicss47 tutorial - dave kingDave King
 
Mining and analyzing social media facebook w gephi - hicss47 tutorial - dav...
Mining and analyzing social media   facebook w gephi - hicss47 tutorial - dav...Mining and analyzing social media   facebook w gephi - hicss47 tutorial - dav...
Mining and analyzing social media facebook w gephi - hicss47 tutorial - dav...Dave King
 
Mining and analyzing social media bollywood w pajek - hicss47 tutorial - da...
Mining and analyzing social media   bollywood w pajek - hicss47 tutorial - da...Mining and analyzing social media   bollywood w pajek - hicss47 tutorial - da...
Mining and analyzing social media bollywood w pajek - hicss47 tutorial - da...Dave King
 
Mining and analyzing social media sample network w ora - hicss47 tutorial -...
Mining and analyzing social media   sample network w ora - hicss47 tutorial -...Mining and analyzing social media   sample network w ora - hicss47 tutorial -...
Mining and analyzing social media sample network w ora - hicss47 tutorial -...Dave King
 
Social media mining hicss 46 part 2
Social media mining   hicss 46 part 2Social media mining   hicss 46 part 2
Social media mining hicss 46 part 2Dave King
 
Social media mining hicss 46 part 1
Social media mining   hicss 46 part 1Social media mining   hicss 46 part 1
Social media mining hicss 46 part 1Dave King
 
Mining and analyzing social media hicss 45 tutorial – part 2
Mining and analyzing social media hicss 45 tutorial – part 2Mining and analyzing social media hicss 45 tutorial – part 2
Mining and analyzing social media hicss 45 tutorial – part 2Dave King
 
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1Dave King
 
Text mining and analytics v6 - p1
Text mining and analytics   v6 - p1Text mining and analytics   v6 - p1
Text mining and analytics v6 - p1Dave King
 
Digital Trails Dave King 1 5 10 Part 2 D3
Digital Trails   Dave King   1 5 10   Part 2   D3Digital Trails   Dave King   1 5 10   Part 2   D3
Digital Trails Dave King 1 5 10 Part 2 D3Dave King
 
Digital Trails Dave King 1 5 10 Part 1 D3
Digital Trails   Dave King   1 5 10   Part 1 D3Digital Trails   Dave King   1 5 10   Part 1 D3
Digital Trails Dave King 1 5 10 Part 1 D3Dave King
 

Mehr von Dave King (12)

Mining and analyzing social media part 2 - hicss47 tutorial - dave king
Mining and analyzing social media   part 2 - hicss47 tutorial - dave kingMining and analyzing social media   part 2 - hicss47 tutorial - dave king
Mining and analyzing social media part 2 - hicss47 tutorial - dave king
 
Mining and analyzing social media part 1 - hicss47 tutorial - dave king
Mining and analyzing social media   part 1 - hicss47 tutorial - dave kingMining and analyzing social media   part 1 - hicss47 tutorial - dave king
Mining and analyzing social media part 1 - hicss47 tutorial - dave king
 
Mining and analyzing social media facebook w gephi - hicss47 tutorial - dav...
Mining and analyzing social media   facebook w gephi - hicss47 tutorial - dav...Mining and analyzing social media   facebook w gephi - hicss47 tutorial - dav...
Mining and analyzing social media facebook w gephi - hicss47 tutorial - dav...
 
Mining and analyzing social media bollywood w pajek - hicss47 tutorial - da...
Mining and analyzing social media   bollywood w pajek - hicss47 tutorial - da...Mining and analyzing social media   bollywood w pajek - hicss47 tutorial - da...
Mining and analyzing social media bollywood w pajek - hicss47 tutorial - da...
 
Mining and analyzing social media sample network w ora - hicss47 tutorial -...
Mining and analyzing social media   sample network w ora - hicss47 tutorial -...Mining and analyzing social media   sample network w ora - hicss47 tutorial -...
Mining and analyzing social media sample network w ora - hicss47 tutorial -...
 
Social media mining hicss 46 part 2
Social media mining   hicss 46 part 2Social media mining   hicss 46 part 2
Social media mining hicss 46 part 2
 
Social media mining hicss 46 part 1
Social media mining   hicss 46 part 1Social media mining   hicss 46 part 1
Social media mining hicss 46 part 1
 
Mining and analyzing social media hicss 45 tutorial – part 2
Mining and analyzing social media hicss 45 tutorial – part 2Mining and analyzing social media hicss 45 tutorial – part 2
Mining and analyzing social media hicss 45 tutorial – part 2
 
Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1Mining and analyzing social media hicss 45 tutorial – part 1
Mining and analyzing social media hicss 45 tutorial – part 1
 
Text mining and analytics v6 - p1
Text mining and analytics   v6 - p1Text mining and analytics   v6 - p1
Text mining and analytics v6 - p1
 
Digital Trails Dave King 1 5 10 Part 2 D3
Digital Trails   Dave King   1 5 10   Part 2   D3Digital Trails   Dave King   1 5 10   Part 2   D3
Digital Trails Dave King 1 5 10 Part 2 D3
 
Digital Trails Dave King 1 5 10 Part 1 D3
Digital Trails   Dave King   1 5 10   Part 1 D3Digital Trails   Dave King   1 5 10   Part 1 D3
Digital Trails Dave King 1 5 10 Part 1 D3
 

Kürzlich hochgeladen

Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxnegromaestrong
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Shubhangi Sonawane
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin ClassesCeline George
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docxPoojaSen20
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfSanaAli374401
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...KokoStevan
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 

Kürzlich hochgeladen (20)

Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 

Text mining and analytics v6 - p2

  • 1. Tutorial: Text Data Mining and Analytics: Part 2 HICSS 44 – January 2011 Dave King
  • 2. Text Mining: Payoff from Simple Approaches Many of the applications of data mining to text “have proved remarkably successful without understanding specific properties of text such as the concepts of grammar or the meaning of words. Strictly low-level frequency information is used, such as the number of times a word appears in a document, and then well-known methods of machine learning are applied.” Source: S. Weiss, et. al. Text Mining: Predictive Methods for Analyzing Unstructured Information, 2005
  • 3. Text Mining:Here’s a fun job! News Articles ??
  • 4. Text Mining:Here’s a fun job! Google News is a computer-generated news site that aggregates headlines from news sources worldwide, groups similar stories together and displays them according to each reader's personalized interests…Google News has no human editors …
  • 5. Text Mining:Text Categorization (Classification) Probably the most frequently used TM technique. Often employed in applications where there is a flow of dynamic information (emails, news articles, blogs, scientific articles, patents, medical claims, legal data …), requiring automated handling and routing. ? Category News Articles
  • 6. Text Mining:Text Categorization (Classification) Inductive, supervised machine learning process the classifies or categorizes a given document instance (of unknown classification) into one of a set of predetermined categories. Docs w/ known classification – training corpa Documents w/ unknown classification Validate Test Train Feature Extraction/Learning Feature Extraction Classification Algorithm Predetermined Categories 1 2 3 n
  • 7. Text Mining:Classification Algorithm Naïve Bayes Decision Trees Nearest Neighbor (k-NN) Support Vector Machine Neural Nets (e.g. SOM)
  • 8. Text Categorization:An Example Who is Gary Thuerk?
  • 9. Text Categorization:An Example “We invite you to come see the 2020 and hear about the DECSystem-20 family.’’ Gary Thuerk, DEC Marketing, 1978 DECSYSTEM-2020: a bit-slice processor with up to 512 kilowords of solid state RAM Source: http://www.newyorker.com/reporting/2007/08/06/070806fa_fact_specter#ixzz16zE3E2zO
  • 10. Text Categorization:An Example Answer: He’s the father of Spam – not the Hormel type but the Email type
  • 11. Spam Detection:Size of the Problem 90 trillion – The number of emails sent on the Internet in 2009. 247 billion – Average number of email messages per day. 1.4 billion – The number of email users worldwide. 100 million – New email users since the year before. 81% – The percentage of emails that were spam. 92% – Peak spam levels late in the year. 24% – Increase in spam since last year. 200 billion – The number of spam emails per day (assuming 81% are spam).
  • 12. Spam Detection:Size of the Problem Estimated Annual Costs of Spam in the US (in $billions) Source: blog.epostmarks.com/team-blog/2009/3/21/the-true-corporate-and-consumer-costs-of-spam.html
  • 13. Spam Detection:Size of the Problem (Yale Univ.) Measured in millions http://www.yale.edu/its/metrics/email/index.html
  • 14. Spam Detection:General Approaches 1 2 SPAM Detection./Filter #
  • 15. Spam Detection:General Approaches Rules Is this email from someone@spam.com? Blacklists & Whitelists Check the subject and body of the message for particular words or phrases Problem: Need new rules to handle dynamic data Ways to alter the data (add spaces at random, non-alpha characters, misspellings, composite words, …)
  • 17. Beginning Example:Yale University Spam Management Blocks messages from known spammers using a service called SpamHaus, a real-time database of IP addresses of verified spam sources. Content-based, central spam detection using SpamAssassin. Messages scored as spam are moved away from a user’s inbox to the Tagged-Spam folder on the server. Rules used for tagging spam are conservative. For that reason some spam gets through the first two levels of filtering. End users should train email clients to recognize and manage spam. Mail clients like Eudora or Outlook have built-in spam filters that you can train to filter messages you consider spam.
  • 18. Spam Detection: Yale University Spam Management A set of Perl programs that uses the combined score from multiple types of checks to determine if a given message is spam including Bayesian filtering. Microsoft Outlook utilizes its SmartScreen Technology which is based on a machine-learning Bayesian technology that employs a probability-based algorithm, to determine whether email is legitimate or spam.
  • 19. Spam Detection:Genesis of Content-Based Control “I think it’s possible to stop spam, and that content-based filters are the way to do it. The Achilles’ heel of the spammers is their message. They can circumvent any other barrier you set up. But they have to deliver their message, whatever it is. There is no way they can get around that… I think we will be able to solve the problem with fairly simple algorithms. In fact, I've found that you can filter present-day spam acceptably well using nothing more than a Bayesian combination of the spam probabilities of individual words. Using a slightly tweaked (as described below) Bayesian filter, we now miss less than 5 per 1000 spams, with 0 false positives. Paul Graham, A Plan for Spam, 2002
  • 20. Spam Detection:The Goal Confusion Matrix Precision = TP / (TP + FP) Recall = TP / (TP + FN) Accuracy = (TP + TN)/N Error = (FP + FN)/N F1 = 2*Recall*Precision/(Recall + Precision) Where N = TP+FP+FN+TN Goal: Minimize false positives FPR = FP/(FP + TN)
  • 21. Spam Detection:Naïve Bayesian Classifier P(H/D) = P(D/H) * P(H)/P(D) H is the hypothesis and D is the data P(H) is the prior probability of H: the probability that H is correct before the data D are seen . P(D/H) is the conditional probability of seeing the data D given that the hypothesis H is true. This conditional probability is called the likelihood. P(D) is the marginal probability of D. P(H/D) is the posterior probability: the probability that the hypothesis is true, given the data and the previous state of belief about the hypothesis. Thomas Bayes
  • 22. Spam Detection:Naïve Bayesian Classifier ? P(Spam | Message) compared to P(Not Spam | Message) Training Set P(Spam | Word) = P(S) * P(W1/S) / P(M) P(Spam | quick) = P(Spam) * P(quick/Spam) P(Spam | quick) = ..4 * .5 = .2 P(Not Spam | Word) = P(NS) * P(W1/NS) / P(M) P(Not Spam | quick) = P(Not Spam) * P(quick/Not Spam) P(Not Spam | quick) = .6 * .67 ~ .4
  • 23. Spam Detection:Naïve Bayesian Classifier ? P(Spam | Message) compared to P(Not Spam | Message) Training Set P(Spam | Words) = P(S) * P(W1/S) * P(W2/S) * ... P(Spam | quick & money ) = P(Spam) * P(quick/Spam) * P(money/Spam) P(Spam | quick & money ) = ..4 * .5 * .5 = .1 P(Not Spam | Words) = P(NS) * P(W1/NS) * P(W2/NS) * ... P(Not Spam | quick & money) = P(Not Spam) * P(quick/Not *Spam) * P(money/Not Spam) P(Not Spam | quick & money) = .6 * .67 * 0 = 0
  • 24. Sentiment Analysis:The Issues and Payoffs Every hour of every day they share their opinions, issues, thoughts and sentiments about products, brands, services and companies.
  • 25. Sentiment Analysis:Some Survey Data Activity 81% of Internet users (or 60% of Americans) have done online research on a product at least once 20% (15% of all Americans) do so on a typical day 32% have provided a rating on a product, service, or person via an online ratings system, and 30% (including 18% of online senior citizens) have posted an online comment or review regarding a product or service.2 Impact Among readers of online reviews of restaurants, hotels,andvarious services (e.g., travel agencies or doctors), between 73% and 87% report that reviews had a significant influence on their purchase Consumers report being willing to pay from 20% to 99% more for a 5-star-rated item than a 4-star-rated item (the variance stems from what type of item or service is considered) Pew Internet & American Life Project Report, 2008.
  • 26. Sentiment Analysis:The Issues and Payoff This evaluative text data is extremely valuable to customer-facing organizations Marketing -- Inform targeted marketing and help determine which marketing messages resonate with customers Service -- Provide more rapid response to perceived customer issues and determine the steps to take to satisfy customers Products -- Quickly determine whether there are emerging product issues, how to position products and where development dollars should be focused. It is also very voluminous – beyond addressing with armies of staff manually sifting through the data
  • 27. Sentiment Analysis:What is it? Also called opinion mining or voice of the customer (VOC) Involves using text mining to classifying subjective opinions in text into categories like "positive" or "negative” extracting various forms of attitudinal information: sentiment, opinion, mood, and emotion. Text analytics techniques are helpful in analyzing sentiment at the entity, concept, or topic level and in distinguishing opinion holder and opinion object.
  • 28. Sentiment Analysis: How do you know if the review is “-” or “+” plot : two teen couples go to a church party , drink and then drive . they get into an accident . one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . what's the deal ? watch the movie and " sorta " find out . . . critique : a mind-xxx movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . which is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn't snag this one correctly . they seem to have taken this pretty neat concept , but executed it terribly . so what are the problems with the movie ? well , its main problem is that it's simply too jumbled . having not seen , " who framed roger rabbit " in over 10 years , and not remembering much besides that i liked it then , i decided to rent it recently . watching it iwas struck by just how brilliant a film it is . aside from the fact that it's a milestone in animation in movies ( it's the first film to combine real actors and cartoon characters , have them interact , and make it convincingly real ) and a great entertainment it's also quite an effective comedy/mystery . while the plot may be somewhat familiar the characters are original , especially baby herman , and watching them together is a lot of fun . … `who framed roger rabbit' is a rare film . one that not only presented a great challenge to the filmmakers but one that can be enjoyed by the whole family ( although some very young viewers may be a little scared by judge doom ) . do yourself a favor and rent it , `p-p-p-p-please . "
  • 29. Sentiment Analysis:Underlying Assumption There are opinion words (aka polar words, opinion-bearing words, and sentiment words) used to express state. Positive opinion words are used to express desired states (e.g. beautiful, wonderful, good, and amazing) Negative opinion words are used to express undesired states (bad, poor, and terrible) There are also opinion phrases and idioms ( e.g. cost someone an arm and a leg) Collectively, they are called the Opinion Lexicon.
  • 30. Sentiment Analysis:Types Sentiment Classification – document level, classified as positive or negative Feature-based opinion – sentence level, determines which aspects of an object people like or dislike Comparative sentence and relationship mining – sentence level comparisons of one object against another (to determine which is better than the other)
  • 31. Sentiment Analysis:Which type is best? From one type to the next (classification, features, comparisons), it becomes more complex to extract the information needed to perform the analysis. However, once extracted, standard text mining techniques can be used to classify and compare the opinions expressed in the documents, statements, sentences, and phrases. Simple techniques (like naïve Bayesian) often produce excellent results (e.g. 80+% accuracy)
  • 32. Text Mining and Analytics:Applications JetBlue Airways Uses Attensity to analyze the large volume of e-mail messages it receives from customers. By matching specific comments and comment patterns with structured data, airline personnel can solve problems rapidly, before they jeopardize the carrier's satisfaction rating. Rosetta Stone Uses IBM SPSS text analytics software to analyze answers to open-ended questions in surveys of current and potential customers. Combines text analysis along with other identification information (e.g. products purchased, demographics) to drive decisions on advertising, marketing and product development as well as strategic planning. Gaylord Hotels Uses Clarabridge software to make sense of thousands of customer satisfaction surveys gathered each day Spots positive and negative comments that helps track trends in customer satisfaction and spot problems -- as well as best practices -- tied to particular properties, departments or employees.
  • 33. Text Mining:Clustering (Setting the Stage) A common problem: Establishing categories or topic structures for Free-form survey data Customer complaints/comments, incident reports and warranty claims Blogs and discussion forums Search results Common answer: Clustering
  • 34. Text Mining:Clustering (Defined) The unsupervised, automated grouping of records, observations, or cases into classes of similar objects called clusters. Document Collection Similarities stronger within clusters than between (i.e. distances shorter) C1 Freq W1 Clustering Algorithm C3 Clusters C2 1 2 3 n Freq W2
  • 35. Text Mining:Clustering (Measuring Distance) In a term-doc matrix treat the docs as vectors and the topics as variables and measure the distance/similarity between them. 3 Euclidean Distance: SQRT(Sum(Xi-Yi)^2)) 2 D1 T1 D2 1 D3 1 2 3 0 T2
  • 36. Text Mining:Clustering (Measuring Distance) Squared Euclidean: Sum of squared differences City Block or Manhattan: Sum of absolute differences Minkowski: hth root of the sum of absolute differences raised to the hth power Matching Distance: For binary – number of (mis)matches divided by number of comparisons (like Jaccard Similarity) Correlation: 1 – 2r where r is corr. coeff. Cosine: angle between the vectors
  • 37. Text Mining:Clustering Methods Hierarchical: Produces a Tree-Like Structure of Clusters (Divisive and Agglomerative) Partitioning: Organizes objects into k partitions (k<=n) where each partition is a cluster
  • 38. Text Mining:Clustering Methods Hierarchical Partitioning Start Start 1 2 3 K … Divisive Agglomerative Clusters
  • 39. Text Mining:Clustering (Simple Example) T1 - The Neatest Guide to Stock Market Investing T2 - Investing For Dummies, 4th Edition T3 - The Book of Common Sense Investing: The Only Way to Guarantee Your Fair Share of StockMarket Returns T4 – The Book of ValueInvesting T5 - ValueInvesting: From Graham to Buffett and Beyond T6 - RichDad'sGuide to Investing: What the Rich Invest in, That the Poor and the Middle Class Do Not! T7 - Investing in RealEstate, 5th Edition T8 - StockInvesting For Dummies" T9 - RichDad's Advisors: The ABC's of RealEstateInvesting: The Secrets of Finding Hidden Profits Most Investors Miss Focused on (exact) indexed words – appears in at least 2 titles and is not a stop word
  • 40. Text Mining:Clustering Method - Hierarchical Calculate distances between docs Select 2 closest docs and put them into a cluster Now determine closest doc among the remaining individual docs and existing clusters [utilizing either single (nearest), complete (farthest) or average linkage] Repeat process until a single cluster is formed Level Plot
  • 41. Text Mining:Clustering Method - Hierarchical 41
  • 42. Text Mining:Clustering Method – K-Means Determine the number of clusters “k”<=n Randomly assign k docs to be the initial cluster center locations (centroids) Repeat until termination For each doc calculate the (Euclidean) distance from the center locations and assign them to the cluster with the nearest center. For every cluster, recompute the centroid based on current members Check for termination – minimal or no changes in doc assigments Return the list of clusters
  • 43. Text Mining:Clustering (K-Means Example) Cluster 1: T1, T3 T1 - The Neatest Guide to Stock Market Investing T3 - The Book of Common Sense Investing: The Only Way to Guarantee Your Fair Share of StockMarket Returns Cluster 2: T6, T7, T9 T6 - RichDad'sGuide to Investing: What the Rich Invest in, That the Poor and the Middle Class Do Not! T7 - Investing in RealEstate, 5th Edition T9 - RichDad's Advisors: The ABC's of RealEstateInvesting: The Secrets of Finding Hidden Profits Most Investors Miss Cluster 3: T2, T4, T5, T8 T2 - Investing For Dummies, 4th Edition T4 – The Book of ValueInvesting T5 - ValueInvesting: From Graham to Buffett and Beyond T8 - StockInvesting For Dummies"
  • 44. Text Mining:Clustering (RSS Feeds Example)
  • 45. Text Mining:Clustering (RSS Feeds Example)
  • 46. Text Mining:Clustering (RSS Feeds Example)
  • 47. Text Mining:Clustering (RSS Feeds Example) http://feeds.reuters.com/reuters/entertainment http://feeds.reuters.com/reuters/technologyNews http://feeds.foxnews.com/foxnews/scitech http://feeds.foxnews.com/foxnews/entertainment http://rss.cnn.com/rss/cnn_showbiz.rss http://rss.cnn.com/rss/cnn_tech.rss
  • 48. Text Mining:Clustering Example (RSS Newsfeeds) RSS Feed-Stem Matrix
  • 50. Text Mining:RSS Newsfeeds K-Means Clusters
  • 51. Text Mining:Clustering Process Many people imagine that it will produce neatly separated clusters like those that (appear in relatively simple examples), but it almost never does. Such ideal clusters are rarely encountered in real data, so we often need to modify our objective from “find the natural clusters in the data” to “organize the cases into groups that are similar in some way.” Cook and Swayne, Interactive and Dynamic Graphics for Data Analysis
  • 52. Text Mining:Real World Clustering Example “Text Mining Warranty and Call Center Data: Early Warning for Product Quality Awareness” (Wallace & Cermack, SUGI29, 2004) Goal: Develop a system that would enable an early warning, alerting system for product quality problems (for American Honda Motors) Problem – most of the information is in text documents Warranty: when dealers complete warranty service claims, a comment field is available to further describe the problem. Customer Relations: the call center logs parts of conversations and written communications with customers. Techline: calls from dealer service technicians to specialized mechanics create more text data.
  • 53. Text Mining:Real World Clustering Example
  • 54. Text Mining:Real World Clustering Example Changes in cluster size Appearance of new words Changes in Shape Alerts
  • 55. Text Mining:Real World Clustering Example Integrated warranty business rules. Emerging issues. Drill-to from emerging issues. Drill on multiple points. Analyze by alert. Ad hoc analysis. Advanced warranty analysis. SAS Warranty Analysis 4.2
  • 59. Text Mining:Information Extraction (Goals) Type of IR Goal is to automatically extract structured information (e.g. entities, concepts and topics) from unstructured text from contextually and semantically well-defined data usually from well-defined domain (sometimes called content analysis) Named-Entity Recognition Subtask of IE, also known as entity identification and entity extraction Seeks to locate and classify atomic elements in text into predefined categories (e.g. names of persons, organizations, locations, dates, quantities, monetary values, percentages and so on) The end goal is usually to fill in templates codifying the extracted information (e.g. entity relationship structures <entity><rel><entity>)
  • 60. Information Extraction:Common Uses Competitive Intelligence Counter-Terrorism & Criminal Intelligence Resume Harvesting Patent Search Scientific Literature Search (biology & medicine) Email Scanning
  • 62. Text Mining:Information Extraction (Process) Linguistic Processing 1 2 Information Extraction 62
  • 63. Information Extraction:Process (Part-of-Speech Tagging) Part-of-speech tagging is the process of converting a sentence, in the form of a list of words, into a list of tuples, where each tuple is of the form (word, tag). The tag is a part-of-speech tag and signifies whether the word is a noun, adjective, verb, and so on. Variety of tagging strategies, most of which are “trainable.”
  • 64. Information Extraction:Process (Part-of-Speech Tagging) The pilot had to bank the plane because it was headed right for the downtown branch bank which was located next to the river bank. Taggers (examples) Training for N-Gram Taggers (sequences of N words): Trigram, Bigram, Unigram Employs training and test sets like other classification systems Utilizes various classification algorithms for training then actual classification
  • 65. Information Extraction:Process (Part-of-Speech Tagging) Sample sentence: CVS Caremark Corporation agreed to buy the Medicare Part D unit of Universal American Financial Corporation for about $1.25 billion. Tagged sentence: [('CVS', 'NNP'), ('Caremark', 'NNP'), ('Corporation', 'NNP'), ('agreed', 'VBD'), ('to', 'TO'), ('buy', 'VB'), ('the', 'DT'), ('Medicare', 'NNP'), ('Part', 'NNP'), ('D', 'NNP'), ('unit', 'NN'), ('of', 'IN'), ('Universal', 'NNP'), ('American', 'NNP'), ('Financial', 'NNP'), ('Corporation', 'NNP'), ('for', 'IN'), ('about', 'IN'), ('$', '$'), ('1.25', 'CD'), ('billion', 'CD')]
  • 66. Information Extraction:Process (Entity Recognition) Chunking Basic technique which segments and labels multi-token sequences Sequences are non-overlapping Usually employs a combination of a “templated” grammar couched as regular expressions along with tagger & classification processes to do the segmenting Simple Example – NP Chunker grammar = "NP:{<DT>?<JJ.*>*<NN.*>+}"
  • 67. Information Extraction:Process (Entity Recognition) (S (NP CVS/NNP Caremark/NNP Corporation/NNP) agreed/VBD to/TO buy/VB (NP the/DT Medicare/NNP Part/NNP D/NNP unit/NN) of/IN (NP Universal/NNP American/NNP Financial/NNP Corporation/NNP) for/IN about/IN $/$ 1.25/CD billion/CD)
  • 68. Information Extraction:Process (Entity Recognition) Named Entity Recognition – Identify all textual mentions of the named entities Hard to rely on precompiled lists of names, locations, … especially in dynamically changing domains A starting point is provided by the “named” entity chunkersfound in toolkits like NLTK
  • 69. Information Extraction:Process (Entity Recognition) Example of Entity Recognition Tree('S', [Tree('ORGANIZATION', [('CVS', 'NNP')]), Tree('PERSON', [('Caremark', 'NNP'), ('Corporation', 'NNP')]), ('agreed', 'VBD'), ('to', 'TO'), ('buy', 'VB'), ('the', 'DT'), Tree('ORGANIZATION', [('Medicare', 'NNP'), ('Part', 'NNP')]), ('D', 'NNP'), ('unit', 'NN'), ('of', 'IN'), Tree('ORGANIZATION', [('Universal', 'NNP'), ('American', 'NNP')]), ('Financial', 'NNP'), ('Corporation', 'NNP'), ('for', 'IN'), ('about', 'IN'), ('$', '$'), ('1.25', 'CD'), ('billion', 'CD')])
  • 71. Text Mining & Analysis:Tools kdnuggets.com/software/text.html digitalresearchtools.pbworks.com/
  • 72. Text Mining and Analysis:Lessons Learned There are practical applications in business, scientific and government arenas with substantial payback Text can be analyzed with many of the same analytical (data mining) techniques applied to structured data, although the text must first be transformed into structured data for this to occur. Many practical applications of text analysis and mining rest on treating documents as “bag of words” and on utilizing simpler versus more complex mining techniques. This techniques often have the same payoffs as more complex techniques