1. Tutorial: Text Data Mining and Analytics: Part 2 HICSS 44 – January 2011 Dave King
2. Text Mining: Payoff from Simple Approaches Many of the applications of data mining to text “have proved remarkably successful without understanding specific properties of text such as the concepts of grammar or the meaning of words. Strictly low-level frequency information is used, such as the number of times a word appears in a document, and then well-known methods of machine learning are applied.” Source: S. Weiss, et. al. Text Mining: Predictive Methods for Analyzing Unstructured Information, 2005
4. Text Mining:Here’s a fun job! Google News is a computer-generated news site that aggregates headlines from news sources worldwide, groups similar stories together and displays them according to each reader's personalized interests…Google News has no human editors …
5. Text Mining:Text Categorization (Classification) Probably the most frequently used TM technique. Often employed in applications where there is a flow of dynamic information (emails, news articles, blogs, scientific articles, patents, medical claims, legal data …), requiring automated handling and routing. ? Category News Articles
6. Text Mining:Text Categorization (Classification) Inductive, supervised machine learning process the classifies or categorizes a given document instance (of unknown classification) into one of a set of predetermined categories. Docs w/ known classification – training corpa Documents w/ unknown classification Validate Test Train Feature Extraction/Learning Feature Extraction Classification Algorithm Predetermined Categories 1 2 3 n
9. Text Categorization:An Example “We invite you to come see the 2020 and hear about the DECSystem-20 family.’’ Gary Thuerk, DEC Marketing, 1978 DECSYSTEM-2020: a bit-slice processor with up to 512 kilowords of solid state RAM Source: http://www.newyorker.com/reporting/2007/08/06/070806fa_fact_specter#ixzz16zE3E2zO
11. Spam Detection:Size of the Problem 90 trillion – The number of emails sent on the Internet in 2009. 247 billion – Average number of email messages per day. 1.4 billion – The number of email users worldwide. 100 million – New email users since the year before. 81% – The percentage of emails that were spam. 92% – Peak spam levels late in the year. 24% – Increase in spam since last year. 200 billion – The number of spam emails per day (assuming 81% are spam).
12. Spam Detection:Size of the Problem Estimated Annual Costs of Spam in the US (in $billions) Source: blog.epostmarks.com/team-blog/2009/3/21/the-true-corporate-and-consumer-costs-of-spam.html
13. Spam Detection:Size of the Problem (Yale Univ.) Measured in millions http://www.yale.edu/its/metrics/email/index.html
15. Spam Detection:General Approaches Rules Is this email from someone@spam.com? Blacklists & Whitelists Check the subject and body of the message for particular words or phrases Problem: Need new rules to handle dynamic data Ways to alter the data (add spaces at random, non-alpha characters, misspellings, composite words, …)
17. Beginning Example:Yale University Spam Management Blocks messages from known spammers using a service called SpamHaus, a real-time database of IP addresses of verified spam sources. Content-based, central spam detection using SpamAssassin. Messages scored as spam are moved away from a user’s inbox to the Tagged-Spam folder on the server. Rules used for tagging spam are conservative. For that reason some spam gets through the first two levels of filtering. End users should train email clients to recognize and manage spam. Mail clients like Eudora or Outlook have built-in spam filters that you can train to filter messages you consider spam.
18. Spam Detection: Yale University Spam Management A set of Perl programs that uses the combined score from multiple types of checks to determine if a given message is spam including Bayesian filtering. Microsoft Outlook utilizes its SmartScreen Technology which is based on a machine-learning Bayesian technology that employs a probability-based algorithm, to determine whether email is legitimate or spam.
19. Spam Detection:Genesis of Content-Based Control “I think it’s possible to stop spam, and that content-based filters are the way to do it. The Achilles’ heel of the spammers is their message. They can circumvent any other barrier you set up. But they have to deliver their message, whatever it is. There is no way they can get around that… I think we will be able to solve the problem with fairly simple algorithms. In fact, I've found that you can filter present-day spam acceptably well using nothing more than a Bayesian combination of the spam probabilities of individual words. Using a slightly tweaked (as described below) Bayesian filter, we now miss less than 5 per 1000 spams, with 0 false positives. Paul Graham, A Plan for Spam, 2002
21. Spam Detection:Naïve Bayesian Classifier P(H/D) = P(D/H) * P(H)/P(D) H is the hypothesis and D is the data P(H) is the prior probability of H: the probability that H is correct before the data D are seen . P(D/H) is the conditional probability of seeing the data D given that the hypothesis H is true. This conditional probability is called the likelihood. P(D) is the marginal probability of D. P(H/D) is the posterior probability: the probability that the hypothesis is true, given the data and the previous state of belief about the hypothesis. Thomas Bayes
24. Sentiment Analysis:The Issues and Payoffs Every hour of every day they share their opinions, issues, thoughts and sentiments about products, brands, services and companies.
25. Sentiment Analysis:Some Survey Data Activity 81% of Internet users (or 60% of Americans) have done online research on a product at least once 20% (15% of all Americans) do so on a typical day 32% have provided a rating on a product, service, or person via an online ratings system, and 30% (including 18% of online senior citizens) have posted an online comment or review regarding a product or service.2 Impact Among readers of online reviews of restaurants, hotels,andvarious services (e.g., travel agencies or doctors), between 73% and 87% report that reviews had a significant influence on their purchase Consumers report being willing to pay from 20% to 99% more for a 5-star-rated item than a 4-star-rated item (the variance stems from what type of item or service is considered) Pew Internet & American Life Project Report, 2008.
26. Sentiment Analysis:The Issues and Payoff This evaluative text data is extremely valuable to customer-facing organizations Marketing -- Inform targeted marketing and help determine which marketing messages resonate with customers Service -- Provide more rapid response to perceived customer issues and determine the steps to take to satisfy customers Products -- Quickly determine whether there are emerging product issues, how to position products and where development dollars should be focused. It is also very voluminous – beyond addressing with armies of staff manually sifting through the data
27. Sentiment Analysis:What is it? Also called opinion mining or voice of the customer (VOC) Involves using text mining to classifying subjective opinions in text into categories like "positive" or "negative” extracting various forms of attitudinal information: sentiment, opinion, mood, and emotion. Text analytics techniques are helpful in analyzing sentiment at the entity, concept, or topic level and in distinguishing opinion holder and opinion object.
28. Sentiment Analysis: How do you know if the review is “-” or “+” plot : two teen couples go to a church party , drink and then drive . they get into an accident . one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . what's the deal ? watch the movie and " sorta " find out . . . critique : a mind-xxx movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . which is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn't snag this one correctly . they seem to have taken this pretty neat concept , but executed it terribly . so what are the problems with the movie ? well , its main problem is that it's simply too jumbled . having not seen , " who framed roger rabbit " in over 10 years , and not remembering much besides that i liked it then , i decided to rent it recently . watching it iwas struck by just how brilliant a film it is . aside from the fact that it's a milestone in animation in movies ( it's the first film to combine real actors and cartoon characters , have them interact , and make it convincingly real ) and a great entertainment it's also quite an effective comedy/mystery . while the plot may be somewhat familiar the characters are original , especially baby herman , and watching them together is a lot of fun . … `who framed roger rabbit' is a rare film . one that not only presented a great challenge to the filmmakers but one that can be enjoyed by the whole family ( although some very young viewers may be a little scared by judge doom ) . do yourself a favor and rent it , `p-p-p-p-please . "
29. Sentiment Analysis:Underlying Assumption There are opinion words (aka polar words, opinion-bearing words, and sentiment words) used to express state. Positive opinion words are used to express desired states (e.g. beautiful, wonderful, good, and amazing) Negative opinion words are used to express undesired states (bad, poor, and terrible) There are also opinion phrases and idioms ( e.g. cost someone an arm and a leg) Collectively, they are called the Opinion Lexicon.
30. Sentiment Analysis:Types Sentiment Classification – document level, classified as positive or negative Feature-based opinion – sentence level, determines which aspects of an object people like or dislike Comparative sentence and relationship mining – sentence level comparisons of one object against another (to determine which is better than the other)
31. Sentiment Analysis:Which type is best? From one type to the next (classification, features, comparisons), it becomes more complex to extract the information needed to perform the analysis. However, once extracted, standard text mining techniques can be used to classify and compare the opinions expressed in the documents, statements, sentences, and phrases. Simple techniques (like naïve Bayesian) often produce excellent results (e.g. 80+% accuracy)
32. Text Mining and Analytics:Applications JetBlue Airways Uses Attensity to analyze the large volume of e-mail messages it receives from customers. By matching specific comments and comment patterns with structured data, airline personnel can solve problems rapidly, before they jeopardize the carrier's satisfaction rating. Rosetta Stone Uses IBM SPSS text analytics software to analyze answers to open-ended questions in surveys of current and potential customers. Combines text analysis along with other identification information (e.g. products purchased, demographics) to drive decisions on advertising, marketing and product development as well as strategic planning. Gaylord Hotels Uses Clarabridge software to make sense of thousands of customer satisfaction surveys gathered each day Spots positive and negative comments that helps track trends in customer satisfaction and spot problems -- as well as best practices -- tied to particular properties, departments or employees.
33. Text Mining:Clustering (Setting the Stage) A common problem: Establishing categories or topic structures for Free-form survey data Customer complaints/comments, incident reports and warranty claims Blogs and discussion forums Search results Common answer: Clustering
34. Text Mining:Clustering (Defined) The unsupervised, automated grouping of records, observations, or cases into classes of similar objects called clusters. Document Collection Similarities stronger within clusters than between (i.e. distances shorter) C1 Freq W1 Clustering Algorithm C3 Clusters C2 1 2 3 n Freq W2
35. Text Mining:Clustering (Measuring Distance) In a term-doc matrix treat the docs as vectors and the topics as variables and measure the distance/similarity between them. 3 Euclidean Distance: SQRT(Sum(Xi-Yi)^2)) 2 D1 T1 D2 1 D3 1 2 3 0 T2
36. Text Mining:Clustering (Measuring Distance) Squared Euclidean: Sum of squared differences City Block or Manhattan: Sum of absolute differences Minkowski: hth root of the sum of absolute differences raised to the hth power Matching Distance: For binary – number of (mis)matches divided by number of comparisons (like Jaccard Similarity) Correlation: 1 – 2r where r is corr. coeff. Cosine: angle between the vectors
37. Text Mining:Clustering Methods Hierarchical: Produces a Tree-Like Structure of Clusters (Divisive and Agglomerative) Partitioning: Organizes objects into k partitions (k<=n) where each partition is a cluster
39. Text Mining:Clustering (Simple Example) T1 - The Neatest Guide to Stock Market Investing T2 - Investing For Dummies, 4th Edition T3 - The Book of Common Sense Investing: The Only Way to Guarantee Your Fair Share of StockMarket Returns T4 – The Book of ValueInvesting T5 - ValueInvesting: From Graham to Buffett and Beyond T6 - RichDad'sGuide to Investing: What the Rich Invest in, That the Poor and the Middle Class Do Not! T7 - Investing in RealEstate, 5th Edition T8 - StockInvesting For Dummies" T9 - RichDad's Advisors: The ABC's of RealEstateInvesting: The Secrets of Finding Hidden Profits Most Investors Miss Focused on (exact) indexed words – appears in at least 2 titles and is not a stop word
40. Text Mining:Clustering Method - Hierarchical Calculate distances between docs Select 2 closest docs and put them into a cluster Now determine closest doc among the remaining individual docs and existing clusters [utilizing either single (nearest), complete (farthest) or average linkage] Repeat process until a single cluster is formed Level Plot
42. Text Mining:Clustering Method – K-Means Determine the number of clusters “k”<=n Randomly assign k docs to be the initial cluster center locations (centroids) Repeat until termination For each doc calculate the (Euclidean) distance from the center locations and assign them to the cluster with the nearest center. For every cluster, recompute the centroid based on current members Check for termination – minimal or no changes in doc assigments Return the list of clusters
43. Text Mining:Clustering (K-Means Example) Cluster 1: T1, T3 T1 - The Neatest Guide to Stock Market Investing T3 - The Book of Common Sense Investing: The Only Way to Guarantee Your Fair Share of StockMarket Returns Cluster 2: T6, T7, T9 T6 - RichDad'sGuide to Investing: What the Rich Invest in, That the Poor and the Middle Class Do Not! T7 - Investing in RealEstate, 5th Edition T9 - RichDad's Advisors: The ABC's of RealEstateInvesting: The Secrets of Finding Hidden Profits Most Investors Miss Cluster 3: T2, T4, T5, T8 T2 - Investing For Dummies, 4th Edition T4 – The Book of ValueInvesting T5 - ValueInvesting: From Graham to Buffett and Beyond T8 - StockInvesting For Dummies"
51. Text Mining:Clustering Process Many people imagine that it will produce neatly separated clusters like those that (appear in relatively simple examples), but it almost never does. Such ideal clusters are rarely encountered in real data, so we often need to modify our objective from “find the natural clusters in the data” to “organize the cases into groups that are similar in some way.” Cook and Swayne, Interactive and Dynamic Graphics for Data Analysis
52. Text Mining:Real World Clustering Example “Text Mining Warranty and Call Center Data: Early Warning for Product Quality Awareness” (Wallace & Cermack, SUGI29, 2004) Goal: Develop a system that would enable an early warning, alerting system for product quality problems (for American Honda Motors) Problem – most of the information is in text documents Warranty: when dealers complete warranty service claims, a comment field is available to further describe the problem. Customer Relations: the call center logs parts of conversations and written communications with customers. Techline: calls from dealer service technicians to specialized mechanics create more text data.
54. Text Mining:Real World Clustering Example Changes in cluster size Appearance of new words Changes in Shape Alerts
55. Text Mining:Real World Clustering Example Integrated warranty business rules. Emerging issues. Drill-to from emerging issues. Drill on multiple points. Analyze by alert. Ad hoc analysis. Advanced warranty analysis. SAS Warranty Analysis 4.2
59. Text Mining:Information Extraction (Goals) Type of IR Goal is to automatically extract structured information (e.g. entities, concepts and topics) from unstructured text from contextually and semantically well-defined data usually from well-defined domain (sometimes called content analysis) Named-Entity Recognition Subtask of IE, also known as entity identification and entity extraction Seeks to locate and classify atomic elements in text into predefined categories (e.g. names of persons, organizations, locations, dates, quantities, monetary values, percentages and so on) The end goal is usually to fill in templates codifying the extracted information (e.g. entity relationship structures <entity><rel><entity>)
63. Information Extraction:Process (Part-of-Speech Tagging) Part-of-speech tagging is the process of converting a sentence, in the form of a list of words, into a list of tuples, where each tuple is of the form (word, tag). The tag is a part-of-speech tag and signifies whether the word is a noun, adjective, verb, and so on. Variety of tagging strategies, most of which are “trainable.”
64. Information Extraction:Process (Part-of-Speech Tagging) The pilot had to bank the plane because it was headed right for the downtown branch bank which was located next to the river bank. Taggers (examples) Training for N-Gram Taggers (sequences of N words): Trigram, Bigram, Unigram Employs training and test sets like other classification systems Utilizes various classification algorithms for training then actual classification
65. Information Extraction:Process (Part-of-Speech Tagging) Sample sentence: CVS Caremark Corporation agreed to buy the Medicare Part D unit of Universal American Financial Corporation for about $1.25 billion. Tagged sentence: [('CVS', 'NNP'), ('Caremark', 'NNP'), ('Corporation', 'NNP'), ('agreed', 'VBD'), ('to', 'TO'), ('buy', 'VB'), ('the', 'DT'), ('Medicare', 'NNP'), ('Part', 'NNP'), ('D', 'NNP'), ('unit', 'NN'), ('of', 'IN'), ('Universal', 'NNP'), ('American', 'NNP'), ('Financial', 'NNP'), ('Corporation', 'NNP'), ('for', 'IN'), ('about', 'IN'), ('$', '$'), ('1.25', 'CD'), ('billion', 'CD')]
66. Information Extraction:Process (Entity Recognition) Chunking Basic technique which segments and labels multi-token sequences Sequences are non-overlapping Usually employs a combination of a “templated” grammar couched as regular expressions along with tagger & classification processes to do the segmenting Simple Example – NP Chunker grammar = "NP:{<DT>?<JJ.*>*<NN.*>+}"
68. Information Extraction:Process (Entity Recognition) Named Entity Recognition – Identify all textual mentions of the named entities Hard to rely on precompiled lists of names, locations, … especially in dynamically changing domains A starting point is provided by the “named” entity chunkersfound in toolkits like NLTK
71. Text Mining & Analysis:Tools kdnuggets.com/software/text.html digitalresearchtools.pbworks.com/
72. Text Mining and Analysis:Lessons Learned There are practical applications in business, scientific and government arenas with substantial payback Text can be analyzed with many of the same analytical (data mining) techniques applied to structured data, although the text must first be transformed into structured data for this to occur. Many practical applications of text analysis and mining rest on treating documents as “bag of words” and on utilizing simpler versus more complex mining techniques. This techniques often have the same payoffs as more complex techniques