Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Search vs Text Classification

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

Hier ansehen

1 von 5 Anzeige

Search vs Text Classification

Herunterladen, um offline zu lesen

Is search always the right solution? There are many things you can do with a hammer, but it’s not so great if you need to turn a screw.

Text Classification is an alternative to search that may be more appropriate for social media data analysis. Text classification is the task of assigning predefined categories to free-text documents. It can provide conceptual views of document collections and has important applications in the real world. Using text classification as the foundation for analysis – i.e., teaching a machine to categorize posts the way humans do – can dramatically improve your ability to gather the right data and, ultimately, increase the chances that you’ll uncover what you need to know.

Is search always the right solution? There are many things you can do with a hammer, but it’s not so great if you need to turn a screw.

Text Classification is an alternative to search that may be more appropriate for social media data analysis. Text classification is the task of assigning predefined categories to free-text documents. It can provide conceptual views of document collections and has important applications in the real world. Using text classification as the foundation for analysis – i.e., teaching a machine to categorize posts the way humans do – can dramatically improve your ability to gather the right data and, ultimately, increase the chances that you’ll uncover what you need to know.

Anzeige
Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (17)

Ähnlich wie Search vs Text Classification (20)

Anzeige

Weitere von Networked Insights (20)

Aktuellste (20)

Anzeige

Search vs Text Classification

  1. 1. White Paper Search vs.Text Classification Increasing the signal, decreasing the noise 1 West Street New York NY 10004 | 646-545-3900 | info@networkedinsights.com | networkedinsights.com
  2. 2. White Paper Networked Insights Network Search vs. Text Classification Increasing the signal, decreasing the noise Since the advent of the World Wide Web, businesses and Topic discovery— consumers have used a variety of ways to find information. letting data speak for itself These various methods of discovery have trained us to think Topic discovery is a valuable type of and behave in ways that make understanding analytics semantic analysis based on text challenging. In fact, what makes retrieving information easy classification. Whereas sentiment analysis for individuals is not the manner in which we should examine simply reveals people’s likes and dislikes, social data. Confused? semantic analysis refers to a group of methods that allow machines to discover In the infancy of the commercial public Web, navigation was nearly impos- the fundamental patterns of words or sible without directories and then information portals. With the explosion phrases that act as building blocks in a of the Web in the late 1990s, keyword searching and using search engines large set of text. Topics, themes, sentiment has become as ubiquitous as the Internet itself. While the underlying and similar elements of meaning appear methods of search have evolved over the years, its primary use has stayed as intricate weavings of those fundamental constant since the early days of companies like Yahoo!, Altavista, Lycos, patterns. So semantic analysis is the Excite and Google. Reflecting its mass popularity and understanding, summarization of large amounts of text search is often the first tool applied to a wide variety of data challenges. by automatically discovering the topics and themes within. But is search always the right solution? There are many things you can do with a hammer, but it’s not so great if you need to turn a screw. By grouping social media posts based on semantic similarity, rather than preset To learn what customers think about your products and services, you may sentiment categories such as positive, nega- need to apply sentiment analysis across millions of social media posts. tive and neutral, topic discovery can help Or, to guide your media buying, you might use topic discovery to uncover companies uncover important information – market trends in the social conversation. for example, what exactly people are saying about a product or service; where and how In either case, using search to identify the set of posts you’ll submit to they use it; the features they use most; and scrutiny could send your social media analysis down the wrong path from the enhancements or new offerings they’re the start. Your approach to conducting sentiment analysis or topic interested in. All of this information can discovery could be spot on. But if it’s based on a number of posts that ultimately drive product development, new aren’t actually about what you think they are, which typically happens revenue streams and strategies for market- with search, the noise created can flaw the inferences and conclusions you ing, advertising and media planning. ultimately draw. Text classification is an alternative to search that may be more appropri- ate for social media data analysis. Text classification is the task of assigning predefined categories to free-text documents. It can provide conceptual views of document collections and has important applications in the real world. Using text classification as the foundation for analysis – i.e., teach- ing a machine to categorize posts the way humans do – can dramatically improve your ability to gather the right data and, ultimately, increase the chances that you’ll uncover what you need to know. 2
  3. 3. White Paper Networked Insights Search vs. Text Classification The impact of bad data A look at several related but distinct topics illustrates how seriously the problems of search can impact analysis. A Networked Insights analyst designed search queries for five topics that moms typically discuss – pregnancy and newborns; school-aged children; food, nutrition and health; shopping and money; and illness and injury. Searches were run on the five topics, then another analyst reviewed the results under two test scenarios to determine how well the search delivered posts fitting the intended criteria as defined by the query. In the first test, the analyst reviewed only the top 20 results returned traditional search by each search as ordered by the search engine. In the second test, the analyst reviewed a random sample of 200 results returned by the search. In each case, the analyst was asked to judge whether each resulting post was appropriate for the intended category or if it fit better in a different one. The percent of appropriate posts is a measure of the “precision” of the search. The test results (Table 1) reveal search’s severe limitations. Precision was Significant problems arise high when only the top 20 results were examined (90 percent or higher), with search when you’re but falls precipitously when examining a larger number of randomly sam- pled posts. In only one search, pregnancy and newborns, did the results after a broad collection of yield a somewhat reliable level of precision (86.5 percent). In three of the similar posts, not a handful five searches, precision rates were under 50 percent. of the best ones. In practical terms, these results mean there’s a greater chance that a ran- domly selected search result will not meet the intended criteria than that it will. Said another way, search might be used to support other analyses by returning a large number of posts assumed to cover the same basic topic. The problem: the majority of the data isn’t relevant to the topic you want to understand. Table 1. Keyword Search Precision Desired Topic Top 20 Results Only Random Sample Pregnancy and newborns 95% 86.5% School-aged children 95% 19.5% Food, nutrition, health 90% 39.5% Shopping and money 100% 57.5% Illness and Injury 100% 41% Overall 96% 48.8% 3
  4. 4. White Paper Networked Insights Search vs. Text Classification The shortcomings of search By definition, the intent of search is to uncover the best responses to a query. A search engine goes out and grabs hundreds of thousands of posts that match the word or phrase programmed into the query and attempts to rank them in order of relevance. Its goal is to put the post most likely to be the one you’re looking for at the top of the list. The search engine does this effectively, as seen in the first column of results in Table 1. Significant problems arise with search when you’re after a broad collection of similar posts, not a handful of the best ones. This is often the case in social media analysis, when the goal is to analyze millions of posts to identify trends that can inform marketing decisions or uncover insights traditional search that can reveal business opportunities. Simply stated, more data points are sometimes much better than a few. In these cases, search will undermine your efforts. The first 20, or even 200, posts might be great matches. But the last 20 or 200 might not match at all, as seen in the second results column of Table 1. Search methodology has other significant shortcomings, which are more apparent when it’s applied to social media data than when used Search cannot contemplate with other, more structured forms of text. For example, search struggles the context of how words when you’re looking for something more complicated than whether or not a document contains a particular word or phrase. Search and phrases are used in cannot contemplate the context of how words and phrases are used relationship to one another; in relationship to one another; it simply can identify whether or not it simply can identify wheth- that word or phrase is present. er or not that word or phrase Search also suffers a bias problem. If the searcher uses words that are is present. not a direct reflection of the words that millions of other people use for a given topic, search can’t accommodate the differences. To sum up the problems, search does not inherently provide a mechanism for determining which results should belong to the desired group and which should not. The norm is to simply say that all posts that match a query belong to the desired topic and use all of them in further analyses. A better way — the power of classification classification In contrast to search, text classification uses machine-learning algorithms to learn from a set of examples how to separate posts into topics. If an algorithm, or program, is presented with examples of how a human would separate posts based on topic, it can learn to mimic that person’s process Classification offers the on new, previously unseen posts. One major advantage of this approach is potential to produce a that the program can scale up to perform its process on millions of docu- dataset in which all of the ments. People do not scale up so easily. posts are relevant to the Classification offers the potential to produce a dataset in which all of the topics being analyzed. The posts are relevant to the topics being analyzed. The last 20 are as valuable last 20 are as valuable to to the analysis as the first 20. the analysis as the first 20. 4 © 2011 Networked Insights, Inc. All rights reserved.
  5. 5. White Paper Networked Insights Search vs. Text Classification The classification process begins with a human analyst selecting a sampling of posts that relate to a specific topic, such as pregnancy and newborns. The analyst also selects posts that are irrelevant, so the algorithm being used can detect the difference. These posts serve as the training examples from which the machine will learn. A variety of algorithms can be used for classification, including artificial neural networks, support vector machines and Naive Bayes algorithms. Selecting the right algorithm and tuning it are critical, as some do well at certain problems and not so well at others. creating a stronger signal In the next step, the algorithm learns how to categorize new posts by reading the example posts and identifying general rules that differentiate the relevant and irrelevant posts. For example, when the program sees the Millions of people use phrases “little one” and “hospital” together in a post, it might notice that the probability the post belongs to the pregnancy and newborns category search every day to find increases significantly. It then uses this knowledge in categorizing other what they’re looking for posts. The goal is not to memorize the training examples, but to find gen- online. But search can send eral characteristics that help the algorithm categorize new posts. you off into the social media Table 2 adds a third column to Table 1 that shows the result of using clas- wilderness if you’re using sification instead of search to identify posts presumably related to the five mom topics. The analysis approach for classification was the same as that traditional monitoring tools applied to the search precision test. An independent analyst reviewed 200 to discover conversations randomly sampled results from classification and determined whether or and trends. So stop not they matched the intended topic. The improvement over the search precision test is dramatic. The overall precision of using classification was searching. Instead, start 86 percent vs. 49 percent using search across all posts. For one topic – asking how real-time data food, nutrition and health – precision rose from 39.5 percent with search can support your existing to 100 percent through classification. decision-making processes Table 2. Precision of Using Classification to Identify Posts in Comparison to Search and then use classification Top 20 Results Only Random Sample Classification Desired Topic techniques to cut through Pregnancy and newborns 95% 86.5% 88.0% School-aged children 95% 19.5% 72% the noise and sharpen your Food, nutrition, health 90% 39.5% 100% social analysis. Shopping and money 100% 57.5% 87% Illness and Injury 100% 41% 83% Overall 96% 48.8% 86% Classification clearly provides greater precision in social data analysis. It offers deeper insights – both on a broad scale and when drilling into specific topics – than can be gleaned from standard search techniques. Questions about this report? Want a free consultation on how social data can improve your media planning and other marketing? Contact us. 646-545-3900 info@networkedinsights.com 5 © 2011 Networked Insights, Inc. All rights reserved. networkedinsights.com

×