This document summarizes Otis Gospodnetić's presentation on content analytics for better search. It discusses using techniques like key phrases, named entity recognition, and statistically improbable phrases to enhance search results. These analytics can be applied to tasks like clustering search results, facilitating related searches, and identifying trends in news articles or customer reviews. The presentation provides examples of how Sematext applies these content analytics techniques to clients to improve search and customer experience.
14. Copyright 2010 Sematext Int'l. All rights reserved.
14
Definitions: Collocations
● Collocations are phrases whose words are
seen together more than you would expect
given an estimate of how frequent each
individual word is in the given text vs. how often
they are seen together in the same text.
● Source: http://sematext.com/demo/kpe/
● See: http://en.wikipedia.org/wiki/Collocation
15. Copyright 2010 Sematext Int'l. All rights reserved.
15
Definitions: SIPs
● Statistically Improbably Phrases are phrases
that appear in a text more often than you would
expect given how often they appear in another
text. In this demo we extract SIPs by comparing
texts from two different time periods.
● Source: http://sematext.com/demo/kpe/
● See:
http://en.wikipedia.org/wiki/Statistically_Improba
ble_Phrases
20. Copyright 2010 Sematext Int'l. All rights reserved.
20
SIPs at Amazon
● Amazon SIPs are the most distinctive phrases in the text of
books in the Search Inside!™ program. To identify SIPs, our
computers scan the text of all books in the Search Inside!
program. If they find a phrase that occurs a large number of
times in a particular book relative to all Search Inside! books,
that phrase is a SIP in that book.
SIPs are not necessarily improbable within a particular book,
but they are improbable relative to all books in Search Inside!.
For example, most SIPs for a book on taxes are tax related. But
because we display SIPs in order of their improbability score,
the first SIPs will be on tax topics that this book mentions more
often than other tax books. For works of fiction, SIPs tend to be
distinctive word combinations that often hint at important plot
elements.
21. Copyright 2010 Sematext Int'l. All rights reserved.
21
News Content Analysis
● Source: http://sematext.com/demo/kpe/
22. Copyright 2010 Sematext Int'l. All rights reserved.
22
SIPs & News Topic Trending
● The text for the new (or you can think of it as
"current") period goes from now to up to 7 days
back. The text for the old (or "past") period is for
the 7 days before that.
now ← new text → (now - 7 days) ← text → (now - 14 days)
23. Copyright 2010 Sematext Int'l. All rights reserved.
23
Customer Experience
● Mindshare Technologies (MT) is a Voice of the Customer
company who helps companies make operational
improvements based on customer feedback. MT's client list
includes many of the world's largest restaurant chains, hotels,
car rental agencies, and telecommunications companies. Much
of the feedback we collect is from surveys that contain open-
ended questions where customers can leave comments. MT
has used the Key Phrase Extractor to unlock the value
contained in these comments. We are able to identify common
problems experienced by customers and are even able to
detect emerging topics that are starting to catch fire.
Mindshare's clients are able to leverage this information and
make operational changes that improve customer experiences.
24. Copyright 2010 Sematext Int'l. All rights reserved.
24
Lessons
● GIGO
● Language-awareness (POS)
● Filtering (England v)
25. Copyright 2010 Sematext Int'l. All rights reserved.
25
• sematext.com
• blog.sematext.com
• @sematext
• @otisg
• otis@sematext.com
Contact