IDC Analyst Connection Connotate Diving Deep Outside the Firewall for Market Research Insights

I D C A N A L Y S T C O N N E C T I O N

David Schubmehl
Research Manager

Diving Deep Outside the Firew all for Market
Research Insights
October 2012

For many enterprises, Big Data is now a mainstream concern, as evidenced by changes in
organizational structure and budgets to focus on this area. However, most enterprises have yet to tap
into the vast resource of data outside the firewall to incorporate Web-based Big Data in real time. The
Web provides a lot of data that can be useful to market research efforts, particularly if organizations
go beyond analyzing quantitative data such as statistics or demographics and look at customer
sentiment as revealed in comments on product reviews as well as posts on social networks.

The following questions were posed by Connotate to David Schubmehl, research manager at IDC, on
behalf of Connotate's customers.

Q. How are enterprises missing out by failing to tap into the Web?

A. The Web has become a global repository that contains over 8 billion pages of unstructured
information ranging from news and social media to research and philosophical treatises. The
Web is a tremendous source of information about an enterprise's prospects, customers, and
competitors, which is why leading organizations are making heavy use of the Web as a
research tool. Survey research indicates that global CEOs are looking to Big Data on the
Web to understand their customers and build engagement models with their existing
customers and prospective customers.

Where enterprises are missing out is by failing to tap into the tremendous amount of social
media information on the Web. Many organizations are beginning to understand that their
customers are out there talking about them on the Web and on social media sites, yet they
don't have a very good handle on how to collect all of that information. As a result, many
companies are missing opportunities because they aren't aware of or don't understand the
conversations — both good and bad — that are going on about them, particularly in focused
blogs and online user group communities. By tapping into these specialized online sources
(not just Twitter and Facebook), companies can better understand what their customers are
saying, thinking, or looking for regarding specific products and services. Just think of all the
product reviews that are posted on the Web. Companies can gain a lot of insight about
customer sentiment by tapping into this information.

IDC 1390

On a similar note, organizations can make use of the wealth of competitive information on
the Web. Competitor product data, prices, reviews, and even comparisons can be found
on the Web. In the same manner that organizations can tap into the "voice of the customer,"
they can also tap into their competitors' data to understand and compete more effectively.

These are just a few examples of valuable data that is out there waiting for organizations that
are willing to go find it and collect it.

Q. What are the benefits — and challenges — of using Web-based data to fuel customer
sentiment analysis in market research?

A. The benefits of Web-based data revolve around three factors: timeliness, legitimacy, and
aggregation. Typically, collecting data from social media sites, product review sites, and other
sources can be very current and even provide up-to-the-minute feedback. Still, it can be a
challenge to figure out how to collect that information in a manner that is as close to real time
as possible and also to determine what kind of feedback that can be collected is going to
evolve — and therefore be more valuable for trend analysis — over time. For many
organizations, trend analysis actually is extremely valuable and can provide long-term benefits.

Legitimacy is also a major factor. Are the review and the sentiment real? Is someone posting
something because he or she wants to share true feelings about a product, or is it a
competitor looking to sabotage reviews? Perhaps a reviewer is being paid to say something
positive, which could skew results, so how can an organization identify the unpaid reviews?
All of these factors can be challenging to quantify. Finally, a wide variety of customer reviews
and feelings need to be collected in order to accurately gauge customer sentiment, especially
if the collection is being done automatically. Small samples can skew results and analysis.

Most organizations would like to collect as many comments or as much information from as
many relevant sites as possible. The problem is that the number of sites that may have
valuable content is expanding at a tremendous rate. It's a challenge for an organization that's
trying to collect all this information and pull it together in a way that is useful. That's why
aggregation into a single structure is important. It's relatively easy to pull things from a Twitter
stream or a Facebook feed, but organizations often have to contend with all of the other sites
that are out there, and this is often where this type of data collection can become complicated.

The fragility and the rate of change of content within Web pose an additional challenge. Web
sites change constantly, pages are moved or modified, and content is added or deleted on a
regular basis. Less robust approaches to collecting Web data will "break" and cease to return
valid output when a change is made to the target Web page. A fragile system delivers only a
fragment of the value when Web content changes and doesn't allow for time series analytics. A
more robust solution features resiliency to change and, in the long run, delivers higher value.

Q. What is "deep" Web data, and why is it more valuable than "surface" Web data?

A. Deep Web data, which builds on IDC's traditional definition of Web data, is typically data that
can't be crawled or accessed at all except through some kind of authentication process.
A typical place for such data is in a document management system that is available via the
Web, but only through authentication. However, many organizations now view the deep Web
as the layers below the surface of a typical Web site. For example, the comments section of a
Web-based ecommerce site might be buried 30 or 40 levels deep within the organization's Web
site; some types of crawlers and aggregators wouldn't easily be able to find this type of
information. Organizations often want to look at this information because there can be real
value in it. What is hidden deep within the system can often reveal more insights than data at

2 ©2012 IDC

the surface level. The ability to ferret out all of the information contained in the deep Web will be
more valuable to organizations than just looking at what is easily crawled at a surface level.

Q. How can enterprises tap into deep Web data, and what are the stumbling blocks to
doing this? What complementary technologies should they consider, and/or how can
they simplify this process?

A. The barriers to accessing deep Web data typically involve the inability to obtain that data
through a standard RSS feed or a standard Twitter API screen feed. Organizations may
collect information at this surface level, but extra processing is required, such as in the case
of shortened URLs. There are different shortening techniques for compressing Web site
locators into the 140-character maximum length of a Twitter or RSS stream. One approach is
to use technology that can shorten the URLs and then use them to go down 20, 30, or 40
levels — however many levels it takes to get at the relevant information. Technologies are
available today that can help automate this process, and they are worthy of consideration for
extracting value from deep Web data.

There are also technologies that include an authentication method if it's necessary to require
a user ID and a password. Then the actual crawling is automated in the system, as if an end
user is pulling up the information and manipulating and extracting it. Then the data can be
handed off in some fashion to another system for something like sentiment analysis or
content analytics to actually understand what's being said on that page or in that set of
comments.

Once you have identified relevant data sources and the technologies required to access
them, the next step is to identify technologies needed for extracting the valuable information
— such as product numbers, prices, descriptions, comments, and other fields — normalize
that information, and then place the information into some kind of structured repository such
as a database or search system. These tools often have to be tailored to the kinds of Web
data that is being collected, but they are absolutely essential to the process of deep Web
data collection.

Q. What are some specific use cases and vertical market applications for deep Web data?

A. From a market research standpoint, there are many different applications where deep Web
data can be used to gain insights. Manufacturers of 35in. large-screen TVs, for example,
could use deep Web extraction technology to pull the pricing information from other Web
sites or from Web-based catalogs. This software can collect product and pricing information
from vendors such as Wal-Mart, Target, Best Buy, Amazon.com, and many others in an
automatic fashion. These types of applications collect all of the relevant information, extract
it, aggregate it, and then place the data in one or more relational database tables. A TV
manufacturer using this type of system could then find out what the current prices are for TVs
and could also go back to previous months or even years to determine pricing trends.

Another potentially interesting application is in the pharmaceuticals industry. A pharmaceutical
manufacturer can see what prices are charged for its products on targeted Web sites anywhere
in the world. If products are being sold below market value in one part of the world, this can
indicate a black market– or potentially white market–type sales activity. A manufacturer can look
at these sites and look at the data aggregation to try to understand why some locales are selling
products at prices that may seem to be below market level.

Appliances are another common use case where deep Web data can be very useful.
Perhaps consumers are looking at reviews for washing machines in an effort to determine
reliability versus price for when they need to make a purchase. It would certainly be helpful

©2012 IDC 3

for the manufacturers to understand what the consumers are saying about their washing
machines and what potential buyers might see if they went to these sites. Manufacturers can
collect this deep Web information from all of these different sites — whether retail sites,
repair sites, review sites, or competitor sites — to find out what people are saying about
washers with regard to reliability, price, and even ease of use. Many similar use cases fall
into this category.

Another market research use is trying to understand future buying patterns by conducting
trend analysis. What's trending in terms of hot new smartphones, best-selling books, or video
games? What are people talking about on Twitter and on social media Web sites? Are they
talking about the latest weight loss medication approved by the FDA? Who is spending
money and where? IDC is seeing a lot of companies starting to think about trend analysis.
Data supporting trend analysis can be used to design future products. A manufacturer can
look to the Web to find out what features of a new phone are being discussed or what
features are being disparaged. This type of information is valuable to designers and
engineers because it provides a view into what customers are actually thinking about when
they use a product.

Deep Web data has many uses in market research, and IDC expects that more and more
organizations will have a deep Web data collection and use strategy as part of their ongoing
market research efforts.

A B O U T T H I S A N A L Y S T

Dave Schubmehl is research manager for IDC's search, content analytics, and discovery research. His research covers
information access technologies including content analytics, search systems, unstructured information representation,
unified access to structured and unstructured information, Big Data, visualization, and rich media search. This research
analyzes the trends and dynamics of the content analytics, search, and discovery software markets and the costs, benefits,
and workflow impacts of solutions that use these technologies.

A B O U T T H I S P U B L I C A T I O N

This publication was produced by IDC Go-to-Market Services. The opinion, analysis, and research results presented herein
are drawn from more detailed research and analysis independently conducted and published by IDC, unless specific vendor
sponsorship is noted. IDC Go-to-Market Services makes IDC content available in a wide range of formats for distribution by
various companies. A license to distribute IDC content does not imply endorsement of or opinion about the licensee.

C O P Y R I G H T A N D R E S T R I C T I O N S

Any IDC information or reference to IDC that is to be used in advertising, press releases, or promotional materials requires
prior written approval from IDC. For permission requests, contact the GMS information line at 508-988-7610 or gms@idc.com.
Translation and/or localization of this document requires an additional license from IDC.

For more information on IDC, visit www.idc.com. For more information on IDC GMS, visit www.idc.com/gms.

Global Headquarters: 5 Speen Street Framingham, MA 01701 USA P.508.872.8200 F.508.935.4015 www.idc.com

4 ©2012 IDC

IDC Analyst Connection Connotate Diving Deep Outside the Firewall for Market Research Insights

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Empfohlen

Empfohlen (20)

IDC Analyst Connection Connotate Diving Deep Outside the Firewall for Market Research Insights