Manichean Progress: Positive and Negative States of the Art in Web-Scale Data. by Lewis Shepherd at AAAI OGK 2011

Manichean Progress:
Positive and Negative
States of the Art
in Web-Scale Data
Lewis Shepherd
Microsoft Institute
for Advanced Technology in
Government

My cautionary personal note on Data
“If all others accepted the lie which the Party
imposed - if all records told the same tale -
then the lie passed into history and became
truth. 'Who controls the past' ran the Party
slogan, 'controls the future: who controls the
present controls the past.’”

George Orwell, Nineteen Eighty-Four

Murray Feshbach,
Demographer & Revolutionary Spark
• Following many years of continuous decline, infant mortality in the Soviet Union started
inexplicably to rise in the early 1970s from 22.9 deaths per 1,000 live births in 1971 to 27.9 in 1974.
The TsSU continued to print the infant mortality series for a few years after the alarming reversal
of the long-term trend, but it stopped open publication of the data in 1975.
• Christopher Davis and Murray Feshbach [Census Bureau] published a research report in 1980
depicting the deteriorating state of public health in the USSR and--with what later proved to be an
accurate set of estimates for the missing years--suggesting that infant mortality in the Soviet
Union was continuing to rise.
• The Davis-Feshbach study was made available to high Soviet authorities who directed beneficial
changes in public health policies.
• [Full publication of ] Infant mortality rates were not resumed until twelve years later in Narodnoye
Khozyaystvo, 1987
• The TsSU and the Ministry of Health of the USSR probably continued to collect statistics on infant
mortality... The Soviet statistical system, however, was known for its reluctance to be the bearer of
bad news. In the case of infant mortality, as in many similar cases, the data on adverse
developments were simply deleted from the open literature.
• It took an alarming and well-publicized American report to alert higher authorities to the critical
situation and to introduce remedies.
Vladimir G. Treml, Center for the Study of Intelligence, “Western Analysis and the Soviet Policymaking Process”, 2007

Tim O’Reilly
Government as a Platform Evangelist
on “The World’s 7 Most Powerful Data Scientists”
• Elizabeth Warren: The banking system excesses that led to the economic
crash of 2008 are an example of big data gone wrong. As the provisional
head of the Consumer Finance Protection Bureau, Elizabeth Warren began
the job of building the algorithmic checks and balances needed to counter
the sorcerer’s apprentices of Wall Street. In her campaign for the US
Senate, she promises to continue that fight.

• …when she was working on the Consumer Finance Protection Board, she
was thinking hard about what role technology could play in building a
truly 21st century regulatory agency, and in my books, that will have to
mean what I've been calling "algorithmic regulation.“
Forbes.com / G+ / Nov. 3, 2011 (emphasis added)
https://plus.google.com/u/0/107033731246200681024/posts/2NU9pZEZ5t1
4

Tim O’Reilly
Government as a Platform Evangelist
on “The World’s 7 Most Powerful Data Scientists”
• My feeling is that someone who is likely to have a major
influence on regulating the data scientists on Wall Street is a
good person to put on a list like this. Yes, I do want them
regulated, and this was a way of giving Elizabeth Warren a
push. I do think that if anyone will help stand up for the rest
of us, she will. And I wanted a chance to plant a few ideas
about how that regulation ought to happen (algorithmically,
in the same way that Google manages search quality.)

Blog Comment / Nov. 4, 2011 (emphasis added)
http://ctovision.com/2011/11/the-worlds-7-most-powerful-data-
scientists/#IDComment217149604
5

Breaking down Data Barriers
Semantic Knowledge for Commodity
Computing

Evelyne Viegas, Microsoft Research, USA
Li Ding, Rensselaer Polytechnic Institute
Natasa Milic-Frayling, Microsoft Research, UK
Haixun Wang, Microsoft Research, Asia
Kuansan Wang, Microsoft Research, USA

Vision – Enable Next Generation Experiences by
working with academia, stakeholders from
industry, government, and
consumers/innovators to make sense of data

DATA > INFORMATION > KNOWLEDGE >
INTELLIGENCE

Data/Information
• To help explore the data value chain, Microsoft’s collaborations
provide access to data that enables:
– Innovation – By having access to real world data, researchers
can unveil new analysis or research directions based on shared
assets and explore new questions
– Science – By allowing wider use of data, repeatability of
experiments can be performed and data misrepresentations or
faulty results avoided
– Training – real-world large-scale data is a powerful tool for
training the next generation of data analysts and researchers

• Cloud-based services: Web Language and Query Language Models
– Used to research topics such as human speech, spelling,
information extraction, learning, and machine translation.

It’s a data-driven world
– Spell Checking
– Machine Translation
– Search queries + click through
– Online games skill matching
– …

Data logs behaviours in more reliable ways than demographic
studies or surveys to study/predict trends

(Banko and Brill, 2001) – effectiveness of statistical NLP techniques is highly
susceptible to the data size used to develop them
(Norvig, 2008) – it is the size of data, not the sophistication of the algorithms
that ultimately play the central role in modern NLP

Data has become a first class citizen

IT’S A DATA-DRIVEN WORLD

Data for Open Innovation - Challenges

With web users becoming producers of
information, leaving the footprint of their lives in
digital trails, it is becoming easier for “data
snoopers” to reconstruct the identity of an
individual or an organization by cross linking
information from different sources

A Face Is Exposed for Searcher No. 4417749

“Search query data can contain the sum total of
our work, interests, associations, desires, dreams,
fantasies, and even darkest fears” said, Lauren Weinstein,
a privacy advocate.

The New York Times, Aug 2006
Thelma Arnold's identity was betrayed by the records of her Web searches

Web N-gram Services
Access to up to petabytes of real world data

http://research.microsoft.com/web-ngram

Leading technology in Search, Machine Translation,
Speech, Learning, …

Web N-Gram in Public Beta
Web data has
structure…

…and that counts
(e.g. Body, Title, Anchor)

Exploring Web Scale Language models for
Search Query Processing, in WWW’2010

Applications Examples using Web
Ngram Services

Multi-word Tag Cloud from Government
Dataset Titles

Ref: Dr. Li Ding, Rensselaer Polytechnic Institute

Query Segmentation
Body:

Title:

Anchor:

Big Data and Machine Learning
at the rescue of
Machine Translation
Audio/Speech
Motion/Gestures

Text: Paraphrasing in English
http://labs.microsofttranslator.com/thesaurus/

Sentence:
“many are dismayed by his
behaviour”

Audio: Search Over Audio
http://www.msravs.com/audiosearch_demo/

http://labs.microsofttranslator.com/thesaurus/

Meaning of Utterances:
Search Over Audio
http://www.msravs.com/audiosearch_demo/

Gestures: Kinect SDK
http://research.microsoft.com/en-us/um/redmond/projects/kinectsdk

It’s now a Knowledge World

From Patterns to Meanings

Semantics as the study of Meaning
• Data semantics – extract and map from structured and
semi-structured sources into ontologies
• Lexical semantics – identify/learn concepts, roles from
sentences (e.g. Powerset; MindNet)
• Statistical semantics – discover meaning from patterns of
use (e.g. concept similarity)
• Computational semantics – automate the process of
constructing and reasoning with meaning representations
• Semantic web – linked data via URI, common graph
structure with RDF, inferences via ontologies and OWL
• Formal semantics – in linguistics? in logic?

Probase : A Knowledge Base for Text
Understanding
http://research.microsoft.com/en-us/projects/probase/

WordNet Wikipedia Freebase Probase

Feline; Felid; Adult male; Man;
TV episode; Creative work; Musical Animal; Pet; Species; Mammal;
Gossip; Gossiper; Domesticated animals; Cats;
recording; Organism classification; Dated Small animal; Thing; Mammalian
Gossipmonger; Rumormonger; Felines; Invasive animal species;
location; Musical release; Book; Musical species; Small pet; Animal species;
Cat Rumourmonger; Newsmonger; Cosmopolitan species; Sequenced
album; Film character; Publication; Carnivore; Domesticated animal;
Woman; Adult female; genomes; Animals described in
Character species; Top level domain; Companion animal; Exotic pet;
Stimulant; Stimulant drug; 1758;
Animal; Domesticated animal; ... Vertebrate; ...
Excitant; Tracked vehicle; ...

Companies listed on the New York Business operation; Issuer; Literature Company; Vendor; Client;
Stock Exchange; IBM; Cloud subject; Venture investor; Competitor; Corporation; Organization;
computing providers; Companies Software developer; Architectural Manufacturer; Industry leader;
based in Westchester County, New structure owner; Website owner; Firm; Brand; Partner; Large
IBM N/A
York; Multinational companies; Programming language designer; company; Fortune 500 company;
Software companies of the United Computer manufacturer/brand; Technology company; Supplier;
States; Top 100 US Federal Customer; Operating system developer; Software vendor; Global company;
Contractors; ... Processor manufacturer; ... Technology company; ...

Instance of: Cognitive function;
Employer; Written work; Musical
Knowledge; Cultural factor;
Communication; Auditory recording; Musical artist; Musical album;
Cultural barrier; Cognitive process;
communication; Word; Higher Languages; Linguistics; Human Literature subject; Query; Periodical;
Cognitive ability; Cultural
Language cognitive process; Faculty; communication; Human skills; Type profile; Journal; Quotation subject;
difference; Ability; Characteristic;
Mental faculty; Module; Text; Wikipedia articles with ASCII art Type/domain equivalent topic; Broadcast
Attribute of: Film; Area; Book;
Textual matter; genre; Periodical subject; Video game
Publication; Magazine; Country;
content descriptor; ...
Work; Program; Media; City; ...

Probase has a big concept space

2.7 M concepts
automatically
Probase: harnessed from 1.68
billion pages
2 K concepts
Freebase: built by community
effort

120 K concepts
Cyc: 25 years human
labor

Uncertainty
Probase vs. Freebase
Correctness is a Knowledge is
probability. black and white.
Live with dirty Clean up
data. everything.
Dirty data is very Dirty data is
useful. unusable.

What’s in your mind when you see the
word ‘apple’
6000

5000

4000

3000

2000 concepts

1000

0

When the machine sees ‘apple’ and
‘pear’ together

Probase Internals
artist

painter Born Died … Movement
Picasso
1881 1973 … Cubism

art

painting
Year Type …
Guernica
1937 Oil on Canvas …

Interim Product: Academic Search

http://academic.research.microsoft.com/

Zentity 2.0– Research Output Platform
New Features:
Default web UI with CSS support Pivot Viewer (defacto browser)
and custom ASP.Net controls Open Data Protocol

Flexible data model enables
many scenarios and can be
easily extended over time

A semantic computing platform to store and
expose relationships between digital assets

http://research.microsoft.com/zentity/

Pattern Discovery and Semantic Interpretation:
Graph of Co-occurring Flickr Tags

Pattern Discovery and Sociological Interpretation:
‘Commenting’ Activity on Flickr

Flickr users who commented on Marc_Smith’s photos (more than 4 times)

Semantics of Network Patterns:
NodeXL
http://nodexl.codeplex.com

INTRODUCTION

TECHNIQUES AND
METRICS

USER RESEARCH

PRODUCT GROUP
ENGAGEMENT

FURTHER WORK

TWITTER NodeXL Graph
“Bing” at 2:30 AM Monday, July 12, 2010

From Pattern to Meaning:
Email

 Validation of pattern analysis
requires human input.
 Meaning can be considered
globally accepted or strictly
contextual, generally
understood or individually
constructed.

Summary
 The challenge is not so much in the standards for
representations (isn’t this just still syntax?) and pattern
discovery but really in the interpretation and validation
of that interpretation.
 ‘Meaning’ has different connotations in different context
 The challenge is in determining and addressing the
right level of granularity.

Thank you
• Evelyne Viegas, Microsoft Research, USA
• Li Ding, Rensselaer Polytechnic Institute
• Natasa Milic-Frayling, Microsoft Research, UK
• Haixun Wang, Microsoft Research, Asia
• Kuansan Wang, Microsoft Research, USA

Lewis Shepherd lewiss@microsoft.com
@lewisshepherd

Manichean Progress: Positive and Negative States of the Art in Web-Scale Data. by Lewis Shepherd at AAAI OGK 2011

Manichean Progress: Positive and Negative States of the Art in Web-Scale Data. by Lewis Shepherd at AAAI OGK 2011

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Manichean Progress: Positive and Negative States of the Art in Web-Scale Data. by Lewis Shepherd at AAAI OGK 2011

Similar to Manichean Progress: Positive and Negative States of the Art in Web-Scale Data. by Lewis Shepherd at AAAI OGK 2011 (20)

Recently uploaded

Recently uploaded (20)

Manichean Progress: Positive and Negative States of the Art in Web-Scale Data. by Lewis Shepherd at AAAI OGK 2011

Editor's Notes