The document discusses the challenges and opportunities of making sense of large-scale web data through semantic analysis and knowledge representation. It describes Microsoft Research's efforts to develop techniques and platforms like Probase, a large knowledge base of concepts extracted from web data. The document outlines applications of these techniques for natural language processing tasks like word segmentation and query understanding. It also discusses the challenges of moving from patterns in data to meaningful interpretations and validating the semantic meaning extracted.
Breaking the Kubernetes Kill Chain: Host Path Mount
Manichean Progress: Positive and Negative States of the Art in Web-Scale Data. by Lewis Shepherd at AAAI OGK 2011
1. Manichean Progress:
Positive and Negative
States of the Art
in Web-Scale Data
Lewis Shepherd
Microsoft Institute
for Advanced Technology in
Government
2. My cautionary personal note on Data
“If all others accepted the lie which the Party
imposed - if all records told the same tale -
then the lie passed into history and became
truth. 'Who controls the past' ran the Party
slogan, 'controls the future: who controls the
present controls the past.’”
George Orwell, Nineteen Eighty-Four
3. Murray Feshbach,
Demographer & Revolutionary Spark
• Following many years of continuous decline, infant mortality in the Soviet Union started
inexplicably to rise in the early 1970s from 22.9 deaths per 1,000 live births in 1971 to 27.9 in 1974.
The TsSU continued to print the infant mortality series for a few years after the alarming reversal
of the long-term trend, but it stopped open publication of the data in 1975.
• Christopher Davis and Murray Feshbach [Census Bureau] published a research report in 1980
depicting the deteriorating state of public health in the USSR and--with what later proved to be an
accurate set of estimates for the missing years--suggesting that infant mortality in the Soviet
Union was continuing to rise.
• The Davis-Feshbach study was made available to high Soviet authorities who directed beneficial
changes in public health policies.
• [Full publication of ] Infant mortality rates were not resumed until twelve years later in Narodnoye
Khozyaystvo, 1987
• The TsSU and the Ministry of Health of the USSR probably continued to collect statistics on infant
mortality... The Soviet statistical system, however, was known for its reluctance to be the bearer of
bad news. In the case of infant mortality, as in many similar cases, the data on adverse
developments were simply deleted from the open literature.
• It took an alarming and well-publicized American report to alert higher authorities to the critical
situation and to introduce remedies.
Vladimir G. Treml, Center for the Study of Intelligence, “Western Analysis and the Soviet Policymaking Process”, 2007
4. Tim O’Reilly
Government as a Platform Evangelist
on “The World’s 7 Most Powerful Data Scientists”
• Elizabeth Warren: The banking system excesses that led to the economic
crash of 2008 are an example of big data gone wrong. As the provisional
head of the Consumer Finance Protection Bureau, Elizabeth Warren began
the job of building the algorithmic checks and balances needed to counter
the sorcerer’s apprentices of Wall Street. In her campaign for the US
Senate, she promises to continue that fight.
• …when she was working on the Consumer Finance Protection Board, she
was thinking hard about what role technology could play in building a
truly 21st century regulatory agency, and in my books, that will have to
mean what I've been calling "algorithmic regulation.“
Forbes.com / G+ / Nov. 3, 2011 (emphasis added)
https://plus.google.com/u/0/107033731246200681024/posts/2NU9pZEZ5t1
4
5. Tim O’Reilly
Government as a Platform Evangelist
on “The World’s 7 Most Powerful Data Scientists”
• My feeling is that someone who is likely to have a major
influence on regulating the data scientists on Wall Street is a
good person to put on a list like this. Yes, I do want them
regulated, and this was a way of giving Elizabeth Warren a
push. I do think that if anyone will help stand up for the rest
of us, she will. And I wanted a chance to plant a few ideas
about how that regulation ought to happen (algorithmically,
in the same way that Google manages search quality.)
Blog Comment / Nov. 4, 2011 (emphasis added)
http://ctovision.com/2011/11/the-worlds-7-most-powerful-data-
scientists/#IDComment217149604
5
6. Breaking down Data Barriers
Semantic Knowledge for Commodity
Computing
Evelyne Viegas, Microsoft Research, USA
Li Ding, Rensselaer Polytechnic Institute
Natasa Milic-Frayling, Microsoft Research, UK
Haixun Wang, Microsoft Research, Asia
Kuansan Wang, Microsoft Research, USA
7. Vision – Enable Next Generation Experiences by
working with academia, stakeholders from
industry, government, and
consumers/innovators to make sense of data
DATA > INFORMATION > KNOWLEDGE >
INTELLIGENCE
8. Data/Information
• To help explore the data value chain, Microsoft’s collaborations
provide access to data that enables:
– Innovation – By having access to real world data, researchers
can unveil new analysis or research directions based on shared
assets and explore new questions
– Science – By allowing wider use of data, repeatability of
experiments can be performed and data misrepresentations or
faulty results avoided
– Training – real-world large-scale data is a powerful tool for
training the next generation of data analysts and researchers
• Cloud-based services: Web Language and Query Language Models
– Used to research topics such as human speech, spelling,
information extraction, learning, and machine translation.
9. It’s a data-driven world
– Spell Checking
– Machine Translation
– Search queries + click through
– Online games skill matching
– …
Data logs behaviours in more reliable ways than demographic
studies or surveys to study/predict trends
(Banko and Brill, 2001) – effectiveness of statistical NLP techniques is highly
susceptible to the data size used to develop them
(Norvig, 2008) – it is the size of data, not the sophistication of the algorithms
that ultimately play the central role in modern NLP
10. Data has become a first class citizen
IT’S A DATA-DRIVEN WORLD
11. Data for Open Innovation - Challenges
With web users becoming producers of
information, leaving the footprint of their lives in
digital trails, it is becoming easier for “data
snoopers” to reconstruct the identity of an
individual or an organization by cross linking
information from different sources
12. A Face Is Exposed for Searcher No. 4417749
“Search query data can contain the sum total of
our work, interests, associations, desires, dreams,
fantasies, and even darkest fears” said, Lauren Weinstein,
a privacy advocate.
The New York Times, Aug 2006
Thelma Arnold's identity was betrayed by the records of her Web searches
13. Web N-gram Services
Access to up to petabytes of real world data
http://research.microsoft.com/web-ngram
Leading technology in Search, Machine Translation,
Speech, Learning, …
14. Web N-Gram in Public Beta
Web data has
structure…
…and that counts
(e.g. Body, Title, Anchor)
Exploring Web Scale Language models for
Search Query Processing, in WWW’2010
25. It’s now a Knowledge World
From Patterns to Meanings
26. Semantics as the study of Meaning
• Data semantics – extract and map from structured and
semi-structured sources into ontologies
• Lexical semantics – identify/learn concepts, roles from
sentences (e.g. Powerset; MindNet)
• Statistical semantics – discover meaning from patterns of
use (e.g. concept similarity)
• Computational semantics – automate the process of
constructing and reasoning with meaning representations
• Semantic web – linked data via URI, common graph
structure with RDF, inferences via ontologies and OWL
• Formal semantics – in linguistics? in logic?
27. Probase : A Knowledge Base for Text
Understanding
http://research.microsoft.com/en-us/projects/probase/
WordNet Wikipedia Freebase Probase
Feline; Felid; Adult male; Man;
TV episode; Creative work; Musical Animal; Pet; Species; Mammal;
Gossip; Gossiper; Domesticated animals; Cats;
recording; Organism classification; Dated Small animal; Thing; Mammalian
Gossipmonger; Rumormonger; Felines; Invasive animal species;
location; Musical release; Book; Musical species; Small pet; Animal species;
Cat Rumourmonger; Newsmonger; Cosmopolitan species; Sequenced
album; Film character; Publication; Carnivore; Domesticated animal;
Woman; Adult female; genomes; Animals described in
Character species; Top level domain; Companion animal; Exotic pet;
Stimulant; Stimulant drug; 1758;
Animal; Domesticated animal; ... Vertebrate; ...
Excitant; Tracked vehicle; ...
Companies listed on the New York Business operation; Issuer; Literature Company; Vendor; Client;
Stock Exchange; IBM; Cloud subject; Venture investor; Competitor; Corporation; Organization;
computing providers; Companies Software developer; Architectural Manufacturer; Industry leader;
based in Westchester County, New structure owner; Website owner; Firm; Brand; Partner; Large
IBM N/A
York; Multinational companies; Programming language designer; company; Fortune 500 company;
Software companies of the United Computer manufacturer/brand; Technology company; Supplier;
States; Top 100 US Federal Customer; Operating system developer; Software vendor; Global company;
Contractors; ... Processor manufacturer; ... Technology company; ...
Instance of: Cognitive function;
Employer; Written work; Musical
Knowledge; Cultural factor;
Communication; Auditory recording; Musical artist; Musical album;
Cultural barrier; Cognitive process;
communication; Word; Higher Languages; Linguistics; Human Literature subject; Query; Periodical;
Cognitive ability; Cultural
Language cognitive process; Faculty; communication; Human skills; Type profile; Journal; Quotation subject;
difference; Ability; Characteristic;
Mental faculty; Module; Text; Wikipedia articles with ASCII art Type/domain equivalent topic; Broadcast
Attribute of: Film; Area; Book;
Textual matter; genre; Periodical subject; Video game
Publication; Magazine; Country;
content descriptor; ...
Work; Program; Media; City; ...
28. Probase has a big concept space
2.7 M concepts
automatically
Probase: harnessed from 1.68
billion pages
2 K concepts
Freebase: built by community
effort
120 K concepts
Cyc: 25 years human
labor
29. Uncertainty
Probase vs. Freebase
Correctness is a Knowledge is
probability. black and white.
Live with dirty Clean up
data. everything.
Dirty data is very Dirty data is
useful. unusable.
30. What’s in your mind when you see the
word ‘apple’
6000
5000
4000
3000
2000 concepts
1000
0
35. Zentity 2.0– Research Output Platform
New Features:
Default web UI with CSS support Pivot Viewer (defacto browser)
and custom ASP.Net controls Open Data Protocol
Flexible data model enables
many scenarios and can be
easily extended over time
A semantic computing platform to store and
expose relationships between digital assets
http://research.microsoft.com/zentity/
38. Pattern Discovery and Sociological Interpretation:
‘Commenting’ Activity on Flickr
Flickr users who commented on Marc_Smith’s photos (more than 4 times)
39. Pattern Discovery and Sociological Interpretation:
‘Commenting’ Activity on Flickr
Flickr users who commented on Marc_Smith’s photos (more than 4 times)
40. Semantics of Network Patterns:
NodeXL
http://nodexl.codeplex.com
INTRODUCTION
TECHNIQUES AND
METRICS
USER RESEARCH
PRODUCT GROUP
ENGAGEMENT
FURTHER WORK
TWITTER NodeXL Graph
“Bing” at 2:30 AM Monday, July 12, 2010
41. From Pattern to Meaning:
Email
Validation of pattern analysis
requires human input.
Meaning can be considered
globally accepted or strictly
contextual, generally
understood or individually
constructed.
42. Summary
The challenge is not so much in the standards for
representations (isn’t this just still syntax?) and pattern
discovery but really in the interpretation and validation
of that interpretation.
‘Meaning’ has different connotations in different context
The challenge is in determining and addressing the
right level of granularity.
43. Thank you
• Evelyne Viegas, Microsoft Research, USA
• Li Ding, Rensselaer Polytechnic Institute
• Natasa Milic-Frayling, Microsoft Research, UK
• Haixun Wang, Microsoft Research, Asia
• Kuansan Wang, Microsoft Research, USA
Lewis Shepherd lewiss@microsoft.com
@lewisshepherd
Editor's Notes
[Dumais, UMAP 2009]
Scaling to Very Very Large Corpora for Natural Language DisambiguationStatistical learning as the ultimate agile development tool.
Here youcan see why making content types (such as title and anchor text) available to the research community is better than body, as they are more similar to users’ queries.Details can be seen in the WWW paper.
The service is Public, can be used for non commercial purposes. This means that it has now been extended to researchers worldwide as part of its public beta launch which happened at WWW, Raleigh NC.What you see here is an application developed at WWW, within 8 hours of the public launch where Dr. Li Ding from Rensselaer Polytechnic Institute used the web n-gram service on a government dataset of titles to build a multi-word tag cloud, thus providing more relevant information.As an example compare on the left: critical and habitat as separate tokens and on the right (multi-word tag), critical-habitat.
At MSR Asia, the Speech Group is working to make the "speech chain" smooth and robust when there is a machine involved, working to develop spoken language technologies that enable human-computer voice interaction and enrich human-to-human voice communications. The group's current focus includes automatic speech recognition to enable computers to facilitate access to data, help create content, and perform tasks; speech synthesis to enable computers to speak with a human-sounding voice, to respond and provide information, and to read; spoken-document retrieval and processing to enrich communication between people like converting voice-mail into text; signal processing to improve the conditioning of signals, change speech signal parameters like pitch, speaking rates, and voice characteristics in a seamless way. Extension of statistical learning algorithms developed in speech-to-other pattern recognition applications like hand-written math equations and East-Asian character recognition are being pursued jointly with other groups.
At MSR Asia, the Speech Group is working to make the "speech chain" smooth and robust when there is a machine involved, working to develop spoken language technologies that enable human-computer voice interaction and enrich human-to-human voice communications. The group's current focus includes automatic speech recognition to enable computers to facilitate access to data, help create content, and perform tasks; speech synthesis to enable computers to speak with a human-sounding voice, to respond and provide information, and to read; spoken-document retrieval and processing to enrich communication between people like converting voice-mail into text; signal processing to improve the conditioning of signals, change speech signal parameters like pitch, speaking rates, and voice characteristics in a seamless way. Extension of statistical learning algorithms developed in speech-to-other pattern recognition applications like hand-written math equations and East-Asian character recognition are being pursued jointly with other groups.