SlideShare a Scribd company logo
1 of 44
Manichean Progress:
Positive and Negative
States of the Art
in Web-Scale Data
Lewis Shepherd
Microsoft Institute
  for Advanced Technology in
Government
My cautionary personal note on Data
 “If all others accepted the lie which the Party
 imposed - if all records told the same tale -
 then the lie passed into history and became
 truth. 'Who controls the past' ran the Party
 slogan, 'controls the future: who controls the
 present controls the past.’”

                George Orwell, Nineteen Eighty-Four
Murray Feshbach,
      Demographer & Revolutionary Spark
•   Following many years of continuous decline, infant mortality in the Soviet Union started
    inexplicably to rise in the early 1970s from 22.9 deaths per 1,000 live births in 1971 to 27.9 in 1974.
    The TsSU continued to print the infant mortality series for a few years after the alarming reversal
    of the long-term trend, but it stopped open publication of the data in 1975.
•   Christopher Davis and Murray Feshbach [Census Bureau] published a research report in 1980
    depicting the deteriorating state of public health in the USSR and--with what later proved to be an
    accurate set of estimates for the missing years--suggesting that infant mortality in the Soviet
    Union was continuing to rise.
•   The Davis-Feshbach study was made available to high Soviet authorities who directed beneficial
    changes in public health policies.
•   [Full publication of ] Infant mortality rates were not resumed until twelve years later in Narodnoye
    Khozyaystvo, 1987
•   The TsSU and the Ministry of Health of the USSR probably continued to collect statistics on infant
    mortality... The Soviet statistical system, however, was known for its reluctance to be the bearer of
    bad news. In the case of infant mortality, as in many similar cases, the data on adverse
    developments were simply deleted from the open literature.
•   It took an alarming and well-publicized American report to alert higher authorities to the critical
    situation and to introduce remedies.
                        Vladimir G. Treml, Center for the Study of Intelligence, “Western Analysis and the Soviet Policymaking Process”, 2007
Tim O’Reilly
       Government as a Platform Evangelist
   on “The World’s 7 Most Powerful Data Scientists”
• Elizabeth Warren: The banking system excesses that led to the economic
  crash of 2008 are an example of big data gone wrong. As the provisional
  head of the Consumer Finance Protection Bureau, Elizabeth Warren began
  the job of building the algorithmic checks and balances needed to counter
  the sorcerer’s apprentices of Wall Street. In her campaign for the US
  Senate, she promises to continue that fight.

• …when she was working on the Consumer Finance Protection Board, she
  was thinking hard about what role technology could play in building a
  truly 21st century regulatory agency, and in my books, that will have to
  mean what I've been calling "algorithmic regulation.“
                          Forbes.com / G+ / Nov. 3, 2011 (emphasis added)
        https://plus.google.com/u/0/107033731246200681024/posts/2NU9pZEZ5t1
                                                         4
Tim O’Reilly
      Government as a Platform Evangelist
  on “The World’s 7 Most Powerful Data Scientists”
• My feeling is that someone who is likely to have a major
  influence on regulating the data scientists on Wall Street is a
  good person to put on a list like this. Yes, I do want them
  regulated, and this was a way of giving Elizabeth Warren a
  push. I do think that if anyone will help stand up for the rest
  of us, she will. And I wanted a chance to plant a few ideas
  about how that regulation ought to happen (algorithmically,
  in the same way that Google manages search quality.)

                            Blog Comment / Nov. 4, 2011 (emphasis added)
          http://ctovision.com/2011/11/the-worlds-7-most-powerful-data-
                                       scientists/#IDComment217149604
                                                      5
Breaking down Data Barriers
Semantic Knowledge for Commodity
Computing


 Evelyne Viegas, Microsoft Research, USA
 Li Ding, Rensselaer Polytechnic Institute
 Natasa Milic-Frayling, Microsoft Research, UK
 Haixun Wang, Microsoft Research, Asia
 Kuansan Wang, Microsoft Research, USA
Vision – Enable Next Generation Experiences by
working with academia, stakeholders from
industry, government, and
consumers/innovators to make sense of data


  DATA > INFORMATION > KNOWLEDGE >
             INTELLIGENCE
Data/Information
 • To help explore the data value chain, Microsoft’s collaborations
   provide access to data that enables:
    – Innovation – By having access to real world data, researchers
       can unveil new analysis or research directions based on shared
       assets and explore new questions
    – Science – By allowing wider use of data, repeatability of
       experiments can be performed and data misrepresentations or
       faulty results avoided
    – Training – real-world large-scale data is a powerful tool for
       training the next generation of data analysts and researchers


 • Cloud-based services: Web Language and Query Language Models
    – Used to research topics such as human speech, spelling,
      information extraction, learning, and machine translation.
It’s a data-driven world
    –   Spell Checking
    –   Machine Translation
    –   Search queries + click through
    –   Online games skill matching
    –   …

    Data logs behaviours in more reliable ways than demographic
      studies or surveys to study/predict trends

(Banko and Brill, 2001) – effectiveness of statistical NLP techniques is highly
   susceptible to the data size used to develop them
(Norvig, 2008) – it is the size of data, not the sophistication of the algorithms
   that ultimately play the central role in modern NLP
Data has become a first class citizen



IT’S A DATA-DRIVEN WORLD
Data for Open Innovation - Challenges


With web users becoming producers of
information, leaving the footprint of their lives in
digital trails, it is becoming easier for “data
snoopers” to reconstruct the identity of an
individual or an organization by cross linking
information from different sources
A Face Is Exposed for Searcher No. 4417749




                                                       “Search query data can contain the sum total of
                                                       our work, interests, associations, desires, dreams,
                                                       fantasies, and even darkest fears” said, Lauren Weinstein,
                                                       a privacy advocate.




                  The New York Times, Aug 2006
Thelma Arnold's identity was betrayed by the records of her Web searches
Web N-gram Services
Access to up to petabytes of real world data


           http://research.microsoft.com/web-ngram



Leading technology in Search, Machine Translation,
                Speech, Learning, …
Web N-Gram in Public Beta
  Web data has
  structure…

  …and that counts
  (e.g. Body, Title, Anchor)




Exploring Web Scale Language models for
Search Query Processing, in WWW’2010
Applications Examples using Web
         Ngram Services
Word Breaking




                16
Multi-word Tag Cloud from Government
            Dataset Titles




      Ref: Dr. Li Ding, Rensselaer Polytechnic Institute
Query Segmentation
Body:




             Title:




                        Anchor:
Big Data and Machine Learning
        at the rescue of
    Machine Translation
      Audio/Speech
     Motion/Gestures
Text:   Paraphrasing in English
   http://labs.microsofttranslator.com/thesaurus/
Sentence:
“many are dismayed by his
behaviour”
Audio: Search             Over Audio
http://www.msravs.com/audiosearch_demo/




http://labs.microsofttranslator.com/thesaurus/
Meaning of Utterances:
  Search Over Audio
http://www.msravs.com/audiosearch_demo/
Gestures:      Kinect SDK
http://research.microsoft.com/en-us/um/redmond/projects/kinectsdk
It’s now a Knowledge World

From Patterns to Meanings
Semantics as the study of Meaning
• Data semantics – extract and map from structured and
    semi-structured sources into ontologies
•   Lexical semantics – identify/learn concepts, roles from
    sentences (e.g. Powerset; MindNet)
•   Statistical semantics – discover meaning from patterns of
    use (e.g. concept similarity)
•   Computational semantics – automate the process of
    constructing and reasoning with meaning representations
•   Semantic web – linked data via URI, common graph
    structure with RDF, inferences via ontologies and OWL
• Formal semantics – in linguistics? in logic?
Probase : A Knowledge Base for Text
                         Understanding
                                  http://research.microsoft.com/en-us/projects/probase/

           WordNet                        Wikipedia                          Freebase                                    Probase

           Feline; Felid; Adult male; Man;
                                                                             TV episode; Creative work; Musical          Animal; Pet; Species; Mammal;
           Gossip; Gossiper;               Domesticated animals; Cats;
                                                                             recording; Organism classification; Dated   Small animal; Thing; Mammalian
           Gossipmonger; Rumormonger; Felines; Invasive animal species;
                                                                             location; Musical release; Book; Musical    species; Small pet; Animal species;
  Cat      Rumourmonger; Newsmonger; Cosmopolitan species; Sequenced
                                                                             album; Film character; Publication;         Carnivore; Domesticated animal;
           Woman; Adult female;            genomes; Animals described in
                                                                             Character species; Top level domain;        Companion animal; Exotic pet;
           Stimulant; Stimulant drug;      1758;
                                                                             Animal; Domesticated animal; ...            Vertebrate; ...
           Excitant; Tracked vehicle; ...


                                          Companies listed on the New York Business operation; Issuer; Literature        Company; Vendor; Client;
                                          Stock Exchange; IBM; Cloud       subject; Venture investor; Competitor;        Corporation; Organization;
                                          computing providers; Companies Software developer; Architectural               Manufacturer; Industry leader;
                                          based in Westchester County, New structure owner; Website owner;               Firm; Brand; Partner; Large
  IBM      N/A
                                          York; Multinational companies;   Programming language designer;                company; Fortune 500 company;
                                          Software companies of the United Computer manufacturer/brand;                  Technology company; Supplier;
                                          States; Top 100 US Federal       Customer; Operating system developer;         Software vendor; Global company;
                                          Contractors; ...                 Processor manufacturer; ...                   Technology company; ...

                                                                                                                         Instance of: Cognitive function;
                                                                             Employer; Written work; Musical
                                                                                                                         Knowledge; Cultural factor;
           Communication; Auditory                                           recording; Musical artist; Musical album;
                                                                                                                         Cultural barrier; Cognitive process;
           communication; Word; Higher Languages; Linguistics; Human         Literature subject; Query; Periodical;
                                                                                                                         Cognitive ability; Cultural
Language   cognitive process; Faculty;   communication; Human skills;        Type profile; Journal; Quotation subject;
                                                                                                                         difference; Ability; Characteristic;
           Mental faculty; Module; Text; Wikipedia articles with ASCII art   Type/domain equivalent topic; Broadcast
                                                                                                                         Attribute of: Film; Area; Book;
           Textual matter;                                                   genre; Periodical subject; Video game
                                                                                                                         Publication; Magazine; Country;
                                                                             content descriptor; ...
                                                                                                                         Work; Program; Media; City; ...
Probase has a big concept space

                  2.7 M concepts
                    automatically
    Probase:     harnessed from 1.68
                     billion pages
                   2 K concepts
   Freebase:     built by community
                        effort

                  120 K concepts
         Cyc:      25 years human
                        labor
Uncertainty
  Probase              vs. Freebase
Correctness is a       Knowledge is
  probability.       black and white.
 Live with dirty        Clean up
      data.            everything.
Dirty data is very     Dirty data is
     useful.            unusable.
What’s in your mind when you see the
             word ‘apple’
6000



5000



4000



3000



2000                              concepts



1000



   0
When the machine sees ‘apple’ and
         ‘pear’ together
Probase Internals
               artist



               painter                           Born Died …   Movement
                                       Picasso
                                                 1881 1973 …   Cubism



     art


    painting
                                   Year Type         …
                        Guernica
                                   1937 Oil on Canvas …
Probase search
Interim Product: Academic Search




http://academic.research.microsoft.com/
Zentity 2.0– Research Output Platform
                                                               New Features:
 Default web UI with CSS support                               Pivot Viewer (defacto browser)
 and custom ASP.Net controls                                   Open Data Protocol




                                                               Flexible data model enables
                                                               many scenarios and can be
                                                               easily extended over time




A semantic computing platform to store and
expose relationships between digital assets

                                              http://research.microsoft.com/zentity/
Pattern Discovery and Semantic Interpretation:
Graph of Co-occurring Flickr Tags
Pattern Discovery and Semantic Interpretation:
Graph of Co-occurring Flickr Tags
Pattern Discovery and Sociological Interpretation:
‘Commenting’ Activity on Flickr




  Flickr users who commented on Marc_Smith’s photos (more than 4 times)
Pattern Discovery and Sociological Interpretation:
‘Commenting’ Activity on Flickr




  Flickr users who commented on Marc_Smith’s photos (more than 4 times)
Semantics of Network Patterns:
                          NodeXL
                          http://nodexl.codeplex.com

INTRODUCTION

TECHNIQUES AND
METRICS

USER RESEARCH

PRODUCT GROUP
ENGAGEMENT

FURTHER WORK




                                         TWITTER NodeXL Graph
                                         “Bing” at 2:30 AM Monday, July 12, 2010
From Pattern to Meaning:
Email

 Validation of pattern analysis
  requires human input.
 Meaning can be considered
  globally accepted or strictly
  contextual, generally
  understood or individually
  constructed.
Summary
 The challenge is not so much in the standards for
  representations (isn’t this just still syntax?) and pattern
  discovery but really in the interpretation and validation
  of that interpretation.
 ‘Meaning’ has different connotations in different context
 The challenge is in determining and addressing the
  right level of granularity.
Thank you
•   Evelyne Viegas, Microsoft Research, USA
•   Li Ding, Rensselaer Polytechnic Institute
•   Natasa Milic-Frayling, Microsoft Research, UK
•   Haixun Wang, Microsoft Research, Asia
•   Kuansan Wang, Microsoft Research, USA

Lewis Shepherd     lewiss@microsoft.com
                   @lewisshepherd
Manichean Progress: Positive and Negative States of the Art in Web-Scale Data. by Lewis Shepherd at AAAI OGK 2011

More Related Content

What's hot

Data on Polarization, Peace, and Propaganda
Data on Polarization, Peace, and PropagandaData on Polarization, Peace, and Propaganda
Data on Polarization, Peace, and PropagandaIngmar Weber
 
Transforming instagram data into location intelligence
Transforming instagram data into location intelligenceTransforming instagram data into location intelligence
Transforming instagram data into location intelligencesuresh sood
 
Digital Demography - WWW'17 Tutorial - Part II
Digital Demography - WWW'17 Tutorial - Part IIDigital Demography - WWW'17 Tutorial - Part II
Digital Demography - WWW'17 Tutorial - Part IIIngmar Weber
 
GitHub as Transparency Device in Data Journalism, Open Data and Data Activism
GitHub as Transparency Device in  Data Journalism, Open Data and Data ActivismGitHub as Transparency Device in  Data Journalism, Open Data and Data Activism
GitHub as Transparency Device in Data Journalism, Open Data and Data ActivismLiliana Bounegru
 
Filth and lies: analysing social media
Filth and lies: analysing social mediaFilth and lies: analysing social media
Filth and lies: analysing social mediaDiana Maynard
 
Fake news and trust and distrust in fact checking sites
Fake news and trust and distrust in fact checking sitesFake news and trust and distrust in fact checking sites
Fake news and trust and distrust in fact checking sitesPetter Bae Brandtzæg
 
Social Media and Crisis Communication
Social Media and Crisis CommunicationSocial Media and Crisis Communication
Social Media and Crisis CommunicationAxel Bruns
 
The Challenge of Digital Sources in the Web Age: Common Tensions Across Three...
The Challenge of Digital Sources in the Web Age: Common Tensions Across Three...The Challenge of Digital Sources in the Web Age: Common Tensions Across Three...
The Challenge of Digital Sources in the Web Age: Common Tensions Across Three...Digital History
 
Doing Digital Methods: Some Recent Highlights from Winter and Summer Schools
Doing Digital Methods: Some Recent Highlights from Winter and Summer SchoolsDoing Digital Methods: Some Recent Highlights from Winter and Summer Schools
Doing Digital Methods: Some Recent Highlights from Winter and Summer SchoolsLiliana Bounegru
 
Sharing Data on the Web
Sharing Data on the WebSharing Data on the Web
Sharing Data on the Web3 Round Stones
 
National Academy of Sciences - Improving the quality of scientific research t...
National Academy of Sciences - Improving the quality of scientific research t...National Academy of Sciences - Improving the quality of scientific research t...
National Academy of Sciences - Improving the quality of scientific research t...gphelan
 
Living in Tech City: 50+ Technology Trends and Innovations Transforming Workp...
Living in Tech City: 50+ Technology Trends and Innovations Transforming Workp...Living in Tech City: 50+ Technology Trends and Innovations Transforming Workp...
Living in Tech City: 50+ Technology Trends and Innovations Transforming Workp...cjbonk
 
Privacy isdeadgetoveritredux 10.12.2014
Privacy isdeadgetoveritredux 10.12.2014Privacy isdeadgetoveritredux 10.12.2014
Privacy isdeadgetoveritredux 10.12.2014protected7000
 
Mapping Online Publics: New Methods for Twitter Research
Mapping Online Publics: New Methods for Twitter ResearchMapping Online Publics: New Methods for Twitter Research
Mapping Online Publics: New Methods for Twitter ResearchAxel Bruns
 
Fact Checking & Information Retrieval
Fact Checking & Information RetrievalFact Checking & Information Retrieval
Fact Checking & Information RetrievalMatthew Lease
 
Librarians in the Intelligence Process
Librarians in the Intelligence ProcessLibrarians in the Intelligence Process
Librarians in the Intelligence Processdavidshumaker
 
HKU Data Curation MLIM7350 Class 7
HKU Data Curation MLIM7350 Class 7HKU Data Curation MLIM7350 Class 7
HKU Data Curation MLIM7350 Class 7Scott Edmunds
 
2013 Oxford Digital Humanities Summer School Workshop
2013 Oxford Digital Humanities Summer School Workshop2013 Oxford Digital Humanities Summer School Workshop
2013 Oxford Digital Humanities Summer School WorkshopEric Meyer
 
Press review dossier
Press review dossierPress review dossier
Press review dossierLucasnatacha
 

What's hot (20)

Data on Polarization, Peace, and Propaganda
Data on Polarization, Peace, and PropagandaData on Polarization, Peace, and Propaganda
Data on Polarization, Peace, and Propaganda
 
Transforming instagram data into location intelligence
Transforming instagram data into location intelligenceTransforming instagram data into location intelligence
Transforming instagram data into location intelligence
 
Digital Demography - WWW'17 Tutorial - Part II
Digital Demography - WWW'17 Tutorial - Part IIDigital Demography - WWW'17 Tutorial - Part II
Digital Demography - WWW'17 Tutorial - Part II
 
GitHub as Transparency Device in Data Journalism, Open Data and Data Activism
GitHub as Transparency Device in  Data Journalism, Open Data and Data ActivismGitHub as Transparency Device in  Data Journalism, Open Data and Data Activism
GitHub as Transparency Device in Data Journalism, Open Data and Data Activism
 
Filth and lies: analysing social media
Filth and lies: analysing social mediaFilth and lies: analysing social media
Filth and lies: analysing social media
 
Fake news and trust and distrust in fact checking sites
Fake news and trust and distrust in fact checking sitesFake news and trust and distrust in fact checking sites
Fake news and trust and distrust in fact checking sites
 
Social Media and Crisis Communication
Social Media and Crisis CommunicationSocial Media and Crisis Communication
Social Media and Crisis Communication
 
The Challenge of Digital Sources in the Web Age: Common Tensions Across Three...
The Challenge of Digital Sources in the Web Age: Common Tensions Across Three...The Challenge of Digital Sources in the Web Age: Common Tensions Across Three...
The Challenge of Digital Sources in the Web Age: Common Tensions Across Three...
 
Doing Digital Methods: Some Recent Highlights from Winter and Summer Schools
Doing Digital Methods: Some Recent Highlights from Winter and Summer SchoolsDoing Digital Methods: Some Recent Highlights from Winter and Summer Schools
Doing Digital Methods: Some Recent Highlights from Winter and Summer Schools
 
Sharing Data on the Web
Sharing Data on the WebSharing Data on the Web
Sharing Data on the Web
 
National Academy of Sciences - Improving the quality of scientific research t...
National Academy of Sciences - Improving the quality of scientific research t...National Academy of Sciences - Improving the quality of scientific research t...
National Academy of Sciences - Improving the quality of scientific research t...
 
Cool Tools
Cool Tools Cool Tools
Cool Tools
 
Living in Tech City: 50+ Technology Trends and Innovations Transforming Workp...
Living in Tech City: 50+ Technology Trends and Innovations Transforming Workp...Living in Tech City: 50+ Technology Trends and Innovations Transforming Workp...
Living in Tech City: 50+ Technology Trends and Innovations Transforming Workp...
 
Privacy isdeadgetoveritredux 10.12.2014
Privacy isdeadgetoveritredux 10.12.2014Privacy isdeadgetoveritredux 10.12.2014
Privacy isdeadgetoveritredux 10.12.2014
 
Mapping Online Publics: New Methods for Twitter Research
Mapping Online Publics: New Methods for Twitter ResearchMapping Online Publics: New Methods for Twitter Research
Mapping Online Publics: New Methods for Twitter Research
 
Fact Checking & Information Retrieval
Fact Checking & Information RetrievalFact Checking & Information Retrieval
Fact Checking & Information Retrieval
 
Librarians in the Intelligence Process
Librarians in the Intelligence ProcessLibrarians in the Intelligence Process
Librarians in the Intelligence Process
 
HKU Data Curation MLIM7350 Class 7
HKU Data Curation MLIM7350 Class 7HKU Data Curation MLIM7350 Class 7
HKU Data Curation MLIM7350 Class 7
 
2013 Oxford Digital Humanities Summer School Workshop
2013 Oxford Digital Humanities Summer School Workshop2013 Oxford Digital Humanities Summer School Workshop
2013 Oxford Digital Humanities Summer School Workshop
 
Press review dossier
Press review dossierPress review dossier
Press review dossier
 

Viewers also liked

Assignment 3 Jacqui Collins
Assignment 3   Jacqui CollinsAssignment 3   Jacqui Collins
Assignment 3 Jacqui Collinsjcollins006
 
顧客紀念日
顧客紀念日顧客紀念日
顧客紀念日shiehrm
 
How to become Certified Business Professional – CBP
How to become Certified Business Professional – CBPHow to become Certified Business Professional – CBP
How to become Certified Business Professional – CBPMustafa Abdulrazaq
 
Cipaganti [ whats next ] Focus or Diversify?
Cipaganti [ whats next ] Focus or Diversify?Cipaganti [ whats next ] Focus or Diversify?
Cipaganti [ whats next ] Focus or Diversify?Frisca Listyaningtyas
 
Creating a college support team
Creating a college support teamCreating a college support team
Creating a college support teamAtty Garfinkel
 
FDates Market Timing - Setups
FDates Market Timing - SetupsFDates Market Timing - Setups
FDates Market Timing - SetupsFDates
 
Krauthammer Corporate Presentation
Krauthammer Corporate PresentationKrauthammer Corporate Presentation
Krauthammer Corporate Presentationmboomen
 
Lpf.2014
Lpf.2014Lpf.2014
Lpf.2014jmu2m
 
Textual Analysis
Textual AnalysisTextual Analysis
Textual Analysisadamb33
 
Identifikasi senyawa diterpen
Identifikasi senyawa diterpenIdentifikasi senyawa diterpen
Identifikasi senyawa diterpenPharmacy
 
Google calendar
Google calendarGoogle calendar
Google calendarshiehrm
 

Viewers also liked (20)

Assignment 3 Jacqui Collins
Assignment 3   Jacqui CollinsAssignment 3   Jacqui Collins
Assignment 3 Jacqui Collins
 
What is history
What is historyWhat is history
What is history
 
Summit slide loop ny
Summit slide loop nySummit slide loop ny
Summit slide loop ny
 
顧客紀念日
顧客紀念日顧客紀念日
顧客紀念日
 
How to become Certified Business Professional – CBP
How to become Certified Business Professional – CBPHow to become Certified Business Professional – CBP
How to become Certified Business Professional – CBP
 
CT Presentation 21.03.2015
CT Presentation 21.03.2015CT Presentation 21.03.2015
CT Presentation 21.03.2015
 
Cipaganti [ whats next ] Focus or Diversify?
Cipaganti [ whats next ] Focus or Diversify?Cipaganti [ whats next ] Focus or Diversify?
Cipaganti [ whats next ] Focus or Diversify?
 
Creating a college support team
Creating a college support teamCreating a college support team
Creating a college support team
 
2010 Results Presentation
2010 Results Presentation2010 Results Presentation
2010 Results Presentation
 
Genes,brain & behavior1
Genes,brain & behavior1Genes,brain & behavior1
Genes,brain & behavior1
 
FDates Market Timing - Setups
FDates Market Timing - SetupsFDates Market Timing - Setups
FDates Market Timing - Setups
 
Students Practice Tracing Letters
Students Practice Tracing LettersStudents Practice Tracing Letters
Students Practice Tracing Letters
 
Fintech for students
Fintech for studentsFintech for students
Fintech for students
 
Krauthammer Corporate Presentation
Krauthammer Corporate PresentationKrauthammer Corporate Presentation
Krauthammer Corporate Presentation
 
Vaccines fast
Vaccines fastVaccines fast
Vaccines fast
 
Lpf.2014
Lpf.2014Lpf.2014
Lpf.2014
 
Brown act
Brown actBrown act
Brown act
 
Textual Analysis
Textual AnalysisTextual Analysis
Textual Analysis
 
Identifikasi senyawa diterpen
Identifikasi senyawa diterpenIdentifikasi senyawa diterpen
Identifikasi senyawa diterpen
 
Google calendar
Google calendarGoogle calendar
Google calendar
 

Similar to Manichean Progress: Positive and Negative States of the Art in Web-Scale Data. by Lewis Shepherd at AAAI OGK 2011

Broad Data (India 2015)
Broad Data (India 2015)Broad Data (India 2015)
Broad Data (India 2015)James Hendler
 
Humanities in the Digital World
Humanities in the Digital WorldHumanities in the Digital World
Humanities in the Digital WorldDavid De Roure
 
AI and the Researcher: ChatGPT and DALL-E in Scholarly Writing and Publishing
AI and the Researcher: ChatGPT and DALL-E in Scholarly Writing and PublishingAI and the Researcher: ChatGPT and DALL-E in Scholarly Writing and Publishing
AI and the Researcher: ChatGPT and DALL-E in Scholarly Writing and PublishingErin Owens
 
Making Decisions in a World Awash in Data: We’re going to need a different bo...
Making Decisions in a World Awash in Data: We’re going to need a different bo...Making Decisions in a World Awash in Data: We’re going to need a different bo...
Making Decisions in a World Awash in Data: We’re going to need a different bo...Micah Altman
 
Defending The Truth In A Wikipedia World
Defending The Truth In A Wikipedia WorldDefending The Truth In A Wikipedia World
Defending The Truth In A Wikipedia WorldElisabethTully
 
Njhs Application Essay. National Junior Honor Society application essay
Njhs Application Essay. National Junior Honor Society application essayNjhs Application Essay. National Junior Honor Society application essay
Njhs Application Essay. National Junior Honor Society application essayDana Burks
 
Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen
Words and More Words: Challenges of Big Data by Prof. Edie RasmussenWords and More Words: Challenges of Big Data by Prof. Edie Rasmussen
Words and More Words: Challenges of Big Data by Prof. Edie Rasmussenwkwsci-research
 
Jdb code biology and ai final
Jdb code biology and ai finalJdb code biology and ai final
Jdb code biology and ai finalJoachim De Beule
 
Open Grid Forum workshop on Social Networks, Semantic Grids and Web
Open Grid Forum workshop on Social Networks, Semantic Grids and WebOpen Grid Forum workshop on Social Networks, Semantic Grids and Web
Open Grid Forum workshop on Social Networks, Semantic Grids and WebNoshir Contractor
 
The power of Structured Journalism & Hacker Culture in NPR
The power of Structured Journalism & Hacker Culture in NPRThe power of Structured Journalism & Hacker Culture in NPR
The power of Structured Journalism & Hacker Culture in NPRPoderomedia
 
Social Machines: The coming collision of Artificial Intelligence, Social Netw...
Social Machines: The coming collision of Artificial Intelligence, Social Netw...Social Machines: The coming collision of Artificial Intelligence, Social Netw...
Social Machines: The coming collision of Artificial Intelligence, Social Netw...James Hendler
 
The Architecture of Understanding (Peter Morville at Enterprise UX 2015)
The Architecture of Understanding (Peter Morville at Enterprise UX 2015)The Architecture of Understanding (Peter Morville at Enterprise UX 2015)
The Architecture of Understanding (Peter Morville at Enterprise UX 2015)Rosenfeld Media
 
The Architecture of Understanding
The Architecture of UnderstandingThe Architecture of Understanding
The Architecture of UnderstandingPeter Morville
 
Corso pisa-2 dh-2017
Corso pisa-2 dh-2017Corso pisa-2 dh-2017
Corso pisa-2 dh-2017Luca De Biase
 
Essay Divorce. Narrative essay divorce. Attitudes Toward Marriage and Divorce...
Essay Divorce. Narrative essay divorce. Attitudes Toward Marriage and Divorce...Essay Divorce. Narrative essay divorce. Attitudes Toward Marriage and Divorce...
Essay Divorce. Narrative essay divorce. Attitudes Toward Marriage and Divorce...Elizabeth Montes
 
Discovery and the Age of Insight: Walmart EIM Open House 2013
Discovery and the Age of Insight: Walmart EIM Open House 2013Discovery and the Age of Insight: Walmart EIM Open House 2013
Discovery and the Age of Insight: Walmart EIM Open House 2013Joe Lamantia
 
Animal Cell Essay.pdf
Animal Cell Essay.pdfAnimal Cell Essay.pdf
Animal Cell Essay.pdfAnna May
 
Animal Cell Essay. Cell theory essay. Cell Theory essay. 2019-02-15
Animal Cell Essay.  Cell theory essay. Cell Theory essay. 2019-02-15Animal Cell Essay.  Cell theory essay. Cell Theory essay. 2019-02-15
Animal Cell Essay. Cell theory essay. Cell Theory essay. 2019-02-15Jean Henderson
 

Similar to Manichean Progress: Positive and Negative States of the Art in Web-Scale Data. by Lewis Shepherd at AAAI OGK 2011 (20)

Broad Data (India 2015)
Broad Data (India 2015)Broad Data (India 2015)
Broad Data (India 2015)
 
Open Data Journalism
Open Data JournalismOpen Data Journalism
Open Data Journalism
 
14 August Essay
14 August Essay14 August Essay
14 August Essay
 
Humanities in the Digital World
Humanities in the Digital WorldHumanities in the Digital World
Humanities in the Digital World
 
AI and the Researcher: ChatGPT and DALL-E in Scholarly Writing and Publishing
AI and the Researcher: ChatGPT and DALL-E in Scholarly Writing and PublishingAI and the Researcher: ChatGPT and DALL-E in Scholarly Writing and Publishing
AI and the Researcher: ChatGPT and DALL-E in Scholarly Writing and Publishing
 
Making Decisions in a World Awash in Data: We’re going to need a different bo...
Making Decisions in a World Awash in Data: We’re going to need a different bo...Making Decisions in a World Awash in Data: We’re going to need a different bo...
Making Decisions in a World Awash in Data: We’re going to need a different bo...
 
Defending The Truth In A Wikipedia World
Defending The Truth In A Wikipedia WorldDefending The Truth In A Wikipedia World
Defending The Truth In A Wikipedia World
 
Njhs Application Essay. National Junior Honor Society application essay
Njhs Application Essay. National Junior Honor Society application essayNjhs Application Essay. National Junior Honor Society application essay
Njhs Application Essay. National Junior Honor Society application essay
 
Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen
Words and More Words: Challenges of Big Data by Prof. Edie RasmussenWords and More Words: Challenges of Big Data by Prof. Edie Rasmussen
Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen
 
Jdb code biology and ai final
Jdb code biology and ai finalJdb code biology and ai final
Jdb code biology and ai final
 
Open Grid Forum workshop on Social Networks, Semantic Grids and Web
Open Grid Forum workshop on Social Networks, Semantic Grids and WebOpen Grid Forum workshop on Social Networks, Semantic Grids and Web
Open Grid Forum workshop on Social Networks, Semantic Grids and Web
 
The power of Structured Journalism & Hacker Culture in NPR
The power of Structured Journalism & Hacker Culture in NPRThe power of Structured Journalism & Hacker Culture in NPR
The power of Structured Journalism & Hacker Culture in NPR
 
Social Machines: The coming collision of Artificial Intelligence, Social Netw...
Social Machines: The coming collision of Artificial Intelligence, Social Netw...Social Machines: The coming collision of Artificial Intelligence, Social Netw...
Social Machines: The coming collision of Artificial Intelligence, Social Netw...
 
The Architecture of Understanding (Peter Morville at Enterprise UX 2015)
The Architecture of Understanding (Peter Morville at Enterprise UX 2015)The Architecture of Understanding (Peter Morville at Enterprise UX 2015)
The Architecture of Understanding (Peter Morville at Enterprise UX 2015)
 
The Architecture of Understanding
The Architecture of UnderstandingThe Architecture of Understanding
The Architecture of Understanding
 
Corso pisa-2 dh-2017
Corso pisa-2 dh-2017Corso pisa-2 dh-2017
Corso pisa-2 dh-2017
 
Essay Divorce. Narrative essay divorce. Attitudes Toward Marriage and Divorce...
Essay Divorce. Narrative essay divorce. Attitudes Toward Marriage and Divorce...Essay Divorce. Narrative essay divorce. Attitudes Toward Marriage and Divorce...
Essay Divorce. Narrative essay divorce. Attitudes Toward Marriage and Divorce...
 
Discovery and the Age of Insight: Walmart EIM Open House 2013
Discovery and the Age of Insight: Walmart EIM Open House 2013Discovery and the Age of Insight: Walmart EIM Open House 2013
Discovery and the Age of Insight: Walmart EIM Open House 2013
 
Animal Cell Essay.pdf
Animal Cell Essay.pdfAnimal Cell Essay.pdf
Animal Cell Essay.pdf
 
Animal Cell Essay. Cell theory essay. Cell Theory essay. 2019-02-15
Animal Cell Essay.  Cell theory essay. Cell Theory essay. 2019-02-15Animal Cell Essay.  Cell theory essay. Cell Theory essay. 2019-02-15
Animal Cell Essay. Cell theory essay. Cell Theory essay. 2019-02-15
 

Recently uploaded

Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 

Recently uploaded (20)

Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 

Manichean Progress: Positive and Negative States of the Art in Web-Scale Data. by Lewis Shepherd at AAAI OGK 2011

  • 1. Manichean Progress: Positive and Negative States of the Art in Web-Scale Data Lewis Shepherd Microsoft Institute for Advanced Technology in Government
  • 2. My cautionary personal note on Data “If all others accepted the lie which the Party imposed - if all records told the same tale - then the lie passed into history and became truth. 'Who controls the past' ran the Party slogan, 'controls the future: who controls the present controls the past.’” George Orwell, Nineteen Eighty-Four
  • 3. Murray Feshbach, Demographer & Revolutionary Spark • Following many years of continuous decline, infant mortality in the Soviet Union started inexplicably to rise in the early 1970s from 22.9 deaths per 1,000 live births in 1971 to 27.9 in 1974. The TsSU continued to print the infant mortality series for a few years after the alarming reversal of the long-term trend, but it stopped open publication of the data in 1975. • Christopher Davis and Murray Feshbach [Census Bureau] published a research report in 1980 depicting the deteriorating state of public health in the USSR and--with what later proved to be an accurate set of estimates for the missing years--suggesting that infant mortality in the Soviet Union was continuing to rise. • The Davis-Feshbach study was made available to high Soviet authorities who directed beneficial changes in public health policies. • [Full publication of ] Infant mortality rates were not resumed until twelve years later in Narodnoye Khozyaystvo, 1987 • The TsSU and the Ministry of Health of the USSR probably continued to collect statistics on infant mortality... The Soviet statistical system, however, was known for its reluctance to be the bearer of bad news. In the case of infant mortality, as in many similar cases, the data on adverse developments were simply deleted from the open literature. • It took an alarming and well-publicized American report to alert higher authorities to the critical situation and to introduce remedies. Vladimir G. Treml, Center for the Study of Intelligence, “Western Analysis and the Soviet Policymaking Process”, 2007
  • 4. Tim O’Reilly Government as a Platform Evangelist on “The World’s 7 Most Powerful Data Scientists” • Elizabeth Warren: The banking system excesses that led to the economic crash of 2008 are an example of big data gone wrong. As the provisional head of the Consumer Finance Protection Bureau, Elizabeth Warren began the job of building the algorithmic checks and balances needed to counter the sorcerer’s apprentices of Wall Street. In her campaign for the US Senate, she promises to continue that fight. • …when she was working on the Consumer Finance Protection Board, she was thinking hard about what role technology could play in building a truly 21st century regulatory agency, and in my books, that will have to mean what I've been calling "algorithmic regulation.“ Forbes.com / G+ / Nov. 3, 2011 (emphasis added) https://plus.google.com/u/0/107033731246200681024/posts/2NU9pZEZ5t1 4
  • 5. Tim O’Reilly Government as a Platform Evangelist on “The World’s 7 Most Powerful Data Scientists” • My feeling is that someone who is likely to have a major influence on regulating the data scientists on Wall Street is a good person to put on a list like this. Yes, I do want them regulated, and this was a way of giving Elizabeth Warren a push. I do think that if anyone will help stand up for the rest of us, she will. And I wanted a chance to plant a few ideas about how that regulation ought to happen (algorithmically, in the same way that Google manages search quality.) Blog Comment / Nov. 4, 2011 (emphasis added) http://ctovision.com/2011/11/the-worlds-7-most-powerful-data- scientists/#IDComment217149604 5
  • 6. Breaking down Data Barriers Semantic Knowledge for Commodity Computing Evelyne Viegas, Microsoft Research, USA Li Ding, Rensselaer Polytechnic Institute Natasa Milic-Frayling, Microsoft Research, UK Haixun Wang, Microsoft Research, Asia Kuansan Wang, Microsoft Research, USA
  • 7. Vision – Enable Next Generation Experiences by working with academia, stakeholders from industry, government, and consumers/innovators to make sense of data DATA > INFORMATION > KNOWLEDGE > INTELLIGENCE
  • 8. Data/Information • To help explore the data value chain, Microsoft’s collaborations provide access to data that enables: – Innovation – By having access to real world data, researchers can unveil new analysis or research directions based on shared assets and explore new questions – Science – By allowing wider use of data, repeatability of experiments can be performed and data misrepresentations or faulty results avoided – Training – real-world large-scale data is a powerful tool for training the next generation of data analysts and researchers • Cloud-based services: Web Language and Query Language Models – Used to research topics such as human speech, spelling, information extraction, learning, and machine translation.
  • 9. It’s a data-driven world – Spell Checking – Machine Translation – Search queries + click through – Online games skill matching – … Data logs behaviours in more reliable ways than demographic studies or surveys to study/predict trends (Banko and Brill, 2001) – effectiveness of statistical NLP techniques is highly susceptible to the data size used to develop them (Norvig, 2008) – it is the size of data, not the sophistication of the algorithms that ultimately play the central role in modern NLP
  • 10. Data has become a first class citizen IT’S A DATA-DRIVEN WORLD
  • 11. Data for Open Innovation - Challenges With web users becoming producers of information, leaving the footprint of their lives in digital trails, it is becoming easier for “data snoopers” to reconstruct the identity of an individual or an organization by cross linking information from different sources
  • 12. A Face Is Exposed for Searcher No. 4417749 “Search query data can contain the sum total of our work, interests, associations, desires, dreams, fantasies, and even darkest fears” said, Lauren Weinstein, a privacy advocate. The New York Times, Aug 2006 Thelma Arnold's identity was betrayed by the records of her Web searches
  • 13. Web N-gram Services Access to up to petabytes of real world data http://research.microsoft.com/web-ngram Leading technology in Search, Machine Translation, Speech, Learning, …
  • 14. Web N-Gram in Public Beta Web data has structure… …and that counts (e.g. Body, Title, Anchor) Exploring Web Scale Language models for Search Query Processing, in WWW’2010
  • 15. Applications Examples using Web Ngram Services
  • 17. Multi-word Tag Cloud from Government Dataset Titles Ref: Dr. Li Ding, Rensselaer Polytechnic Institute
  • 18. Query Segmentation Body: Title: Anchor:
  • 19. Big Data and Machine Learning at the rescue of Machine Translation Audio/Speech Motion/Gestures
  • 20. Text: Paraphrasing in English http://labs.microsofttranslator.com/thesaurus/
  • 21. Sentence: “many are dismayed by his behaviour”
  • 22. Audio: Search Over Audio http://www.msravs.com/audiosearch_demo/ http://labs.microsofttranslator.com/thesaurus/
  • 23. Meaning of Utterances: Search Over Audio http://www.msravs.com/audiosearch_demo/
  • 24. Gestures: Kinect SDK http://research.microsoft.com/en-us/um/redmond/projects/kinectsdk
  • 25. It’s now a Knowledge World From Patterns to Meanings
  • 26. Semantics as the study of Meaning • Data semantics – extract and map from structured and semi-structured sources into ontologies • Lexical semantics – identify/learn concepts, roles from sentences (e.g. Powerset; MindNet) • Statistical semantics – discover meaning from patterns of use (e.g. concept similarity) • Computational semantics – automate the process of constructing and reasoning with meaning representations • Semantic web – linked data via URI, common graph structure with RDF, inferences via ontologies and OWL • Formal semantics – in linguistics? in logic?
  • 27. Probase : A Knowledge Base for Text Understanding http://research.microsoft.com/en-us/projects/probase/ WordNet Wikipedia Freebase Probase Feline; Felid; Adult male; Man; TV episode; Creative work; Musical Animal; Pet; Species; Mammal; Gossip; Gossiper; Domesticated animals; Cats; recording; Organism classification; Dated Small animal; Thing; Mammalian Gossipmonger; Rumormonger; Felines; Invasive animal species; location; Musical release; Book; Musical species; Small pet; Animal species; Cat Rumourmonger; Newsmonger; Cosmopolitan species; Sequenced album; Film character; Publication; Carnivore; Domesticated animal; Woman; Adult female; genomes; Animals described in Character species; Top level domain; Companion animal; Exotic pet; Stimulant; Stimulant drug; 1758; Animal; Domesticated animal; ... Vertebrate; ... Excitant; Tracked vehicle; ... Companies listed on the New York Business operation; Issuer; Literature Company; Vendor; Client; Stock Exchange; IBM; Cloud subject; Venture investor; Competitor; Corporation; Organization; computing providers; Companies Software developer; Architectural Manufacturer; Industry leader; based in Westchester County, New structure owner; Website owner; Firm; Brand; Partner; Large IBM N/A York; Multinational companies; Programming language designer; company; Fortune 500 company; Software companies of the United Computer manufacturer/brand; Technology company; Supplier; States; Top 100 US Federal Customer; Operating system developer; Software vendor; Global company; Contractors; ... Processor manufacturer; ... Technology company; ... Instance of: Cognitive function; Employer; Written work; Musical Knowledge; Cultural factor; Communication; Auditory recording; Musical artist; Musical album; Cultural barrier; Cognitive process; communication; Word; Higher Languages; Linguistics; Human Literature subject; Query; Periodical; Cognitive ability; Cultural Language cognitive process; Faculty; communication; Human skills; Type profile; Journal; Quotation subject; difference; Ability; Characteristic; Mental faculty; Module; Text; Wikipedia articles with ASCII art Type/domain equivalent topic; Broadcast Attribute of: Film; Area; Book; Textual matter; genre; Periodical subject; Video game Publication; Magazine; Country; content descriptor; ... Work; Program; Media; City; ...
  • 28. Probase has a big concept space 2.7 M concepts automatically Probase: harnessed from 1.68 billion pages 2 K concepts Freebase: built by community effort 120 K concepts Cyc: 25 years human labor
  • 29. Uncertainty Probase vs. Freebase Correctness is a Knowledge is probability. black and white. Live with dirty Clean up data. everything. Dirty data is very Dirty data is useful. unusable.
  • 30. What’s in your mind when you see the word ‘apple’ 6000 5000 4000 3000 2000 concepts 1000 0
  • 31. When the machine sees ‘apple’ and ‘pear’ together
  • 32. Probase Internals artist painter Born Died … Movement Picasso 1881 1973 … Cubism art painting Year Type … Guernica 1937 Oil on Canvas …
  • 34. Interim Product: Academic Search http://academic.research.microsoft.com/
  • 35. Zentity 2.0– Research Output Platform New Features: Default web UI with CSS support Pivot Viewer (defacto browser) and custom ASP.Net controls Open Data Protocol Flexible data model enables many scenarios and can be easily extended over time A semantic computing platform to store and expose relationships between digital assets http://research.microsoft.com/zentity/
  • 36. Pattern Discovery and Semantic Interpretation: Graph of Co-occurring Flickr Tags
  • 37. Pattern Discovery and Semantic Interpretation: Graph of Co-occurring Flickr Tags
  • 38. Pattern Discovery and Sociological Interpretation: ‘Commenting’ Activity on Flickr Flickr users who commented on Marc_Smith’s photos (more than 4 times)
  • 39. Pattern Discovery and Sociological Interpretation: ‘Commenting’ Activity on Flickr Flickr users who commented on Marc_Smith’s photos (more than 4 times)
  • 40. Semantics of Network Patterns: NodeXL http://nodexl.codeplex.com INTRODUCTION TECHNIQUES AND METRICS USER RESEARCH PRODUCT GROUP ENGAGEMENT FURTHER WORK TWITTER NodeXL Graph “Bing” at 2:30 AM Monday, July 12, 2010
  • 41. From Pattern to Meaning: Email  Validation of pattern analysis requires human input.  Meaning can be considered globally accepted or strictly contextual, generally understood or individually constructed.
  • 42. Summary  The challenge is not so much in the standards for representations (isn’t this just still syntax?) and pattern discovery but really in the interpretation and validation of that interpretation.  ‘Meaning’ has different connotations in different context  The challenge is in determining and addressing the right level of granularity.
  • 43. Thank you • Evelyne Viegas, Microsoft Research, USA • Li Ding, Rensselaer Polytechnic Institute • Natasa Milic-Frayling, Microsoft Research, UK • Haixun Wang, Microsoft Research, Asia • Kuansan Wang, Microsoft Research, USA Lewis Shepherd lewiss@microsoft.com @lewisshepherd

Editor's Notes

  1. [Dumais, UMAP 2009]
  2. Scaling to Very Very Large Corpora for Natural Language DisambiguationStatistical learning as the ultimate agile development tool.
  3. Here youcan see why making content types (such as title and anchor text) available to the research community is better than body, as they are more similar to users’ queries.Details can be seen in the WWW paper.
  4. The service is Public, can be used for non commercial purposes. This means that it has now been extended to researchers worldwide as part of its public beta launch which happened at WWW, Raleigh NC.What you see here is an application developed at WWW, within 8 hours of the public launch where Dr. Li Ding from Rensselaer Polytechnic Institute used the web n-gram service on a government dataset of titles to build a multi-word tag cloud, thus providing more relevant information.As an example compare on the left: critical and habitat as separate tokens and on the right (multi-word tag), critical-habitat.
  5. At MSR Asia, the Speech Group is working to make the "speech chain" smooth and robust when there is a machine involved, working to develop spoken language technologies that enable human-computer voice interaction and enrich human-to-human voice communications. The group's current focus includes automatic speech recognition to enable computers to facilitate access to data, help create content, and perform tasks; speech synthesis to enable computers to speak with a human-sounding voice, to respond and provide information, and to read; spoken-document retrieval and processing to enrich communication between people like converting voice-mail into text; signal processing to improve the conditioning of signals, change speech signal parameters like pitch, speaking rates, and voice characteristics in a seamless way. Extension of statistical learning algorithms developed in speech-to-other pattern recognition applications like hand-written math equations and East-Asian character recognition are being pursued jointly with other groups.
  6. At MSR Asia, the Speech Group is working to make the "speech chain" smooth and robust when there is a machine involved, working to develop spoken language technologies that enable human-computer voice interaction and enrich human-to-human voice communications. The group's current focus includes automatic speech recognition to enable computers to facilitate access to data, help create content, and perform tasks; speech synthesis to enable computers to speak with a human-sounding voice, to respond and provide information, and to read; spoken-document retrieval and processing to enrich communication between people like converting voice-mail into text; signal processing to improve the conditioning of signals, change speech signal parameters like pitch, speaking rates, and voice characteristics in a seamless way. Extension of statistical learning algorithms developed in speech-to-other pattern recognition applications like hand-written math equations and East-Asian character recognition are being pursued jointly with other groups.