SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Machine Learning and
Information Retrieval
Zoubin Ghahramani
Department of Engineering
University of Cambridge
Machine Learning
• Machine learning is an interdisciplinary field focusing on both the
mathematical foundations and practical applications of systems that
learn, reason and act.
• Other related terms: Pattern Recognition, Neural Networks, Data
Mining, Adaptive Control, Decision Theory, Statistical Modelling...
Applications of Machine Learning
• Bioinformatics
• Robotics
• Computer vision
• Modelling Brain Imaging and Neural Data
• Financial prediction
• Collaborative filtering
• Information retrieval…
What is Information Retrieval?
• finding material from within a large unstructured collection
(e.g. the internet) that satisfies the user’s information need
(e.g. expressed via a query).
• well known examples…
• …but there are many specialist search systems as well:
Traditional approach to information retrieval
• user types a text query
• system returns an ordered list of items
A new approach to retrieval
• user inputs a small set of items
• system returns a larger set of items that belong
in the concept or category exemplified by the
query set
• a simple example:
– query = {Monday, Wednesday}
– return = {Monday, Tuesday, Wednesday,
Thursday, Friday, Saturday, Sunday}
(work with Katherine A. Heller, UCL)
Universe of items being searched…
Imagine a universe of items:
….
The items could be:
images, music, documents, websites, publications, proteins,
news stories, customer profiles, products, medical records, …
or any other type of item one might want to query.
Query set:
Result:
Query set:
Result:
Illustrative example
Ranking items
• Rank each item in the universe by how well it would “fit
into” a set which includes the query set
• Limit output to the top few items
Query:
Ranking:
Best Worst
Key Technical Step
• Information retrieval using Bayesian model comparison:
: item being scored
: query set of items
Key Advantages of Our Approach
• Novel search paradigm for retrieval
– queries are a small set of examples
• Based on:
– principled statistical methods (Bayesian machine learning)
– recent psychological research into models of human
categorization and generalization
• Extremely fast
– search >100,000 records per second on a laptop computer
– uses sparse matrix methods
– easy to parallelize and use inverted indices to search billions
of records/sec
Some Example Applications
• literature search: searching scientific literature, patent databases, or news articles
by giving a small set of example articles
• targeted advertising: finding similar customers as represented by their buying
patterns
• biomedical search: searching for sets of similar patients based on medical records
• drug discovery: searching for similar sets of proteins based on sequence,
annotations, structure, literature
• collaborative filtering: finding similar movies, music, books, based on matching
your preferences to other people’s preferences
• online shopping: searching for products by giving a few examples
• online dating services / social networks: searching for people based on profiles
• finance: finding similar companies / stocks based on patterns of transactions
Prototype systems that we have built
• movie search
• automatic thesaurus
• finding similar authors of scientific articles
• searching academic literature
• image retrieval
• protein search
Movie Search
• 1813 people rating 1532 movies
• query is a small set of movies
• system searches for other movies that would
fit into this set based on the ratings
Movie Search Example Results
• Query:
– Gone with the wind
– Casablanca
• Result (top hits):
– Gone with the wind (1939)
– Casablanca (1942)
– The African Queen (1951)
– The Philadelphia Story (1940)
– My Fair Lady (1964)
– The Adventures of Robin Hood (1938)
– The Maltese Falcon (1941)
– Rebecca (1940)
– Singing in the Rain (1952)
– It Happened One Night (1934)
Movie Search Example Results
• Query:
– Mary Poppins
– Toy Story
• Result (top hits):
– Mary Poppins
– Toy Story
– Winnie the Pooh
– Cinderella
– The Love Bug
– Bedknobs and Broomsticks
– Davy Crockett
– The Parent Trap
– Dumbo
– The Sound of Music
Searching Academic Literature
• Query:
– A. Smola
– B. Scholkopf
• Result (top hits):
– A. Smola
– B. Scholkopf
– S. Mika
– G. Ratsch
– R. Williamson
– K. Muller
– J. Weston
– J. Shawe-Taylor
– V. Vapnik
– T. Onoda
these two researchers published
conference papers in the area of
“support vector machines”
these are additional researchers who
published conference papers in the
area of “support vector machines”
Image Retrieval Results for
Query: “sunset”
These are the top 9 images returned. Our system finds images of sunsets
using only the color and texture features of these unlabelled images.
Results for Query: “sign”
These are the top 9 images returned. It finds images of signs using
only the color and texture features of these unlabelled images.
Results for Query: “fireworks”
These are the top 9 images returned.
Protein Search
• Proteins are the fundamental building blocks of life; our
genes code for proteins
• Understanding the functions of and relationships
between proteins is essential for bioscience,
biomedicine, and drug discovery (a multi-billion dollar
industry).
• We have built a protein retrieval system to search
UniProt, an annotated database of 200,000+ proteins
Summary
• We have a new approach to searching based on providing a
small set of examples.
• Our approach is based on sound statistical theory, machine
learning methods, and psychological research.
• Answering queries is extremely fast, the system is very easy to
implement, and search can be parallelized easily over a grid of
computers.
• We have built several prototype systems to illustrate the
applicability of this approach.
• There are many other very practical real-world applications.
Appendix
A Prototype Image Retrieval System
• A system for searching large collections of unlabelled images.
• You enter a word, e.g. “sunset”, and it retrieves images that match
this label, using only color and texture features of the images
• A database of 32,000 images
–Labelled Images: 10,000 images with about 3-10 text labels per image
–Unlabelled Images: 22,000 images
–Each image is represented by 240 binary color and texture features, no
other information is used
• A vocabulary of about 2000 keywords
• Goal: we want to search the unlabelled images using queries which are
subsets of the labelled images associated with keywords
The Image Retrieval Prototype System
The Algorithm:
1. Input query word: e.g. w=“sunset”
2. Find all labelled images with label w
3. Use those images as a query set
4. Return the unlabelled images with the highest probability
of belonging with the query set
The algorithm is very fast:
about 0.2 sec on a laptop to query 22,000 test images
code can be further optimized and parallelized
Example Labelled Images for “sunset”
These are 9 random images that were labelled “sunset” in the labelled training data.
Notice that these images are quite variable, and the labelling subjective and somewhat noisy.
Our retrieval system does very well and is quite robust to ambiguous categories and poor labelling.

Weitere ähnliche Inhalte

Andere mochten auch

Theory Generation for Security Protocols
Theory Generation for Security ProtocolsTheory Generation for Security Protocols
Theory Generation for Security Protocolsbutest
 
REQUEST FOR PROPOSAL PROCEDURES
REQUEST FOR PROPOSAL PROCEDURESREQUEST FOR PROPOSAL PROCEDURES
REQUEST FOR PROPOSAL PROCEDURESbutest
 
DATA-DEPENDENT MODELS OF SPECIES-HABITAT RELATIONSHIPS D. Todd ...
DATA-DEPENDENT MODELS OF SPECIES-HABITAT RELATIONSHIPS D. Todd ...DATA-DEPENDENT MODELS OF SPECIES-HABITAT RELATIONSHIPS D. Todd ...
DATA-DEPENDENT MODELS OF SPECIES-HABITAT RELATIONSHIPS D. Todd ...butest
 
Lecture 16
Lecture 16Lecture 16
Lecture 16butest
 
2.1 Temporal RSCDF extension
2.1 Temporal RSCDF extension2.1 Temporal RSCDF extension
2.1 Temporal RSCDF extensionbutest
 
l2r.cs.uiuc.edu
l2r.cs.uiuc.edul2r.cs.uiuc.edu
l2r.cs.uiuc.edubutest
 
Relational Transfer in Reinforcement Learning
Relational Transfer in Reinforcement LearningRelational Transfer in Reinforcement Learning
Relational Transfer in Reinforcement Learningbutest
 

Andere mochten auch (7)

Theory Generation for Security Protocols
Theory Generation for Security ProtocolsTheory Generation for Security Protocols
Theory Generation for Security Protocols
 
REQUEST FOR PROPOSAL PROCEDURES
REQUEST FOR PROPOSAL PROCEDURESREQUEST FOR PROPOSAL PROCEDURES
REQUEST FOR PROPOSAL PROCEDURES
 
DATA-DEPENDENT MODELS OF SPECIES-HABITAT RELATIONSHIPS D. Todd ...
DATA-DEPENDENT MODELS OF SPECIES-HABITAT RELATIONSHIPS D. Todd ...DATA-DEPENDENT MODELS OF SPECIES-HABITAT RELATIONSHIPS D. Todd ...
DATA-DEPENDENT MODELS OF SPECIES-HABITAT RELATIONSHIPS D. Todd ...
 
Lecture 16
Lecture 16Lecture 16
Lecture 16
 
2.1 Temporal RSCDF extension
2.1 Temporal RSCDF extension2.1 Temporal RSCDF extension
2.1 Temporal RSCDF extension
 
l2r.cs.uiuc.edu
l2r.cs.uiuc.edul2r.cs.uiuc.edu
l2r.cs.uiuc.edu
 
Relational Transfer in Reinforcement Learning
Relational Transfer in Reinforcement LearningRelational Transfer in Reinforcement Learning
Relational Transfer in Reinforcement Learning
 

Ähnlich wie MLforIR.pps

MLforIR.pps
MLforIR.ppsMLforIR.pps
MLforIR.ppsbutest
 
Big Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiBig Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiVijay Susheedran C G
 
Big Data 101 - An introduction
Big Data 101 - An introductionBig Data 101 - An introduction
Big Data 101 - An introductionNeeraj Tewari
 
Improving Semantic Search Using Query Log Analysis
Improving Semantic Search Using Query Log AnalysisImproving Semantic Search Using Query Log Analysis
Improving Semantic Search Using Query Log AnalysisStuart Wrigley
 
Ch14-Part4-ImageRetrieval.pdf
Ch14-Part4-ImageRetrieval.pdfCh14-Part4-ImageRetrieval.pdf
Ch14-Part4-ImageRetrieval.pdfAbdullah Azzeh
 
Indexing Techniques: Their Usage in Search Engines for Information Retrieval
Indexing Techniques: Their Usage in Search Engines for Information RetrievalIndexing Techniques: Their Usage in Search Engines for Information Retrieval
Indexing Techniques: Their Usage in Search Engines for Information RetrievalVikas Bhushan
 
[CVPR 2018] Visual Search (Image Retrieval) and Metric Learning
[CVPR 2018] Visual Search (Image Retrieval) and Metric Learning[CVPR 2018] Visual Search (Image Retrieval) and Metric Learning
[CVPR 2018] Visual Search (Image Retrieval) and Metric LearningNAVER Engineering
 
HyperMembrane Structures for Open Source Cognitive Computing
HyperMembrane Structures for Open Source Cognitive ComputingHyperMembrane Structures for Open Source Cognitive Computing
HyperMembrane Structures for Open Source Cognitive ComputingJack Park
 
Introduction Data Science.pptx
Introduction Data Science.pptxIntroduction Data Science.pptx
Introduction Data Science.pptxAkhirulAminulloh2
 
Pattern recognition and Machine Learning.
Pattern recognition and Machine Learning.Pattern recognition and Machine Learning.
Pattern recognition and Machine Learning.Rohit Kumar
 
Data Mining- Unit-I PPT (1).ppt
Data Mining- Unit-I PPT (1).pptData Mining- Unit-I PPT (1).ppt
Data Mining- Unit-I PPT (1).pptAravindReddy565690
 
Image Search: Then and Now
Image Search: Then and NowImage Search: Then and Now
Image Search: Then and NowSi Krishan
 
Data science for advanced dummies
Data science for advanced dummiesData science for advanced dummies
Data science for advanced dummiesSaurav Chakravorty
 
ICPSR Data Services at Kenyon College - David Thomas
ICPSR Data Services at Kenyon College - David ThomasICPSR Data Services at Kenyon College - David Thomas
ICPSR Data Services at Kenyon College - David Thomasjoemmurphy
 
Learning Emergent Knowledge from Blog Postings
Learning Emergent Knowledge from Blog PostingsLearning Emergent Knowledge from Blog Postings
Learning Emergent Knowledge from Blog PostingsSaltlux Inc.
 
Recommender Systems and Learning Analytics in TEL
Recommender Systems and Learning Analytics in TELRecommender Systems and Learning Analytics in TEL
Recommender Systems and Learning Analytics in TELHendrik Drachsler
 

Ähnlich wie MLforIR.pps (20)

MLforIR.pps
MLforIR.ppsMLforIR.pps
MLforIR.pps
 
Big Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiBig Data Real Time Training in Chennai
Big Data Real Time Training in Chennai
 
Big Data 101 - An introduction
Big Data 101 - An introductionBig Data 101 - An introduction
Big Data 101 - An introduction
 
IRT Unit_I.pptx
IRT Unit_I.pptxIRT Unit_I.pptx
IRT Unit_I.pptx
 
Improving Semantic Search Using Query Log Analysis
Improving Semantic Search Using Query Log AnalysisImproving Semantic Search Using Query Log Analysis
Improving Semantic Search Using Query Log Analysis
 
Ch14-Part4-ImageRetrieval.pdf
Ch14-Part4-ImageRetrieval.pdfCh14-Part4-ImageRetrieval.pdf
Ch14-Part4-ImageRetrieval.pdf
 
Indexing Techniques: Their Usage in Search Engines for Information Retrieval
Indexing Techniques: Their Usage in Search Engines for Information RetrievalIndexing Techniques: Their Usage in Search Engines for Information Retrieval
Indexing Techniques: Their Usage in Search Engines for Information Retrieval
 
[CVPR 2018] Visual Search (Image Retrieval) and Metric Learning
[CVPR 2018] Visual Search (Image Retrieval) and Metric Learning[CVPR 2018] Visual Search (Image Retrieval) and Metric Learning
[CVPR 2018] Visual Search (Image Retrieval) and Metric Learning
 
HyperMembrane Structures for Open Source Cognitive Computing
HyperMembrane Structures for Open Source Cognitive ComputingHyperMembrane Structures for Open Source Cognitive Computing
HyperMembrane Structures for Open Source Cognitive Computing
 
Introduction Data Science.pptx
Introduction Data Science.pptxIntroduction Data Science.pptx
Introduction Data Science.pptx
 
Pattern recognition and Machine Learning.
Pattern recognition and Machine Learning.Pattern recognition and Machine Learning.
Pattern recognition and Machine Learning.
 
Data Mining- Unit-I PPT (1).ppt
Data Mining- Unit-I PPT (1).pptData Mining- Unit-I PPT (1).ppt
Data Mining- Unit-I PPT (1).ppt
 
Image Search: Then and Now
Image Search: Then and NowImage Search: Then and Now
Image Search: Then and Now
 
Data science for advanced dummies
Data science for advanced dummiesData science for advanced dummies
Data science for advanced dummies
 
eScience Resources for the Chemistry Community from the Royal Society of Chem...
eScience Resources for the Chemistry Community from the Royal Society of Chem...eScience Resources for the Chemistry Community from the Royal Society of Chem...
eScience Resources for the Chemistry Community from the Royal Society of Chem...
 
ICPSR Data Services at Kenyon College - David Thomas
ICPSR Data Services at Kenyon College - David ThomasICPSR Data Services at Kenyon College - David Thomas
ICPSR Data Services at Kenyon College - David Thomas
 
Learning Emergent Knowledge from Blog Postings
Learning Emergent Knowledge from Blog PostingsLearning Emergent Knowledge from Blog Postings
Learning Emergent Knowledge from Blog Postings
 
Data Mining Dissertations and Adventures and Experiences in the World of Chem...
Data Mining Dissertations and Adventures and Experiences in the World of Chem...Data Mining Dissertations and Adventures and Experiences in the World of Chem...
Data Mining Dissertations and Adventures and Experiences in the World of Chem...
 
Utilizing Online Databases for the Purpose of Structure Identification – Appr...
Utilizing Online Databases for the Purpose of Structure Identification – Appr...Utilizing Online Databases for the Purpose of Structure Identification – Appr...
Utilizing Online Databases for the Purpose of Structure Identification – Appr...
 
Recommender Systems and Learning Analytics in TEL
Recommender Systems and Learning Analytics in TELRecommender Systems and Learning Analytics in TEL
Recommender Systems and Learning Analytics in TEL
 

Mehr von butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

Mehr von butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

MLforIR.pps

  • 1. Machine Learning and Information Retrieval Zoubin Ghahramani Department of Engineering University of Cambridge
  • 2. Machine Learning • Machine learning is an interdisciplinary field focusing on both the mathematical foundations and practical applications of systems that learn, reason and act. • Other related terms: Pattern Recognition, Neural Networks, Data Mining, Adaptive Control, Decision Theory, Statistical Modelling...
  • 3. Applications of Machine Learning • Bioinformatics • Robotics • Computer vision • Modelling Brain Imaging and Neural Data • Financial prediction • Collaborative filtering • Information retrieval…
  • 4. What is Information Retrieval? • finding material from within a large unstructured collection (e.g. the internet) that satisfies the user’s information need (e.g. expressed via a query). • well known examples… • …but there are many specialist search systems as well:
  • 5. Traditional approach to information retrieval • user types a text query • system returns an ordered list of items
  • 6. A new approach to retrieval • user inputs a small set of items • system returns a larger set of items that belong in the concept or category exemplified by the query set • a simple example: – query = {Monday, Wednesday} – return = {Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday} (work with Katherine A. Heller, UCL)
  • 7. Universe of items being searched… Imagine a universe of items: …. The items could be: images, music, documents, websites, publications, proteins, news stories, customer profiles, products, medical records, … or any other type of item one might want to query.
  • 9. Ranking items • Rank each item in the universe by how well it would “fit into” a set which includes the query set • Limit output to the top few items Query: Ranking: Best Worst
  • 10. Key Technical Step • Information retrieval using Bayesian model comparison: : item being scored : query set of items
  • 11. Key Advantages of Our Approach • Novel search paradigm for retrieval – queries are a small set of examples • Based on: – principled statistical methods (Bayesian machine learning) – recent psychological research into models of human categorization and generalization • Extremely fast – search >100,000 records per second on a laptop computer – uses sparse matrix methods – easy to parallelize and use inverted indices to search billions of records/sec
  • 12. Some Example Applications • literature search: searching scientific literature, patent databases, or news articles by giving a small set of example articles • targeted advertising: finding similar customers as represented by their buying patterns • biomedical search: searching for sets of similar patients based on medical records • drug discovery: searching for similar sets of proteins based on sequence, annotations, structure, literature • collaborative filtering: finding similar movies, music, books, based on matching your preferences to other people’s preferences • online shopping: searching for products by giving a few examples • online dating services / social networks: searching for people based on profiles • finance: finding similar companies / stocks based on patterns of transactions
  • 13. Prototype systems that we have built • movie search • automatic thesaurus • finding similar authors of scientific articles • searching academic literature • image retrieval • protein search
  • 14. Movie Search • 1813 people rating 1532 movies • query is a small set of movies • system searches for other movies that would fit into this set based on the ratings
  • 15. Movie Search Example Results • Query: – Gone with the wind – Casablanca • Result (top hits): – Gone with the wind (1939) – Casablanca (1942) – The African Queen (1951) – The Philadelphia Story (1940) – My Fair Lady (1964) – The Adventures of Robin Hood (1938) – The Maltese Falcon (1941) – Rebecca (1940) – Singing in the Rain (1952) – It Happened One Night (1934)
  • 16. Movie Search Example Results • Query: – Mary Poppins – Toy Story • Result (top hits): – Mary Poppins – Toy Story – Winnie the Pooh – Cinderella – The Love Bug – Bedknobs and Broomsticks – Davy Crockett – The Parent Trap – Dumbo – The Sound of Music
  • 17. Searching Academic Literature • Query: – A. Smola – B. Scholkopf • Result (top hits): – A. Smola – B. Scholkopf – S. Mika – G. Ratsch – R. Williamson – K. Muller – J. Weston – J. Shawe-Taylor – V. Vapnik – T. Onoda these two researchers published conference papers in the area of “support vector machines” these are additional researchers who published conference papers in the area of “support vector machines”
  • 18. Image Retrieval Results for Query: “sunset” These are the top 9 images returned. Our system finds images of sunsets using only the color and texture features of these unlabelled images.
  • 19. Results for Query: “sign” These are the top 9 images returned. It finds images of signs using only the color and texture features of these unlabelled images.
  • 20. Results for Query: “fireworks” These are the top 9 images returned.
  • 21. Protein Search • Proteins are the fundamental building blocks of life; our genes code for proteins • Understanding the functions of and relationships between proteins is essential for bioscience, biomedicine, and drug discovery (a multi-billion dollar industry). • We have built a protein retrieval system to search UniProt, an annotated database of 200,000+ proteins
  • 22. Summary • We have a new approach to searching based on providing a small set of examples. • Our approach is based on sound statistical theory, machine learning methods, and psychological research. • Answering queries is extremely fast, the system is very easy to implement, and search can be parallelized easily over a grid of computers. • We have built several prototype systems to illustrate the applicability of this approach. • There are many other very practical real-world applications.
  • 24. A Prototype Image Retrieval System • A system for searching large collections of unlabelled images. • You enter a word, e.g. “sunset”, and it retrieves images that match this label, using only color and texture features of the images • A database of 32,000 images –Labelled Images: 10,000 images with about 3-10 text labels per image –Unlabelled Images: 22,000 images –Each image is represented by 240 binary color and texture features, no other information is used • A vocabulary of about 2000 keywords • Goal: we want to search the unlabelled images using queries which are subsets of the labelled images associated with keywords
  • 25. The Image Retrieval Prototype System The Algorithm: 1. Input query word: e.g. w=“sunset” 2. Find all labelled images with label w 3. Use those images as a query set 4. Return the unlabelled images with the highest probability of belonging with the query set The algorithm is very fast: about 0.2 sec on a laptop to query 22,000 test images code can be further optimized and parallelized
  • 26. Example Labelled Images for “sunset” These are 9 random images that were labelled “sunset” in the labelled training data. Notice that these images are quite variable, and the labelling subjective and somewhat noisy. Our retrieval system does very well and is quite robust to ambiguous categories and poor labelling.

Hinweis der Redaktion

  1. Hi, I’ll be presenting our work on Bayesian information retrieval. This is joint work with Katherine Heller. Let me first describe what I mean by information retrieval…
  2. We could call this problem ‘clustering on demand’. Assume a universe of objects (D) which could be images, web pages, documents, movies, proteins, or any other object we wish to form queries on. For example here we have a potentially very large universe of objects, which includes a yellow train, a blue car, a London bus, a tomato, and a computational neuroscientist. (pause) We want an algorithm which does the following:
  3. Given a query, a very small set (Dc) of the objects in our universe, for example here we have a red car and a blue car. We want our algorithm to return all the objects which belong to the concept, exemplified by the query. So for the query red car, blue car, we want it to return all the objects which are cars. While for the query red car, tomato, we want it to return all the red objects.
  4. Our approach to this problem is to rank each object in our universe D by how well it would fit into a set which includes the query, D_C, or how relevant it is to the query. So for the query red car blue car we can rank all of the objects from best fit to worst fit. We use a Bayesian model-based probabilistic relevance criterion to do this and we limit the output to the top few items.
  5. We’ve created an image ret system which searches large collections of unlab images.