SlideShare ist ein Scribd-Unternehmen logo
1 von 24
Learning from
Descriptive Text

     Tamara L Berg
 Stony Brook University
Tags:
                                               Vision
canon, eos, macro, japan, vacation, f
rog, animal, toad, amphibian, pet, ey
e, feet, mouth, finger, hand, prince, p
hoto, art, light, photo, flickr, blurry, fa
vorite, nice.

                                              Language




                                              Humans

                                                          It's the perfect party dress. With
                                                         distinctly feminine details such as a wide
                                                         sash bow around an empire waist and a
                                                         deep scoopneck, this linen dress will
                                                         keep you comfortable and feeling
                                                         elegant all evening long.
Visually Descriptive Text
                      “It was an arresting face, pointed of chin, square of jaw. Her eyes
                      were pale green without a touch of hazel, starred with bristly black
                      lashes and slightly tilted at the ends. Above them, her thick black
                      brows slanted upward, cutting a startling oblique line in her
                      magnolia-white skin–that skin so prized by Southern women and so
                      carefully guarded with bonnets, veils and mittens against hot
                      Georgia suns” –       Gone with the Wind


                                                                    How do people
                                                                    describe the world?

Visually descriptive language provides:
• information about how people construct natural language for imagery.
• information about the world, especially the visual world.
• guidance for computational visual recognition.         How does the
                                                                 world work?
                                         What should we
                                         recognize?
Visually Descriptive Text
                      “It was an arresting face, pointed of chin, square of jaw. Her eyes
                      were pale green without a touch of hazel, starred with bristly black
                      lashes and slightly tilted at the ends. Above them, her thick black
                      brows slanted upward, cutting a startling oblique line in her
                      magnolia-white skin–that skin so prized by Southern women and so
                      carefully guarded with bonnets, veils and mittens against hot
                      Georgia suns” –       from Gone with the Wind by Margaret Mitchell



                                                                   How do people
                                                                   describe the
Visually descriptive language provides:                            world?
• information about how people construct natural language for imagery.
• information about the world, especially the visual world.
• guidance for computational visual recognition.         How does the
                                                                 world work?
                                         What should we
                                         recognize?
What’s in a description?
                                                   What’s in this image?
                                                             man
                                                             baby
                                                             sling
                                                             shirt
                                                             glasses
                                                             ladder
                                                             fridge
                                                             table
                                                             watermelon
                                                             chair
What do people describe?                                     boxes
“A bearded man is holding a child in a sling.”               cups
“A bearded man stands while holding a small child in a green water bottle
sheet.”                                                      wall
“A bearded man with a baby in a sling poses.”                pacifier
“Man standing in kitchen with little girl in green sack.”    beard
“Man with beard and baby”                                    …
What’s in a description?
                                                               women    ✔
                                                                bench ✔
1)                               “two women sitting brunette
                                                               magazine ✔
                                 blonde on bench reading
                                 magazine”                      grass ✖
                                 Predict what people will        skirt ✖
                                                                   …
         Given an image          describe

     e.g. Spain & Perona, 2010
                                                               clouds ✔
      “looking for                                               car  ✖
2)    castles in the
                                                               window ✖
      clouds out my car
      window”                                                  castle   ?
        Given a caption          Predict what’s in the image
Who’s in the picture?
                           T.L. Berg, A.C. Berg, J. Edwards, D.A. Forsyth




President George W. Bush makes a
statement in the Rose Garden while
Secretary of Defense Donald Rumsfeld
looks on, July 23, 2003. Rumsfeld said
the United States would release graphic
photographs of the dead sons of
Saddam Hussein to prove they were                          Model             Accuracy of labeling
killed by American troops. Photo by            Vision model, No Lang model          67%
Larry Downing/Reuters                          Vision model + Lang model            78%
Visually Descriptive Text
                     “It was an arresting face, pointed of chin, square of jaw. Her eyes
                     were pale green without a touch of hazel, starred with bristly black
                     lashes and slightly tilted at the ends. Above them, her thick black
                     brows slanted upward, cutting a startling oblique line in her
                     magnolia-white skin–that skin so prized by Southern women and so
                     carefully guarded with bonnets, veils and mittens against hot
                     Georgia suns” –       from Gone with the Wind by Margaret Mitchell


                                                                   How do people
                                                                   describe the world?

Visually descriptive language provides:
• information about how people construct natural language for imagery.
• information about the world, especially the visual world.
• guidance for computational visual recognition.         How does the
                                        What should we          world work?
                                        recognize?
Vision is hard




                                          Green sheep




World knowledge (from descriptive text)
can be used to smooth noisy vision
predictions!
Learning World Knowledge
               BabyTalk: Understanding and Generating Simple Image Descriptions
               Kulkarni, Premraj, Dhar, Li, Choi, AC Berg, TL Berg, CVPR 2011




                                                                              Attributes


green green grass by the           a very shiny car in the car
lake                               museum in my hometown of
                                   upstate NY.




                                                                             Relationships


  very little person in a big         Our cat Tusik sleeping on
  rocking chair                       the sofa near a hot radiator.
System Flow
                              near(a,b)0.01
                                 brown 1
                                     +, ($%

                              near(b,a) 0.16
                                 striped 1
                                 furry .26
                              against(a,b)! " #$%         ' () *$%
                              .11wooden .2
                                 feathered
                              against(b,a)
               a) dog         .04.06                                  +, (&%
                              beside(a,b)  ...      This is a photograph of one
                              .24
                                brown 0.32
                                   ' () *- %
                                                    person and one brown sofa and
                                                                 ! "#
                                                                    &%
                              beside(b,a)
                                striped 0.09
                              near(a,c) 1
                              .17                   one dog. The person is against
                                furry .04
                              near(c,a) 1
                                       ...
                                wooden .2
                              against(a,c) .3
                                                    the brown sofa. And the dog is
                                Feathered
                              against(c,a)          near the person, and beside the
                                .04
                              .05                   brown sofa.
                                           ... "
                              beside(a,c) !.5#- %         ' () *&%
              b) person       beside(c,a)
                              .45 +, (- %
                                   ...
                              near(b,c)0.94
                                brown 1
                              near(c,b) 0.10
                                striped 1
                                  <<null,person_b>,against,<brown,sofa_c>>
                              against(b,c)
                                furry .06
Input Image                   .67 <<null,dog_a>,near,<null,person_b>>
                                wooden .8        Generate natural
                                  <<null,dog_a>,beside,<brown,sofa_c>>
                              against(c,b)
                                Feathered
                              .33
                                .08              language
             c) sofa          beside(b,c) .0
                                       ...        – vision
                                Predict labeling description
                              beside(c,b)
                Objects/stuff potentials smoothed with text
        Extract Predict attributes
                Predict prepositions
                              .19
                                potentials
                                  ...
BabyTalk results


                            Objects, Attributes,
                            Prepositions
This is a picture of one
sky, one road and one                                  Here we see one
sheep. The gray sky is                                 road, one sky and one
over the gray road. The                                bicycle. The road is near
gray sheep is by the gray                              the blue sky, and near the
road.                                                  colorful bicycle. The
                                                       colorful bicycle is within
                                                       the blue sky.

                            This is a picture of two
                            dogs. The first dog is
                            near the second furry
Visually Descriptive Text
                      “It was an arresting face, pointed of chin, square of jaw. Her eyes
                      were pale green without a touch of hazel, starred with bristly black
                      lashes and slightly tilted at the ends. Above them, her thick black
                      brows slanted upward, cutting a startling oblique line in her
                      magnolia-white skin–that skin so prized by Southern women and so
                      carefully guarded with bonnets, veils and mittens against hot
                      Georgia suns” –       from Gone with the Wind by Margaret Mitchell


                                                                    How do people
                                                                    describe the world?

Visually descriptive language provides:
• information about how people construct natural language for imagery.
• information about the world, especially the visual world.
• guidance for computational visual recognition.         How does the
                                                                 world work?
                                         What should we
                                         recognize?
What should we recognize?

• Recognition is beginning to work

• Open question – what should we recognize?

• Maybe objects aren’t (always) the right base
  level entities
Object Recognition




Parts, Poselets, Attributes
  For example:
  [Fergus, Perona, Zisserman2003],
  [Bourdev, Malik2009], …


                                        Slide Credit: Ali Farhadi
Automatically Discovering Attributes from Noisy Web Data
                  T.L. Berg, A.C. Berg, J. Shih ECCV 2010

                                    Fully beaded with megawatt
                                    crystals, this Christian Louboutin suede
                                    pump matches the gleam in your eye.

                                    Pump's linear heel plays up the alluring
                                    curves of its dipped sides.

                                    Round toe frames low-cut vamp.
                                    Tonally topstitched collar.

                                    4" straight, covered heel shows off
                                    signature red sole.

                                    Creamy leather lining with padded
                                    insole.
                                    "Fifi" is made in Italy.

Learn which attributes in descriptions are depictable
             terms
Given Web Images + Noisy Text Descriptions:
 1) Discover visual attribute terms in text descriptions - likely domain dependent
 2) Learn appearance models for attributes without labeled data
 3) Characterize attributes by: type, localizability
Object Recognition




                Scenes
                For example:
                [Oliva, Torralba 2001],
                [SUN 2010], …



                      Slide Credit: Ali Farhadi
What are the right quanta of
      Recognition?




              Farhadi & Sadeghi
              Recognition using Visual Phrases , CVPR 2011
Participating in Phrases Profoundly affects the
            appearance of objects




                       Farhadi & Sadeghi
                       Recognition using Visual Phrases , CVPR 2011
What should we recognize?




  “a sleeping dog in NTHU”     “the dog is sleeping”




     “A dog is sleeping in”    “sleeping dog in delhi”

Maybe descriptive text can inform entity hypotheses!
What should we recognize?




     “the cat is in the bag”    “cat in a bag”




            “cat in bag”       “cat in the bag”


Maybe descriptive text can inform entity hypotheses!
Conclusion

   Use large pools of descriptive text to:

       Learn how people describe the visual world

       Learn how the world works

       Guide future efforts in recognition


   Apply this knowledge to multi-modal
    collections & applications
Acknowledgements

• Collaborators: Alex Berg, David Forsyth, Jaety
  Edwards, Jonathan Shih, Girish Kulkarni, Visruth
  Premraj, Sagnik Dhar, Vicente Ordonez, Siming
  Li, Yejin Choi, Kota Yamaguchi, Vicente Ordonez

• Funded by NSF Faculty Early Career
  Development (CAREER) Program: Award
  #1054133

Weitere ähnliche Inhalte

Mehr von zukun

ETHZ CV2012: Information
ETHZ CV2012: InformationETHZ CV2012: Information
ETHZ CV2012: Information
zukun
 
Siwei lyu: natural image statistics
Siwei lyu: natural image statisticsSiwei lyu: natural image statistics
Siwei lyu: natural image statistics
zukun
 
Lecture9 camera calibration
Lecture9 camera calibrationLecture9 camera calibration
Lecture9 camera calibration
zukun
 
Brunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer visionBrunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer vision
zukun
 
Modern features-part-4-evaluation
Modern features-part-4-evaluationModern features-part-4-evaluation
Modern features-part-4-evaluation
zukun
 
Modern features-part-3-software
Modern features-part-3-softwareModern features-part-3-software
Modern features-part-3-software
zukun
 
Modern features-part-2-descriptors
Modern features-part-2-descriptorsModern features-part-2-descriptors
Modern features-part-2-descriptors
zukun
 
Modern features-part-1-detectors
Modern features-part-1-detectorsModern features-part-1-detectors
Modern features-part-1-detectors
zukun
 
Modern features-part-0-intro
Modern features-part-0-introModern features-part-0-intro
Modern features-part-0-intro
zukun
 
Lecture 02 internet video search
Lecture 02 internet video searchLecture 02 internet video search
Lecture 02 internet video search
zukun
 
Lecture 01 internet video search
Lecture 01 internet video searchLecture 01 internet video search
Lecture 01 internet video search
zukun
 
Lecture 03 internet video search
Lecture 03 internet video searchLecture 03 internet video search
Lecture 03 internet video search
zukun
 
Icml2012 tutorial representation_learning
Icml2012 tutorial representation_learningIcml2012 tutorial representation_learning
Icml2012 tutorial representation_learning
zukun
 
Advances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionAdvances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer vision
zukun
 
Gephi tutorial: quick start
Gephi tutorial: quick startGephi tutorial: quick start
Gephi tutorial: quick start
zukun
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysis
zukun
 
Object recognition with pictorial structures
Object recognition with pictorial structuresObject recognition with pictorial structures
Object recognition with pictorial structures
zukun
 
Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities
zukun
 
Icml2012 learning hierarchies of invariant features
Icml2012 learning hierarchies of invariant featuresIcml2012 learning hierarchies of invariant features
Icml2012 learning hierarchies of invariant features
zukun
 
ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...
ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...
ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...
zukun
 

Mehr von zukun (20)

ETHZ CV2012: Information
ETHZ CV2012: InformationETHZ CV2012: Information
ETHZ CV2012: Information
 
Siwei lyu: natural image statistics
Siwei lyu: natural image statisticsSiwei lyu: natural image statistics
Siwei lyu: natural image statistics
 
Lecture9 camera calibration
Lecture9 camera calibrationLecture9 camera calibration
Lecture9 camera calibration
 
Brunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer visionBrunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer vision
 
Modern features-part-4-evaluation
Modern features-part-4-evaluationModern features-part-4-evaluation
Modern features-part-4-evaluation
 
Modern features-part-3-software
Modern features-part-3-softwareModern features-part-3-software
Modern features-part-3-software
 
Modern features-part-2-descriptors
Modern features-part-2-descriptorsModern features-part-2-descriptors
Modern features-part-2-descriptors
 
Modern features-part-1-detectors
Modern features-part-1-detectorsModern features-part-1-detectors
Modern features-part-1-detectors
 
Modern features-part-0-intro
Modern features-part-0-introModern features-part-0-intro
Modern features-part-0-intro
 
Lecture 02 internet video search
Lecture 02 internet video searchLecture 02 internet video search
Lecture 02 internet video search
 
Lecture 01 internet video search
Lecture 01 internet video searchLecture 01 internet video search
Lecture 01 internet video search
 
Lecture 03 internet video search
Lecture 03 internet video searchLecture 03 internet video search
Lecture 03 internet video search
 
Icml2012 tutorial representation_learning
Icml2012 tutorial representation_learningIcml2012 tutorial representation_learning
Icml2012 tutorial representation_learning
 
Advances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionAdvances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer vision
 
Gephi tutorial: quick start
Gephi tutorial: quick startGephi tutorial: quick start
Gephi tutorial: quick start
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysis
 
Object recognition with pictorial structures
Object recognition with pictorial structuresObject recognition with pictorial structures
Object recognition with pictorial structures
 
Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities
 
Icml2012 learning hierarchies of invariant features
Icml2012 learning hierarchies of invariant featuresIcml2012 learning hierarchies of invariant features
Icml2012 learning hierarchies of invariant features
 
ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...
ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...
ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Act...
 

Kürzlich hochgeladen

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 

Fcv hum mach_t_berg

  • 1. Learning from Descriptive Text Tamara L Berg Stony Brook University
  • 2. Tags: Vision canon, eos, macro, japan, vacation, f rog, animal, toad, amphibian, pet, ey e, feet, mouth, finger, hand, prince, p hoto, art, light, photo, flickr, blurry, fa vorite, nice. Language Humans It's the perfect party dress. With distinctly feminine details such as a wide sash bow around an empire waist and a deep scoopneck, this linen dress will keep you comfortable and feeling elegant all evening long.
  • 3. Visually Descriptive Text “It was an arresting face, pointed of chin, square of jaw. Her eyes were pale green without a touch of hazel, starred with bristly black lashes and slightly tilted at the ends. Above them, her thick black brows slanted upward, cutting a startling oblique line in her magnolia-white skin–that skin so prized by Southern women and so carefully guarded with bonnets, veils and mittens against hot Georgia suns” – Gone with the Wind How do people describe the world? Visually descriptive language provides: • information about how people construct natural language for imagery. • information about the world, especially the visual world. • guidance for computational visual recognition. How does the world work? What should we recognize?
  • 4. Visually Descriptive Text “It was an arresting face, pointed of chin, square of jaw. Her eyes were pale green without a touch of hazel, starred with bristly black lashes and slightly tilted at the ends. Above them, her thick black brows slanted upward, cutting a startling oblique line in her magnolia-white skin–that skin so prized by Southern women and so carefully guarded with bonnets, veils and mittens against hot Georgia suns” – from Gone with the Wind by Margaret Mitchell How do people describe the Visually descriptive language provides: world? • information about how people construct natural language for imagery. • information about the world, especially the visual world. • guidance for computational visual recognition. How does the world work? What should we recognize?
  • 5. What’s in a description? What’s in this image? man baby sling shirt glasses ladder fridge table watermelon chair What do people describe? boxes “A bearded man is holding a child in a sling.” cups “A bearded man stands while holding a small child in a green water bottle sheet.” wall “A bearded man with a baby in a sling poses.” pacifier “Man standing in kitchen with little girl in green sack.” beard “Man with beard and baby” …
  • 6. What’s in a description? women ✔ bench ✔ 1) “two women sitting brunette magazine ✔ blonde on bench reading magazine” grass ✖ Predict what people will skirt ✖ … Given an image describe e.g. Spain & Perona, 2010 clouds ✔ “looking for car ✖ 2) castles in the window ✖ clouds out my car window” castle ? Given a caption Predict what’s in the image
  • 7. Who’s in the picture? T.L. Berg, A.C. Berg, J. Edwards, D.A. Forsyth President George W. Bush makes a statement in the Rose Garden while Secretary of Defense Donald Rumsfeld looks on, July 23, 2003. Rumsfeld said the United States would release graphic photographs of the dead sons of Saddam Hussein to prove they were Model Accuracy of labeling killed by American troops. Photo by Vision model, No Lang model 67% Larry Downing/Reuters Vision model + Lang model 78%
  • 8. Visually Descriptive Text “It was an arresting face, pointed of chin, square of jaw. Her eyes were pale green without a touch of hazel, starred with bristly black lashes and slightly tilted at the ends. Above them, her thick black brows slanted upward, cutting a startling oblique line in her magnolia-white skin–that skin so prized by Southern women and so carefully guarded with bonnets, veils and mittens against hot Georgia suns” – from Gone with the Wind by Margaret Mitchell How do people describe the world? Visually descriptive language provides: • information about how people construct natural language for imagery. • information about the world, especially the visual world. • guidance for computational visual recognition. How does the What should we world work? recognize?
  • 9. Vision is hard Green sheep World knowledge (from descriptive text) can be used to smooth noisy vision predictions!
  • 10. Learning World Knowledge BabyTalk: Understanding and Generating Simple Image Descriptions Kulkarni, Premraj, Dhar, Li, Choi, AC Berg, TL Berg, CVPR 2011 Attributes green green grass by the a very shiny car in the car lake museum in my hometown of upstate NY. Relationships very little person in a big Our cat Tusik sleeping on rocking chair the sofa near a hot radiator.
  • 11. System Flow near(a,b)0.01 brown 1 +, ($% near(b,a) 0.16 striped 1 furry .26 against(a,b)! " #$% ' () *$% .11wooden .2 feathered against(b,a) a) dog .04.06 +, (&% beside(a,b) ... This is a photograph of one .24 brown 0.32 ' () *- % person and one brown sofa and ! "# &% beside(b,a) striped 0.09 near(a,c) 1 .17 one dog. The person is against furry .04 near(c,a) 1 ... wooden .2 against(a,c) .3 the brown sofa. And the dog is Feathered against(c,a) near the person, and beside the .04 .05 brown sofa. ... " beside(a,c) !.5#- % ' () *&% b) person beside(c,a) .45 +, (- % ... near(b,c)0.94 brown 1 near(c,b) 0.10 striped 1 <<null,person_b>,against,<brown,sofa_c>> against(b,c) furry .06 Input Image .67 <<null,dog_a>,near,<null,person_b>> wooden .8 Generate natural <<null,dog_a>,beside,<brown,sofa_c>> against(c,b) Feathered .33 .08 language c) sofa beside(b,c) .0 ... – vision Predict labeling description beside(c,b) Objects/stuff potentials smoothed with text Extract Predict attributes Predict prepositions .19 potentials ...
  • 12. BabyTalk results Objects, Attributes, Prepositions This is a picture of one sky, one road and one Here we see one sheep. The gray sky is road, one sky and one over the gray road. The bicycle. The road is near gray sheep is by the gray the blue sky, and near the road. colorful bicycle. The colorful bicycle is within the blue sky. This is a picture of two dogs. The first dog is near the second furry
  • 13. Visually Descriptive Text “It was an arresting face, pointed of chin, square of jaw. Her eyes were pale green without a touch of hazel, starred with bristly black lashes and slightly tilted at the ends. Above them, her thick black brows slanted upward, cutting a startling oblique line in her magnolia-white skin–that skin so prized by Southern women and so carefully guarded with bonnets, veils and mittens against hot Georgia suns” – from Gone with the Wind by Margaret Mitchell How do people describe the world? Visually descriptive language provides: • information about how people construct natural language for imagery. • information about the world, especially the visual world. • guidance for computational visual recognition. How does the world work? What should we recognize?
  • 14. What should we recognize? • Recognition is beginning to work • Open question – what should we recognize? • Maybe objects aren’t (always) the right base level entities
  • 15. Object Recognition Parts, Poselets, Attributes For example: [Fergus, Perona, Zisserman2003], [Bourdev, Malik2009], … Slide Credit: Ali Farhadi
  • 16. Automatically Discovering Attributes from Noisy Web Data T.L. Berg, A.C. Berg, J. Shih ECCV 2010 Fully beaded with megawatt crystals, this Christian Louboutin suede pump matches the gleam in your eye. Pump's linear heel plays up the alluring curves of its dipped sides. Round toe frames low-cut vamp. Tonally topstitched collar. 4" straight, covered heel shows off signature red sole. Creamy leather lining with padded insole. "Fifi" is made in Italy. Learn which attributes in descriptions are depictable terms
  • 17. Given Web Images + Noisy Text Descriptions: 1) Discover visual attribute terms in text descriptions - likely domain dependent 2) Learn appearance models for attributes without labeled data 3) Characterize attributes by: type, localizability
  • 18. Object Recognition Scenes For example: [Oliva, Torralba 2001], [SUN 2010], … Slide Credit: Ali Farhadi
  • 19. What are the right quanta of Recognition? Farhadi & Sadeghi Recognition using Visual Phrases , CVPR 2011
  • 20. Participating in Phrases Profoundly affects the appearance of objects Farhadi & Sadeghi Recognition using Visual Phrases , CVPR 2011
  • 21. What should we recognize? “a sleeping dog in NTHU” “the dog is sleeping” “A dog is sleeping in” “sleeping dog in delhi” Maybe descriptive text can inform entity hypotheses!
  • 22. What should we recognize? “the cat is in the bag” “cat in a bag” “cat in bag” “cat in the bag” Maybe descriptive text can inform entity hypotheses!
  • 23. Conclusion  Use large pools of descriptive text to:  Learn how people describe the visual world  Learn how the world works  Guide future efforts in recognition  Apply this knowledge to multi-modal collections & applications
  • 24. Acknowledgements • Collaborators: Alex Berg, David Forsyth, Jaety Edwards, Jonathan Shih, Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Vicente Ordonez, Siming Li, Yejin Choi, Kota Yamaguchi, Vicente Ordonez • Funded by NSF Faculty Early Career Development (CAREER) Program: Award #1054133