SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Data Set




Improving Video Activity Recognition using
   Object Recognition and Text Mining

   Tanvi S. Motwani and Raymond J. Mooney
       The University of Texas at Austin




                                             1
What is Video Activity Recognition?

Input                       Output




                             TYPING




                            LAUGHING




                                       2
What has been done so far?

There has been a lot of recent work in activity recognition:

   • Pre defined set of activities are used and recognition is treated
     as a classification problem

   • Scene context and Object context in the video is used and
     correlation between the context and activities are generally
     predefined

   • Text associated with the video in the form of scripts or
     captions are used as “bag of words” to improve performance


                                                                    3
Our Work

• Automatically discover activities from video descriptions
  because we use real world YouTube dataset with unconstrained
  set of activities

• Integrate video features and object context in video

• Use general large text corpus to automatically find correlation
  between activities and objects

• Use deeper natural language processing techniques to improve
  results over “bag of words” methodology.


                                                               4
Data Set




•A girl is dancing.               •A man is cutting a piece of paper   •A woman is riding horse on a    •A group of young girls are
•A young woman is dancing         in half lengthwise using scissors.   trail.                           dancing on stage.
ritualistically.                  •A man cuts a piece of paper.        •A woman is riding on a horse.   •A group of girls perform a dance
• An indian woman dances.         •A man cut the piece of paper.       • A woman rides a horse          onstage.
•A traditional girl is dancing.                                        • Horse is being ridden by a     • Kids are dancing.
•A girl is dancing.                                                    woman                            • small girls are dancing.
                                                                                                        • few girls are dancing.


 • Data Collected through Mechanical Turk by Chen et al. (2011)
 • 1,970 YouTube Video Clips
 • 85k English Language Descriptions
 • YouTube videos submitted by workers
       Short (usually less than 10 seconds)
       Single, unambiguous action/event
                                                                                                                                 5
Overall Activity Recognizer
using video features

         Video Feature     Training Input
           Extractor


                              Activity
                            Recognizer
                               using
                           Video Features

                                             Predicted
                                             Activity
                              Activity
                            Recognizer
                               using
                           Object Features

           Pre-Trained
             Object         Training Input
            Detectors

   using object features
                                                  6
Overall Activity Recognizer

   Video Feature     Training Input
     Extractor


                       Activity
                      Recognizer
                        using
                    Video Features
                                       Predicted
                                       Activity
                        Activity
                      Recognizer
                         using
                     Object Features

    Pre-Trained
      Object          Training Input
     Detectors


                                            7
Activity Recognizer using Video Features




  Training Video                                                  Classifier Trained
                                                                  on input features
                                     STIP features
                                                                  as STIP features
                                                                   and classes as
•A woman is riding horse in                                        activity cluster
a beach.
•A woman is riding on a
                                       ride, walk,                      labels
horse.                                 run, move,
• A woman is riding on a                  race
horse.

  NL description                      Discovered
                                     Activity Label
                                                                                  8
Automatically Discovering Activities and Producing Labeled
                          Training Data


                                                                                                                             ….Video Clips


•A puppy is playing in a tub of
              playing in a tub of      •A girl is dancing.
                                                  dancing.                   •A man is cutting a piece of paper
                                                                                        cutting a piece of paper
water.                                 •A young woman is dancing
                                                            dancing          in half lengthwise using scissors.
•A dog is playing with water in a
           playing with water in a     ritualistically.                      •A man cuts a piece of paper.
                                                                                      cuts a piece of paper.
small tub.                             •Indian women are dancing in
                                                           dancing in        •A man is cutting a piece of paper.
                                                                                        cutting a piece of paper.       …. NL Descriptions
•A dog is sitting in a basin of
           sitting in a basin of       traditional costumes.                 •A man is cutting a paper by
                                                                                        cutting a paper by
water and playing with the water.
           playing with the water.     •Indian women dancing for a
                                                        dancing for a        scissor.
•A dog sits and plays in a tub of
                 plays in a tub of     crowd.                                •A guy cuts paper.
                                                                                      cuts paper.
water.                                 •The ladies are dancing outside.
                                                        dancing outside.     •A person doing something
                                                                                        doing something




play            throw                hit          dance              jump            cut            chop            slice        .… 265
                                                                                                                               Verb Labels

play             throw                hit          dance              jump                    cut, chop, slic
                                                                                                     e                         Hierarchical
                                                                                                                                Clustering
play                   throw, hit                      dance, jump

                                                                                                                                     9
                    play # throw # hit # dance # jump # cut # chop # slice # …..
Automatically Discovering Activities and Producing Labeled
                          Training Data



• Hierarchical Agglomerative Clustering
    • WordNet::Similarity
      (Pedersen et al.), 6 metrics:
         • Path length based measures:
         lch, wup, path
         • Information Content based
         measures: res, lin, jcn
• Cut the resulting hierarchy at a level
• Use clusters at that level as activity
labels

                                           28 discovered clusters in our dataset


                                                                            10
Automatically Discovering Activities
                              and Producing Labeled Training Data


                                                                                                  climb,
                                                                                                    fly
                                                                                                              ride, walk,
                                                                                                               ride, walk,
                                                                                           cut, chop,         run, move,
                                                                                                              run, move,
                   •A man is          •A woman is                                             slice               race
                                                                                                                  race
•A girl is                                              •A group of       •A woman is
dancing.           cutting a piece    riding horse on   young girls are   riding a horse
•A young           of paper in half   a trail.                                                           dance,
                                                                                                        dance, ju
                                                        dancing on        on the beach.                   jump
woman is           lengthwise using   •A woman is                                                          mp
                                                        stage.            •A woman is
dancing            scissors.          riding on a       •A group of       riding a          throw,
                   •A man cuts a      horse.                                                                        play
ritualistically.                                        girls perform a   horse.              hit
                   piece of paper.                      dance onstage.




                                                                                                                       11
Overall Activity Recognizer

   Video Feature      Training Input
     Extractor


                        Activity
                      Recognizer
                         using
                     Video Features

                                       Predicted
                                       Activity
                        Activity
                      Recognizer
                         using
                     Object Features

    Pre-Trained
      Object          Training Input
     Detectors


                                           12
Spatio-Temporal Video Features

• STIP:
A set of Spatial temporal interest points (STIP) are extracted using
motion descriptors developed by Laptev et al.

• HOG + HOF:
At each point, HOG (Histograms of oriented Gradients) feature and
HOF (Histograms of optical flow) feature are extracted

• Visual Vocabulary:
50000 motion descriptors are randomly sampled and clustered
using K-means (k = 200), to form visual vocabulary

• Bag of Visual Words:
Each video is finally converted into a vector of k values in which ith
value is number of motion descriptors corresponding to ith cluster.
                                                                  13
Overall Activity Recognizer

   Video Feature      Training Input
     Extractor


                        Activity
                      Recognizer
                         using
                     Video Features

                                       Predicted
                                       Activity
                        Activity
                      Recognizer
                         using
                     Object Features


   Pre-Trained
     Object           Training Input
    Detectors


                                           14
Object Detection in Videos
• Discriminatively Trained Deformable Part Models (Felzenszwalb et
al): Pre-trained object detector for 19 objects
• Extract one frame per second
• Run object detection on each frame, and compute maximum score
of an object over all frames, and use that to compute probability of
each object for each video




                                                                 15
Overall Activity Recognizer

   Video Feature      Training Input
     Extractor


                        Activity
                      Recognizer
                         using
                     Video Features

                                       Predicted
                                       Activity
                        Activity
                      Recognizer
                         using
                     Object Features

    Pre-Trained
      Object         Training Input
     Detectors


                                           16
Learning Correlations between Activities and Objects

• English Gigaword corpus 2005 (LDC), 15GB of raw text
• Occurrence counts:
   • of an activity Ai: occurrence of any of the verbs in the verb
   cluster
   • of an object Oj: occurrence of object noun Oj or its synonym.
• Co-occurrence of an Activity and an Object:
   • Windowing
   Occurrence of the object with w or fewer words of an
   occurrence of the activity. Experimented with w of 3, 10 and
   entire sentence.
   • POS Tagging
   Entire corpus is POS Tagged using Stanford tagger. Occurrence
   of the object tagged as noun with w or fewer words of an
   occurrence of the activity tagged as verb.                     17
Learning Correlations between Activities and Objects

• Parsing
Parse the corpus using Stanford Statistical Syntactic
Dependency Parser
   • Parsing I
   Object is the direct object of the activity verb in the
   sentence.
   • Parsing II
   Object is syntactically attached to activity by any
   grammatical relation (eg, PP, NP, ADVP etc.)

Example:
  “Sitting in café, Kaye thumps a table and wails white blues”
  Windowing: “sit” and “table” co-occur
  POS Tagging: “sit” and “table” co-occur
                                                          18
  Parsing I and II: No co-occurrence
Learning Correlations between Activities and Objects




Probability of each activity given each object using Laplace (add-one)
smoothing:




                                                                    19
Overall Activity Recognizer

   Video Feature      Training Input
     Extractor


                        Activity
                      Recognizer
                         using
                     Video Features

                                       Predicted
                                       Activity
                        Activity
                       Recognizer
                         using
                     Object Features

    Pre-Trained
      Object          Training Input
     Detectors


                                           20
Activity Recognizer using Object Features




Probability of an Activity Ai using object detection and co-occurrence
information:




                                                                         21
Overall Activity Recognizer

   Video Feature      Training Input
     Extractor


                        Activity
                      Recognizer
                         using
                     Video Features

                                       Predicted
                                        Activity
                        Activity
                      Recognizer
                         using
                     Object Features

    Pre-Trained
      Object          Training Input
     Detectors


                                           22
Integrated Activity Recognizer


Final recognized activity =


    • Videos on which object detector detected at least one object
    (applying Naïve Bayes independence assumption between features given
    activity)




    • Videos on which there were no detected objects




                                                                           23
Experimental Methodology
• Ideally we would have trained detector for all objects, but because we just have 19 object
  detectors we included videos containing at least one of 19 objects in test set
  (128 videos).

• From the rest we discovered activity labels and found 28 clusters in 1190 training video
  set.

• Training set is used to construct activity classifier based on video features.

• We do not use description of test videos, they are only used to obtain gold standard labels
  for calculating accuracy. For testing only the video is given as input and we obtain
  activity as output.

• We run the object detectors on the test set.

• For activity-object correlation we compare all the methods: Windowing, POS tagging,
  Parsing and their types.

•    All the pieces are then combined in the final activity recognizer to obtain the predicted
    label.                                                                                 24
Experimental Evaluation


      Final Results using Different Text Mining Methods

                    Parsing II                                  0.48

                     Parsing I                                         0.523

POS tagging, w = full sentence                          0.4

         POS tagging, w = 10                              0.44

           POS tagging, w = 3                                 0.46

Windowing, w = full sentence                                  0.46

          Windowing, w = 10                                     0.47

            Windowing, w = 3                                    0.47

                                 0   0.1   0.2    0.3     0.4            0.5   0.6
                                             Accuracy
                                                                                     25
Experimental Evaluation



                  Result of System Ablations


                  Integrated System                                         0.52



Object Features only using parsing I                       0.38



                Video Features only                            0.39


                                       0   0.1   0.2     0.3          0.4    0.5   0.6

                                                       Accuracy



                                                                                         26
Conclusion


Three important contributions:

• Automatically discovering activity classes from Natural
Language descriptions of videos.

• Improve existing activity recognition systems using object
context together with correlation between objects and activities.

• Natural language processing techniques can be used to extract
knowledge about correlation of objects and activities from
general text.



                                                                    27
Questions?




             28
Abstract




   We present a novel combination of standard activity
classification, object recognition and text mining to learn
 effective activity recognizers which does not require any
        manual labeling of training videos and uses
     “world knowledge” to improve existing systems.




                                                         29
Related Work

• There has been a lot of recent work in video activity recognition.: Malik et
al.(2003), Laptev et al.(2004)
      They all have defined set of activities, we automatically discover the set of
     activities from textual descriptions.

• Work on context information to aid activity recognition:
    Scene context: Laptev et al (2009)
    Object context: Davis et al (2007), Aggarwal et al.(2007), Rehg et al.(2007)
    Most have constraint set of activities, we address diverse set of activities in
   real world YouTube videos.

• Work using text associated with video in form of scripts or closed captions:
Everingham et al.(2006), Laptev et al.(2007), Gupta et al.(2010)
    We use large text corpus to automatically extract correlation between
   activities and objects.
    We display the advantage of deeper natural language processing specifically
   parsing to mine general knowledge connecting activities and objects.
                                                                                30

Weitere ähnliche Inhalte

Kürzlich hochgeladen

Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Shubhangi Sonawane
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docxPoojaSen20
 

Kürzlich hochgeladen (20)

Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 

Empfohlen

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Empfohlen (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

European conference on artificial intelligence talk, 2012 : Activity Recognition

  • 1. Data Set Improving Video Activity Recognition using Object Recognition and Text Mining Tanvi S. Motwani and Raymond J. Mooney The University of Texas at Austin 1
  • 2. What is Video Activity Recognition? Input Output TYPING LAUGHING 2
  • 3. What has been done so far? There has been a lot of recent work in activity recognition: • Pre defined set of activities are used and recognition is treated as a classification problem • Scene context and Object context in the video is used and correlation between the context and activities are generally predefined • Text associated with the video in the form of scripts or captions are used as “bag of words” to improve performance 3
  • 4. Our Work • Automatically discover activities from video descriptions because we use real world YouTube dataset with unconstrained set of activities • Integrate video features and object context in video • Use general large text corpus to automatically find correlation between activities and objects • Use deeper natural language processing techniques to improve results over “bag of words” methodology. 4
  • 5. Data Set •A girl is dancing. •A man is cutting a piece of paper •A woman is riding horse on a •A group of young girls are •A young woman is dancing in half lengthwise using scissors. trail. dancing on stage. ritualistically. •A man cuts a piece of paper. •A woman is riding on a horse. •A group of girls perform a dance • An indian woman dances. •A man cut the piece of paper. • A woman rides a horse onstage. •A traditional girl is dancing. • Horse is being ridden by a • Kids are dancing. •A girl is dancing. woman • small girls are dancing. • few girls are dancing. • Data Collected through Mechanical Turk by Chen et al. (2011) • 1,970 YouTube Video Clips • 85k English Language Descriptions • YouTube videos submitted by workers  Short (usually less than 10 seconds)  Single, unambiguous action/event 5
  • 6. Overall Activity Recognizer using video features Video Feature Training Input Extractor Activity Recognizer using Video Features Predicted Activity Activity Recognizer using Object Features Pre-Trained Object Training Input Detectors using object features 6
  • 7. Overall Activity Recognizer Video Feature Training Input Extractor Activity Recognizer using Video Features Predicted Activity Activity Recognizer using Object Features Pre-Trained Object Training Input Detectors 7
  • 8. Activity Recognizer using Video Features Training Video Classifier Trained on input features STIP features as STIP features and classes as •A woman is riding horse in activity cluster a beach. •A woman is riding on a ride, walk, labels horse. run, move, • A woman is riding on a race horse. NL description Discovered Activity Label 8
  • 9. Automatically Discovering Activities and Producing Labeled Training Data ….Video Clips •A puppy is playing in a tub of playing in a tub of •A girl is dancing. dancing. •A man is cutting a piece of paper cutting a piece of paper water. •A young woman is dancing dancing in half lengthwise using scissors. •A dog is playing with water in a playing with water in a ritualistically. •A man cuts a piece of paper. cuts a piece of paper. small tub. •Indian women are dancing in dancing in •A man is cutting a piece of paper. cutting a piece of paper. …. NL Descriptions •A dog is sitting in a basin of sitting in a basin of traditional costumes. •A man is cutting a paper by cutting a paper by water and playing with the water. playing with the water. •Indian women dancing for a dancing for a scissor. •A dog sits and plays in a tub of plays in a tub of crowd. •A guy cuts paper. cuts paper. water. •The ladies are dancing outside. dancing outside. •A person doing something doing something play throw hit dance jump cut chop slice .… 265 Verb Labels play throw hit dance jump cut, chop, slic e Hierarchical Clustering play throw, hit dance, jump 9 play # throw # hit # dance # jump # cut # chop # slice # …..
  • 10. Automatically Discovering Activities and Producing Labeled Training Data • Hierarchical Agglomerative Clustering • WordNet::Similarity (Pedersen et al.), 6 metrics: • Path length based measures: lch, wup, path • Information Content based measures: res, lin, jcn • Cut the resulting hierarchy at a level • Use clusters at that level as activity labels 28 discovered clusters in our dataset 10
  • 11. Automatically Discovering Activities and Producing Labeled Training Data climb, fly ride, walk, ride, walk, cut, chop, run, move, run, move, •A man is •A woman is slice race race •A girl is •A group of •A woman is dancing. cutting a piece riding horse on young girls are riding a horse •A young of paper in half a trail. dance, dance, ju dancing on on the beach. jump woman is lengthwise using •A woman is mp stage. •A woman is dancing scissors. riding on a •A group of riding a throw, •A man cuts a horse. play ritualistically. girls perform a horse. hit piece of paper. dance onstage. 11
  • 12. Overall Activity Recognizer Video Feature Training Input Extractor Activity Recognizer using Video Features Predicted Activity Activity Recognizer using Object Features Pre-Trained Object Training Input Detectors 12
  • 13. Spatio-Temporal Video Features • STIP: A set of Spatial temporal interest points (STIP) are extracted using motion descriptors developed by Laptev et al. • HOG + HOF: At each point, HOG (Histograms of oriented Gradients) feature and HOF (Histograms of optical flow) feature are extracted • Visual Vocabulary: 50000 motion descriptors are randomly sampled and clustered using K-means (k = 200), to form visual vocabulary • Bag of Visual Words: Each video is finally converted into a vector of k values in which ith value is number of motion descriptors corresponding to ith cluster. 13
  • 14. Overall Activity Recognizer Video Feature Training Input Extractor Activity Recognizer using Video Features Predicted Activity Activity Recognizer using Object Features Pre-Trained Object Training Input Detectors 14
  • 15. Object Detection in Videos • Discriminatively Trained Deformable Part Models (Felzenszwalb et al): Pre-trained object detector for 19 objects • Extract one frame per second • Run object detection on each frame, and compute maximum score of an object over all frames, and use that to compute probability of each object for each video 15
  • 16. Overall Activity Recognizer Video Feature Training Input Extractor Activity Recognizer using Video Features Predicted Activity Activity Recognizer using Object Features Pre-Trained Object Training Input Detectors 16
  • 17. Learning Correlations between Activities and Objects • English Gigaword corpus 2005 (LDC), 15GB of raw text • Occurrence counts: • of an activity Ai: occurrence of any of the verbs in the verb cluster • of an object Oj: occurrence of object noun Oj or its synonym. • Co-occurrence of an Activity and an Object: • Windowing Occurrence of the object with w or fewer words of an occurrence of the activity. Experimented with w of 3, 10 and entire sentence. • POS Tagging Entire corpus is POS Tagged using Stanford tagger. Occurrence of the object tagged as noun with w or fewer words of an occurrence of the activity tagged as verb. 17
  • 18. Learning Correlations between Activities and Objects • Parsing Parse the corpus using Stanford Statistical Syntactic Dependency Parser • Parsing I Object is the direct object of the activity verb in the sentence. • Parsing II Object is syntactically attached to activity by any grammatical relation (eg, PP, NP, ADVP etc.) Example: “Sitting in café, Kaye thumps a table and wails white blues” Windowing: “sit” and “table” co-occur POS Tagging: “sit” and “table” co-occur 18 Parsing I and II: No co-occurrence
  • 19. Learning Correlations between Activities and Objects Probability of each activity given each object using Laplace (add-one) smoothing: 19
  • 20. Overall Activity Recognizer Video Feature Training Input Extractor Activity Recognizer using Video Features Predicted Activity Activity Recognizer using Object Features Pre-Trained Object Training Input Detectors 20
  • 21. Activity Recognizer using Object Features Probability of an Activity Ai using object detection and co-occurrence information: 21
  • 22. Overall Activity Recognizer Video Feature Training Input Extractor Activity Recognizer using Video Features Predicted Activity Activity Recognizer using Object Features Pre-Trained Object Training Input Detectors 22
  • 23. Integrated Activity Recognizer Final recognized activity = • Videos on which object detector detected at least one object (applying Naïve Bayes independence assumption between features given activity) • Videos on which there were no detected objects 23
  • 24. Experimental Methodology • Ideally we would have trained detector for all objects, but because we just have 19 object detectors we included videos containing at least one of 19 objects in test set (128 videos). • From the rest we discovered activity labels and found 28 clusters in 1190 training video set. • Training set is used to construct activity classifier based on video features. • We do not use description of test videos, they are only used to obtain gold standard labels for calculating accuracy. For testing only the video is given as input and we obtain activity as output. • We run the object detectors on the test set. • For activity-object correlation we compare all the methods: Windowing, POS tagging, Parsing and their types. • All the pieces are then combined in the final activity recognizer to obtain the predicted label. 24
  • 25. Experimental Evaluation Final Results using Different Text Mining Methods Parsing II 0.48 Parsing I 0.523 POS tagging, w = full sentence 0.4 POS tagging, w = 10 0.44 POS tagging, w = 3 0.46 Windowing, w = full sentence 0.46 Windowing, w = 10 0.47 Windowing, w = 3 0.47 0 0.1 0.2 0.3 0.4 0.5 0.6 Accuracy 25
  • 26. Experimental Evaluation Result of System Ablations Integrated System 0.52 Object Features only using parsing I 0.38 Video Features only 0.39 0 0.1 0.2 0.3 0.4 0.5 0.6 Accuracy 26
  • 27. Conclusion Three important contributions: • Automatically discovering activity classes from Natural Language descriptions of videos. • Improve existing activity recognition systems using object context together with correlation between objects and activities. • Natural language processing techniques can be used to extract knowledge about correlation of objects and activities from general text. 27
  • 29. Abstract We present a novel combination of standard activity classification, object recognition and text mining to learn effective activity recognizers which does not require any manual labeling of training videos and uses “world knowledge” to improve existing systems. 29
  • 30. Related Work • There has been a lot of recent work in video activity recognition.: Malik et al.(2003), Laptev et al.(2004)  They all have defined set of activities, we automatically discover the set of activities from textual descriptions. • Work on context information to aid activity recognition:  Scene context: Laptev et al (2009)  Object context: Davis et al (2007), Aggarwal et al.(2007), Rehg et al.(2007)  Most have constraint set of activities, we address diverse set of activities in real world YouTube videos. • Work using text associated with video in form of scripts or closed captions: Everingham et al.(2006), Laptev et al.(2007), Gupta et al.(2010)  We use large text corpus to automatically extract correlation between activities and objects.  We display the advantage of deeper natural language processing specifically parsing to mine general knowledge connecting activities and objects. 30