Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

動作認識の最前線:手法,タスク,データセット

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige

Hier ansehen

1 von 128 Anzeige

動作認識の最前線:手法,タスク,データセット

Herunterladen, um offline zu lesen

動作認識の最前線:手法,タスク,データセット

精密工学会 画像応用技術専門委員会, 2022年度第4回定例研究, 2022/11/18

動作認識の最前線:手法,タスク,データセット

精密工学会 画像応用技術専門委員会, 2022年度第4回定例研究, 2022/11/18

Anzeige
Anzeige

Weitere Verwandte Inhalte

Weitere von Toru Tamaki (20)

Aktuellste (20)

Anzeige

動作認識の最前線:手法,タスク,データセット

  1. 1. 動作認識の最前線 手法,タスク,データセット 玉木徹(名工大) 2022/11/18 精密工学会 画像応用技術専門委員会 2022/11/18
  2. 2. 動作認識とは n人間の動作(action)を認識 する • Action Recognition (AR) n動画像の識別 • 「人間が動作している映像」に限 らない • 人間に限定する手法もある • 人物検出や姿勢推定を併用 n入力:動画像 • 時間方向の次元が増える • 時間情報,動き情報のモデル化が 必要 モデル カテゴリ 画像 モデル カテゴリ 動画像 時間 画像認識 動作認識
  3. 3. M. S. Hutchinson, V. N. Gadepally: Video Action Understanding メジャーなタスク Video Action Understanding [Hutchinson&Gadepally, IEEE Access, 2021]
  4. 4. trimmed, untrimmed n untrimmed video • YouTubeなどから取得 • 元動画の長さはバラバラ(数分〜) • タスク:アクション区間検出など n trimmed video(クリップ) • untrimmed videoからアクション部分を抽出 • 数秒〜10秒程度の動画 • タスク:アクション認識など n 注意 • YouTube動画はそもそも投稿者が切り取っ た動画 • 同じ動画でも • 動画単位で識別する時:trimmed • 動画の区間を検出する時:untrimmed • と呼ぶ場合もある(UCF101-24がそれ) Video Action Understanding [Hutchinson&Gadepally, IEEE Access, 2021] FIGURE 2. An overview of the main action understanding problems. Video is depicted as a 3D volume temporal dimension (left-to-right). Action recognition (upper left) shows how an action class label is a Action prediction (upper right) shows how an action class label is assigned to a yet unobserved or onl action proposal (middle left) shows how temporal regions of likely action are bounded by start and en (middle right) shows how action class labels are assigned to temporal regions of likely action that are untrimmed trimmed 事前に トリミング
  5. 5. 関連タスク nAction recognition • zero-shot (ZSAR) • low-res video, low-quality video • compressed video nCaptioning • video captioning:trimmed video captioning • dense captioning:untrimmed video temporal localization + captioning nVideo QA nVideo object segmentation • VOS nTracking • Video Object Tracking (VOT) • Multiple Object Tracking (MOT) nVideo summarization nMore tasks & datasets • xiaobai1217 / Awesome-Video- Datasets
  6. 6. Challenges, contests, competitions nActivityNet challenges • 2016, 2017, 2018, 2019, 2020, 2021 nLOVEU (LOng-form VidEo Understanding) • 2021, 2022 nDeeperAction • Localized and Detailed Understanding of Human Actions in Videos • 2021, 2022
  7. 7. ActivityNet challenges n 2016 • untrimmed video classification • temporal activity localization n 2017 • untrimmed video classification • trimmed video classification (Kinetics) • temporal action proposals • temporal action localization • event dense captioning n 2018 • ActivityNet Task • temporal action proposals • temporal action localization • event dense captioning • Guest tasks • trimmed activity recognition (Kinetics) • spatio-temporal action localization (AVA) • trimmed event recognition (MiT) n 2019 • ActivityNet Task • temporal action proposals • temporal action localization • event dense captioning • Guest tasks • trimmed activity recognition (Kinetics) • spatio-temporal action localization (AVA) • EPIC-Kitchens • Activity detection (ActEV-PC) n 2020 • ActivityNet Task • temporal action localization • event dense captioning • Guest tasks • object localization (ActivityNet Entities) • trimmed activity recognition (Kinetics) • spatio-temporal action localization (AVA) • EPIC-Kitchens • Activity detection (ActEV SDL) • temporal action localization (HACS) n 2021 • Action recognition • Kinetics700 • TinyActions • Temporal localization • ActicityNet • HACS • SoccerNet • Spatio-Temporal localization • AVA-Kinetics & Active speakers • ActEV SDL UF • Event understanding • Event dense captioning (ActivityNet) • object localization (ActivityNet Entities) • Video Semantic Role Labeling (VidSitu) • Multi-view, cross-modal • MMAct • HOMAGE n 2022 • Temporal localization (ActivityNet) • Event dense captioning (ActivityNet) • AVA-Kinetics & Active speakers • ActEV SRL • SoccerNet • TinyActions • HOMAGE
  8. 8. データセット
  9. 9. メジャーなデータセット TABLE 2. Thirty historically influential, current state-of-the-art, and emerging benchmarks of video action datasets. Tabular information includes dataset name, year of publication, citations on Google Scholar as of May 2021, number of action classes, number of action instances, actors: human (H) and/or non-human (N), annotations: action class (C), temporal markers (T), spatiotemporal bounding boxes/masks (S), and theme/purpose. Video Action Understanding [Hutchinson&Gadepally, IEEE Access, 2021]
  10. 10. Action recognition datasets KTH Weizmann UCF11 UCF50 UCF101 UCF101-24 (THUMOS13) THUMOS14 THUMOS15 2005 2010 2012 2015 2017 2019 2020 Kinetics 400 Kinetics 600 Kinetics 700 Kinetics 700 -2020 ActivityNet v1.3 MiT Multi THUMOS Multi-MiT HVU AVA Actions AVA-Kinetics HACS Segment SSv1 HMDB51 JHMDB21 Charades Action Genome Home Action Genome Visual Genome Transformer Ego-4D ActivityNet v1.2 SSv2 Charades-Ego Jester HAA500 FineGym Hollywood2 MPII Composites MPII Cooking 2 Diving48 2021 2022 2016 2018 2013 2014 2011 2006 2007 2008 2009 2004 Action label(s) per video/clip Temporal annotation Spatio-Temporal annotation HACS Clip MPII Cooking YouYube-8M HowTo 100M ImageNet AlexNet ResNet U-Net EPIC- KITCHENS55 EPIC- KITCHENS100 SURF FAST ORB GAN ViT HOG YouYube-8M Segments ver2016 ver2017 ver2018 Hollywood Olympic Sports IXMAS DALY Sports-1M UCF Sports machine generated labels multi labels untrimmed YouTube videos worker/lab videos YouCook2 YouCook Hollywood -Extended Breakfast 50Salads Coffee Cigarettes ActivityNet Entities Kinetics 100 Mini-Kinetics 200
  11. 11. KTH nスウェーデン王立工科大学(Kungliga Tekniska Högskolan) n6カテゴリ,2391動画,25fps,固定カメラ,平均4秒,モノクロ • 25人,4シーン:屋外(スケール変化あり・なし,異なる服装),屋内 [Schuldt+, ICPR2004] s1 Walking Jogging Running Boxing Hand waving Hand clapping s2 s3 s4
  12. 12. Weizmann nWeizmann Institute of Science (ワイツマン科学研究所) n固定カメラ • 2005: 9カテゴリ,81動画 (180x144, 25fps) • 2007: 10カテゴリ,90動画 (180x144, deinterlaced 50fps) periodic actions in the same framework as well as to com- pensate for different length of periods, we used a sliding window in time to extract space-time cubes, each having 10 [Blank+, ICCV2005] [Gorelick+, TPAMI2007]
  13. 13. IXMAS nInria XmasMotion Acquisition Sequences nIXMAS [Weinland+, CVIU2006] • 11カテゴリ • 10人(男5女5)が3回ずつ動作 • 計330動画 nIXMAS Actions with Occlusions [Weinland+, ECCV2010] • 11カテゴリ • マルチビュー • オクルージョン • 1148動画 [Weinland+, CVIU2006] Check watch Cross arms Scratch head Sit down Get up Turn around Walk Wave Punch Kick Pick up Fig. 6. 11 actions, performed by 10 actors. [Weinland+, ECCV2010] 2 Daniel Weinland, Mustafa Özuysal, and Pascal Fua Fig. 1. We evaluate our approach on several datasets. (Top) Weizmann, KTH, and UCF datasets. (Middle) IXMAS dataset, which contains strong viewpoint changes. (Bottom) Finally, to measure robustness to occlusions, we evaluate the models learned
  14. 14. Coffee and Cigarettes n最初のin-the-wildデータセット • KTH, Weizmannなどは制御下撮影 n映像 • 映画Coffee and Cigarettes (2003) • 計36,000フレーム nタスク • 識別:2クラス • drinking 105サンプル • smoking 141サンプル • spatio-temporal action localization • 3Dのbboxアノテーション • 区間長:30〜200フレーム,平均 70フレーム [Laptev&Pérez, ICCV2007] https://www.di.ens.fr/~laptev/download.html https://www.di.ens.fr/~laptev/eventdetection/eventann01/ https://web.archive.org/web/20080219232702/http://www.irisa.fr/vista/Equipe/People/Laptev/actiondetection.html
  15. 15. Hollywood nHollywood Human Actions (HOHA) • 動画のアノテーションは大変,そこで台本 と字幕が利用できる映画を利用する • 台本はwebで入手可能 • 時刻のある字幕と時刻のない台本とを対応 付ける • 識別器を学習して,台本の文章を8クラスの アクションに変換することでラベル付け • 学習データセット:12映画 • clean/manual:手動でラベル付け • automatic:識別器でラベル付け,1000フ レーム以下の動画 • テストデータセット:20映画 • 手動でラベル付け [Laptev+, CVPR2008] Ivan Laptev Marcin Marszałek Cordelia Schmid Benjamin Rozenfeld INRIA Rennes, IRISA INRIA Grenoble, LEAR - LJK Bar-Ilan University ivan.laptev@inria.fr marcin.marszalek@inria.fr cordelia.schmid@inria.fr grurgrur@gmail.com Abstract The aim of this paper is to address recognition of natural human actions in diverse and realistic video settings. This challenging but important subject has mostly been ignored in the past due to several problems one of which is the lack of realistic and annotated video datasets. Our first contri- bution is to address this limitation and to investigate the use of movie scripts for automatic annotation of human ac- tions in videos. We evaluate alternative methods for action retrieval from scripts and show benefits of a text-based clas- sifier. Using the retrieved action samples for visual learn- ing, we next turn to the problem of action classification in video. We present a new method for video classification that builds upon and extends several recent ideas including local space-time features, space-time pyramids and multi- channel non-linear SVMs. The method is shown to improve state-of-the-art results on the standard KTH action dataset by achieving 91.8% accuracy. Given the inherent problem of noisy labels in automatic annotation, we particularly in- vestigate and show high tolerance of our method to annota- tion errors in the training set. We finally apply the method to learning and classifying challenging action classes in Figure 1. Realistic samples for three classes of human actions: kissing; answering a phone; getting out of a car. All samples have been automatically retrieved from script-aligned movies. dividual variations of people in expression, posture, motion and clothing; perspective effects and camera motions; illu-
  16. 16. Hollywood2 nHollywoodの拡張 • 作成方法は同じ • カテゴリ: • アクションが12クラスに • シーン10クラスも追加 • Automatic train set:33映画 • 識別器でラベル付け • 動画数:action 810, scene 570 • Actionとsceneで別サンプル • Clean test set:36映画 • 手動でラベル付け • 動画数:action 570, scene 582 [Marszalek+, CVPR2009] Marcin Marszałek INRIA Grenoble marcin.marszalek@inria.fr Ivan Laptev INRIA Rennes ivan.laptev@inria.fr Cordelia Schmid INRIA Grenoble cordelia.schmid@inria.fr Abstract This paper exploits the context of natural dynamic scenes for human action recognition in video. Human actions are frequently constrained by the purpose and the physi- cal properties of scenes and demonstrate high correlation with particular scene classes. For example, eating often happens in a kitchen while running is more common out- doors. The contribution of this paper is three-fold: (a) we automatically discover relevant scene classes and their cor- relation with human actions, (b) we show how to learn se- lected scene classes from video without manual supervision and (c) we develop a joint framework for action and scene recognition and demonstrate improved recognition of both in natural video. We use movie scripts as a means of auto- matic supervision for training. For selected action classes we identify correlated scene classes in text and then re- trieve video samples of actions and scenes for training using script-to-video alignment. Our visual models for scenes and actions are formulated within the bag-of-features frame- work and are combined in a joint scene-action SVM-based classifier. We report experimental results and validate the (a) eating, kitchen (b) eating, cafe (c) running, road (d) running, street Figure 1. Video samples from our dataset with high co-occurrences of actions and scenes and automatically assigned annotations. automatically discover correlated scene classes and to use this correlation to improve action recognition. Since some actions are relatively scene-independent (e.g. “smile”), we do not expect context to be equally important for all ac- tions. Scene context, however, is correlated with many ac-
  17. 17. Hollywood-Extended nアクションの順序 • 69映画 • HollyWood2を利用 • 16アクションを全フレームに付 与 • 異なるアクションが連続して いる部分をクリップとして抽 出 • 全937クリップ • 更に10フレーム毎の区間に分 割(1クリップ平均84区間) [Bojanowski+, ECCV2014]
  18. 18. HMDB51 nHuman Motion DataBase • ソース:Digitized movies, Prelinger archive, YouTube and Google videos, etc • 既存のUCF-SportsやOlympicSportsは ソースがYoutubeのみ,アクションが 曖昧,人の姿勢で分かってしまう n 51カテゴリ,6766動画 • 各カテゴリ最低101 • 1〜5秒程度,平均3.15秒程度 • 3スプリット • train 70% (3570), test 30% (1530) • not using clips (1530) • 高さ240pix, 30fps • DivX5.0 (ffmpeg), avi • 動き補正済み(stabilization) n構築 • main actorが最低60pix • 最短でも1秒 • 1クリップ1action n配布 • rarアーカイブ • unrar必要 • 動き補正 • あり・なし [Jhuang+, ICCV2011] HMDB: A Large Video Database for Human Motion Recognition H. Kuehne Karlsruhe Instit. of Tech. Karlsruhe, Germany kuehne@kit.edu H. Jhuang E. Garrote T. Poggio Massachusetts Institute of Technology Cambridge, MA 02139 hueihan@mit.edu, tp@ai.mit.edu T. Serre Brown University Providence, RI 02906 thomas serre@brown.edu Abstract With nearly one billion online videos viewed everyday, an emerging new frontier in computer vision research is recognition and search in video. While much effort has been devoted to the collection and annotation of large scal- able static image datasets containing thousands of image categories, human action datasets lag far behind. Cur- rent action recognition databases contain on the order of ten different action categories collected under fairly con- trolled conditions. State-of-the-art performance on these datasets is now near ceiling and thus there is a need for the design and creation of new benchmarks. To address this is- sue we collected the largest action video database to-date with 51 action categories, which in total contain around 7,000 manually annotated clips extracted from a variety of sources ranging from digitized movies to YouTube. We use this database to evaluate the performance of two represen- tative computer vision systems for action recognition and explore the robustness of these methods under various con- ditions such as camera motion, viewpoint, video quality and occlusion. Figure 1. Sample frames from the proposed HMDB51 [1] (from
  19. 19. JHMDB21 nJoint-annotated Human Motion DataBase • HMDB51から21カテゴリ928 video を抽出 • アクションの開始終了でトリミング • 各クリップ15-48フレーム • 各フレームでアノテーション • スケール,姿勢,人物マスク,フ ロー,カメラ視点 • 2Dパペットモデル[Zuffi&Black, MPII-TR-2013][Zuffi+, ICCV2013] を利用 nBboxアノテーション • オリジナルにはない • [Li+, ECCV2020]がGitHubリポジトリ 経由でGooge Driveで配布 • フレーム数が最大40にトリミング [Jhuang+, ICCV2013] low level (e) baseline (a) image (b) puppt flow (f) given puppt flow (c) puppet mask (g) given puppet mask (d) joint positions and relations (h) given joint positions u v mid level high level Figure 1. Overview of our annotation and evaluation. (a-d) A video frame annotated by a puppet model [36]. (a) image frame, (b) puppet
  20. 20. UCF11/50/101/Sports nUniversity of Central Florida • UCF11(YouTube Action Dataset) [Liu+, CVPR2009] • YouTubeの動画 = "in the Wild" • 11クラス,1168動画 • 25グループ(背景で識別するのを防ぐために,同じグループには同じ環境や 撮影者を含まない) • UCF50 [Reddy&Shah, MVAP2012] • 50クラス,6676動画 • UCF11はUCF50のサブセット • UCF101 [Soomro+, arXiv, 2012] • UCF50はUCF101のサブセット • UCF-Sports [Rodriguez+, CVPR2008][Soomro&Zamir, 2014]
  21. 21. UCF101 n University of Central Florida • ソースはYouTube • 手動でクリーニング • 既存のHMDB51やUCF50の50カテゴリ 程度では少ない n 101カテゴリ,13,320動画 • UCF50に51カテゴリ追加 • 追加カテゴリの動画には音声あり • 25グループに分割 • 25-fold cross validation用 • 各グループで各カテゴリに4-7動画 • 3スプリット • train 9537, test 3783 • 320x240pix, 25fps, DivX, avi n 配布 • rarアーカイブ(unrar必要) n 最短1.06秒,最長71.04秒,平均 7.21秒 • Kineticsなどaction recognition・ temporal localizationタスクではtrimmed video clip扱い • JHMDBなどst-localizationタスクでは untrimemd video扱い [Soomro+, arXiv, 2012] Figure 3. Number of clips per action class. The distribution of clip durations is illustrated by the colors. Dhol, Playing Flute, Playing Sitar, Rafting, Shaving Beard, Shot put, Sky Diving, Soccer Penalty, Still Rings, Sumo Wrestling, Surfing, Table Tennis Shot, Typing, Uneven Bars, Wall Pushups, Writing On Board}. Fig. 2 shows a sample frame for each action class of UCF101. Clip Groups: The clips of one action class are divided into 25 groups which contain 4-7 clips each. The clips in one group share some common features, such as the back- ground or actors. The bar chart of Fig. 3 shows the number of clips in each class. The colors on each bar illustrate the durations of different clips included in that class. The chart shown in Fig. 4 illustrates the average clip length (green) and total duration of clips (blue) for each action class. Actions 101 Clips 13320 Groups per Action 25 Clips per Group 4-7 Mean Clip Length 7.21 sec Total Duration 1600 mins Min Clip Length 1.06 sec Max Clip Length 71.04 sec Frame Rate 25 fps Resolution 320⇥240 Audio Yes (51 actions) Table 1. Summary of Characteristics of UCF101
  22. 22. Kinetics n the DeepMind Kinetics human action video dataset • Kinetics-400 [Kay+, arXiv2017] [Carreira&Zisserman, CVPR2017] • Kinetics-600 [Carreira+, arXiv2018] • Kinetics-700 [Carreira+, arXiv2019] • Kinetics-700-2020 [Smaira+, arXiv2020] • "the 2020 edition of the DeepMind Kinetics human action video dataset" n ポリシー:1動画から1クリップ • HMDBやUCFは1動画から複数クリップ n 配布 • 公式にはYoutubeリンクのみ(+時刻) • youtube-dlでダウンロード • ffmpegでトリミング • ただし約5%/yearでリンク消失 • 動画の配布あり • Activity-Net challenge用に配布 • CVD FoundationのgithubリポジトリにあるAmazon S3のリンクから各自でダウン ロード A Short Note on the Kinetics-70 Lucas Smaira lsmaira@google.com João Carreira joaoluis@google.com Amy Wu amybwu@google.com Dataset # classes Average Minimum Kinetics-400 400 683 303 Kinetics-600 600 762 519 Kinetics-700 700 906 532 Kinetics-700-2020 700 926 705 Table 1: Statistics on the number of video clips per class for different Kinetics datasets as of 14-10-2020. Abstract We describe the 2020 edition of the DeepMind Kinetics 4v1 [cs.CV] 21 Oct 2020 [Smaira+, arXiv2020]
  23. 23. Kinetics-400 nKinetics400, K400 • ソースはYoutube • train 22k, val 18k, test 35k • 動画は最長10秒 • サイズとfpsはバラバラ • top1とtop5で評価 • 1動画に複数のactionが入っていることもあ るがアノテーションは1つだけだから • ImageNetも同じコンセプトか? [Kay+, arXiv2017] (a) headbanging (c) shaking hands (e) robot dancing
  24. 24. Kinetics-400 n 構築手順 • アクション名:既存のHMDB, UCF101, ActivityNetなどから再利用+ワーカーが思いつ いたもの • (自動)候補ビデオ:アクション名でYouTubeのタイトルを検索 • (自動)アクションの抽出:アクション名でGoogle画像検索して出てきた画像で識別 器を構築し,候補ビデオをフレーム単位で識別,スコアの高い時刻の前後5秒で10秒ク リップを作成(ただし動画の最初や最後だったら10秒にならない) • 実験では短いビデオは予め複数回ループしてある • (手動)AMTワーカーがクリップを(音なしで映像だけで)判断. • 本物を混ぜてワーカーの質を評価しておく • 3人以上のワーカーがOKならそのクリップを採用 • (手動)フィルタリング • 重複ビデオの削除(Inception-V1のスコアでクラスタリング,各クラスタから1つ採 用) • カテゴリを統廃合 • Two-Streamモデルのスコアを使ってソートして更にフィルタリング [Kay+, arXiv2017]
  25. 25. Kinetics-600 nKinetics600, K600 • K400とほぼ同様 • test setが2つ • standard test set • ラベル公開 • 論文の数値はこれを使うこと • held-out test set • ラベル未公開 • Activity-Net challenge用 • K400 test setのラベルも同時公開 nK400からの変更 • カテゴリ名の選出 • Google知識グラフ • YouTubeの検索補完 • 候補動画の検索 • 英語だけでなくポルトガル語でも (ブラジルで使われているので言語 人口のtop2,実は著者もポルトガル 語ネイティブ) • 検索には重み付きN-gramを利用 • 多言語でも可能 • タイトルとメタデータで検索 • カテゴリ • K400の368カテゴリを再利用 • 残りの32は名称変更,分割,削除 • いくつか動画を移動 • K400 valからK600 testへ • K400 trainからK600 valへ [Carreira+, arXiv2018] Version Train Valid. Test Held-out Test Total Train Total Classes Kinetics-400 [6] 250–1000 50 100 0 246,245 306,245 400 Kinetics-600 450–1000 50 100 around 50 392,622 495,547 600
  26. 26. Kinetics-700 n Kinetics700, K700 • K600とほぼ同様 • held-out test setは廃止 • 論文はstandard val setで評価 • test setはActivity-Net challenge用 n K600からの変更点 • カテゴリ • K600の597カテゴリを再利用 • いくつかは分割 • fruitをapplesとblueberriesに • 最近のEPIC-KitchenとAVAから採用流 用 • 創造的なカテゴリを追加 • スライムを作る,無重力 • 候補動画の検索 • クラス名の検索キーワードを分離,ク ラス名以外でも検索 • 英語とポルトガル語に加えてフランス 語とスペイン語も(言語人口のtop4) で検索 • 5分以上の動画は除外 • 最終判断 • K400とK600は自分たちで最終判断を した • K700では最終判断もワーカーまかせ n 収集方法の補足 • クリップ中に動作が持続するアクション には適している(ギターを引く,ジャグ リングする) • 動作が開始・終わりの時間的要素を持つ ものは難しい(皿を落とす,車を降り る) n 最終ゴール • 1000クラスのデータセット作成 [Carreira+, arXiv2019] Version Train Valid. Test Held-out Test Total Train Total Classes Kinetics-400 [7] 250–1000 50 100 0 246,245 306,245 400 Kinetics-600 [2] 450–1000 50 100 around 50 392,622 495,547 600 Kinetics-700 450–1000 50 100 0 545,317 650,317 700
  27. 27. Kinetics-700-2020 nKinetics700-2020 • K600とほぼ同様 nK700からの変更点 • カテゴリは変更なし • K700のうち動画数の少ない123カテゴ リの動画数を最低700に増やす • K700の他のカテゴリにも動画を継ぎ 足す • 年5%の割合でYouTube動画が消え ていくため n実験 • 各カテゴリの動画数を増やすと性能は 向上する [Smaira+, arXiv2020] Dataset & split # clips # clips 14-10-2020 % retained Kinetics-400 train 246,245 220,033 89% Kinetics-400 val 20,000 18,059 90% Kinetics-400 test 40,000 35,400 89% Kinetics-600 train 392,622 371,910 95% Kinetics-600 val 30,000 28,366 95% Kinetics-600 test 60,000 56,703 95% Kinetics-700 train 545,317 532,370 98% Kinetics-700 val 35,000 34,056 97% Kinetics-700 test 70,000 67,302 96% Kinetics-700-2020 train 545,793 – – Kinetics-700-2020 val 34,256 – – Kinetics-700-2020 test 67,858 – – Table 2: The number of original (left) and current (right) available video clips in the various Kinetics datasets. ing milk’, ’tasting wine’, ’vacuuming car’. In order to filter those clips from the final dataset, we cluster them and look at individual clusters gifs removing duplicates. A final fil- tering is also done to make sure clips belong to the correct class. Geographical diversity. We provide an analysis of the geo- graphical distribution of the videos in the final dataset at the granularity of continents. The location is assigned based on where the video was uploaded from. The results are shown in table 3 based on the fraction of videos containing that information (around 90%). Geographical diversity increased slightly over the years, especially the percentage of videos from Latin America, Figure 1: Performance of an I3D model with RGB in on the Kinetics-700-2020 dataset using different nu 2020/10/14時点でまだ 入手できる割合 Dataset & split # clips # clips 14-10-2020 % retained Kinetics-400 train 246,245 220,033 89% Kinetics-400 val 20,000 18,059 90% Kinetics-400 test 40,000 35,400 89% Kinetics-600 train 392,622 371,910 95% Kinetics-600 val 30,000 28,366 95% Kinetics-600 test 60,000 56,703 95% Kinetics-700 train 545,317 532,370 98% Kinetics-700 val 35,000 34,056 97% Kinetics-700 test 70,000 67,302 96% Kinetics-700-2020 train 545,793 – – Kinetics-700-2020 val 34,256 – – Kinetics-700-2020 test 67,858 – – Table 2: The number of original (left) and current (right) available video clips in the various Kinetics datasets. ing milk’, ’tasting wine’, ’vacuuming car’. In order to filter those clips from the final dataset, we cluster them and look at individual clusters gifs removing duplicates. A final fil- tering is also done to make sure clips belong to the correct class. Geographical diversity. We provide an analysis of the geo- graphical distribution of the videos in the final dataset at the granularity of continents. The location is assigned based on where the video was uploaded from. The results are shown in table 3 based on the fraction of videos containing that information (around 90%). # training examples 20 40 60 80 100 200 300 400 500 600 700 Validation-Top1 Test-Top1 Validation-Top5 Test-Top5
  28. 28. Kinetics-200 / Mini-Kinetics nS3D [Xie+, ECCV2018]で提案 • 200カテゴリからなるサブセット • train:各クラスからランダムに400サンプル • val:25サンプル • train/val:80k/5k [Xie+, ECCV2018]
  29. 29. Kinetics-100 nVideoSSL [Jing+, WACV2021]で提案 • 各クラス最低700サンプル(train)を持つ100クラスからなるサブセット [Jing+, WACV2021]
  30. 30. Mimetics [Weinzaepfel&Rogez, IJCV2021] nコンテキスト無視のアクション動画 • 713動画 • 例(右図) • 部屋の中でサーフィン • パントマイム • サッカー場でボーリング • 50クラス(Kinetics400のサブセット) • 評価専用,学習には使用しない • 使い方: • Kinetics400の学習セットで学習 • Mimeticsの50クラスで評価 [Weinzaepfel&Rogez, arXiv2019]
  31. 31. SSv2 nsomething-something • v1 [Goyal+, ICCV2017] • v2:SSv2, sth-sth-v2(v2が主流) • 元は20BN (Twenty Billion Neurons Inc.) • 現在はQualcomm • 2021/7にQualcommが20BNを買収 n動画 • v1:108,499(平均4.03秒) • train 86k, val 12k, test 11k • v2:220,847 • train 167k, val 25k, test 27k • ラベル数174(テンプレート文の数) • v2はwebm形式(v1はフレームjpeg?) [Goyal+, ICCV2017] [Mahdisoltani+, arXiv2018] Putting a white remote into a cardboard box Pretending to put candy onto chair Pushing a green chilli so that it falls off the table Moving puncher closer to scissor Figure 4: Example videos and corresponding descriptions. Object entries shown in italics. To obtain useful natural-language labels, but also be able to train, and potentially bootstrap, networks to learn from the data, we generate natural language descriptions auto- matically by appropriately combining partly pre-defined, and partly human-generated, parts of speech. Natural lan- guage descriptions take the form of templates that crowd and high-level concepts. However, it is possible to increase the degree of com- plexity as well as the sophistication of language over time as the dataset grows. This approach can be viewed as “cur- riculum learning” [2], where simple concepts are taught first, and more complicated concepts are added progres-
  32. 32. SSv2 nコンセプト • アクションラベルではなく,名詞・動詞のパターンを理解するべき • テンプレート文を準備 • "Dropping [something] into [something]" • "Stacking [number of] [something]" • 「something」はアクション対象の物体名が入るプレースホルダー • テンプレート文の種類が174(これがラベル数) • 物体名は多い(v2 trainで23688) n構築 • Charadesと同様:ワーカーが撮影 • 名詞と動詞を多数用意,でも組み合わせは膨大 • ワーカーにテンプレート文を提示(プレースホルダーあり) • ワーカーはvideoを撮影し,映像提出時には物体名も提出 [Goyal+, ICCV2017] [Mahdisoltani+, arXiv2018]
  33. 33. Jester nジェスチャ認識データセット • 元は20BN,現在はQualcomm (SSv2と同じ) n動画 • 148,092 clips • train/val/testは8:1:1 • AMTワーカー 1376人が参加 • 27ジェスチャカテゴリ • 「No gesture」と「Doing Other Things 」も含む • 12fps,高さ100pix,幅可変 • 平均36フレーム(3秒) [Materzynska+, ICCVW2019] Figure 3. Examples from the Jester Dataset. Classes presented from the top; ’Zooming Out With Two Fingers’, ’Rolling Hand Backward’, ’Rolling Hand Forward’. Videos are different with respect to the person, background and lighting conditions.
  34. 34. MiT nMoments in Time • 英語で「その時,あの瞬間」 • 3秒クリップ • train727k, val30k • (論文ではtrain802k, val34k, test68k) • 論文の副題が「one million videos」 • 305カテゴリ(論文では339) • visualとaudibleの両方の動画 • 音だけで静止画のクリップもある • 多様なソース • Youtube, Flickr, Vine, Metacafe, Peeks, Vimeo, VideoBlocks, Bing, Giphy, The Weather Channel, and Getty-Images [Monfort+, TPAMI2019] This article has been accepted for publication in a future issue of this journal, but has not be Transaction Fig. 1: Sample Videos. Day-to-day events can hap scales. Moments in Time dataset has a significant in
  35. 35. MiT n構築 • カテゴリ名:VerbNetの頻出4500動 詞をクラスタリング • 動画検索:多様なソースからクラス 名をメタデータで検索 • ランダムな3秒クリップを切り出す • 3秒を検出するモデルを使うとバ イアスがかかるため • AMTワーカーが棄却 n配布 • フォームに記入して提出 • zipをダウンロード(275GB) • ソースが多様なためリンク配布で は無理? [Monfort+, TPAMI2019]
  36. 36. Multi-MiT nMiTの拡張 • マルチラベル • 2ラベル以上が553k • 3ラベル以上が275k • 動画数1M • train997k, val9.8k • MiTに追加 • カテゴリ数292 • MiTからマージ・追加・削除 [Monfort+, TPAMI2021] F Abstract—Videos capture events that typically contain multiple sequen- tial, and simultaneous, actions even in the span of only a few seconds. However, most large-scale datasets built to train models for action recognition in video only provide a single label per video. Consequently, models can be incorrectly penalized for classifying actions that exist in the videos but are not explicitly labeled and do not learn the full spectrum of information present in each video in training. Towards this goal, we present the Multi-Moments in Time dataset (M-MiT) which includes over two million action labels for over one million three second videos. This multi-label dataset introduces novel challenges on how to train and analyze models for multi-action detection. Here, we present baseline results for multi-action recognition using loss functions adapted for long tail multi-label learning, provide improved methods for visualizing and interpreting models trained for multi-label action detection and show the strength of transferring models trained on M-MiT to smaller datasets. 1 INTRODUCTION In this paper we present the Multi-Moments in Time dataset (M-MiT). This is a multi-label extension to the Moments in Time dataset (MiT) [27] which includes one million 3-second videos each with a human annotated action label. Videos by their nature are dynamic. In contrast to images, events can evolve over time and a single action label for an event may not fully capture the set of actions being depicted. For example, a short video of a person raising their hand and snapping their fingers before laughing has multiple actions (a) Multi-Moments in Time 2 million action labels for 1 million 3 second videos (b) Multi-Regions Localizing multiple visual regions involved in recognizing simultaneous actions, like running and bicycling (c) Action Regions Spatial localization of actions in single frames for network interpretation (d) Action Concepts Interpretable action features learned by a trained model (i.e. jogging)
  37. 37. HVU nHolistic Video Understanding • 3142カテゴリ,6種類 • 動作認識だけではない • マルチラベル • 最長10秒クリップ • HVU • train481k, val31k, test65k • Mini-HVU • train130k, 10k • ただしラベルはmachine-generated Supplementary Material: Large Scale Holistic Video Understanding forest,musician,flutist,music,musical_instrument,brass_inst rument,wind_instrument,flautist,recreation,musical_instrum ent_accessory,plant,playing_flute,tree string_instrument,musician,man,,guitarist,plucked_string_i nstruments,music,tapping_guitar,bass,musical_instrument _accessory,performance,string_instrument_accessory,elec tric_guitar,sitting,monochrome_photography,musical_instru ment,guitar_accessory,resonator sport_venue,shoe,outdoor_shoe,joint,foot,ball,grass,knee, human_leg,fun,football_player,ball_game,green,footwear,f ootball,player,sports_equipment,juggling_soccer_ball,socc er,plant,soccer_ball,sports,play opening_bottle_not_wine_,joint,muscle,service,finger,distill ed_beverage,fun,taste,standing,arm,t_shirt,glass,alcohol,d rink,hand,bottle,photograph,cooking [Diba+, ECCV2020] Large Scale Holistic Video Understanding 5 Task Category Scene Object Action Event Attribute Concept Total #Labels 248 1678 739 69 117 291 3142 #Annotations 672,622 3,418,198 1,473,216 245,868 581,449 1,108,552 7,499,905 #Videos 251,794 471,068 479,568 164,924 316,040 410,711 481,417
  38. 38. HVU n構築 • ビデオは既存のデータセットを再利 用 • YouYube-8M, Kinetics-600, HACS • ただしtest setはHVUのtrainからは 除外 • アノテーション • クラウドAPIでラフなタグ付け • 人間が正しくないタグを除去 • 残ったタグ約3500を元にWordNet からラベル3142を採用 • 6種類に手動で分類 n配布 • YouTubeリンク • 動画の配布なし • 消失動画多数 [Diba+, ECCV2020]
  39. 39. HAA500 nHuman Atomic Action • アトミックな(最小単位の)アク ションが500カテゴリ • MiTの「open」はいろいろopen • MAAは「door open」 • クリップ長は可変 • Kineticsのように10秒固定だと不要 な部分が多い・ショットの切り替 えが発生 • 10000動画:train8k, val500, test1500, 平均2.12秒 • クラスバランスは調整 • trainは各クラス16動画 • アクションの人物のみ大写し • 83%の動画で人物が一人しか写っ ていない [Chung+, ICCV2021] Sports/Athletics Soccer Baseball Run (Dribble) Throw In Shoot Save Run Pitch Swing Catch Flyball Figure 1. HAA500 is a fine-grained atomic action dataset, with fi compared to the traditional composite action annotations (e.g., Soc atomic action datasets, where we have distinctions (e.g., Soccer-Thro when the action difference is visible. The figure above displays sam video contains one or a few dominant human figures performing the cess, they only contain 4 events with atomic action anno- tations (Balance Beam, Floor Exercise, Uneven Bars, and Vault-Women), and their clips were extracted from profes- AVA HACS re 3. The video clips in AVA, HACS, and Kinetics 400 contain multiple human figures with different actions in the same frame. ething-Something focuses on the target object and barely shows any human body parts. In contrast, all video clips in HAA500 are fully curated where each video shows either a single person or the person-of-interest as the most dominant figure in a given frame. Dataset Detectable Joints Kinetics 400 [21] 41.0% UCF101 [42] 37.8% HMDB51 [25] 41.8% FineGym [39] 44.7% HAA500 69.7% e 5. Detectable joints of video action datasets. We use Alpha- e [10] to detect the largest person in the frame, and count the ber of joints with a score higher than 0.5. me (e.g., a basketball to detect Playing Basketball) rather n recognizing the pertinent human gesture, thus causing action recognition to have no better performance im-
  40. 40. Activity-Net nActivity-Net challenge • ActivityNet v1.2 • 100カテゴリ,train4.8k, val2.4k, test2.5k • ActivityNet v1.3(主流) • 200カテゴリ,train10k, val5k, test5k • カテゴリは階層的に定義(4階層) • CVPR2015論文では203カテゴリ • ソースはYoutube n動画 • untrimmed video:5〜10分,最長20分,半数が1280x720,多くが30fps • アクションの区間をアノテーション • 複数の異なるアクションも発生するのでマルチラベル問題 • trimmed video:各untrimmed videoからアクション部分を抽出(平均で1.41動 画) • 1クリップ1アクションなのでシングルラベル問題 [Heilbron+, CVPR2015]
  41. 41. Activity-Net n 構築手順 • カテゴリ名 • 米国労働省のAmerican Time Use Survey (ATUS)を利用 • 2000カテゴリ以上,階層的(最上位 は18カテゴリ) • ここから203カテゴリを手動で選択 (最上位7カテゴリ) • 候補untrimmed videoをYouTubeから検索 • WordNetの類義語・上位語・下位語も 利用 • AMTワーカーが選択(正しいものを混ぜ て,エキスパートワーカーを選別) • アクション区間のアノテーション • 複数のAMTエキスパートワーカーがア ノテーション • それらをクラスタリングして最終採用 • 区間をtrimmed videoに切り出す n 配布 • YouTubeリンク(json)のみ • downloadページのrequest formから missing video listを提出 • 面倒なのか全動画zipのダウンロードリ ンクが送られてくる.ただし • Google Drive共有のはダウンロード数 制限のためダウンロードできない • Baidu Drive(百度网盘)のはなんとか 日本でアカウント作成しても帯域制限 (50kb/p程度...)で無理 • 上級会員(app store経由で月400円 程度)になって50MB/sでダウンロー ドするしかない • みんな困っているgithub issue [Heilbron+, CVPR2015]
  42. 42. KineticsとActivity-Net nKinetics • 1動画1クリップのポリシー • trimmed video clipの識別に使われる • temporal action localizationには使わ れない • 元のuntrimmed video中に複数の actionがあっても1つのclipしか抽 出してないから • trimmed clipが配布されているから元 のuntrimmed videoは不要 nActivity-Net • untrimmed videoとtrimmed clip • temporal action localizationに使われ る • trimmed clipの識別には使われない • どこかで見かけてもメジャーでは ない • untrimmed videoの識別には使われて ない • この問題設定が少ない • やるならwealkly TAL • やるならマルチラベル問題
  43. 43. ActivityNet Captions n各イベント区間にキャプション • ActivityNetから20k動画,計100k文 • 時間区間アノテーションあり • 1動画平均2.65文 • 1文平均13.48単語 nDense captioningというタスク の開始 • video captioning:ビデオ単位で caption生成 • dense captioning:TAL+captioning [Krishna+, ICCV2017]
  44. 44. ActivityNet Entities nキャプションの各単語に bbox付与 • ActivityNet Captionsを利用 • bbox 158k [Zhou+, CVPR2019]
  45. 45. THUMOS nthe THUMOS challenges [Idrees+, CVIU2016] • THUMOS2013 • spatio-temporal action localization • THUMOS2014 • THUMOS2015 • temporal action localization nTHUMOSの読み方 • ギリシャ語θυμoς:サーモス • 意味:spirited contest(活発なコン テスト) http://www.thumos.info/
  46. 46. THUMOS13, aka UCF101-24 nICCV2013併設コンテスト nUCF101の動画のみを利用 • シングルラベル(101クラス) • 3 splitで交差検証 • 2 splitで学習,1 splitでテスト nタスク • 3207 untrimmed video • UCF101のサブセット • spatio-temporal localization nアノテーション • UCF101のうち24クラスにbboxを付 けたもの • xml形式 nUCF101-24 • なぜかTHUMOS13とは呼ばれない • 論文ではUCF101-24と書く • オリジナルのxml形式は使いにくく, また間違いがある • [Singh+, ICCV2017]が修正した githubリポジトリ (gurkirt/corrected-UCF101- Annots)がよく使われる? • matとpickleがある [Singh+, ICCV2017]
  47. 47. THUMOS14 & THUMOS15 nTHUMOS2013からの変更点 • spation-temporalからtemporalへ • 区間検出の方がbbox検出よりも有用 • ただしアノテーションはbbox (THUMOS2013と同じもの) • 背景ビデオの導入 • 20クラスのマルチラベルタスク • 1区間につき複数アクション • 平均4.6秒(動画長の28%) • untrimmed videoの導入 • 最長250秒? • 学習はtrimmedのUCF101とuntrimmed のval+背景 H. Idrees et al. / Computer Vision and Ima ARTICLE IN JID: YCVIU [Idrees+, CVIU2016]
  48. 48. THUMOS14 & THUMOS15 n THUMOS14データ • train:13320 trimmed (UCF101),101 クラス • val:1010 untrimmed,20クラス • 区間アノテーションあり200 • background:2500 untrimmed • test:1574 untrimmed,20クラス • 区間アノテーションあり213 n THUMOS15データ • train:13320 trimmed (UCF101) ,101 クラス • val:2140 untrimmed,20クラス • 区間アノテーションあり413 • background:2980 untrimmed • test:5613 untrimmed,20クラス • 区間アノテーションと背景かどうか は未公開 n コンテスト • train, val, backgroundを使ってよい • 自分で手動アノテーションは不可 • action recognition:マルチラベル • 101クラスのconfidence • (ただしマルチラベルの動画は全体の 15%程度) • TAL • 各区間 • その区間の検出クラスのconfidence value n 配布 • zipファイル • 展開パスワードはフォームから申請
  49. 49. THUMOS14 & THUMOS15 n構築 • 正例(アクションを含む)ビデオ: • Youtube検索にFreeBaseのトピックを利用 • 各actionのFreeBaseトピックを定義 • キーワード検索も併用 • 編集ビデオは削除 • "-awesome" "-crazy"などで検索から排除 • 背景ビデオ:こちらのほうが収集が大変 • 単に異なるカテゴリの映像を使うのはだめ(見た目が似てしまう9 • 背景とは? • アクションのコンテキストは共有している(似たシーン,物体,人) • しかしそのアクションは実際には発生していない • 例:ピアノが写っているが人はピアノを弾いていない • 他のカテゴリのアクションが入っていてもダメ
  50. 50. THUMOS14 & THUMOS15 n構築:2 • 背景ビデオの集め方 • 背景になりそうなキーワード指定で検索 • X + "for sale",X + "review" • 物体名で検索 • カテゴリにない「物体とアクションの組み合わせ」で検索 • 修正とアノテーション • ワーカーが101クラスのどれかを含んでいるかを判定 • 含んでいても,スローモーションや遮蔽,アニメ,長過ぎる(10分以上), 編集されたもの,スライドショー,一人称視点などは排除 • 共起する2次的なアクションがあればアノテーション • 背景ビデオ:3人が34クラスずつ担当して,どれでもないことを確認
  51. 51. THUMOS14 & THUMOS15 n構築:3 • 問題 • アクション区間の境界はあいまいで主観的,ひとによってバラバラ • 101カテゴリを2つに分類 • instantaneous:区間が明確(BasketballDunk, GolfSwing) • この20クラスを利用する • syclic:反復的・持続的(Biking, HairCut, PlayingGuitar) • 指標を客観的にするための対策 • UCF101カテゴリと一致するように,区間を厳密にアノテーション • 曖昧,不完全,一般的ではない場合には,ambiguousとマーク • 20クラス+ambigousクラス • 10%という小さいIoUを設定 • 複数のIoUで評価
  52. 52. MultiTHUMOS nTHUMOS14を拡張 • val200, trian213にラベル付け • 全フレームをマルチラベル • 元の20クラスに新たに45クラス加えて 合計65クラス • THUMOS14のラベル数 • フレームあたり平均0.3最大2 • 動画あたり平均1.1最大3 • 区間長平均4.8秒 • MultiTHUMOSのラベル数 • フレームあたり平均1.5最大9 • 動画あたり平均10.5最大25 • 区間長平均3.3秒最短66ms(2フ レーム) [Serana+, IJCV2018]
  53. 53. AVA nAtomic Visual Action • AVA-Kinetics [Li+, arXiv2020] • AVA Actions [Gu+, CVP2018] • AVA Spoken Activity datasets • AVA Active Speaker [Roth+, arXiv2019] • AVA Speech [Chaudhuri, Interspeech2018] n配布 • 公式サイトはcsvのみ • CVDのGitHubリポジトリに Amazon S3から動画をダウン ロードするリンクあり https://research.google.com/ava/explore.html
  54. 54. AVA Actions nいわゆるAVAデータセット • spatio-temporal action localizationタス ク • アトミックな(最小単位の)アクショ ンが80カテゴリ • 15分のuntrimmed videoが430本 • train 235, val 64, test 131 • アノテーションは1秒毎(1Hz) • 30分以上の動画の15分目から30分目 の15分間(900フレーム) • 1.5秒(3秒)のセグメントに対し てアノテーション • バージョン v1.0, v2.0, v2.1, v2.2 • 現在はv2.2を使う Chunhui Gu⇤ Chen Sun⇤ David A. Ross⇤ Carl Vondrick⇤ Caroline Pantofaru⇤ Yeqing Li⇤ Sudheendra Vijayanarasimhan⇤ George Toderici⇤ Susanna Ricco⇤ Rahul Sukthankar⇤ Cordelia Schmid† ⇤ Jitendra Malik‡ ⇤ Abstract This paper introduces a video dataset of spatio- temporally localized Atomic Visual Actions (AVA). The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1.58M action labels with multiple labels per person occurring frequently. The key charac- teristics of our dataset are: (1) the definition of atomic visual actions, rather than composite actions; (2) precise spatio-temporal annotations with possibly multiple annota- tions for each person; (3) exhaustive annotation of these atomic actions over 15-minute video clips; (4) people tem- porally linked across consecutive segments; and (5) using movies to gather a varied set of action representations. This departs from existing datasets for spatio-temporal action recognition, which typically provide sparse annotations for Figure 1. The bounding box and action annotations in sample frames of the AVA dataset. Each bounding box is associated with 1 pose action (in orange), 0–3 interactions with objects (in red), and 0–3 interactions with other people (in blue). Note that some
  55. 55. AVA Actions n 構築 • アクションは一般的・アトミック・網羅的なのもを選択 • 過去のデータセットを参考 • pose 14,man-object interaction 49,man-man interaction 17 • YouTubeから有名俳優の"film"や"television"で検索 • 30分以上,1年以上経過,1000ビュー以上の動画を選択 • モノクロ,低解像度,アニメ,ゲームなどは除外 • 予告などの部分を除外するために開始15分後から15分間を利用 • Bboxアノテーション • まずFaster R-CNN,次に手動で追加削除 • フレーム間リンクは自動マッチングして手動で修正 • Bboxあたりpose 1クラス(必須),man-objectやman-man interactionがあれば追加 (それぞれ3つまで).合計最大7ラベル • 注:poseはsoftmax-CE,man-objやman-manはBCEにすることが多い • 動画1本あたり1秒毎の3秒セグメント(897個)に対してラベルあり n 評価 • frame-mAP
  56. 56. AVA-Kinetics nAVA Actionsと同様の手順で Kinetics-700にbboxを付与 • 10秒クリップ中のある1フレーム (key-frame) だけにbboxアノテーショ ン • train • AVA80クラスで認識率が低い27カ テゴリに対応するK700の115クラ スを手動選出し,すべての動画を アノテーション • 残りのクラスからは一様にサンプ ルしてアノテーション • val・test:すべての動画をアノテー ション The AVA-Kinetics Localized Human Actions Video Dataset Ang Li1 Meghana Thotakuri1 David A. Ross2 João Carreira1 Alexander Vostrikov1⇤ Andrew Zisserman1,4 1 DeepMind 2 Google Research 4 VGG, Oxford {anglili,sreemeghana,dross,joaoluis,zisserman}@google.com alexander.vostrikov@gmail.com Abstract This paper describes the AVA-Kinetics localized human actions video dataset. The dataset is collected by annotat- ing videos from the Kinetics-700 dataset using the AVA an- notation protocol, and extending the original AVA dataset with these new AVA annotated Kinetics clips. The dataset contains over 230k clips annotated with the 80 AVA action classes for each of the humans in key-frames. We describe the annotation process and provide statistics about the new dataset. We also include a baseline evaluation using the Video Action Transformer Network on the AVA-Kinetics dataset, demonstrating improved performance for action classification on the AVA test set. The dataset can be down- loaded from https://research.google.com/ava/. 1. The AVA-Kinetics Dataset kinetics: chopping wood ava: bend/bow (at the waist) ava: touch (an object) lift (a person) stand kinetics: high kick carry/hold (an object); stand; talk to (e.g., self, a person, a group); watch (a person) jump/leap; touch (an object) high jump Figure 1. An example key-frame in the AVA-Kinetics dataset. The 0214v2 [cs.CV] 20 May 2020 [Li+, arXiv2020] and a total of around 650k video clips. For each class, each clip is from a different Internet video, lasts about 10s and has a single label describing the dominant action occurring in the video. 1.2. Data Annotation Process The AVA-Kinetics dataset extends the Kinetics dataset with AVA-style bounding boxes and atomic actions. A single frame is annotated for each Kinetics video, using a frame selection procedure described below. The AVA anno- tation process is applied to a subset of the training data and to all video clips in the validation and testing sets from the Kinetics-700 dataset. The procedure to annotate bounding boxes for each Kinetics video clip was as follows: 1. Person detection: Apply a pre-trained Faster RCNN [8] person detector on each frame of the 10-second long video clips. 2. Key-frame selection: Choose the frame with the highest person detection confidence as the key-frame of each video clip, at least 1s away from the start/endpoint of the clip. 3. Missing box annotation: Human annotators verify and annotate missing bounding boxes for the key-frame. unique video clips in different splits for the AVA-Kinetics dataset. AVA-Kinetics data is a combination of AVA and Kinetics. While the numbers of annotated frames are roughly on par between AVA and Kinetics, Kinetics brings many more unique videos to the AVA-Kinetics dataset. # unique frames # unique videos AVA Kinetics AVA-Kinetics AVA Kinetics AVA-Kinetics Train 210,634 141,457 352,091 235 141,240 141,475 Val 57,371 32,511 89,882 64 32,465 32,529 Test 117,441 65,016 182,457 131 64,771 64,902 Total 385,446 238,984 624,430 430 238,476 238,906 Clips from the remaining Kinetics classes are sampled uni- formly (we have not yet annotated all of them). Different from the training set, the Kinetics validation and test sets are both fully annotated. 2. Data Statistics We discuss in this section characteristics of the data dis- tribution in AVA-Kinetics and compare it with the existing AVA dataset. The statistics of the dataset are given in Ta- ble 1 which shows the total number of unique frames and that of unique videos in these datasets. Kinetics dataset con- tains a large number of videos so it contributes a lot more unique videos to the AVA-Kinetics dataset.
  57. 57. AVA Speech nSpeech activityの認識 • 発話があるかないか,音楽やノ イズがあるかどうか • 動画:AVA v1.0(192動画) • カテゴリは4つ • No speech • Clean speech • Speech + Music • Speech + Noise • 各時刻にラベル付け(dense) • 185動画,各15分 Figure 1: Rating interface with the set of labels and labeling shortcuts shown on the right of the video player, and playback shortcuts below the player. 190 movies, and should lead to a wider diversity of contexts. It explicitly annotates when speech activity co-occurs with back- Figure 2: An example of the labeled activity timeline from 3 [Chaudhuri+, Interspeech2018]
  58. 58. AVA Active Speaker n発話者の特定 • 各フレームで検出された顔が発話してい るかどうか • 動画:AVA v1.0 (192動画) • カテゴリは3つ • Not Speaking • Speaking and Audible • Speaking but not Audible(話している が聞き取れない) • 各時刻でラベル付け(dense) • 160動画 • Face trackerを使用 • 1秒以上,10秒以下のトラック • 38.5kトラック,顔3.65M個 {josephroth, sourc, klejcho, radahika, agallagher, lkaver, sharadh, astopczynski, cordelias, zxi, cpantofaru}@google.com Abstract aker detection is an important component in algorithms for applications such as speaker ideo re-targeting for meetings, speech en- nd human-robot interaction. The absence refully labeled audio-visual dataset for this trained algorithm evaluations with respect ity, environments, and accuracy. This has sons and improvements difficult. In this pa- nt the AVA Active Speaker detection dataset eaker) that will be released publicly to facili- development and enable comparisons. The Figure 1: The annotation interface for AVA Given its surrounding video and audio (w [Roth+, ICASSP2020]
  59. 59. HACS nHACS Clip • 1.5M trimmed videos • train 1.4M, val 50k, test 50k • from 492k, 6k, 6k untrimmed videos • 2秒クリップ nHACS Segment • 139k action segments • train 38k, val 6k, test 6k untrimmed videos nカテゴリ • 200カテゴリ (clips/segment) • Activit-Net 200と同じもの [Zhao+, ICCV2019] Hang Zhao† , Antonio Torralba† , Lorenzo Torresani‡ , Zhicheng Yan! † Massachusetts Institute of Technology, ‡ Dartmouth College, ! University of Illinois at Urbana-Champaign Abstract This paper presents a new large-scale dataset for recog- nition and temporal localization of human actions collected from Web videos. We refer to it as HACS (Human Action Clips and Segments). We leverage both consensus and dis- agreement among visual classifiers to automatically mine candidate short clips from unlabeled videos, which are sub- sequently validated by human annotators. The resulting dataset is dubbed HACS Clips. Through a separate pro- cess we also collect annotations defining action segment boundaries. This resulting dataset is called HACS Seg- ments. Overall, HACS Clips consists of 1.5M annotated clips sampled from 504K untrimmed videos, and HACS Seg- ments contains 139K action segments densely annotated in 50K untrimmed videos spanning 200 action categories. HACS Clips contains more labeled examples than any ex- isting video benchmark. This renders our dataset both a large-scale action recognition benchmark and an excellent source for spatiotemporal feature learning. In our transfer learning experiments on three target datasets, HACS Clips outperforms Kinetics-600, Moments-In-Time and Sports1M as a pretraining source. On HACS Segments, we evaluate state-of-the-art methods of action proposal generation and 4.E+03 3.E+04 3.E+05 2.E+06 4.0E+3 1.6E+4 6.4E+4 2.6E+5 1.0E+6 number of videos number of clips UCF-101 HMDB-51 Kinetics-600 HACS Clips (Ours) 5.E+01 4.E+02 3.E+03 3.E+04 2.E+05 5.0E+3 2.0E+4 8.0E+4 number of videos number of segments THUMOS 14 AVA ActivityNet Charades HACS Segments (Ours) 400 200 100 50 200 100 50 Moments Figure 1: Comparisons of manually labeled action recog- nition datasets (Top) and action localization datasets (Bottom), where ours are marked as red. The marker size encodes the number of action classes in logarithmic scale. It currently contains 9M images with image-level label and 1.7M images with 14.6M bounding boxes, and has greatly pushed the advances of research work in those fields [1, 19]. In the video domain, we have witnessed an analogous growth in the scale of action recognition datasets. While video benchmarks created a few years ago consists of only a few thousands examples (7K videos in HMDB51 [29],
  60. 60. HACS n構築 • YouTubeからActivityNetの200カテゴリ名で検索 • 得られた890k動画をフィルタリング • 重複除去,他データセット(Kinetics, ActivityNet, UCF101, HMDB51)のval とtestも除去 • 504k動画(最長4分,平均2.6分) • ここから2秒クリップ1.5Mを一様にサンプリング • ただしpos(アクション)よりneg(非アクション)が圧倒的に多いので一様 サンプリングではダメ • 複数の画像識別器でpos 0.6Mとneg 0.9Mに分類 • これをtrain/val/testにsplitしたのがHACS Clip • そのうちのサブセット50k untrimmed videoにアクション区間を付与 • これがHACS Segment • ActivityNetより1動画あたりのsegment数は多く,segment長は短い [Zhao+, ICCV2019]
  61. 61. DALY nDaily Action Localization in YouTube • 10カテゴリ,8133クリップ • 高解像度:1290x790 • 長いuntrimmed video(平均3分45 秒) • 平均アクション長8秒 • アクション=道具が体に触れてい る時間 • 区間アノテーションはワーカーを使 わず自分達で • bboxは区間から一様にサンプルし た5フレームのみ(最大1fps) [Weinzaepfel+, arXiv2016] supervision, performs on par with methods using full supervision, i.e., one bounding box annotation per frame. To further validate our method, we introduce DALY (Daily Action Localization in YouTube), a dataset for realistic action localization in space and time. It contains high quality temporal and spatial annotations for 3.6k instances of 10 actions in 31 hours of videos (3.3M frames). It is an order of magnitude larger than existing datasets, with more diversity in appearance and long untrimmed videos. Index Terms—Spatio-temporal action localization, weak supervision, human tubes, CNNs, dense trajectories. F 1 INTRODUCTION ACTION classification has been widely studied over the past decade and state-of-the-art methods [1], [2], [3], [4], [5] now achieve excellent performance. However, to analyze video content in more detail, we need to localize actions in space and time. Detecting actions in videos is a challenging task which has received increasing attention over the past few years. Recently, significant progress has been achieved in supervised action localization, see for ex- ample [6], [7], [8], [9], [10]. However these methods require a large amount of annotation, i.e., bounding box annotations in every frame. Such annotations are, for example, used to train Convolutional Neural Networks (CNNs) [6], [7], [9], [10] at the bounding box level. Several works have suggested to generate action proposals before classifying them [11], [12], however they generate hundreds of pro- posals for a video, thus supervision is still required to label them in order to train a classifier. Consequently, all these ap- proaches require full supervision, where action localization needs to be annotated in every frame. This makes scaling up to a large dataset difficult. The goal of this paper is to move away from full supervision, similar in spirit to recent work on weakly-supervised object localization [13], [14]. Recently, Mettes et al. [15] have addressed action local- ization with another annotation scheme, e.g. with pointly- supervised proposals. A large number of candidate pro- posals are obtained using APT [12], a method based on grouping dense trajectories. They show that Multiple In- stance Learning (MIL) applied directly on these proposals performs poorly. They thus introduce point supervision and incorporate an overlap measure between annotated points and proposals into the mining process. This requires • P. Weinzaepfel is with Xerox Research Centre Europe, Meylan, France. E-mail: philippe.weinzaepfel@xrce.xerox.com • X. Martin and C. Schmid are with Inria, LJK, Grenoble, France. E-mail: firstname.lastname@inria.fr Fig. 1. We consider sparse spatial supervision: the temporal extent of the action as well as one box per instance are annotated in the training videos (left). To train an action detector, we extract human tubes and select positive and negative ones (right) according to the sparse annotations. annotating a point in every frame. In this paper we go a step further and significantly reduce the number of frames to annotate. To this end, we leverage the fact that actors are humans and extract human tubes. Given these human tubes, our approach uses only one spatial annotation per action instance, see Figure 1. We show that such a sparse annotation scheme is sufficient to train state-of-the-art action detectors. Our approach first extracts human tubes from videos. Using human tubes for action recognition is not a novel idea [16], [17], [18]. However, we show that extracting high quality human tubes is possible by leveraging a recent state- of-the-art object detection approach (Faster R-CNN [19]), a large annotated dataset of humans in a variety of poses arXiv:1605.05197v2 [cs.CV] 23 May 20 7 Fig. 5. Example frames from the DALY dataset with simultaneous actions. Fig. 6. Example of spatial annotation from the DALY dataset. In addition to the bounding box around the actor (yellow), we also annotate the objects (green) and the pose of the upper body (bounding box around the head in blue and joint annotation for shoulders, elbows and wrists). 7 Fig. 5. Example frames from the DALY dataset with simultaneous actions.
  62. 62. Action Genome n動画のシーングラフ • Charadesの動画にVisual Genome [Krishna+, IJCV2017]的なグラフをアノ テーション • Visual Genome:画像中のすべての物 体間の関係 • Action Genome:アクション発生区間 中のみ,アクション対象の物体と人 間の関係のみ • 234,253フレームの物体35クラス,関係 25クラス • 各アクション区間から一様に5フレー ム抽出してアノテーション • アクションカテゴリは157(Charades と同じ) [Ji+, CVPR2020] Action recognition in videos. Many research projects have tackled the task of action recognition. A major line of work has focused on developing powerful neural architectures to extract useful representations from videos [10, 23, 31, 69, 72]. Pre-trained on large-scale databases for action clas- sification [8, 9], these architectures serve as cornerstones for downstream video tasks and action recognition on other datasets. To assist more complicated action understanding, another growing set of research explores structural informa- tion in videos including temporal ordering [51, 88], object localization [4, 25, 32, 53, 74, 76], and implicit interactions between objects [4, 53]. In our work, we contrast against these methods by explicitly using a structured decomposi- tion of actions into objects and relationships. Table 1 lists some of the most popular datasets used for action recognition. One major trend of video datasets is providing considerably large amount of video clips with single action labels [8, 9, 87]. Although these databases have driven the progress of video feature representation for
  63. 63. Home Action Genome n Home Action Genome (HOMAGE) • 家2軒の全部屋,参加者27名 • マルチモーダル • カメラ:1人称,3人称マルチビュー,赤外 線 • センサ:照明,加速度,ジャイロ,人感,磁 気,気圧,湿度,室温 • 1752 untrimmed video • train1388, tests 198/166 • そこから5700動画を抽出 • 75 activity,453 atomic action • 1動画1 activity • 各フレームにマルチラベルのatomic action (2〜5秒) • アノテーション • atomic action:開始,終了,カテゴリ • train20k, tests2.1k/2.5k • bbox 583k:一様に抽出した3-5フレーム • 物体86クラス,関係29クラス [Rai+, CVPR2021] ization and relationships between objects. LEMMA [21] is a recent multi-view and multi-agent human activity recogni- tion dataset, providing bounding box annotations on third- person views and compositional action labels annotated with predefined action templates and verbs/nouns. How- ever, they do not provide bounding boxes of objects the subjects (human) interact with. Action Genome [10] is built upon the videos from Charades [38], with the additional an- notation of spatio-temporal scene graph labels. However, it only provides videos from a single camera view. HOMAGE aims to provide 1) multiple modalities to promote multi- modal video representation learning, 2) high-level activity labels and temporally localized atomic action labels, and 3) scene graphs that provide spatial localization cues for both the subject and the object and their relationship. Multi-Modal Learning. Multiple modalities of videos are rich sources of information for both supervised [25] and self-supervised learning [26, 27, 39]. [40, 27] introduce a contrastive learning framework to maximize the mutual in- formation between modalities in a self-supervised manner. The method achieves state-of-the-art results on unsuper- vised learning benchmarks while being modality-agnostic view1(Ego-view) view2 view3 view4 view5 eat dinner pack suitcase blow-dry hair handwash dishes Figure 2: Multiple Views of Home Action Genome (HOMAGE) Dataset. Each sequence has one ego-view video as well as at least one or more synchronized third person views. the instructions assigned. To make sure the behaviors are as natural as possible, we did not specify detailed procedures and time limits within the activities, and let the individual participants perform the activity freely. Data Collection. We recorded 27 participants in kitchens, bathrooms, bedrooms, living rooms, and laundry rooms in two different houses. We used 12 sensor types: cameras (RGB), infrared (IR), microphone, RGB light, light, accel- UCF101 [14] 13K 27 1 ActivityNet [13] 28K 648 1 Kinetics-700 [11] 650K 1.79K 1 AVA [42] 430 108 1 PKU-MMD [33] 1.08K 50 3 EPIC-Kitchens [15] - 55 1 MMAct [37] 36K - 6 Action Genome [10] 10K 82 1 Breakfast [43] - 77 1 LEMMA [21] 324 10.1 2 Ours 1.75K 25.4 12 Table 1: Comparison between related datasets and HOMAGE. not including annotation data or derived data like optical flow, V level activity label (often assigned one per video), TL: temporal rich multi-modal action data, including dense annotations such as Video (ego-view) Video (3rd-view) Atomic Action Activity take sth from washing machine holding detergent holding a basket unloading washing machine
  64. 64. Charades n英語でジェスチャゲームを意味する • 発音:シュレィヅ(米)シャラーヅ(英) n動画 • 9848本の30秒ビデオ • train 7985, test 1863 • (区間はtrain 49k, test 17k) • 157カテゴリ • 46物体,30アクション(動詞),15屋内 シーンの組み合わせ • アノテーション • アクション区間(平均12.8秒,動画当たり 平均6.8アクション) • 説明文(description) • マルチラベル • 映像中に異なるカテゴリの区間が存在 [Sigurdsson+, ECCV2016] Hollywood in Homes: Crowdsou The Charades Dataset Fig. 1. Comparison of actions in the Charades dataset book, Opening a refrigerator, Drinking from a cup. YouT often atypical videos, while Charades contains typical ev
  65. 65. Charades n構築:「Hollywood in Homes」アプローチ • ハリウッドでの撮影プロセスAMTワーカーに依頼する • 15種類の屋内シーンで起こりうる状況を文章(script)にする • ランダムな物体とアクションが5つずつ提示される • 2つずつ選択して1人か2人が家で行う行動を記述する • その文章に従って30秒の動画を自分で撮影する • 価格が安いとワーカーが集まらないので工夫あり • 全scriptを解析して157のアクションクラスを選出 • アノテーションする(descriptionを書く) • ビデオから物体を自動抽出 • その物体とそれに関連するアクション(5個程度)とscriptを見て,ワーカー が判断 • さらに別のワーカーがアクション区間をアノテーション [Sigurdsson+, ECCV2016]
  66. 66. Charades-Ego n1人称と3人称の動画ペアの理解 • Charadesと同様の手順 n撮影 • AMTに依頼,参加者112人 • ワーカーはどこかにカメラを置いて 3人称で撮影後,同じ内容を1人称で 撮影(カメラを額に取り付け) • 同一人物,同一環境が保証 • Charadesのスクリプトを借用 • 157カテゴリ(Charadesと同様) • 4000ペア動画(364ペアで2人以上が 登場),平均31.2秒 [Sigurdsson+, CVPR2018]
  67. 67. EPIC-KITCHENS-55 n キッチンの一人称視点動画 • 55時間,参加者32人4都市 • キッチンに入ったら録画,出たら停 止 • 平均動画長1.7h,最長4.6h • 参加者あたり平均13.6動画 • 行動の指定なし,記録は連続する3日 間,カメラはGoPro • 参加者に音声で荒くアノテーションし てもらう • AMTでラベル付け • 最低0.5秒,前後の区間と重複可 • 39.6k アクション区間,454.3k 物体 bbox • アクション:動詞125クラス • 物体:名詞331クラス [Damen+, ECCV2018] [Damen+, TPAMI2021] 8 D. Damen et al Fig. 6: Sample consecutive action segments with keyframe object annotations We refer to the set of minimally-overlapping verb classes as CV , and similarly CN for nouns. We attempted to automate the clustering of verbs and nouns using combinations of WordNet [32], Word2Vec [31], and Lesk algorithm [4], however, due to limited context there were too many meaningless clusters. We thus elected to manually cluster the verbs and semi-automatically cluster the nouns. We preprocessed the compound nouns e.g. ‘pizza cutter’ as a subset of the second noun e.g. ‘cutter’. We then manually adjusted the clustering, merging the variety of names used for the same object, e.g. ‘cup’ and ‘mug’, as well as splitting some base nouns, e.g. ‘washing machine’ vs ‘coffee machine’. In total, we have 125 CV classes and 331 CN classes. Table 3 shows a sample of grouped verbs and nouns into classes. These classes are used in all three APRIL 2020 Fig. 2: Head-mounted GoPro used in dataset recording Use any Use pres Use verb You may the kiwi Use prop Use ‘and If an act Fig. 3: particip 0.5% at 90fps1 . The kitchen
  68. 68. EPIC-KITCHENS-100 nEPIC-KITCHENS-55の拡張 • EK55の動画は使いまわしせず,新たに 100時間を作成(参加者45人) • 参加者による音声ナレーション付け • EP55では動画にアフレコ • EP100では一時停止して録画 • 90.0k アクション区間,454.3k 物体bbox • アクション:動詞97クラス • 物体:名詞300クラス • マスクの追加 by Mask R-CNN • 38M物体,31Mの手 • Hand-object interactionのラベル • [Shan+, CVPR2020] [Damen+, IJCV2021] 38 International Journal of Computer Vision (2022) 130:33–55 Fig. 5 Top: Sample Mask R-CNN of large objects (col1: oven), hands (labelled person), smaller objects (col2: knife, carrot, banana, col3: clock, toaster, col4: bottle, bowl), incorrect labels of visually ambiguous objects (col3: apple vs onion) and incorrect labels (col3: mouse, col4: chair). Bottom: Sample hand-object detections from Shan et al. (2020). L/R = Left/Right, P = interaction with portable object, O = object. Mul- tiple object interactions are detected (col2: pan and lid, col4: tap and kettle)
  69. 69. nマックス・プランク情報学研究所 • Max Planck Institute for Informatics (MPII) • MPII Cooking Activities [Rohrbach+, CVPR2012] • MPII Cooking Composites Activities [Rohrbach+, ECCV2012] • MPII Cooking 2 [Rohrbach+, IJCV2016] MPII Cooking datasets
  70. 70. MPII Cooking Activities nMPII Cooking • クラス間の違いが小さい(体の動きが小さい) • 区間アノテーションがある初のデータセット • 5609区間(以降の論文では3824?) • 1秒以上何もなければ自動的に背景ラベルを付 与 • 65種類の料理アクティビティ • 14種類の料理,参加者12人 • 44動画(計8時間,881kフレーム) • 動画あたり3〜41分 • そのうち2.2kフレームには2D姿勢アノテーショ ンあり(train 1071, test 1277) • 1624x1224, 29.4fps [Rohrbach+, CVPR2012] A Database for Fine Grained Activity Detection of Cooking Activities Marcus Rohrbach Sikandar Amin Mykhaylo Andriluka Bernt Schiele Max Planck Institute for Informatics, Saarbrücken, Germany Abstract While activity recognition is a current focus of research the challenging problem of fine-grained activity recognition is largely overlooked. We thus propose a novel database of 65 cooking activities, continuously recorded in a realistic setting. Activities are distinguished by fine-grained body motions that have low inter-class variability and high intra- class variability due to diverse subjects and ingredients. We benchmark two approaches on our dataset, one based on articulated pose tracks and the second using holistic video features. While the holistic approach outperforms the pose- based approach, our evaluation suggests that fine-grained activities are more difficult to detect and the body model can help in those cases. Providing high-resolution videos as well as an intermediate pose representation we hope to foster research in fine-grained activity recognition. Figure 1. Fine grained cooking activities. (a) Full scene of cut slices, and crops of (b) take out from drawer, (c) cut dice, (d) take
  71. 71. MPII Cooking Composite Activities n MPII Composites • 上位のアクティビティ(料理)は基礎レ ベルのアクティビティ(手順)の組み合 わせ • 手順(スクリプト)を先に作成 • 理由:手順を示さないと,あまりにバ ラバラすぎる • ワーカーが料理手順を記述 • 53種のタスクを最大15手順で • スクリプト数2124 • 手順に従って参加者が調理 • カテゴリ • 料理41(composite) • 手順218(attributes) • 212動画,参加者22人 • 動画あたり1〜23分 • 区間数8818 [Rohrbach+, ECCV2012] 2 Rohrbach, Regneri, Andriluka, Amin, Pinkal, and Schiele prepare scrambled egg version - K 1) get the pan from drawer 2) put some butter on the pan then heat it on the stove 3) crack the egg in a bowl 4) put some salt and whisk 5) put the mixture on pan 6) stir for 3-4 minutes version - 02 1) open the egg in a bowl and stir, add salt and pepper 2) heat the pan on the stove 3) put some oil on the pan 4) when oil is hot then put the mixture in the pan and stir for some minutes version - 01 1) take egg from the fridge 2) put pan on the stove 3) open egg over pan 4) fry for 3-4 minutes separate egg prepare onion take-out egg open pan onion fry prepare scrambled eggs egg open pan fry Script data collected using Mechanical Turk Fig. 1. Sharing or transferring attributes of composite activities using script data.
  72. 72. Int J Comput Vis (2016) 119:346–373 347 MPII Cooking 2 nMPII Compositesの拡張 • CookingとCompositesのカテゴリを整理統合 • 273動画(計27時間) • train 201, val 17, test 42 • 動画あたり1〜41分 • 区間数14k • カメラ8台同期(ただし1台のみ使用) • カテゴリ • 料理59(composite) • 手順222(attributes) [Rohrbach+, IJCV2016]
  73. 73. YouCook n動画単位での説明文付き • video captioning用 nソース:YouTube動画 • 88動画,三人称視点 • train 49, test 39 • 6料理スタイル • キャプション • 動画あたり最低3文,平均8文 • 各文15単語以上,平均10単語 • 動画あたりの合計は平均67単語 [Das+, CVPR2013]
  74. 74. YouCook2 n手順動画を分析したい • 区間アノテーション+区間の内容の説 明文 • 複雑な手順はアクションラベルでは 表せないはず • 2000動画(176時間),89レシピ • train 1333, val 457, test 210 • ソース:YouTube動画 • 一人称視点は除外 • 平均動画長5.27分,最大10分 • 動画あたり3〜16区間(手順) • 区間長1〜264秒,平均19.6秒 • 説明文は20単語以下 [Zou+, README] [Zou+, AAAI2018] Grill the tomatoes in a pan and then put them on a plate. Add oil to a pan and spread it well so as to fry the bacon Place a piece of lettuce as the first layer, place the tomatoes over it. Sprinkle salt and pepper to taste. Add a bit of Worcestershire sauce to mayonnaise and spread it over the bread. Place the bacon at the top. Place a piece of bread at the top. Cook bacon until crispy, then drain on paper towel 00:21 00:54 01:06 01:56 02:41 03:08 03:16 03:25 00:51 01:03 01:54 02:40 03:00 03:15 03:25 03:28 Start time: End time:
  75. 75. Breakfast [Kuehne+, CVPR2014] n手順(Activity)の認識 • レシピ10種類(coffee, etc) • レシピだけ渡して作り方は参加者にま かせる(unscripted) • 参加者52名,18キッチン • 複数カメラ:場所により台数は異なる (3〜5台) • ステレオカメラあり • 320x240, 15fps, 計77時間 • 各動画数分(1〜5分程度) nアノテーションは2種類 • coarse:48種類.例「牛乳を注ぐ」 • fine:「牛乳を掴む,カップを回す, カップを開ける」
  76. 76. 50 Salads n手順(Activity)の認識 • レシピはサラダのみ • activityはhigh-level 3種類,low-level 17種類 • それぞれ3種類に分割 • pre, core(約60%), post • 参加者27名,50動画(各4〜8分) • 参加者は提示された手順に従う(分量は 指定していないが1人分を作成) • それぞれ2回ずつ撮影 n計測 • 天井固定のRGB-Dカメラ(Kinect), 640x480, 30Hz • 各道具に加速度センサ(50Hz) [Stein&McKenna, UbiComp13] [Stein&McKenna, CVIU2017] University of Dundee Dundee, United Kingdom {sstein,stephen}@computing.dundee.ac.uk ABSTRACT This paper introduces a publicly available dataset of com- plex activities that involve manipulative gestures. The dataset captures people preparing mixed salads and contains more than 4.5 hours of accelerometer and RGB-D video data, de- tailed annotations, and an evaluation protocol for compari- son of activity recognition algorithms. Providing baseline results for one possible activity recognition task, this pa- per further investigates modality fusion methods at different stages of the recognition pipeline: (i) prior to feature extrac- tion through accelerometer localization, (ii) at feature level via feature concatenation, and (iii) at classification level by combining classifier outputs. Empirical evaluation shows that fusing information captured by these sensor types can considerably improve recognition performance. Author Keywords Activity recognition, sensor fusion, accelerometers, com- puter vision, multi-modal dataset ACM Classification Keywords I.5.5 Pattern Recognition: Applications; I.4.8 Scene Analy- sis: Sensor Fusion; I.2.10 Vision and Scene Understanding: Figure 1. Snapshot from the dataset. Data from an RGB-D camera and from accelerometers attached to kitchen objects were recorded while 25 people prepared two mixed salads each. Activities were split into preparation, core and post-phase, and these phases were annotated as temporal intervals.
  77. 77. HowTo100M n動画とテキストのペア • テキストは自動生成字幕 n作成 • YouTubeのナレーション付きinstruction動画 1.22Mから136Mクリップ • untrimmed video:平均6.5分,計15年 • ダウンロードすると360pでも >40TB • 1動画から平均110クリップ生成 • 1クリップ平均4秒,4単語 • WikiHowから23kのvisual taskを先に設定,動画 を検索 • 抽象的なタスクは除外 • 自動字幕作成機能でナレーション取得 • 字幕の各行をクリップに対応 • 厳密ではないのでweakly paired [Miech+, ICCV2019]
  78. 78. Olympic Sports nスポーツに特化 • YouTube動画800本 • 16カテゴリ • 各クラス50本 • train 40, test 10 n配布フォーマット • seq video format (Piotr's Computer Vision Matlab Toolbox) [Niebles+, ECCV2010]
  79. 79. Sports-1M n100万本のYouTube動画 • 487カテゴリ • 各クラス1000〜3000本 • train70%, val 20%, test 10% • 重複はたった0.18%程度 • 自動アノテーション • YouTubeのメタデータを利用 • 5%に複数のラベル付与(マルチラベ ル) • untrimmed video • URLのみ配布 [Karpathy+, CVPR2014] Figure 4: Predictions on Sports-1M test data. Blue (first row) in
  80. 80. FineGym n 時間もラベルも階層的な詳細認識 • ラベルの階層 • event:競技種目名 • set:elementの集合 • element:技の名前(double saltoな ど) • 時間 • action:eventに対応 • sub-action:elementに対応 n データセット • event 10(男子競技6,女子競技4) • element 530(gym530の場合) • バージョン:v1.0, 1.1 • カテゴリ数:gym99, gym288, gym530 [Shao+, CVPR2020] FineGym: A Hierarchical Video Dataset for Fine-grained Action Understanding Dian Shao Yue Zhao Bo Dai Dahua Lin CUHK-SenseTime Joint Lab, The Chinese University of Hong Kong {sd017, zy317, bdai, dhlin}@ie.cuhk.edu.hk … … … … … … … … … vault Uneven Bars Actions Sub-actions Balance Beam Floor Exercise Balance Beam Beam-turns Leap-Jump-Hop BB-flight-handspring 3 turn in tuck stand Wolf jump--hip angle at 45, knees together Flic-flac with step-out Tree Reasoning Tree Reasoning Tree Reasoning Sets Elements More Fine-grained Events
  81. 81. Diving48 n飛び込み競技の詳細認識 • 48カテゴリ:以下の組み合わせ • takeoff, movements in flight (somersaults / twists), entry • 国際水泳連盟(FINA, Fédération internationale de natation)のルー ルに準拠 • webから動画を取得 • 1分クリップ18kに自動分割 • train16k, test 2k • AMTでアノテーション • 飛び込みの種類と難度 • 開始,終了フレーム n動機 • representation biasの排除 • object, scene, person • 「飛び込み」は国際基準のカテゴリ がある,差異は小さい,見えにも似 ている • 物体やシーンでは識別できないタ スク [Li+, ECCV2018]
  82. 82. Ego-4D n 1人称視点動画(Ego4D-3K) • 全3670時間,参加者931人,9カ国74都 市 • 日常のアクティビティ • カメラのバッテリがなくなるまで撮 影し続ける(1撮影者で1〜10時間) • 1クリップ平均8分 • マルチモーダル • カメラ:RGB,ステレオ,マルチ ビュー • 画像:顔,視線方向, 3Dスキャン • ナレーションテキスト,音声,IMU n アノテーション250k時間 • ナレーション • EPIC-Kitchenと類似手順,1分あたり 平均13.2文,計3.85M文 n タスク毎のアノテーション • Episodic memory:110アクティビティ • Hands and Objects:手と物体のbboxと関係ラ ベル,時刻,状態 • Audio-Visual Diarization:顔bbox,人物ラベル, 発話の時刻と内容 • Social interaction:顔bbox,人物ID,フレーム 単位の発話人物特定,カメラを向いているか どうか,カメラに向かって話しているか [Grauman+, CVPR2022] Gardening Shopping Pets Sewing / Knitting Baking Playing games Reading Sports Multi-perspective 1 3 4 2 L R Stereo vision Human locomotion IMU Social interaction Doing laundry 3D Video + 3D scans Geographic diversity Figure 1. Ego4D is a massive-scale egocentric video dataset of daily life activity spanning 74 locations worldwide. Here we see a snapshot of the dataset (5% of the clips, randomly sampled) highlighting its diversity in geographic location, activities, and modalities. The data includes social videos where participants consented to remain unblurred. See https://ego4d-data.org/fig1.html for interactive figure. However, in both robotics and augmented reality, the input is a long, fluid video stream from the first-person or “ego- centric” point of view—where we see the world through the eyes of an agent actively engaged with its environment. Second, whereas Internet photos are intentionally captured by a human photographer, images from an always-on wear- able egocentric camera lack this active curation. Finally, first-person perception requires a persistent 3D understand- ing of the camera wearer’s physical surroundings, and must interpret objects and actions in a human context—attentive to human-object interactions and high-level social behaviors. Motivated by these critical contrasts, we present the Ego4D dataset and benchmark suite. Ego4D aims to cat- alyze the next era of research in first-person visual percep- video content that displays the full arc of a person’s complex interactions with the environment, objects, and other people. In addition to RGB video, portions of the data also provide audio, 3D meshes, gaze, stereo, and/or synchronized multi- camera views that allow seeing one event from multiple perspectives. Our dataset draws inspiration from prior ego- centric video data efforts [43,44,129,138,179,201,205,210], but makes significant advances in terms of scale, diversity, and realism. Equally important to having the right data is to have the right research problems. Our second contribution is a suite of five benchmark tasks spanning the essential components of egocentric perception—indexing past experiences, ana- lyzing present interactions, and anticipating future activity. Ego4D: Around the World in 3,000 Hours of Egocentric Video Kristen Grauman1,2 , Andrew Westbury1 , Eugene Byrne⇤1 , Zachary Chavis⇤3 , Antonino Furnari⇤4 , Rohit Girdhar⇤1 , Jackson Hamburger⇤1 , Hao Jiang⇤5 , Miao Liu⇤6 , Xingyu Liu⇤7 , Miguel Martin⇤1 , Tushar Nagarajan⇤1,2 , Ilija Radosavovic⇤8 , Santhosh Kumar Ramakrishnan⇤1,2 , Fiona Ryan⇤6 , Jayant Sharma⇤3 , Michael Wray⇤9 , Mengmeng Xu⇤10 , Eric Zhongcong Xu⇤11 , Chen Zhao⇤10 , Siddhant Bansal17 , Dhruv Batra1 , Vincent Cartillier1,6 , Sean Crane7 , Tien Do3 , Morrie Doulaty1 , Akshay Erapalli13 , Christoph Feichtenhofer1 , Adriano Fragomeni9 , Qichen Fu7 , Abrham Gebreselasie12 , Cristina González14 , James Hillis5 , Xuhua Huang7 , Yifei Huang15 , Wenqi Jia6 , Weslie Khoo16 , Jáchym Kolář13 , Satwik Kottur1 , Anurag Kumar5 , Federico Landini1 , Chao Li5 , Yanghao Li1 , Zhenqiang Li15 , Karttikeya Mangalam1,8 , Raghava Modhugu17 , Jonathan Munro9 , Tullie Murrell1 , Takumi Nishiyasu15 , Will Price9 , Paola Ruiz Puentes14 , Merey Ramazanova10 , Leda Sari5 , Kiran Somasundaram5 , Audrey Southerland6 , Yusuke Sugano15 , Ruijie Tao11 , Minh Vo5 , Yuchen Wang16 , Xindi Wu7 , Takuma Yagi15 , Ziwei Zhao16 , Yunyi Zhu11 , Pablo Arbeláez†14 , David Crandall†16 , Dima Damen†9 , Giovanni Maria Farinella†4 , Christian Fuegen†1 , Bernard Ghanem†10 , Vamsi Krishna Ithapu†5 , C. V. Jawahar†17 , Hanbyul Joo†1 , Kris Kitani†7 , Haizhou Li†11 , Richard Newcombe†5 , Aude Oliva†18 , Hyun Soo Park†3 , James M. Rehg†6 , Yoichi Sato†15 , Jianbo Shi†19 , Mike Zheng Shou†11 , Antonio Torralba†18 , Lorenzo Torresani†1,20 , Mingfei Yan†5 , Jitendra Malik1,8 1 Meta AI, 2 University of Texas at Austin, 3 University of Minnesota, 4 University of Catania, 5 Meta Reality Labs, 6 Georgia Tech, 7 Carnegie Mellon University, 8 UC Berkeley, 9 University of Bristol, 10 King Abdullah University of Science and Technology, 11 National University of Singapore, 12 Carnegie Mellon University Africa, 13 Meta, 14 Universidad de los Andes, 15 University of Tokyo, 16 Indiana University, 17 International Institute of Information Technology, Hyderabad, 18 MIT, 19 University of Pennsylvania, 20 Dartmouth Abstract We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily- life activity video spanning hundreds of scenarios (house- hold, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards, with con- senting participants and robust de-identification procedures where relevant. Ego4D dramatically expands the volume of diverse egocentric video footage publicly available to the research community. Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cam- eras at the same event. Furthermore, we present a host of new benchmark challenges centered around understanding the first-person visual experience in the past (querying an episodic memory), present (analyzing hand-object manipu- lation, audio-visual conversation, and social interactions), and future (forecasting activities). By publicly sharing this massive annotated dataset and benchmark suite, we aim to push the frontier of first-person perception. Project page: https://ego4d-data.org/ 1. Introduction Today’s computer vision systems excel at naming objects and activities in Internet photos or video clips. Their tremen- dous progress over the last decade has been fueled by major dataset and benchmark efforts, which provide the annota- tions needed to train and evaluate algorithms on well-defined tasks [49,60,61,92,108,143]. While this progress is exciting, current datasets and mod- els represent only a limited definition of visual perception. First, today’s influential Internet datasets capture brief, iso- lated moments in time from a third-person “spectactor” view. 18995 Ego4Dプロジェクト紹介(SSII2022 東大佐藤先生)
  83. 83. YouTube-8M / Segments nYouTube-8M • 自動タグ付けでラベル付与,マルチラベル • 2016: 8.2M動画, 4800クラス, 1.8 labels/video • 2017: 7.0M動画, 4716クラス, 3.4 labels/video • 2018: 6.1M動画, 3862クラス, 3.0 labels/video nYouTube-8M Segmenrts • 2019: 230Kセグメント(約46k video), 1000クラス, 動画あたり5セグメント • 各動画のどこかに5秒セグメントを5つ • 開始時刻,終了時刻(=開始時刻+5秒),クラス ラベルを人間がアノテーション nKaggleコンペ開催 • 2017, 2018, 2019 [Abu-El-Haija+, arXiv2016] YouTube-8M: A Large-Scale Video Classification Benchmark Sami Abu-El-Haija haija@google.com Nisarg Kothari ndk@google.com Joonseok Lee joonseok@google.com Paul Natsev natsev@google.com George Toderici gtoderici@google.com Balakrishnan Varadarajan balakrishnanv@google.com Sudheendra Vijayanarasimhan svnaras@google.com Google Research ABSTRACT Many recent advancements in Computer Vision are attributed to large datasets. Open-source software packages for Machine Learn- ing and inexpensive commodity hardware have reduced the bar- rier of entry for exploring novel approaches at scale. It is possible to train models over millions of examples within a few days. Al- though large-scale datasets exist for image understanding, such as ImageNet, there are no comparable size video classification datasets. In this paper, we introduce YouTube-8M, the largest multi-label video classification dataset, composed of ⇠8 million videos—500K hours of video—annotated with a vocabulary of 4800 visual en- tities. To get the videos and their (multiple) labels, we used a YouTube video annotation system, which labels videos with the main topics in them. While the labels are machine-generated, they have high-precision and are derived from a variety of human-based signals including metadata and query click signals, so they repre- sent an excellent target for content-based annotation approaches. We filtered the video labels (Knowledge Graph entities) using both automated and manual curation strategies, including asking human Figure 1: YouTube-8M is a large-scale benchmark for general multi-label video classification. This screenshot of a dataset explorer depicts a subset of videos in the dataset annotated 5v1 [cs.CV] 27 Sep 2016
  84. 84. YouTube-8M / Segments n構築 • クラスはGoogle Knowldge Graphのエンティティ(元Freebase topic) • 映像だけから分かるもの,多数のビデオがあるものを選択 • YouTubeの全動画で検索,再生回数1000以上,長さ120秒〜500秒のみ • ランダムに10M動画を選択 • 200動画以下のクラスは除外,残ったのが8M動画 • ラベルは自動作成 • 特徴量 • 500k時間≧50年なので色々無理.特徴量だけ提供. • 最初の6分間のフレームについて • フレーム単位のInception特徴2048dをPCA圧縮した1024d特徴 • 動画単位に集約した1024d特徴 • 2017から:フレーム単位・動画単位の音声128d特徴 [Abu-El-Haija+, arXiv2016]
  85. 85. YFCC-100M nYahoo Flickr Creative Commons 100 Million Dataset • 画像99.2M,動画0.8M • マルチメディアデータセット nアノテーション • 人間:画像68M,動画0.4M • 自動:画像3M, 動画7k [Thomee+, Comm. ACM, 2016]
  86. 86. 動作認識モデル
  87. 87. Action Recognition models 2015 2017 2019 2020 2021 2022 2016 2018 2013 2014 restricted 3D Full 3D DT IDT Two Stream TSN C3D I3D P3D S3D R(2+1)D 3D ResNet Non-Local TSM SlowFast X3D ViVit TimeSformer STAM Video Transformer Network VidTr X-ViT 2D + 1D aggregation (2+1)D (2+1)D CNN Non-Deep Vision Transformer 2D + 1D aggregation R3D Transformer Kinetics ResNet U-Net GAN ViT 2012 ImageNet TokenShift VideoSwin
  88. 88. IDT nDeep以前のSoTA • Dense Trajectory (DT) [Wang+, IJCV, 2013] を改良 • 密に検出・追跡したSIFT特 徴点周辺の特徴量(HOG, HOF, etc)のBoF • Improved DT (IDT) • カメラ運動の補正 • 人物部分を除外 • BoFに加えてFisher vectorも • power normalization利用 nDeep初期にも • CNNにIDTを加えて性能ブース トということも [Wang&Schmid, ICCV2013] [Wang+, IJCV, 2013] Int J Comput Vis (2013) 103:60–79 63 Fig. 2 Illustration of our approach to extract and characterize dense trajectories. Left Feature points are densely sampled on a grid for each spatial scale. Middle Tracking is carried out in the corresponding spatial scale for L frames by median filtering in a dense optical flow field. Right The trajectory shape is represented by relative point coordinates, and the descriptors (HOG, HOF, MBH) are computed along the trajectory in a N × N pixels neighborhood, which is divided into nσ × nσ × nτ cells 3.1 Dense Sampling We first densely sample feature points on a grid spaced by W pixels. Sampling is carried out on each spatial scale sep- arately, see Fig. 2 (left). This guarantees that feature points equally cover all spatial positions and scales. Experimental results showed that a sampling step size of W = 5 pixels is dense enough to give good results over all datasets. There are at most 8 spatial scales in total, depending on the resolution of the video. The spatial scale increases by a factor of 1/ √ 2. Our goal is to track all these sampled points through the video. However, in homogeneous image areas without any Fig. 3 Visualization of densely sampled feature points after removing Figure 2. Visualization of inlier matches of the robustly esti-

×