【DL輪読会】Bridge-Prompt: Toward Ordinal Action Understanding in Instructional Videos

•

0 gefällt mir•486 views

Deep Learning JP

2023/3/3 Deep Learning JP http://deeplearning.jp/seminar-2/

Technologie

DEEP LEARNING JP
[DL Papers]
Bridge-Prompt: Toward Ordinal Action Understanding in
InstructionalVideos(CVPR 2022)
Yoshifumi Seki
http://deeplearning.jp/

書誌情報
● 投稿先
○ CVPR 2022
● 投稿者
○ 精華大学
● 選定理由
○ 動画からの動作解析系に最近取り組ん
でいます
https://github.com/ttlmh/Bridge-Prompt

背景・目的
● 動画からの動作解析をいい感じにやりたい
● 動作には連続性がある
○ ex. 水を飲む動作
■ コップを持つ -> 水を入れる -> 水を飲む
○ ex. パンを食べる動作
■ バターを塗る -> ジャムをぬる -> パンを食べる
● 連続性をモデルに組み込みたい
○ グラフモデルは最近いくつかあるが道のラベルには対応できない
● Prompt Engineeringをやって大規模言語モデルの強みを活かす

Prompt Engineeringとは
● 与えられた入力（ラベル情報など）をテンプレートに入れて、適切な文として入力さ
せることで、大規模言語モデルの恩恵を受けられるようにするアイデア
●
● GPT-3でのfew shot learningの仕組みに採用
● OpenAIのCLIPによる画像分類でtext-image
● Action CLIPで動画にも適用

ActionCLIP
● ラベルからPrompt
Engineeringにより文章を生成
し、Text Encoder, Video
Encoderによって類似性を図る
ことでラベル推定をする
https://arxiv.org/abs/2109.08472

$Prompt部の詳細 ● 1. Stastical Prompt ○ いくつactionが動画中にあるか ○ The video has {num} actions. ● 2. Ordinal Prompt ○ 何番目のactionか ○ This is the {ord_i} action in the video. ● 3. Semantic Prompt ○ “{ord_i}, the person is performing the action step of {vp_i}” ● 3+1. Integrated Prompt ○ 全部 ○ Semanticを全て文として並べる$

評価用データセット
● 50Salads: 50 top view 30-fps instructional videos regarding salad preparation
○ 19 kind of actions
● Georgia Tech Egocentric Activities(GTEA): 28 egocentric 15-fps instructional
videos daily kitchen activities
○ 74 class of actions
● Breakfast: 1,712 third person 15-fps videos of breakfast preparation activities.
○ 48 type of different actions
○

Implementation
● 動画は16 frameで分割される
● Kinetics-400でAction CLIPを用いて事前学習をする
●

未知のIDに対する対応力
● fine-tune時に特定の行動だけを学習させた場合、類似した行動を推定できるか？
○ cofee2teaはfine-tuneをmaking cofeeだけで行って、making teaが当てられるかを見る
○ AKLは全体としての精度

まとめ・感想
● Prompt EngineeringがNLP以外にも出ていることを初めて知って勉強になりました
● 順序を持たせたことがどのような意味を持っているのかがこの実験だとあまりわか
らなかったので残念
● 未知のIDに対応できているのはすごいけど、この実験方法がそれを測るのに適切
かは疑問
● 既存モデルとの違いをもう少し結果から読み取りたかった
○ 精度だけだとどこが良くなっているのかよくわからん

Empfohlen

【メタサーベイ】基盤モデル / Foundation Modelscvpaper. challenge

[DL輪読会]Vision Transformer with Deformable Attention （Deformable Attention Tra...Deep Learning JP

[DL輪読会]GQNと関連研究，世界モデルとの関係についてDeep Learning JP

SSII2022 [TS1] Transformerの最前線〜畳込みニューラルネットワークの先へ〜SSII

【DL輪読会】"Masked Siamese Networks for Label-Efficient Learning"Deep Learning JP

[DL輪読会]GLIDE: Guided Language to Image Diffusion for Generation and EditingDeep Learning JP

[DL輪読会]相互情報量最大化による表現学習Deep Learning JP

動画認識サーベイv1（メタサーベイ）cvpaper. challenge

Empfohlen

【メタサーベイ】基盤モデル / Foundation Modelscvpaper. challenge

[DL輪読会]Vision Transformer with Deformable Attention （Deformable Attention Tra...Deep Learning JP

[DL輪読会]GQNと関連研究，世界モデルとの関係についてDeep Learning JP

SSII2022 [TS1] Transformerの最前線〜畳込みニューラルネットワークの先へ〜SSII

【DL輪読会】"Masked Siamese Networks for Label-Efficient Learning"Deep Learning JP

[DL輪読会]GLIDE: Guided Language to Image Diffusion for Generation and EditingDeep Learning JP

[DL輪読会]相互情報量最大化による表現学習Deep Learning JP

動画認識サーベイv1（メタサーベイ）cvpaper. challenge

[DL輪読会]Swin Transformer: Hierarchical Vision Transformer using Shifted WindowsDeep Learning JP

Deep Learningによる超解像の進歩Hiroto Honda

動作認識の最前線：手法，タスク，データセットToru Tamaki

深層学習によるHuman Pose Estimationの基礎Takumi Ohkuma

3D CNNによる人物行動認識の動向Kensho Hara

【DL輪読会】ViTPose: Simple Vision Transformer Baselines for Human Pose EstimationDeep Learning JP

GAN（と強化学習との関係）Masahiro Suzuki

Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料Yusuke Uchida

[DL輪読会]End-to-End Object Detection with TransformersDeep Learning JP

【メタサーベイ】Video Transformercvpaper. challenge

SSII2022 [SS2] 少ないデータやラベルを効率的に活用する機械学習技術〜足りない情報をどのように補うか？〜SSII

論文紹介 Pixel Recurrent Neural NetworksSeiya Tokui

全力解説！TransformerArithmer Inc.

【DL輪読会】NeRF-VAE: A Geometry Aware 3D Scene Generative ModelDeep Learning JP

【DL輪読会】SDEdit: Guided Image Synthesis and Editing with Stochastic Differentia...Deep Learning JP

cvpaper.challenge 研究効率化 Tipscvpaper. challenge

SfM Learner系単眼深度推定手法についてRyutaro Yamauchi

3次元レジストレーションの基礎とOpen3Dを用いた3次元点群処理Toru Tamaki

【メタサーベイ】数式ドリブン教師あり学習cvpaper. challenge

自己教師学習（Self-Supervised Learning）cvpaper. challenge

実務でGo使い始めましたYuki Kikuchi

Django ORM道場：クエリの基本を押さえ，より良い形を身に付けようTakayuki Shimizukawa

Weitere ähnliche Inhalte

Was ist angesagt?

[DL輪読会]Swin Transformer: Hierarchical Vision Transformer using Shifted WindowsDeep Learning JP

Deep Learningによる超解像の進歩Hiroto Honda

動作認識の最前線：手法，タスク，データセットToru Tamaki

深層学習によるHuman Pose Estimationの基礎Takumi Ohkuma

3D CNNによる人物行動認識の動向Kensho Hara

【DL輪読会】ViTPose: Simple Vision Transformer Baselines for Human Pose EstimationDeep Learning JP

GAN（と強化学習との関係）Masahiro Suzuki

Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料Yusuke Uchida

[DL輪読会]End-to-End Object Detection with TransformersDeep Learning JP

【メタサーベイ】Video Transformercvpaper. challenge

SSII2022 [SS2] 少ないデータやラベルを効率的に活用する機械学習技術〜足りない情報をどのように補うか？〜SSII

論文紹介 Pixel Recurrent Neural NetworksSeiya Tokui

全力解説！TransformerArithmer Inc.

【DL輪読会】NeRF-VAE: A Geometry Aware 3D Scene Generative ModelDeep Learning JP

【DL輪読会】SDEdit: Guided Image Synthesis and Editing with Stochastic Differentia...Deep Learning JP

cvpaper.challenge 研究効率化 Tipscvpaper. challenge

SfM Learner系単眼深度推定手法についてRyutaro Yamauchi

3次元レジストレーションの基礎とOpen3Dを用いた3次元点群処理Toru Tamaki

【メタサーベイ】数式ドリブン教師あり学習cvpaper. challenge

自己教師学習（Self-Supervised Learning）cvpaper. challenge

Was ist angesagt? (20)

[DL輪読会]Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Deep Learningによる超解像の進歩

動作認識の最前線：手法，タスク，データセット

深層学習によるHuman Pose Estimationの基礎

3D CNNによる人物行動認識の動向

【DL輪読会】ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation

GAN（と強化学習との関係）

Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料

[DL輪読会]End-to-End Object Detection with Transformers

【メタサーベイ】Video Transformer

SSII2022 [SS2] 少ないデータやラベルを効率的に活用する機械学習技術〜足りない情報をどのように補うか？〜

論文紹介 Pixel Recurrent Neural Networks

全力解説！Transformer

【DL輪読会】NeRF-VAE: A Geometry Aware 3D Scene Generative Model

【DL輪読会】SDEdit: Guided Image Synthesis and Editing with Stochastic Differentia...

cvpaper.challenge 研究効率化 Tips

SfM Learner系単眼深度推定手法について

3次元レジストレーションの基礎とOpen3Dを用いた3次元点群処理

【メタサーベイ】数式ドリブン教師あり学習

自己教師学習（Self-Supervised Learning）

Ähnlich wie 【DL輪読会】Bridge-Prompt: Toward Ordinal Action Understanding in Instructional Videos

実務でGo使い始めましたYuki Kikuchi

Django ORM道場：クエリの基本を押さえ，より良い形を身に付けようTakayuki Shimizukawa

NodeにしましょうYuzo Hebishima

コミュニケーションスキルを重視したソフトウェア技術者教育手法の研究Yuichiro Saito

2015.08.29 JUS共催勉強会資料umidori

オンプレエンジニアがクラウドエンジニアを夢見て。じっと手を見る。Akihiro Kuwano

Kanamori cedec2011Yoshihiro Kanamori

第83回名古屋アジャイル勉強会「一言で言うと、アジャイルってなんなの？」hiroyuki Yamamoto

goog.require()を手書きしていいのは小学生までTeppei Sato

Androidの新ビルドシステムl_b__

C#でわかるこわくないMonadKouji Matsui

コンピュータビジョンの今を映す-CVPR 2017 速報より- （夏のトップカンファレンス論文読み会）cvpaper. challenge

Multibranch Pipeline with Docker 入門編kimulla

【いまこそ】エンジニアとデザイナー【立ち上がれ】 Yuki Kuroki

Webサイトのようには作れない！Webアプリ設計の考え方girigiribauer

初めてのDockerＹｏｕ＆Ｉ

Transformer 動向調査 in 画像認識(修正版)Kazuki Maeno

Eclipse modeling 勉強会はじめにAkira Tanaka

みくみくまうすについて&Unity で使えるコーディングノウハウtorisoup

アート・オブ・アジャイルデベロップメント〜テストが駆動するビジネス価値〜Fumihiko Kinoshita

Ähnlich wie 【DL輪読会】Bridge-Prompt: Toward Ordinal Action Understanding in Instructional Videos (20)

実務でGo使い始めました

Django ORM道場：クエリの基本を押さえ，より良い形を身に付けよう

Nodeにしましょう

コミュニケーションスキルを重視したソフトウェア技術者教育手法の研究

2015.08.29 JUS共催勉強会資料

オンプレエンジニアがクラウドエンジニアを夢見て。じっと手を見る。

Kanamori cedec2011

第83回名古屋アジャイル勉強会「一言で言うと、アジャイルってなんなの？」

goog.require()を手書きしていいのは小学生まで

Androidの新ビルドシステム

C#でわかるこわくないMonad

コンピュータビジョンの今を映す-CVPR 2017 速報より- （夏のトップカンファレンス論文読み会）

Multibranch Pipeline with Docker 入門編

【いまこそ】エンジニアとデザイナー【立ち上がれ】

Webサイトのようには作れない！Webアプリ設計の考え方

初めてのDocker

Transformer 動向調査 in 画像認識(修正版)

Eclipse modeling 勉強会はじめに

みくみくまうすについて&Unity で使えるコーディングノウハウ

アート・オブ・アジャイルデベロップメント〜テストが駆動するビジネス価値〜

Mehr von Deep Learning JP

【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving PlannersDeep Learning JP

【DL輪読会】事前学習用データセットについてDeep Learning JP

【DL輪読会】 "Learning to render novel views from wide-baseline stereo pairs." CVP...Deep Learning JP

【DL輪読会】Zero-Shot Dual-Lens Super-ResolutionDeep Learning JP

【DL輪読会】BloombergGPT: A Large Language Model for Finance arxivDeep Learning JP

【DL輪読会】マルチモーダル LLMDeep Learning JP

【 DL輪読会】ToolLLM: Facilitating Large Language Models to Master 16000+ Real-wo...Deep Learning JP

【DL輪読会】AnyLoc: Towards Universal Visual Place RecognitionDeep Learning JP

【DL輪読会】Can Neural Network Memorization Be Localized?Deep Learning JP

【DL輪読会】Hopfield network　関連研究についてDeep Learning JP

【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )Deep Learning JP

【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...Deep Learning JP

【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"Deep Learning JP

【DL輪読会】"Language Instructed Reinforcement Learning for Human-AI Coordination "Deep Learning JP

【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat ModelsDeep Learning JP

【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"Deep Learning JP

【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...Deep Learning JP

【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...Deep Learning JP

【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...Deep Learning JP

【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...Deep Learning JP

Mehr von Deep Learning JP (20)

【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners

【DL輪読会】事前学習用データセットについて

【DL輪読会】 "Learning to render novel views from wide-baseline stereo pairs." CVP...

【DL輪読会】Zero-Shot Dual-Lens Super-Resolution

【DL輪読会】BloombergGPT: A Large Language Model for Finance arxiv

【DL輪読会】マルチモーダル LLM

【 DL輪読会】ToolLLM: Facilitating Large Language Models to Master 16000+ Real-wo...

【DL輪読会】AnyLoc: Towards Universal Visual Place Recognition

【DL輪読会】Can Neural Network Memorization Be Localized?

【DL輪読会】Hopfield network　関連研究について

【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )

【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...

【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"

【DL輪読会】"Language Instructed Reinforcement Learning for Human-AI Coordination "

【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models

【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"

【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...

【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...

【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...

【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...

【DL輪読会】Bridge-Prompt: Toward Ordinal Action Understanding in Instructional Videos

1. DEEP LEARNING JP [DL Papers] Bridge-Prompt: Toward Ordinal Action Understanding in InstructionalVideos(CVPR 2022) Yoshifumi Seki http://deeplearning.jp/

2. 書誌情報 ● 投稿先 ○ CVPR 2022 ● 投稿者 ○ 精華大学 ● 選定理由 ○ 動画からの動作解析系に最近取り組んでいます https://github.com/ttlmh/Bridge-Prompt

3. 背景・目的 ● 動画からの動作解析をいい感じにやりたい ● 動作には連続性がある ○ ex. 水を飲む動作 ■ コップを持つ -> 水を入れる -> 水を飲む ○ ex. パンを食べる動作 ■ バターを塗る -> ジャムをぬる -> パンを食べる ● 連続性をモデルに組み込みたい ○ グラフモデルは最近いくつかあるが道のラベルには対応できない ● Prompt Engineeringをやって大規模言語モデルの強みを活かす

4. Prompt Engineeringとは ● 与えられた入力（ラベル情報など）をテンプレートに入れて、適切な文として入力させることで、大規模言語モデルの恩恵を受けられるようにするアイデア ● ● GPT-3でのfew shot learningの仕組みに採用 ● OpenAIのCLIPによる画像分類でtext-image ● Action CLIPで動画にも適用

5. CLIP(ICML2021) 2021/1/15の発表より

6. CLIP(ICML2021) 2021/1/15の発表より

7. ActionCLIP ● ラベルからPrompt Engineeringにより文章を生成し、Text Encoder, Video Encoderによって類似性を図ることでラベル推定をする https://arxiv.org/abs/2109.08472

8. 提案手法

9. 提案手法の全体図

10. Prompt部の詳細 ● 1. Stastical Prompt ○ いくつactionが動画中にあるか ○ The video has {num} actions. ● 2. Ordinal Prompt ○ 何番目のactionか ○ This is the {ord_i} action in the video. ● 3. Semantic Prompt ○ “{ord_i}, the person is performing the action step of {vp_i}” ● 3+1. Integrated Prompt ○ 全部 ○ Semanticを全て文として並べる

11. 評価用データセット ● 50Salads: 50 top view 30-fps instructional videos regarding salad preparation ○ 19 kind of actions ● Georgia Tech Egocentric Activities(GTEA): 28 egocentric 15-fps instructional videos daily kitchen activities ○ 74 class of actions ● Breakfast: 1,712 third person 15-fps videos of breakfast preparation activities. ○ 48 type of different actions ○

12. Implementation ● 動画は16 frameで分割される ● Kinetics-400でAction CLIPを用いて事前学習をする ●

13.

14.

15. Long-termな映像に対する比較

16.

17. Fusion Moduleの比較・検討

18. 未知のIDに対する対応力 ● fine-tune時に特定の行動だけを学習させた場合、類似した行動を推定できるか？ ○ cofee2teaはfine-tuneをmaking cofeeだけで行って、making teaが当てられるかを見る ○ AKLは全体としての精度

19. まとめ・感想 ● Prompt EngineeringがNLP以外にも出ていることを初めて知って勉強になりました ● 順序を持たせたことがどのような意味を持っているのかがこの実験だとあまりわからなかったので残念 ● 未知のIDに対応できているのはすごいけど、この実験方法がそれを測るのに適切かは疑問 ● 既存モデルとの違いをもう少し結果から読み取りたかった ○ 精度だけだとどこが良くなっているのかよくわからん