第三回 全日本コンピュータビジョン勉強会(後編)で発表した Detecting attended visual targets in video のまとめ資料です。映像中にいる人物が注意を払っている対象を推定するタスクを解いた話です。コンピュータビジョンや認知科学などに興味がある方はぜひご覧ください。
cvpaper.challenge の Meta Study Group 発表スライド
cvpaper.challenge はコンピュータビジョン分野の今を映し、トレンドを創り出す挑戦です。論文サマリ・アイディア考案・議論・実装・論文投稿に取り組み、凡ゆる知識を共有します。2019の目標「トップ会議30+本投稿」「2回以上のトップ会議網羅的サーベイ」
http://xpaperchallenge.org/cv/
cvpaper.challenge の Meta Study Group 発表スライド
cvpaper.challenge はコンピュータビジョン分野の今を映し、トレンドを創り出す挑戦です。論文サマリ・アイディア考案・議論・実装・論文投稿に取り組み、凡ゆる知識を共有します。2019の目標「トップ会議30+本投稿」「2回以上のトップ会議網羅的サーベイ」
http://xpaperchallenge.org/cv/
Several recent papers have explored self-supervised learning methods for vision transformers (ViT). Key approaches include:
1. Masked prediction tasks that predict masked patches of the input image.
2. Contrastive learning using techniques like MoCo to learn representations by contrasting augmented views of the same image.
3. Self-distillation methods like DINO that distill a teacher ViT into a student ViT using different views of the same image.
4. Hybrid approaches that combine masked prediction with self-distillation, such as iBOT.
This document summarizes several datasets for image captioning, video classification, action recognition, and temporal localization. It describes the purpose, collection process, annotation format, examples and references for datasets including MS COCO, Visual Genome, Flickr8K/30K, Kinetics, Charades, AVA, STAIR Captions and Actions. The datasets vary in scale from thousands to millions of images/videos and cover a wide range of tasks from image captioning to complex activity recognition.
This document summarizes recent research on applying self-attention mechanisms from Transformers to domains other than language, such as computer vision. It discusses models that use self-attention for images, including ViT, DeiT, and T2T, which apply Transformers to divided image patches. It also covers more general attention modules like the Perceiver that aims to be domain-agnostic. Finally, it discusses work on transferring pretrained language Transformers to other modalities through frozen weights, showing they can function as universal computation engines.
CVPR2019読み会 (Rethinking the Evaluation of Video Summaries)Yasunori Ozaki
CVPR2019読み会で発表したRethinking the Evaluation of Video Summariesの説明スライドです。論文自体は映像要約全体を分析しており、読み応えがありました。説明スライドがあっているかどうかよくわからないので、詳しくは本人に聞いてください。よろしくおねがいします。
Several recent papers have explored self-supervised learning methods for vision transformers (ViT). Key approaches include:
1. Masked prediction tasks that predict masked patches of the input image.
2. Contrastive learning using techniques like MoCo to learn representations by contrasting augmented views of the same image.
3. Self-distillation methods like DINO that distill a teacher ViT into a student ViT using different views of the same image.
4. Hybrid approaches that combine masked prediction with self-distillation, such as iBOT.
This document summarizes several datasets for image captioning, video classification, action recognition, and temporal localization. It describes the purpose, collection process, annotation format, examples and references for datasets including MS COCO, Visual Genome, Flickr8K/30K, Kinetics, Charades, AVA, STAIR Captions and Actions. The datasets vary in scale from thousands to millions of images/videos and cover a wide range of tasks from image captioning to complex activity recognition.
This document summarizes recent research on applying self-attention mechanisms from Transformers to domains other than language, such as computer vision. It discusses models that use self-attention for images, including ViT, DeiT, and T2T, which apply Transformers to divided image patches. It also covers more general attention modules like the Perceiver that aims to be domain-agnostic. Finally, it discusses work on transferring pretrained language Transformers to other modalities through frozen weights, showing they can function as universal computation engines.
CVPR2019読み会 (Rethinking the Evaluation of Video Summaries)Yasunori Ozaki
CVPR2019読み会で発表したRethinking the Evaluation of Video Summariesの説明スライドです。論文自体は映像要約全体を分析しており、読み応えがありました。説明スライドがあっているかどうかよくわからないので、詳しくは本人に聞いてください。よろしくおねがいします。
第六回全日本コンピュータビジョン勉強会資料 UniT (旧題: Transformer is all you need)Yasunori Ozaki
第六回全日本コンピュータビジョン勉強会資料です。今回は、 UniT: Multimodal Multitask Learning with a Unified Transformer を紹介します。提案手法であるUniTは自然言語、ビジョン、 Vision and Language のタスクをそれぞれ統一的に解けるTransformer になります。
Learning from and teaching in communities
コミュニティーで学び、そこで教えた事
Can we bring “Software Carpentry” to Japan? 「ソフトウィア・カーペントリー」を日本でやりませんか?
Presentation in English (with slides in English and Japanese)
#TokyoR 73th Meeting 2018-10-20
Tom Kelly (RIKEN IMS, Yokohama, Japan)
This document provides an overview of POMDP (Partially Observable Markov Decision Process) and its applications. It first defines the key concepts of POMDP such as states, actions, observations, and belief states. It then uses the classic Tiger problem as an example to illustrate these concepts. The document discusses different approaches to solve POMDP problems, including model-based methods that learn the environment model from data and model-free reinforcement learning methods. Finally, it provides examples of applying POMDP to games like ViZDoom and robot navigation problems.
ROSCon 2019と併催されたIROS 2019参加報告です。ROSCon 2019参加報告会で報告する予定の内容となります。内容はIROSにおけるROSの位置づけとなります。よろしくおねがいします。
This presentation reports a part of IROS 2019 in terms of ROS. If you want to read this presentation in English, please refer to that as a comment.
このスライドは、インタラクション分野のトップカンファレンスCHIに設けられたセクションInteract with AIの論文をそれぞれ1枚に要約したものになります。説明可能なAI (XAI) の話やAIによる作業支援の話をまとめています。この内容は CHI 2019勉強会で発表した一部となります。CHI 2019勉強会の内容をもっと知りたい方はこちらのリンクを御覧ください。
http://study.hci.one/event/chi2019/program
Detecting attended visual targets in video の勉強会用資料
1. Detecting Attended Visual Targets in
Video
Eunji Chong1 Yongxin Wang2 Nataniel Ruiz3 James M. Rehg1
1Georgia Institute of Technology 2Carnegie Mellon University 3Boston University
資料作成者: 尾崎安範
株式会社サイバーエージェント AI Lab