MediaEval 2018: Fine grained sport action recognition: Application to table tennis
1. Medieval :
Fine grained sport action recognition.
Application to Table Tennis.
Pierre-Etienne Martin
Univ. Bordeaux, LaBRI
Jenny Benois-Pineau
Univ. Bordeaux, LaBRI
Renaud Péteri
Univ. La Rochelle, MIA
Julien Morlier
Univ. Bordeaux, Bordeaux
IMS
1
2. Goal of sport video analysis
2
Improve athletes performances for teachers and athletes
through tools
Sport video analysis today: improving performance of athletes, efficient coaching, study of
competitors
MediaEval 2018 EURECOM, Sophia Antipolis
3. 1- Goal
- Extract strokes in the temporal dimension
- Classify the strokes
MediaEval 2018 EURECOM, Sophia Antipolis3
Offensive Forehand Loop
Input
Output
t
4. [1] H. Bilen, B. Fernando, E. Gavves, and A. Vedaldi, “Action recognition with dynamic image networks,” CoRR, vol. abs/1612.00738, 2016.
[2] J. Carreira and, A. Zisserman, “Quo vadis, action recognition? A new model and the kinetics dataset,” CoRR, vol. abs/1705.07750, 2017.
[3] G. Varol, I. Laptev, and C. Schmid, “Long-term temporal convolutions for action recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 6, pp.
1510–1517, 2018.
Use of Dynamic Images[1]
Very deep 3D CNN[2]
Long-term Temporal Convolutions[3]
2 - Related Work
4 MediaEval 2018 EURECOM, Sophia Antipolis
Popular Corpus : UCF-101
5. Corpus TTStroke-21
5
129 videos at 120 fps
1 387 / 1 074 annotations before / after filtering for 20 classes
Total of 1 048 strokes extracted
Handedness video player:
Left 28
Right 101
Annotation length:
Min - 76 frames (0.63s)
Max - 272 frames (2.27s)
Mean - 173 frames +- 43.76 (1.45s +- 0.36)
Actions length:
Min - 99 frames (0.82s)
Max - 276 frames (2.30s)
Mean - 174 frames +- 43.14 (1.46s +- 0.36)
Acquisition
Annotation platform
Samples TTStroke-21
MIA/ULr, LABRI/UBx, IMS/UBx
CRISP project
In a continuous completion process: recording
and annotation sessions 1/month MediaEval 2018 EURECOM, Sophia Antipolis
6. [4] C. Liu, “Beyond pixels: Exploring new representations and applications for motion analysis,” Ph.D. dissertation, Massachusetts Institute of Technology, 5
2009.
[5] Z. Zivkovic and F. van der Heijden, “Efficient adaptive density estimation per image pixel for the task of background subtraction,” Pattern Recognition
Letters, vol. 27, no. 7, pp. 773–780, 2006.
[6] P-E. Martin, J. Benois-Pineau, R. Péteri and J.Morlier. Sport Action Recognition with Siamese Spatio-Temporal CNNs: Application to Table Tennis.
Proceedings of the International Conference on Content-Based Multimedia Indexing (CBMI), 2018, La Rochelle, France
Our Research [6] (1)
6 MediaEval 2018 EURECOM, Sophia Antipolis
Original Frame Motion estimation[4]
Foreground estimation[5]
Foreground Motion
Nous ne pouvons pas afficher cette image pour l’instant.
Principle : extract moving sportsmen first, then classify the action
7. 7 MediaEval 2018 EURECOM, Sophia Antipolis
Our Research (2)
Spatial Segmentation using foreground motion
Final segmentation
Smoothing over temporal dimension
using gaussian kernel of size 40 and
standard deviation 4.44.
Xmax
XgXroi
8. 8
Siamese Spatio-Temporal Convolutional Neural Network
(W,H,T) = (100,120,120)
Very deep 3D CNN[1]
[7] J. Carreira and, A. Zisserman, “Quo vadis, action recognition? A new model and the kinetics dataset,” CoRR, vol. abs/1705.07750, 2017.
Our Research(3)
MediaEval 2018 EURECOM, Sophia Antipolis
9. Proposed tasks for MediaEval(1)
9
Offensive Forehand Loop
Input
- Task n°1 : Stroke recognition with temporal boundaries known
Output
Given a set of clips with annotated boundaries of strokes: recognize each stroke temporally
segmented accordingly to the given taxonomy of 21 classes.
Dataset splitting: Train 80% Test 20%
Evaluation metric : Accuracy
Tool provided : xml reader for annotation extraction
Find the class of each temporal segment
MediaEval 2018 EURECOM, Sophia Antipolis
10. - Task n°2 : Stroke recognition with temporal decision allowing 10% of error range on temporal borders
Offensive Forehand Loop
Input
t
Output
Proposed tasks for MediaEval(2)
Given a set of clips with annotated boundaries of strokes recognize each stroke NOT temporally
segmented accordingly to the given taxonomy.
Dataset splitting : Train 80% Test 20%
Evaluation metric : Accuracy
Tool provided : xml reader for annotation extraction
Perform on test set whch is neither temporally segmented or
labelled.
t1 t2
MediaEval 2018 EURECOM, Sophia Antipolis10
11. Proposed tasks for MediaEval
➔ Data formats tools supplied:
➔ - a .dtd file for the expected stroke output xml file
➔ - allows validation of xml files with simple tools “eclipse”
➔ <!ELEMENT stroke(TimeStart, TimeEnd,Label) EMPTY >
➔ ....
➔ <!ELEMENT Lablel (#PCDATA) REQUIRED>
MediaEval 2018 EURECOM, Sophia Antipolis11