TechnicalBackgroundOverview

Text, Image and Speech
Understanding: it’s all about
Learning from Data
Motaz El Saban
Senior applied research, Microsoft T&R, Cairo Lab
Associate professor faculty of computers and
information, Cairo university

Technical Focus
• Analyzing, modeling, learning and predicting in various domains to make sense of
digital media content
• Early on: better analysis for video compression
• Ph.D.: automatic tracking and analysis of sub cellular structures in time lapse
videos
• Post Ph.D.
– Visual object recognition (NevenVision & Google)
– Textual web rankers (Arabic) (Microsoft)
– Analyzing content on social networks
– Mobile multimedia experiences
• Automatic panoramic video construction
• Automatic video tagging
– Object/scene recognition/detection (face reco and expression as special cases)
– Activity recognition in videos (RGB/RGB+D)
– Speaker recognition

Application Scenarios
Web search (Text
ranking, media tagging
(e.g. object detection))
Visual similarity search
(clothes/products)
Ranking user
generated content on
forums
Searching document
archives (OCRLess)
Real-time classifiers
for mobile phones
(document-like
content)
Access control (face
unlocking)
Enhanced visual
communication
(background
replacement in Skype)
Activity recognition for
Kinect

Agenda
• Ph.D. work: Microtubules
• Real-time video stitching
– Basic framework
– Active Feedback for stitching
– Exploiting frame correlation
– Exploiting mobile sensors
• Annotating mobile generated videos
– Main concept
– Matching using frame information aggregation
– Propagating tags over frames
• Object recognition
– Detection
– Segmentation
– General object/specific
– Activity in videos
– Facial expression recognition using DNNs
4

Ph.D. thesis: Microtubule tracking and
analysis
5

MT analysis: HMMs
• HMM for each experimental condition of MT
• Distances between HMMs estimated
• MDS embedding of distances for visualization
7

MT analysis: Association rules

Ranking User Content on Forums
• Forums are conversational social cyberspaces
constituting rich repositories of content and an
important source of collaborative knowledge
• However, most of this knowledge is buried inside
the forum infrastructure and its extraction is both
complex and difficult
• In this work, focus on automatic rating of
postings in online discussion forums for easier
content access

Features & Classifier
Classifier: Non-linear SVM into three levels (M/H/L)
Relevance (OnSubForumTopic, OnThreadTopic,…)
Originality (OverlapPrevious,…)
Forum-specific features (Referencing,…)
Surface features (Timeliness, Lengthiness,…)
Posting-component features (Weblinks, Questioning, …)

Dataset
• Discussion threads from Slashdot
• 200 threads with a maximum of 200 posts from the 14 sub-
forums
• A total of 120,000 posts were scraped
• Posts on Slashdot are rated on a scale from -1 (irrelevant)
to 5 (high quality posts)
• Default rating for a registered user is 1 and for an
unregistered user is 0
• Posts rated as 0 were removed, unless they were from a
registered user
• Final dataset was composed of 20,008 rated posts, which
were clustered into three groups, namely low, medium, and
high, according to their value

Results
High Medium Low
F1-measure 0.61 0.42 0.46
Relative Accuracy and F1-measure for each metric category

Building an Arabic Web Ranker
• English rankers for Arabic documents is
suboptimal
• Re-training an Arabic ranker from scratch
suffers from lack of sufficient training data
• Solution: start from English ranker, fine tune
with Arabic specific data for top returned
results

Agenda
– Basic framework
– Main concept
– Detection
– Segmentation
• Initial thoughts/Ideas on Smarter Urban Dynamics
15

Real time Stitching
Estimating geometric and photometric mapping
16A novel research agenda, published over a number of articles and resulted in a
recent book chapter

Stitching Pipeline
Extract
Interest points
Match
between
frame pairs
Solve
geometric
transform
(with RANSAC)
Photometric
alignment

Active Feedback Results
Pr Re F1 Overlap
NoFeedback 0.95 0.49 0.65 73.48
With
Feedback
0.97 0.65 0.78 58.85

Frame Correlation Results
21
recall precision Average time
(ms)
SIFT + no time
info usage
0.37 0.46 414
SURF + no time
info usage
0.32 0.48 162
SURF + Motion
Vector
estimation
0.23 0.38 128
SURF +
overlapping
region
0.27 0.48 149
SURF + Optical
Flow
0.32 0.47 103

Rotational angles from 3-D accelerometers
22
Mobile phone axis
Device coordinates
accelerations (front view)

Agenda
– Basic framework
– Main concept
– Detection
– Segmentation
25

Key Contributions










L
k
tkikii
jiji
tagsConfffSim
L
k
tagsConftagsScore
1
,
,,
)(*),(*
1
1
)()(

Results
Effect of different time window size N when sampling rate (a) S = 5, (b) S = 25

Agenda
– Basic framework
– Main concept
– Detection
– Segmentation
29

Object Recognition
Object recognition
Object detection
General object
(with MSRC, ICIP 2013)
Specific such as face and
expression recognition
(ICCV 2011, ICIP 2015)
Object segmentation
(ICIP 2011, ICIP 2014)
Instance recognition
(with MSRA, ICME 2011,
Car recognition ICPR
2012)
Activity Recognition
RGB: with NU, WACV
2011, MWSCS 2011
RGB+D: IJCAI 2013
30

Activity Recognition in RGBD Videos
Skeleton joint locations and names
as captured by the Kinect sensor

Trajectory Descriptor
Pt is the position of the joint at time t.
This figure shows a general displacement between Pt and Pt+1.
In this example, the trajectory is described by a histogram of 8 bins. For each
displacement, the angle and the length of the displacement are calculated.

Sample Trajectory
From the Action3D dataset, this figure shows the three projections of the Right
Hand joint when a subject is performing the action of High Arm Wave. For each
projection, HOD is used to describe the movement.

Temporal localization: Pyramid

Classification Accuracy Comparison
Method Accuracy
Recurrent Neural Network 42.5%
Hidden Markov Model 78.97%
Action Graph on Bag of 3D Points 74.7%
Random Occupancy Patterns 86.5%
Actionlets Ensemble 88.2%
HOD 91.26%

Facial Expression Recognition in the
Wild using Rich Deep Features
• Traditionally, researchers used hand crafted
features to represent images: texture, color,
gradients, Histogram of these (e.g. Hist of
gradients)
• Then came Hinton and produced impressive
results on ImageNet with deep Nets with
almost raw input

DeepNet Krizhevsky, Sutskever & Hinton
• 5 conv layers, 2 fully connected
• Uses raw image data as input (centered)
• Best results on ImageNet 2012 (~15% top 5 error)
• Trained by standard backprop, SGD with mini batches, dropout and ReLU
• Successful because:
– More training data 1.2 M images/1000 classes
– More computation power (GPUs)
After: ImageNet Classification with Deep Convolutional Neural Networks, Krizhevsky et al. NIPS 2012

Deep Net as Hierarchical Feature Extractors
• Shown in literature
After: Visualizing and Understanding Convolutional Networks,
Matthew D. Zeiler and Rob Fergus, ECCE 2014

Facial Expression Recognition in the Wild using Rich
Deep Features
Selected among top 10% papers in ICIP 2015
40
Idea: use deep features from Krishvesky net for facial expression but with added
domain knowledge

Results
Facial parts contribution
Comparative results on different datasets

AutoCaption: Automatic Caption Generation for
Personal Photos
43

Sample Results
(a) An example with two people interacting. (b) An example with a prominent landmark. (c)
An example where the caption is personalized to the user. (d) An example of a caption when
no people are present. (e) An example of a caption for a photo with no GPS. (f) An example
of a caption with a scene classification error.

Publications
• Book Chapter
– Motaz El Saban and Ayman Kaheel, “Panoramic
video construction from mobile video streams” in
Mobile and Cloud Visual Media Computing,
Springer (under preparation).
45

Publications
• Abbubakrelsedik Karali, Ahmad Bassiouny and Motaz El Saban, “Facial Expression Recognition in the Wild Using Rich Deep Features”, ICIP 2015 (selected as one of the
top 10% papers in ICIP 2015).
• Amr Sharaf, Mohammad Hussein, Marwan Torki and Motaz EL Saban, “Real-time Multi-scale Action Detection From 3D Features”, WACV 2015.
• Ahmed Bassiouny and Motaz El-Saban, “Semantic segmentation as image representation for scene recognition”, ICIP 2014.
• Krishnan Ramnath, Simon Baker, Anitha Kannan, Lucy Vanderwende, Michel Galley, Yi Yang, Deva Ramanan, Motaz El-Saban, Noran Hasan, Lorenzo Torresani, Sudipta
Sinha, “Auto Caption”, WACV 2014.
• Mostafa Izz, Alaa Abd El Hakeem and Motaz El-Saban, “Graph-Based Superpixel Labeling for Enhancement of Online Video Segmentation”, ICIIP 2013.
• Mohammad Gowayyed, Marwan Torki, Mohamed Hussein, Motaz El-Saban, “Histogram of Oriented Displacements (HOD): Describing Trajectories of Human Joints for
Action Recognition”, IJCAI 2013.
• Mohamed Hussein, Marwan Torki, Mohammad Gowayyed, Motaz El-Saban, “Human Action Recognition Using a Temporal Hierarchy of Covariance Descriptors on 3D
Joint Locations”, IJCAI 2013
• Osama Khalil, Mohammad Fathy, Dina Khalil, Motaz El Saban, Pushmeet kohli and Jamie Shotton, “Synthetic Training in Object Detection”, ICIP 2013.
• Noran Hasan, Ahmad Fathy, Tamer Deif, Ramy Shahin and Motaz El Saban, “Using Skin Segmentation to Improve Similar Product Recommendations in Online Clothing
Stores”, VISAPP 2013.
• Noureldin Laban, Motaz ElSaban, Ayman Nasr and Hoda Onsi, “Spatial cloud detection and retrieval system for satellite images”, International Journal of Advanced
Computer Science and Applications (IJACSA), December 2012.
• Ayman Kaheel, Motaz El-Saban, Mostafa Izz and Mahmoud Refaat, “Employing 3D Accelerometer Information for Fast and Reliable Image Features Matching on mobile
devices”, ICME 2012 workshop on hot topics in mobile multimedia.
• Alaa Abd El Hakeem and Motaz El-Saban and, “FRPCA: Fast Robust Principal Component Analysis”, ICPR 2012.
• Meena Abd El Meseeh, Islam Badr El Deen, Mohammad Abd El Kader and Motaz El-Saban, “Combining global and local information for within-category object class
recognition”, ICPR 2012.
• NourElDin Laban, Motaz ElSaban, Ayman Nasr and Hoda Onsi, “System Refinement for Content Based Satellite Image Retrieval”, The Egyptian Journal of Remote Sensing
and Space Sciences, 2012.
• Alaa Abd El Hakeem and Motaz El-Saban, “Distortion Impact on Low-Dimensional Manifold Recovery of High-Dimensional Data”, Taibah University International
Conference on Computing and Information Technology (ICCIT 2012).
• Alaa Abd El Hakeem and Motaz El-Saban, “Face Authentication Using Graph-Based Low-Rank Representation of Face Components”, ICCV 2011 workshop on Mobile
Vision.
• Mahmoud Refaat , M. El-Saban and Ayman Kaheel, “Active Feedback for Enhancing the Construction of Panoramic Live Mobile Video Streams”, ICME 2011, full paper.
46

Publications
• Motaz El-Saban, Xin-Jing Wang, Noran Hasan, Mahmoud Bassiouny and Mahmoud Refaat, “Seamless annotation and enrichment of mobile captured video streams in real-time”, ICME
2011 application/industrial short paper.
• Mostafa S. Ibrahim, Motaz El Saban, "Higher order potentials with superpixel neighbourhood (HSN) for semantic image segmentation", ICIP 2011.
• Motaz El-Saban, Mostafa Izz, Ayman Kaheel and Mahmoud Refaat, "Improved optimal seam selection blending for fast video stitching of videos captured from freely moving devices",
ICIP 2011.
• Mohammad Nael, Moataz Abd El Wahab, Motaz El-Saban and Mikhail Wasfy, “Highly Efficient Human Action Recognition using compact 2DPCA-based descriptors in the Spatial and
Transform domains”, invited paper at the Midwest Symposium on Circuits and Systems (MWSCAS) 2011, special session.
• Mohammad Nael, Moataz Abd El Wahab and Motaz El-Saban, “Multi-view Human Action Recognition System Employing 2DPCA”, WACV 2011.
• Mahmoud Bassiouny and Motaz El-Saban, “Object Matching Using Feature Aggregation Over a Frame Sequence”, WACV 2011.
• Motaz El-Saban, Mostafa Izz and Ayman Kaheel , “Fast stitching of videos captured from freely moving devices by exploiting temporal redundancy”, ICIP 2010.
• Mohammad El Deeb and Motaz El-Saban, “Human age estimation using enhanced bio-inspired features (EBIF)”, ICIP 2010.
• A. Kaheel, M. El-Saban, M. Refaat and M. Izz, “Mobicast: A System for Collaborative Event Casting Using Mobile Phones”, ACM-MUM 09.
• M. El-Saban, M. Refaat, A. Kaheel and A. Abd El-Hameed, “Stitching videos streamed by mobile phones in real-time“, ACM-MM 09 (technical demonstration)
• M. Eldib, B. Abou Zaid, H. Zawbaa, M. El-Zahar, M. El-Saban, “Soccer video summarization using enhanced logo detection', ICIP 2009.
• Waleed Magdy, Kareem Darwish and Motaz El-Saban, “Efficient Language-Independent Retrieval of Printed Documents without OCR”, in SPIRE 2009.
• Nayer Wanas, Motaz El-Saban, Heba Ashour, Waleed Ammar, "Automatic Scoring of Online Discussion Posts", CIKM Second Workshop on Information Credibility on the Web (WICOW
2008)
• M. El-Saban et al. Automated tracking and modelling of microtubule dynamics, International Symposium of biomedical imaging, IEEE International Symposium on Biomedical Imaging
(ISBI) 06.
• B. S. Manjunath, B. Sumengen, Z. Bi, J. Byun, M. El-Saban, D. Fedorov, N. Vu, Towards Automated Bioimage Analysis: from features to semantics, IEEE International Symposium on
Biomedical Imaging (ISBI), invited paper.
• A. Altinok, M. E-Saban et al. Activity Recognition in Microtubule Videos by Mixture of Hidden Markov Models, Proc. International Conference on Computer Vision and Pattern
Recognition (CVPR), 2006.
• S. Bhagavaty and M. A. El Saban. SketchIt: Basketball Video Retrieval Using Ball Motion Similarity, in Advances in Multimedia Information Processing - PCM 2004: 5th Pacific Rim
Conference on Multimedia, Tokyo, Japan.
• M. A. El Saban and B. S. Manjunath. Interactive Segmentation Using Curve Evolution and Relevance Feedback. ICIP 2004
• M. A. El Saban and B. S. Manjunath. Video Region Segmentation by Spatio-temporal watersheds. ICIP 2003.
• M. A. El Saban, S. Abd El-Azeem and M. Rashwan. A new video coding scheme based on the H.263 standard and entropy constrained vector quantization. Faculty of engineering journal,
Nov. 2000.
• M. A. El Saban, S. Abd El-Azeem and M. Rashwan. A new video coding scheme based on the H.263 standard and entropy constrained vector quantization. ICII 2000, Kuwait, Nov. 2000.
47

Patents
• Waleed Magdi and Motaz El-Saban, Personalized notification of live events (U.S. patent #8,881,191)
• Waleed Magdi and Motaz El-Saban, Using An Id Domain To Improve Searching (U.S. granted patent #8,131,720
and #8,538,964 (continuation))
• Heba Ashour, Nayer Wanas, Mostafa El Baradei and Motaz El-Saban, User Evaluation in a Collaborative Online
Forum (U.S. patent #8,893,024)
• Ayman Kaheel, Motaz El-Saban, Mohamed Shawky and Mahmoud Refaat, Sharing video data associated with the
same event (U.S. patent #8,767,081)
• Motaz El-Saban, Christopher Burges and Qiang Wu, Re-ranking top search results (U.S. patent #8,661,030)
• Motaz El-Saban, Ayman Kaheel, Mahmoud Refaat and Ahmad Abd El Hameed, Composite video generation (U.S.
patent #8,605,783)
• Ayman Kaheel, Motaz El-Saban, Mahmoud Refaat, Ahmad El Arabawy, Mostafa Baradei Using accelerometer
information for determining orientation of pictures and video images (U.S. patent pending)
• Motaz El-Saban, Xin-Jing Wang and May Sayed, Real-Time Annotation And Enrichment Of Captured Video (U.S.
patent #8,903,798)
• James Lau, Ayman Kaheel, Motaz El-Saban Mohammad Shawky, Monica Gonzales, Ahmed El Baz, Tamer Deif and
Alaa Abd El Hakeem, Using facial data for device authentication or subject identification (U.S. patent pending)
• Motaz El-Saban, Ayman Kaheel, Mohammad Shawky and James Lau, Modifying video regions using mobile device
input (U.S. patent pending)
• Pushmeet Kohli, Jamie Shotton and Motaz El Saban, Synthesizing Training Samples for Object Recognition (U.S.
patent #8,903,167)
• Alaa Abd El Hakeem and Motaz El-Saban, Dynamic update of recovered subspaces of high dimensional
• Motaz et al, Natural language search of images and navigation
48

TechnicalBackgroundOverview

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie TechnicalBackgroundOverview

Ähnlich wie TechnicalBackgroundOverview (20)

TechnicalBackgroundOverview

Hinweis der Redaktion