Rosinski ibm ai overview with several examples of projects in the media and lessons learned
1. FIAT/IFTA Media Management Seminar
“Game Changers? From Automation to Curation: Futureproofing AV Content”
IBM AI Overview with several Examples of
Projects in the Media and Lessons Learned
Jakob Rosinski | Lead Architect Video Solutions & Broadcast Industry Europe
Stockholm | 13.06.2018
2. This speech will give you an overview about client projects in the space of media archives worldwide IBM has
contributed to with it's own AI - named Watson - but also with it's knowledge and integration capabilities. Major topics are
scope definition and use case identification, further the usage of cognitive services of different kinds and vendors - with
success and open problems. In such a multi-modal approach training of services is also key, and the speech should
show how this can be managed both from a human and machine perspective.
Abstract
Jakob is the Lead Architect for Video Solutions & Broadcast Industry for IBM Services
in Europe. He is also the product owner of IBM AREMA, a workflow and essence
management solution which is widely used at different broadcasters for essence
archives and workflow automation.
Over the last decade Jakob was responsible for various projects in the media industry
at HBO, France24, ORF, SRF, RTL Mediengruppe or Deutsche Bundesliga/Sportcast.
He is an expert for multi-site & multi-tier essence management and workflow
automation for ingest, archive, production & distribution.
Further he is known and valued as a subject matter expert for the topics above in the
WW IBM M&E community. He is skilled at translating business needs into systems
solutions
3. Video Enrichment uses industry leading AI capabilities to analyze textual, audio, and visual data
within multi-media content, and to build easily searchable metadata packages for every asset.
By understanding content in new ways, media companies can improve content discovery,
increase operational efficiency, deliver higher ad revenues, drive viewer engagement and offer
entirely new ways to meet the demands of their businesses.
Enriched content is inherently more searchable. Improved content discovery in your consumer
service leads to increased usage.
4. Cognitive base services used for content enrichment
Enhanced and automated
understanding of personalities
present in the frame, and objects
Activate decade-old material by
running it through the STT API and
then performing deeper analytics
Deeper understanding of concepts,
recognized entities, keywords, and
relationships
Target
Deeply
enriched
content
second-to-
second
Search for image and videodata for
not trained objects or contexts.
Visual Recognition
Audiomining & Speech
to Text
NLU & Translation
Videodetection / Speed /
Movement
Pattern Detection &
Similarity Search
5. A lot of vendors are providing base cognitive
services...
Visual Recognition
Audioming & Speech to
Text
NLU & Translation
Videodetection / Speed /
Movement
Pattern Detection &
Similarity Search
11. 11
Customer
MAM or DAM
Enriched metadata is delivered as an open JSON bundle to be
stored and used for search, compliance, recommendation and
other vital use cases.
Assets are acquired, ingested, processed and enriched
using the Watson Media platform.
SEMANTIC SCENE CHAPTERING
Divides the Media into meaningful chunks or chapters that can be more
easily managed by people responsible for editing or producing.
SPEECH TO TEXT
Converts audio into text, by leveraging machine intelligence to combine
information about grammar and language structure with knowledge of
the composition of the audio signal. Trainable.
NATURAL LANGUAGE UNDERSTANDING
Using the Textual output of S2T or a Close Caption File, NLU derives:
Concepts, Document-Level Emotions Sentiment, Entities, Keywords,
Language, & Taxonomy. Trainable.
VISUAL RECOGNITION
Detects the contents of an image or video frame, answering the
question: “What is in this image?” Returns class, class description, face
detection, and text recognition. Trainable.
TONE ANALYZER & PERSONALITY INSIGHTS
Provide additional features that document the Emotional Tone, Writing
Tone, Social Tone of dialogue, as well as the overall personalities of
characters based on their words.
Watson Video Enrichment Workflow
> > >
>>>
>>>
15. Scene Detection
Deep Video-Analysis
People-, Object and Context-Detection
Classification of actors based on 24
emotions
Classification of scenes based on 22.000
categories
Deep Audio-Analysis
Background
Actor sentiment and tone
Analysis of scene composition
Classification of light and color
Analysis of succesful trailers
to automatically create a
new one
https://www.youtube.com/watch?v=gJEzuYynaiw
15
16. Concept and proving of an automatic content
enrichment system for 40+ years of soccer history
Annotation by usage of a portfolio of cognitive solutions
Audio: Speech-to-text / Transcript
Audio: Speaker-Detection
Audio: Atmosphere (cheers, whistles, ..)
Video: Angle/Camera & Context Detection
Video: Face- & Object Detection
Domain trained services including Traningsportal
Sharpening of results by knowledge of domain and
creation of timelines, identifiying of concepts
Link with Game- and Playerdata
Optimize content analysis and search based on game
and player statistics
Guided search.
Persona-based User Experience
Personalized Discovery, Suggestions, Design &
Projects
Content enrichment for
Bundesliga archive
16
17. 17
Target: Automatic content enrichment
of 30+ years of show content
Annotation by usage of a portfolio of
cognitive solutions (IBM, OpenCV)
Audio: Speech-to-text / Transcript /
Phrase detection
Video: Angle/Camera & Context
Detection
Video: Face- & Object Detection
Domain trained services including
Traningsportal
Sharpening of results by knowledge of
domain and creation of timelines,
identifiying of concepts
Content enrichment for
Brazils most famous TV show
18. Architecture of “Captain Caption” Demo
AREMA
Speech
to Text
Deep Learning –
Sound
Recognition
Natural
Language
Understanding
Conform results into one Close Caption file
Translation into target language
L
19. 19
Context / Solution
Frame accurate detection of trained frames of lead in and out scenes to mark those
scenes in the content and exchange those automatically in master format without
transcoding (unwrap, cut, wrap) and with appropriate audio track handling to
enable fast channel switch of content.
• Usage of own developed detection component using OpenCV and Watson VR for
frameaccurate detection of scenes.
• Usage of AREMA‘s Dalet Galaxy integration to directly pull and push content to
MAM system, no need to extend Galaxy for this purpose
• Automatically scalable by using AREMA autoscaler in combination with
Kubernetes & Docker
• Usage of AREMA MXF Package for
• metadata extraction of source file
• rewrapping / preparartion audiotrack schema of new scene
• partial cut of source file
• conforming of all parts to target file
=> very fast, no transcoding or change of audio and video streams
Use Case: “Implement a full integrated, trained
cognitive service to exchange ident in and out
scenes”
Result:
• Fully automatized exchange of scenes, deeply integrated with existing environment
• Nearly endlessly scalable as all components can run in Kubernetes/Docker environment leads to significant reduce of time and people effort and faster
change of content between programs => from 3 months (2 full-time persons) to days
20. Each Use Case of Multimodal Analysis has different requirements so the workflows and the
combination of AI Services have to be adopted to these requirements
This is where the following model provides flexibility to adapt to each unique use case of
multimodal analytics
Vendor independant usage of cognitive services
The whole is greater than the sum of its parts (Aristoteles), but sometimes also particular
„tiny“ use cases are worth to be evaluated
Flexible MULTIMODALITY is a must
There is no One Size Fits All
21. 21
Elemental parts of a content
enrichment platform
Multi-Modality &
Training &
Vendorindependence
Data-Consolidation &
Monitoring
Integration
& Workflow
212121
23. Why is training necessary?
- How do we tell Will Ferrell (famous actor) apart from
Chad Smith (famous rock musician)?
- Challenges include:
• Out-of-Plane Rotation: frontal, 45 degree, profile,
upside down
• Presence of beard, mustache, glasses.
• Facial Expressions
• Occlusions by long hair, hand
• In-Plane Rotation
• Image conditions: size, lighting condition, distortion,
noise, compression
Trust me, these are two non-related different people!
https://medium.com/@ageitgey/machine-learning-is-fun-part-4-modern-face-recognition-with-deep-learning-c3cffc121d78
https://medium.com/@ageitgey/machine-learning-is-fun-part-4-modern-face-
recognition-with-deep-learning-c3cffc121d78
24. A lot of vendors are providing base cognitive services...but without
individual training they do not provide sufficient benefit
Customized user
AI model
Industry/Domain AI
Model
Base AI Model
Training data size
Accuracy
70%
60%
40%
Base model
learning curve
Domain-specific model
learning curve
50%
Customer adapted model
learning curve
0
80%
90%
As the domain specializes, learning accelerates
• Public models
• Pre-trained
• Limited accuracy for
typical real life use
cases
• Trained with proprietary
data
• Data ownership critical for
differentiation
Automated TRAINING is a must
27. Parts for an successful content enrichment
1. By combination of
trained cognitive
serviced new valuable
metadata can be
retrieved from content
2. Automatic creation and
use of those metadata
must be included in
existing processes
3. Quality of cognitive
services and processes
must be supervisioned
Information Corpora
- Rule-based configuration
- Batch learning
- Manual labeling
- Cognitive workflow builder
- E2E Broadcast Integration
(MAM, etc.)
- Full integration into AREMA
Operations Dashboards
…
Training
Cognitive Workflow
Orchestration
Cognitive Workflow
Operations
Elementary AI Services
Cognitive Content Media Services
IBM Watson APIs 3rd Party APIs
Speech-
to-Text
NLC/
NLU*
Visual
Recogn. …
General Domain
Content Tagging
Domain-specific
Content Tagging
(3rd party)
Domain-specific
Content Tagging
(propriety)
Domain-specific
Content Tagging
(shared)
Speech
Languag
e
Visual …Watson
Media
Knowledge
Studio
Essence Files Meta Data Public Data
Other Data
sources
…
28. • A comparison between single cognitive services is not adequate, but the reasonable combination of
services is
• The solution approach must start with the use case given, for which the solution will be defined and
customized
• AI will not overtake all human work, but will support in the areas where automization is meaningful
• The process will be a mix of human an AI based tasks and steps
• Sufficient solutions will be created by try-out and optimization, not by waiting for the perfect
technology.
Summary
While AI can’t fully
equate the human
touch creatively, it can
optimize workflows and
media processes to
gain more value from
content.
29.
30.
31. 31
Notes and Sources
McCaskill, Steve. “Wimbledon 2018: AI Marries Tennis Tradition With Digital Innovation.” Forbes. July 2018.
https://www.forbes.com/sites/stevemccaskill/2018/07/06/ wimbledon-marries-innovation-with-tradition-in-use-of- ai/#7686e2d92198
Moore, Mike. “Wimbledon 2018: How IBM Watson is serving up the best viewer experience.” Tech Radar. July 2018.
https://www.techradar.com/news/wimbledon-2018-how-ibm- watson-is-serving-up-the-best-viewer-experience
McCarthy, John. “IBM and Fox Sports lean on AI so fans can generate World Cup highlights packages.” The Drum. June 2018.
https://www.thedrum.com/news/2018/06/06/ibm-
and-fox-sports-lean-ai-so-fans-can-generate-world- cup-highlights-packages
Alvarez, Edgar. “Fox Sports’ World Cup Highlight Machine is powered by IBM’s Watson.” Engadget. June 2018.
https://www.engadget.com/2018/06/04/fox-sports-world- cup-highlight-machine-ibm-watson
Chang, Lulu. “IBM’s Watson will make headlines at the Masters tournament.” Digital Trends. April 2018.
https://www.digitaltrends.com/outdoors/ibm-watson-masters
Alexander, Julia, “Watch the first ever movie trailer made by artificial intelligence.” Polygon. September 2016.
https://www.polygon.com/2016/9/1/12753298/morgan- trailer-artificial-intelligence
Smith, John R. “IBM Research takes Watson to Hollywood with the first “cognitive movie trailer.” IBM. August 2016.
https://www.ibm.com/blogs/think/2016/08/cognitive- movie-trailer
“Uncovering Dark Video Data with AI: How Watson Video Enrichment can provide better decision-making data and unlock new business possibilities in
the media industry.” IBM. August 2017. https://public.dhe.ibm.com/common/ ssi/ecm/me/en/mew03018usen/uncovering-dark-data_
MEW03018USEN.pdf