Big Data Day LA 2016/ Use Case Driven track - Reliable Media Reporting in an Ever-changing Data LandscapeRachel Kelley, Project Manager, Josh Andrews, Data & Analytics Architect, OnPrem & Eric Avila, Senior Anti-Piracy Technologist, NBCUniversal
OnPrem Solution Partners worked with NBCU to profile in-house data to determine data quality, and recommend process and quality improvements. We present our process for data import, improvements we want to make, and lessons learned regarding various tools used, including MariaDB, ElasticSearch, Cassandra, and others.
Building a Federated Data Directory Platform for Public Health
Ähnlich wie Big Data Day LA 2016/ Use Case Driven track - Reliable Media Reporting in an Ever-changing Data LandscapeRachel Kelley, Project Manager, Josh Andrews, Data & Analytics Architect, OnPrem & Eric Avila, Senior Anti-Piracy Technologist, NBCUniversal
Ähnlich wie Big Data Day LA 2016/ Use Case Driven track - Reliable Media Reporting in an Ever-changing Data LandscapeRachel Kelley, Project Manager, Josh Andrews, Data & Analytics Architect, OnPrem & Eric Avila, Senior Anti-Piracy Technologist, NBCUniversal (20)
Handwritten Text Recognition for manuscripts and early printed texts
Big Data Day LA 2016/ Use Case Driven track - Reliable Media Reporting in an Ever-changing Data LandscapeRachel Kelley, Project Manager, Josh Andrews, Data & Analytics Architect, OnPrem & Eric Avila, Senior Anti-Piracy Technologist, NBCUniversal
4. NBCU CCP Overview
NBCU is one of the worlds largest entertainment companies
Responsibilities of NBCU’s Creative Content Protection Group (CCP)
CCP creates & manages technological solutions to these needs
♮
4
Cable
Television
Broadcast
Television
Digital
Parks
Film
5. OnPrem Solution Partners
5
Media & Entertainment Technology Consulting Firm
Business
Consulting
Technology
Leadership
Applied
Innovation
Business Strategy
Product Roadmap
Process Improvement
Change Management
CRM
Data & Analytics
Digital Supply Chain
PMO & SI Services
Custom Solutions
Enterprise App Development
QA & Support
UX/UI
Los Angeles
New York
Austin
♬
6. Problem Statement
Problem Statement:
• NBCU CCP wanted to obtain a better view of their data flow and process
to manage asset identification and analytics
Scope:
• Data from streaming services regarding NBCU owned content
• Priority data solutions in place within CCP and other NBCU teams
Objectives:
• Develop a data strategy around streaming services metadata
• Investigate/define initial taxonomy, initiate data profiling, and develop
data source list
6
♬
8. Project Background
8
• Lightweight digital
identifier, easily
referenced against
fingerprints
generated from other
assets of its kind
• Sent to
vendors/partners &
verified against
uploaded content
• Example: titles
• Common data
problem across
industries:
• Duplicates
• Language
• Quality
• “Truth” changes
over time and by
business need
• Oh hey, that’s my
content you’ve got
there…
• Streaming services
are triggered to
associate content
in video to
ownership of
reference asset
♮
9. Time series data
Analytic summaries
Title metadata and fingerprinting
Fingerprint, title and analytic data
Systems in Place
9
Solutions which allow full Proof-of-Concept testing before full implementation,
without licensing or contract constraints, have been easier to employ
♮
10. Methodology
10
♯
Identify relevant systems and tables from stakeholders & obtain
access to databases
Determine table purpose and population source
Generate fundamental metrics for all columns, using proprietary
data profiling methodology, e.g.: Datatype, Scale, Cardinality
Review metrics for outstanding measures
Generate further questions for investigation
Data Profiling Methodology Project Stats
• 18+ data
systems
encountered
• 9 stakeholder
interviews
• 32 data profiling
reports run
• 8 weeks
11. 11
Data Flow Diagram
3. External Data Sources
4. Vendors and Partners
1. CCP Internal Systems
2. NBCU Systems
CCP
SQL
Server
APIs
Release
Dates
♯
Release
Dates
12. Analysis Performed: SQL Server
As the system takes external metadata and uses it to “patch” together title
data received from various systems to create a more reliable dataset, our
primary concerns were:
• Data Quality & Source Integrity
• Update Frequency
• Data Complexity
CCP
SQL
ServerUpstream Metadata Sources Downstream Reporting
Capabilities
♯
12
14. Analysis Results: Metadata Staging
General Observations:
• Looked at grain of title, country, language, category, season and episode,
and others
• Records pulled from multiple sources lead to complexity…
– Duplicate release dates within titles
– Conflicting records within titles
51K 48K
40K
14K
System 1 System 2 System 3 System 4
♯
14
2,584
3,417
352
1 2 3 4
External Sources Per Title# of Records Ingested by External Source
(Release Date)
15. Analysis Performed: MariaDB
CONSIDERATIONS
• Overall is an analysis of viewership and hits
• Account for matches against official, whitelisted, and licensed videos
• Outliers were not removed due to the large percentage of match data that would be expunged
• Summary statistics indicated a left leaning data set
♯
Title Information
MariaDB
CCP SQL
Server
Cassandra
Summarized
Copyright Match
Information
15X: Viewers per Video
Y: Count of
Cases in
Bucket
16. Column Name Datatype Nullable % Non-Null
Standard
Deviation
Min Max
claim_type varchar YES 76.77%
asset_name varchar YES 92.78%
asset_type varchar YES 100.00%
video_title varchar YES 76.08%
reference_status varchar YES 65.68%
reference_length int YES 65.68% 3065.065455 18 18746
content_type varchar YES 65.68%
view_count int YES 76.08% 919084.1099 0 1.05E+09
duration int YES 76.08% 2228.763702 0 192887
video_total_match int YES 74.66% 1554.598552 0 38385
channel_title varchar YES 76.08%
claim_date datetime YES 100.00% 12/7/2007 11/16/2015
video_upload_date datetime YES 76.08% 8/9/2005 11/16/2015
licensed_content tinyint YES 76.08% 0.140196747 0 1
privacy varchar YES 76.08%
policy_name varchar YES 95.33%
match_percentage int YES 56.95% 48.10774725 0 32388
channel_comments int YES 55.04% 3625.070764 0 1213995
channel_videos int YES 55.04% 1518.373624 0 228113
season int YES 10.49% 72.64507792 1 2015
episode int YES 10.70% 78.35203659 1 4601
last_updated timestamp NO 100.00% 11/16/2015 11/16/2015
Whitelisted tinyint YES 100.00% 0.099045453 0 1
official tinyint YES 100.00% 0.076134446 0 1
owner varchar YES 100.00%
Analysis Results: Hits
Table: SMART_MATCH (copyright match data)
♯
16
Data Discrepancy:
• Reference length
longer than actual
video length
Data Limitation:
• Only most recent upload
date is displayed, and the
value may actually be the
date of publishing or being
made public
18. Key Findings & Recommendations
18
Gaps in metadata
make it difficult to
understand and
utilize collected data
effectively
Streamline the
metadata gathering
and cleaning process,
leveraging other
metadata systems
Daily quotas and
threshold limit and
distort data pulled
Selectively pull data
to circumvent daily
quotas and
potentially improve
data integrity
Data integrity from
some sources is
questionable and
variance in incentive
to improve
Improve data
processes, e.g.,
addition of data
cleaning to certain
data extract and
aggregation process
(ETL)
Brand specific
workflows, fringe
use cases hinder
ability to acquire
metadata &
accurately map
references
Roadmap of brand
and title match data
cleanup for
reporting needs,
process to maintain
data integrity
FindingsRecommendations
Data Challenges Tech Challenges Organizational Challenges
♯
19. Data Project Principles & Pitfalls
Maintenance is the Monster
• Initial creation of data solutions is often easier than long term maintenance
Common issues
• Rapidly changing platforms, frameworks, and methodologies
• Need for continuous maintenance and verification of data quality
• Incentives and cultures vary across departments and companies
• Establishing and disseminating a “data stewardship” mentality
• Data “truth” changes over time and by business need
• Ongoing changes in individual consumer behavior, options for copyright owners
19
♬
TechnicalNon-Technical
20. Go Forward Plan
NBCU Next Steps
• Increased focus on and automation of data matching
& data clean up
• Enable better business unit segmentation of
enterprise data
• Transition from organic to directed architecture
• Increased internal outreach
20
♮