SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Reliable Media Reporting in an
Ever-Changing Data Landscape
Presenters
Eric Avila, NBCU
• NBCU Senior Technologist, Creative
Content Protection Team
Rachel Kelley, OnPrem
• Senior Project Manager, Data &
Analytics Practice
Josh Andrews, OnPrem
• Data Technology Lead/Architect, Data &
Analytics Practice
2
Agenda
Introduction
NBCU and
OnPrem
Problem
Statement
Approach
Background
Methodology
Data Analysis
Outcome
Recommendations
Next Steps
Q&A
3
♬
NBCU CCP Overview
NBCU is one of the worlds largest entertainment companies
Responsibilities of NBCU’s Creative Content Protection Group (CCP)
CCP creates & manages technological solutions to these needs
♮
4
Cable
Television
Broadcast
Television
Digital
Parks
Film
OnPrem Solution Partners
5
Media & Entertainment Technology Consulting Firm
Business
Consulting
Technology
Leadership
Applied
Innovation
Business Strategy
Product Roadmap
Process Improvement
Change Management
CRM
Data & Analytics
Digital Supply Chain
PMO & SI Services
Custom Solutions
Enterprise App Development
QA & Support
UX/UI
Los Angeles
New York
Austin
♬
Problem Statement
Problem Statement:
• NBCU CCP wanted to obtain a better view of their data flow and process
to manage asset identification and analytics
Scope:
• Data from streaming services regarding NBCU owned content
• Priority data solutions in place within CCP and other NBCU teams
Objectives:
• Develop a data strategy around streaming services metadata
• Investigate/define initial taxonomy, initiate data profiling, and develop
data source list
6
♬
Introduction
NBCU and
OnPrem
Problem
Statement
Approach
Background
Methodology
Data Analysis
Outcome
Recommendations
Next Steps
Q&A
7
Agenda
♬
Project Background
8
• Lightweight digital
identifier, easily
referenced against
fingerprints
generated from other
assets of its kind
• Sent to
vendors/partners &
verified against
uploaded content
• Example: titles
• Common data
problem across
industries:
• Duplicates
• Language
• Quality
• “Truth” changes
over time and by
business need
• Oh hey, that’s my
content you’ve got
there…
• Streaming services
are triggered to
associate content
in video to
ownership of
reference asset
♮
Time series data
Analytic summaries
Title metadata and fingerprinting
Fingerprint, title and analytic data
Systems in Place
9
Solutions which allow full Proof-of-Concept testing before full implementation,
without licensing or contract constraints, have been easier to employ
♮
Methodology
10
♯
Identify relevant systems and tables from stakeholders & obtain
access to databases
Determine table purpose and population source
Generate fundamental metrics for all columns, using proprietary
data profiling methodology, e.g.: Datatype, Scale, Cardinality
Review metrics for outstanding measures
Generate further questions for investigation
Data Profiling Methodology Project Stats
• 18+ data
systems
encountered
• 9 stakeholder
interviews
• 32 data profiling
reports run
• 8 weeks
11
Data Flow Diagram
3. External Data Sources
4. Vendors and Partners
1. CCP Internal Systems
2. NBCU Systems
CCP
SQL
Server
APIs
Release
Dates
♯
Release
Dates
Analysis Performed: SQL Server
As the system takes external metadata and uses it to “patch” together title
data received from various systems to create a more reliable dataset, our
primary concerns were:
• Data Quality & Source Integrity
• Update Frequency
• Data Complexity
CCP
SQL
ServerUpstream Metadata Sources Downstream Reporting
Capabilities
♯
12
Analysis Results: Metadata Staging
Column Name Is Nullable Min Max Cardinality
Effective
Cardinality % NULL
Release_Date_ID no N/A N/A 100% 100% 0%
Prefix yes N/A N/A 0% NULL 100%
Title_ID no N/A N/A 4% 4% 0%
Release_Date_Category_ID no N/A N/A 0% 0% 0%
Country_ID yes N/A N/A 0% 0% 29%
Language_ID yes N/A N/A 0% 0% 85%
Original_Network_Code yes N/A N/A 0% NULL 100%
Licensee_ID yes N/A N/A 0% NULL 100%
Season_Number yes 1 2015 0% 0% 73%
Episode_Name yes N/A N/A 19% 69% 72%
Episode_Number yes 0 2210 0% 2% 72%
Episode_Length yes N/A N/A 0% NULL 100%
Comment yes N/A N/A 1% 5% 86%
Date no 1/1/1900 1/1/3000 14% 14% 0%
Is_Special yes N/A N/A 0% 0% 97%
Table: RELEASE_DATES
♯
13
Data Quality:
• Irregular
Season/Episode
naming conventions
• Improperly populated
Release Dates
Analysis Results: Metadata Staging
General Observations:
• Looked at grain of title, country, language, category, season and episode,
and others
• Records pulled from multiple sources lead to complexity…
– Duplicate release dates within titles
– Conflicting records within titles
51K 48K
40K
14K
System 1 System 2 System 3 System 4
♯
14
2,584
3,417
352
1 2 3 4
External Sources Per Title# of Records Ingested by External Source
(Release Date)
Analysis Performed: MariaDB
CONSIDERATIONS
• Overall is an analysis of viewership and hits
• Account for matches against official, whitelisted, and licensed videos
• Outliers were not removed due to the large percentage of match data that would be expunged
• Summary statistics indicated a left leaning data set
♯
Title Information
MariaDB
CCP SQL
Server
Cassandra
Summarized
Copyright Match
Information
15X: Viewers per Video
Y: Count of
Cases in
Bucket
Column Name Datatype Nullable % Non-Null
Standard
Deviation
Min Max
claim_type varchar YES 76.77%
asset_name varchar YES 92.78%
asset_type varchar YES 100.00%
video_title varchar YES 76.08%
reference_status varchar YES 65.68%
reference_length int YES 65.68% 3065.065455 18 18746
content_type varchar YES 65.68%
view_count int YES 76.08% 919084.1099 0 1.05E+09
duration int YES 76.08% 2228.763702 0 192887
video_total_match int YES 74.66% 1554.598552 0 38385
channel_title varchar YES 76.08%
claim_date datetime YES 100.00% 12/7/2007 11/16/2015
video_upload_date datetime YES 76.08% 8/9/2005 11/16/2015
licensed_content tinyint YES 76.08% 0.140196747 0 1
privacy varchar YES 76.08%
policy_name varchar YES 95.33%
match_percentage int YES 56.95% 48.10774725 0 32388
channel_comments int YES 55.04% 3625.070764 0 1213995
channel_videos int YES 55.04% 1518.373624 0 228113
season int YES 10.49% 72.64507792 1 2015
episode int YES 10.70% 78.35203659 1 4601
last_updated timestamp NO 100.00% 11/16/2015 11/16/2015
Whitelisted tinyint YES 100.00% 0.099045453 0 1
official tinyint YES 100.00% 0.076134446 0 1
owner varchar YES 100.00%
Analysis Results: Hits
Table: SMART_MATCH (copyright match data)
♯
16
Data Discrepancy:
• Reference length
longer than actual
video length
Data Limitation:
• Only most recent upload
date is displayed, and the
value may actually be the
date of publishing or being
made public
Introduction
NBCU and
OnPrem
Problem
Statement
Approach
Background
Methodology
Data Analysis
Outcome
Recommendations
Next Steps
Q&A
17
Agenda
♯
Key Findings & Recommendations
18
Gaps in metadata
make it difficult to
understand and
utilize collected data
effectively
Streamline the
metadata gathering
and cleaning process,
leveraging other
metadata systems
Daily quotas and
threshold limit and
distort data pulled
Selectively pull data
to circumvent daily
quotas and
potentially improve
data integrity
Data integrity from
some sources is
questionable and
variance in incentive
to improve
Improve data
processes, e.g.,
addition of data
cleaning to certain
data extract and
aggregation process
(ETL)
Brand specific
workflows, fringe
use cases hinder
ability to acquire
metadata &
accurately map
references
Roadmap of brand
and title match data
cleanup for
reporting needs,
process to maintain
data integrity
FindingsRecommendations
Data Challenges Tech Challenges Organizational Challenges
♯
Data Project Principles & Pitfalls
Maintenance is the Monster
• Initial creation of data solutions is often easier than long term maintenance
Common issues
• Rapidly changing platforms, frameworks, and methodologies
• Need for continuous maintenance and verification of data quality
• Incentives and cultures vary across departments and companies
• Establishing and disseminating a “data stewardship” mentality
• Data “truth” changes over time and by business need
• Ongoing changes in individual consumer behavior, options for copyright owners
19
♬
TechnicalNon-Technical
Go Forward Plan
NBCU Next Steps
• Increased focus on and automation of data matching
& data clean up
• Enable better business unit segmentation of
enterprise data
• Transition from organic to directed architecture
• Increased internal outreach
20
♮

Weitere ähnliche Inhalte

Was ist angesagt?

Free Servers to Build Big Data System on: Bing’s Approach
Free Servers to Build Big Data System on: Bing’s ApproachFree Servers to Build Big Data System on: Bing’s Approach
Free Servers to Build Big Data System on: Bing’s Approach
DataWorks Summit
 

Was ist angesagt? (20)

Building a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with RBuilding a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with R
 
Spark in the Enterprise - 2 Years Later by Alan Saldich
Spark in the Enterprise - 2 Years Later by Alan SaldichSpark in the Enterprise - 2 Years Later by Alan Saldich
Spark in the Enterprise - 2 Years Later by Alan Saldich
 
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSetsWebinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
 
Strata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma PresentationStrata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma Presentation
 
About CDAP
About CDAPAbout CDAP
About CDAP
 
Innovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseInnovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data Warehouse
 
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
"Who Moved my Data? - Why tracking changes and sources of data is critical to..."Who Moved my Data? - Why tracking changes and sources of data is critical to...
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
 
What's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and BeyondWhat's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and Beyond
 
Building Custom Big Data Integrations
Building Custom Big Data IntegrationsBuilding Custom Big Data Integrations
Building Custom Big Data Integrations
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
 
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTestistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
 
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
 
Disrupting Insurance with Advanced Analytics The Next Generation Carrier
Disrupting Insurance with Advanced Analytics The Next Generation CarrierDisrupting Insurance with Advanced Analytics The Next Generation Carrier
Disrupting Insurance with Advanced Analytics The Next Generation Carrier
 
Testistanbul 2016 - Keynote: "Enterprise Challenges of Test Data" by Rex Black
Testistanbul 2016 - Keynote: "Enterprise Challenges of Test Data" by Rex BlackTestistanbul 2016 - Keynote: "Enterprise Challenges of Test Data" by Rex Black
Testistanbul 2016 - Keynote: "Enterprise Challenges of Test Data" by Rex Black
 
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
 
Free Servers to Build Big Data System on: Bing’s Approach
Free Servers to Build Big Data System on: Bing’s ApproachFree Servers to Build Big Data System on: Bing’s Approach
Free Servers to Build Big Data System on: Bing’s Approach
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkSpark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with Spark
 
Automated Analytics at Scale
Automated Analytics at ScaleAutomated Analytics at Scale
Automated Analytics at Scale
 
Breakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data StoreBreakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data Store
 
Building a Federated Data Directory Platform for Public Health
Building a Federated Data Directory Platform for Public HealthBuilding a Federated Data Directory Platform for Public Health
Building a Federated Data Directory Platform for Public Health
 

Ähnlich wie Big Data Day LA 2016/ Use Case Driven track - Reliable Media Reporting in an Ever-changing Data LandscapeRachel Kelley, Project Manager, Josh Andrews, Data & Analytics Architect, OnPrem & Eric Avila, Senior Anti-Piracy Technologist, NBCUniversal

Empowering Real Time Patient Care Through Spark Streaming
Empowering Real Time Patient Care Through Spark StreamingEmpowering Real Time Patient Care Through Spark Streaming
Empowering Real Time Patient Care Through Spark Streaming
Databricks
 

Ähnlich wie Big Data Day LA 2016/ Use Case Driven track - Reliable Media Reporting in an Ever-changing Data LandscapeRachel Kelley, Project Manager, Josh Andrews, Data & Analytics Architect, OnPrem & Eric Avila, Senior Anti-Piracy Technologist, NBCUniversal (20)

Igniting Audience Measurement at Time Warner Cable
Igniting Audience Measurement at Time Warner CableIgniting Audience Measurement at Time Warner Cable
Igniting Audience Measurement at Time Warner Cable
 
Data Science and Analytics
Data Science and Analytics Data Science and Analytics
Data Science and Analytics
 
Neo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperativeNeo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperative
 
Data analytics and audit coverage guide
Data analytics and audit coverage guideData analytics and audit coverage guide
Data analytics and audit coverage guide
 
Data analytics and audit coverage guide
Data analytics and audit coverage guideData analytics and audit coverage guide
Data analytics and audit coverage guide
 
Cisco Analytics: Accelerate Network Optimization with Virtualization
Cisco Analytics: Accelerate Network Optimization with VirtualizationCisco Analytics: Accelerate Network Optimization with Virtualization
Cisco Analytics: Accelerate Network Optimization with Virtualization
 
Customer value analysis of big data products
Customer value analysis of big data productsCustomer value analysis of big data products
Customer value analysis of big data products
 
Technical Product Manager Case Challenge
Technical Product Manager Case ChallengeTechnical Product Manager Case Challenge
Technical Product Manager Case Challenge
 
Forecast 2014: SaaS Data Exchange
Forecast 2014: SaaS Data ExchangeForecast 2014: SaaS Data Exchange
Forecast 2014: SaaS Data Exchange
 
001 More introduction to big data analytics
001   More introduction to big data analytics001   More introduction to big data analytics
001 More introduction to big data analytics
 
The Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data ImplementationThe Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data Implementation
 
Data Quality Challenges & Solution Approaches in Yahoo!’s Massive Data
Data Quality Challenges & Solution Approaches in Yahoo!’s Massive DataData Quality Challenges & Solution Approaches in Yahoo!’s Massive Data
Data Quality Challenges & Solution Approaches in Yahoo!’s Massive Data
 
Renewing the BI infrastructure at Hellorider - Big Data Expo 2019
Renewing the BI infrastructure at Hellorider - Big Data Expo 2019Renewing the BI infrastructure at Hellorider - Big Data Expo 2019
Renewing the BI infrastructure at Hellorider - Big Data Expo 2019
 
Empowering Real Time Patient Care Through Spark Streaming
Empowering Real Time Patient Care Through Spark StreamingEmpowering Real Time Patient Care Through Spark Streaming
Empowering Real Time Patient Care Through Spark Streaming
 
The Importance of Data for DevOps: How TCF Bank Meets Test Data Challenges
The Importance of Data for DevOps: How TCF Bank Meets Test Data ChallengesThe Importance of Data for DevOps: How TCF Bank Meets Test Data Challenges
The Importance of Data for DevOps: How TCF Bank Meets Test Data Challenges
 
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
 
Keeping the Pulse of Your Data: Why You Need Data Observability to Improve D...
Keeping the Pulse of Your Data:  Why You Need Data Observability to Improve D...Keeping the Pulse of Your Data:  Why You Need Data Observability to Improve D...
Keeping the Pulse of Your Data: Why You Need Data Observability to Improve D...
 
The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!
 
CET DQ Tool Selection - Executive
CET DQ Tool Selection - ExecutiveCET DQ Tool Selection - Executive
CET DQ Tool Selection - Executive
 
Thavron maturing to consumption based models
Thavron maturing to consumption based modelsThavron maturing to consumption based models
Thavron maturing to consumption based models
 

Mehr von Data Con LA

Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA
 

Mehr von Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Kürzlich hochgeladen (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Big Data Day LA 2016/ Use Case Driven track - Reliable Media Reporting in an Ever-changing Data LandscapeRachel Kelley, Project Manager, Josh Andrews, Data & Analytics Architect, OnPrem & Eric Avila, Senior Anti-Piracy Technologist, NBCUniversal

  • 1. Reliable Media Reporting in an Ever-Changing Data Landscape
  • 2. Presenters Eric Avila, NBCU • NBCU Senior Technologist, Creative Content Protection Team Rachel Kelley, OnPrem • Senior Project Manager, Data & Analytics Practice Josh Andrews, OnPrem • Data Technology Lead/Architect, Data & Analytics Practice 2
  • 4. NBCU CCP Overview NBCU is one of the worlds largest entertainment companies Responsibilities of NBCU’s Creative Content Protection Group (CCP) CCP creates & manages technological solutions to these needs ♮ 4 Cable Television Broadcast Television Digital Parks Film
  • 5. OnPrem Solution Partners 5 Media & Entertainment Technology Consulting Firm Business Consulting Technology Leadership Applied Innovation Business Strategy Product Roadmap Process Improvement Change Management CRM Data & Analytics Digital Supply Chain PMO & SI Services Custom Solutions Enterprise App Development QA & Support UX/UI Los Angeles New York Austin ♬
  • 6. Problem Statement Problem Statement: • NBCU CCP wanted to obtain a better view of their data flow and process to manage asset identification and analytics Scope: • Data from streaming services regarding NBCU owned content • Priority data solutions in place within CCP and other NBCU teams Objectives: • Develop a data strategy around streaming services metadata • Investigate/define initial taxonomy, initiate data profiling, and develop data source list 6 ♬
  • 8. Project Background 8 • Lightweight digital identifier, easily referenced against fingerprints generated from other assets of its kind • Sent to vendors/partners & verified against uploaded content • Example: titles • Common data problem across industries: • Duplicates • Language • Quality • “Truth” changes over time and by business need • Oh hey, that’s my content you’ve got there… • Streaming services are triggered to associate content in video to ownership of reference asset ♮
  • 9. Time series data Analytic summaries Title metadata and fingerprinting Fingerprint, title and analytic data Systems in Place 9 Solutions which allow full Proof-of-Concept testing before full implementation, without licensing or contract constraints, have been easier to employ ♮
  • 10. Methodology 10 ♯ Identify relevant systems and tables from stakeholders & obtain access to databases Determine table purpose and population source Generate fundamental metrics for all columns, using proprietary data profiling methodology, e.g.: Datatype, Scale, Cardinality Review metrics for outstanding measures Generate further questions for investigation Data Profiling Methodology Project Stats • 18+ data systems encountered • 9 stakeholder interviews • 32 data profiling reports run • 8 weeks
  • 11. 11 Data Flow Diagram 3. External Data Sources 4. Vendors and Partners 1. CCP Internal Systems 2. NBCU Systems CCP SQL Server APIs Release Dates ♯ Release Dates
  • 12. Analysis Performed: SQL Server As the system takes external metadata and uses it to “patch” together title data received from various systems to create a more reliable dataset, our primary concerns were: • Data Quality & Source Integrity • Update Frequency • Data Complexity CCP SQL ServerUpstream Metadata Sources Downstream Reporting Capabilities ♯ 12
  • 13. Analysis Results: Metadata Staging Column Name Is Nullable Min Max Cardinality Effective Cardinality % NULL Release_Date_ID no N/A N/A 100% 100% 0% Prefix yes N/A N/A 0% NULL 100% Title_ID no N/A N/A 4% 4% 0% Release_Date_Category_ID no N/A N/A 0% 0% 0% Country_ID yes N/A N/A 0% 0% 29% Language_ID yes N/A N/A 0% 0% 85% Original_Network_Code yes N/A N/A 0% NULL 100% Licensee_ID yes N/A N/A 0% NULL 100% Season_Number yes 1 2015 0% 0% 73% Episode_Name yes N/A N/A 19% 69% 72% Episode_Number yes 0 2210 0% 2% 72% Episode_Length yes N/A N/A 0% NULL 100% Comment yes N/A N/A 1% 5% 86% Date no 1/1/1900 1/1/3000 14% 14% 0% Is_Special yes N/A N/A 0% 0% 97% Table: RELEASE_DATES ♯ 13 Data Quality: • Irregular Season/Episode naming conventions • Improperly populated Release Dates
  • 14. Analysis Results: Metadata Staging General Observations: • Looked at grain of title, country, language, category, season and episode, and others • Records pulled from multiple sources lead to complexity… – Duplicate release dates within titles – Conflicting records within titles 51K 48K 40K 14K System 1 System 2 System 3 System 4 ♯ 14 2,584 3,417 352 1 2 3 4 External Sources Per Title# of Records Ingested by External Source (Release Date)
  • 15. Analysis Performed: MariaDB CONSIDERATIONS • Overall is an analysis of viewership and hits • Account for matches against official, whitelisted, and licensed videos • Outliers were not removed due to the large percentage of match data that would be expunged • Summary statistics indicated a left leaning data set ♯ Title Information MariaDB CCP SQL Server Cassandra Summarized Copyright Match Information 15X: Viewers per Video Y: Count of Cases in Bucket
  • 16. Column Name Datatype Nullable % Non-Null Standard Deviation Min Max claim_type varchar YES 76.77% asset_name varchar YES 92.78% asset_type varchar YES 100.00% video_title varchar YES 76.08% reference_status varchar YES 65.68% reference_length int YES 65.68% 3065.065455 18 18746 content_type varchar YES 65.68% view_count int YES 76.08% 919084.1099 0 1.05E+09 duration int YES 76.08% 2228.763702 0 192887 video_total_match int YES 74.66% 1554.598552 0 38385 channel_title varchar YES 76.08% claim_date datetime YES 100.00% 12/7/2007 11/16/2015 video_upload_date datetime YES 76.08% 8/9/2005 11/16/2015 licensed_content tinyint YES 76.08% 0.140196747 0 1 privacy varchar YES 76.08% policy_name varchar YES 95.33% match_percentage int YES 56.95% 48.10774725 0 32388 channel_comments int YES 55.04% 3625.070764 0 1213995 channel_videos int YES 55.04% 1518.373624 0 228113 season int YES 10.49% 72.64507792 1 2015 episode int YES 10.70% 78.35203659 1 4601 last_updated timestamp NO 100.00% 11/16/2015 11/16/2015 Whitelisted tinyint YES 100.00% 0.099045453 0 1 official tinyint YES 100.00% 0.076134446 0 1 owner varchar YES 100.00% Analysis Results: Hits Table: SMART_MATCH (copyright match data) ♯ 16 Data Discrepancy: • Reference length longer than actual video length Data Limitation: • Only most recent upload date is displayed, and the value may actually be the date of publishing or being made public
  • 18. Key Findings & Recommendations 18 Gaps in metadata make it difficult to understand and utilize collected data effectively Streamline the metadata gathering and cleaning process, leveraging other metadata systems Daily quotas and threshold limit and distort data pulled Selectively pull data to circumvent daily quotas and potentially improve data integrity Data integrity from some sources is questionable and variance in incentive to improve Improve data processes, e.g., addition of data cleaning to certain data extract and aggregation process (ETL) Brand specific workflows, fringe use cases hinder ability to acquire metadata & accurately map references Roadmap of brand and title match data cleanup for reporting needs, process to maintain data integrity FindingsRecommendations Data Challenges Tech Challenges Organizational Challenges ♯
  • 19. Data Project Principles & Pitfalls Maintenance is the Monster • Initial creation of data solutions is often easier than long term maintenance Common issues • Rapidly changing platforms, frameworks, and methodologies • Need for continuous maintenance and verification of data quality • Incentives and cultures vary across departments and companies • Establishing and disseminating a “data stewardship” mentality • Data “truth” changes over time and by business need • Ongoing changes in individual consumer behavior, options for copyright owners 19 ♬ TechnicalNon-Technical
  • 20. Go Forward Plan NBCU Next Steps • Increased focus on and automation of data matching & data clean up • Enable better business unit segmentation of enterprise data • Transition from organic to directed architecture • Increased internal outreach 20 ♮