SlideShare ist ein Scribd-Unternehmen logo
1 von 21
Using a Hadoop Data Pipeline to Build a Graph of Users and Content  Hadoop Summit - June 29, 2011 Bill Graham bill.graham@cbs.com
About me Principal Software Engineer Technology, Business & News BU (TBN) TBN Platform Infrastructure Team Background in SW Systems Engineering and Integration Architecture Contributor: Pig, Hive, HBase Committer: Chukwa
About CBSi – who are we? ENTERTAINMENT  GAMES & MOVIES  SPORTS TECH, BIZ & NEWS  MUSIC
About CBSi - scale Top 10 global web property 235M worldwide monthly uniques1 Hadoop Ecosystem CDH3, Pig, Hive, HBase, Chukwa, Oozie, Sqoop, Cascading Cluster size: Currently workers: 35 DW + 6 TBN (150TB) Next quarter: 100 nodes (500TB) DW peak processing: 400M events/day globally 1 - Source: comScore, March 2011
Abstract    At CBSi we’re developing a scalable, flexible platform to provide the ability to aggregate large volumes of data, to mine it for meaningful relationships and to produce a graph of connected users and content. This will enable us to better understand the connections between our users, our assets, and our authors.
The Problem User always voting on what they find interesting Got-it, want-it, like, share, follow, comment, rate, review, helpful vote, etc. Users have multiple identities Anonymous Registered (logged in) Social Multiple devices Connections between entities are in silo-ized sub-graphs Wealth of valuable user connectedness going unrealized
The Goal Create a back-end platform that enables us to assemble a holistic graph of our users and their connections to: Content Authors Each other Themselves Better understand how our users connect to our content Improved content recommendations Improved user segmentation and content/ad targeting
Requirements Integrate with existing DW/BI Hadoop Infrastructure Aggregate data from across CBSi and beyond Connect disjointed user identities Flexible data model Assemble graph of relationships Enable rapid experimentation, data mining and hypothesis testing Power new site features and advertising optimizations
The Approach Mirror data into HBase  Use MapReduce to process data Export RDF data into a triple store
Data Flow Site Triple Store SPARQL RDF CMS Publishing Site Activity Stream a.k.a. Firehose (JMS) HBase MapReduce ,[object Object]
ImportTsvatomic writes transform & load Social/UGC Systems DW Systems HDFS bulk load CMS Systems Content Tagging Systems
NOSQL Data Models Key-value stores ColumnFamily Document databases Graph databases Data size Data complexity Credit: Emil Eifrem, Neotechnology
Conceptual Graph PageEvent PageEvent contains contains Brand SessionId regId is also is also Asset had  session follow Author anonId like is also Asset follow Asset Author is also authored by Product tagged with tagged with Story tag Activity firehose (real-time) CMS (batch + incr.) Tags (batch) DW (daily)
HBase Schema user_info table
HBase Loading Incremental Consuming from a JMS queue == real-time Batch Pig’s HBaseStorage== quick to develop & iterate HBase’sImportTsv== more efficient
Generating RDF with Pig RDF1 is an XML standard to represent subject-predicate-object relationships Philosophy: Store large amounts of data in Hadoop, be selective of what goes into the triple store For example: “first class” graph citizens we plan to query on Implicit to explicit (i.e., derived) connections Content recommendations User segments Related users Content tags Easily join data to create new triples with Pig Run SPARQL2 queries, examine, refine, reload 1 - http://www.w3.org/RDF, 2 - http://www.w3.org/TR/rdf-sparql-query
Example Pig RDF Script Create RDF triples of users to social events: RAW = LOAD 'hbase://user_info' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('event:*', '-loadKey true’) 	AS (id:bytearray, event_map:map[]); -- Convert our maps to bags so we can flatten them out  A = FOREACH RAW GENERATE id, FLATTEN(mapToBag(event_map)) AS (social_k, social_v);  -- Convert the JSON events into maps  B = FOREACH A GENERATE id, social_k, jsonToMap(social_v) AS social_map:map[];  -- Pull values from map  C = FOREACH B GENERATE id, social_map#'levt.asid' AS asid, social_map#'levt.xastid' AS astid, social_map#'levt.event' AS event, social_map#'levt.eventt' AS eventt, social_map#'levt.ssite' AS ssite, social_map#'levt.ts' AS eventtimestamp ; EVENT_TRIPLE = FOREACH C GENERATE GenerateRDFTriple( 	'USER-EVENT', id, astid, asid, event, eventt, ssite, eventtimestamp ) ;  STORE EVENT_TRIPLE INTO 'trident/rdf/out/user_event' USING PigStorage ();
Example SPARQL query   Recommend content based on Facebook “liked” items: SELECT ?asset1 ?tagname ?asset2 ?title2 ?pubdt2 WHERE {   # anon-user who Like'd a content asset (news item, blog post) on Facebook   <urn:com.cbs.dwh:ANON-Cg8JIU14kobSAAAAWyQ> <urn:com.cbs.trident:event:LIKE> ?x .   ?x <urn:com.cbs.trident:eventt> "SOCIAL_SITE” .   ?x <urn:com.cbs.trident:ssite> "www.facebook.com" .   ?x <urn:com.cbs.trident:tasset> ?asset1 .   ?asset1 a <urn:com.cbs.rb.contentdb:content_asset> .   # a tag associated with the content asset    ?asset1 <urn:com.cbs.cnb.bttrax:tag> ?tag1 .    ?tag1 <urn:com.cbs.cnb.bttrax:tagname> ?tagname .   # other content assets with the same tag and their title    ?asset2 <urn:com.cbs.cnb.bttrax:tag> ?tag2 . FILTER (?asset2 != ?asset1)   ?tag2 <urn:com.cbs.cnb.bttrax:tagname> ?tagname .   ?asset2 <http://www.w3.org/2005/Atom#title> ?title2 .   ?asset2 <http://www.w3.org/2005/Atom#published> ?pubdt2 . FILTER    (?pubdt2 >= "2011-01-01T00:00:00"^^<http://www.w3.org/2001/XMLSchema#dateTime>)  } ORDER BY DESC (?pubdt2) LIMIT 10
Conclusions I - Power and Flexibility Architecture is flexible with respect to: Data modeling Integration patterns Data processing, querying techniques Multiple approaches for graph traversal SPARQL Traverse HBase MapReduce
Conclusions II – Match Tool with the Job Hadoop - scale and computing horsepower HBase – atomic r/w access, speed, flexibility RDF Triple Store – complex graph querying Pig – rapid MR prototyping and ad-hoc analysis Future: HCatalog – Schema & table management Oozie or Azkaban – Workflow engine Mahout – Machine learning Hama – Graph processing
Conclusions III – OSS, woot! If it doesn’t do what you want, submit a patch.

Weitere ähnliche Inhalte

Was ist angesagt?

HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on HadoopPaco Nathan
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGAdam Kawa
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopSteve Watt
 
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...guest5b1607
 
Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teo...
Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teo...Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teo...
Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teo...javier ramirez
 
Building Data Products at LinkedIn with DataFu
Building Data Products at LinkedIn with DataFuBuilding Data Products at LinkedIn with DataFu
Building Data Products at LinkedIn with DataFuMatthew Hayes
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and PigRicardo Varela
 
Beauty and Big Data
Beauty and Big DataBeauty and Big Data
Beauty and Big DataSri Ambati
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingMitsuharu Hamba
 
Intro to cassandra + hadoop
Intro to cassandra + hadoopIntro to cassandra + hadoop
Intro to cassandra + hadoopJeremy Hanna
 
Recent IT Development and Women: Big Data and The Power of Women in Goryeo
 Recent IT Development and Women: Big Data and The Power of Women in Goryeo Recent IT Development and Women: Big Data and The Power of Women in Goryeo
Recent IT Development and Women: Big Data and The Power of Women in GoryeoJongwook Woo
 
140614 bigdatacamp-la-keynote-jon hsieh
140614 bigdatacamp-la-keynote-jon hsieh140614 bigdatacamp-la-keynote-jon hsieh
140614 bigdatacamp-la-keynote-jon hsiehData Con LA
 
Try It The Google Way .
Try It The Google Way .Try It The Google Way .
Try It The Google Way .abhinavbom
 
The Hive Think Tank: Heron at Twitter
The Hive Think Tank: Heron at TwitterThe Hive Think Tank: Heron at Twitter
The Hive Think Tank: Heron at TwitterThe Hive
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To HadoopAdeel Ahmad
 
Big data advance topics - part 2.pptx
Big data   advance topics - part 2.pptxBig data   advance topics - part 2.pptx
Big data advance topics - part 2.pptxMoldovan Radu Adrian
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchLucidworks (Archived)
 

Was ist angesagt? (20)

HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on Hadoop
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUG
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petaby...
 
Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teo...
Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teo...Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teo...
Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teo...
 
Building Data Products at LinkedIn with DataFu
Building Data Products at LinkedIn with DataFuBuilding Data Products at LinkedIn with DataFu
Building Data Products at LinkedIn with DataFu
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Beauty and Big Data
Beauty and Big DataBeauty and Big Data
Beauty and Big Data
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
 
Intro to cassandra + hadoop
Intro to cassandra + hadoopIntro to cassandra + hadoop
Intro to cassandra + hadoop
 
Recent IT Development and Women: Big Data and The Power of Women in Goryeo
 Recent IT Development and Women: Big Data and The Power of Women in Goryeo Recent IT Development and Women: Big Data and The Power of Women in Goryeo
Recent IT Development and Women: Big Data and The Power of Women in Goryeo
 
140614 bigdatacamp-la-keynote-jon hsieh
140614 bigdatacamp-la-keynote-jon hsieh140614 bigdatacamp-la-keynote-jon hsieh
140614 bigdatacamp-la-keynote-jon hsieh
 
Try It The Google Way .
Try It The Google Way .Try It The Google Way .
Try It The Google Way .
 
The Hive Think Tank: Heron at Twitter
The Hive Think Tank: Heron at TwitterThe Hive Think Tank: Heron at Twitter
The Hive Think Tank: Heron at Twitter
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
 
Apache Pig
Apache PigApache Pig
Apache Pig
 
Big data advance topics - part 2.pptx
Big data   advance topics - part 2.pptxBig data   advance topics - part 2.pptx
Big data advance topics - part 2.pptx
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
 

Ähnlich wie Using Hadoop to Build a Graph of Users and Content Connections

Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Streaming map reduce
Streaming map reduceStreaming map reduce
Streaming map reducedanirayan
 
Big dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlBig dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlKhanderao Kand
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zingzingopen
 
AWS Webcast - Amazon Kinesis and Apache Storm
AWS Webcast - Amazon Kinesis and Apache StormAWS Webcast - Amazon Kinesis and Apache Storm
AWS Webcast - Amazon Kinesis and Apache StormAmazon Web Services
 
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptxDowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptxLex Avstreikh
 
Building a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache SparkBuilding a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache SparkDatabricks
 
Hadoop & Zing
Hadoop & ZingHadoop & Zing
Hadoop & ZingLong Dao
 
Mashups in the Information Technology Classroom
Mashups in the Information Technology ClassroomMashups in the Information Technology Classroom
Mashups in the Information Technology ClassroomMark Frydenberg
 
B3 - Business intelligence apps on aws
B3 - Business intelligence apps on awsB3 - Business intelligence apps on aws
B3 - Business intelligence apps on awsAmazon Web Services
 
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages JaunesBreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages JaunesDataiku
 
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...Amazon Web Services
 
Data Ingestion in Big Data and IoT platforms
Data Ingestion in Big Data and IoT platformsData Ingestion in Big Data and IoT platforms
Data Ingestion in Big Data and IoT platformsGuido Schmutz
 
Tapping into Scientific Data with Hadoop and Flink
Tapping into Scientific Data with Hadoop and FlinkTapping into Scientific Data with Hadoop and Flink
Tapping into Scientific Data with Hadoop and FlinkMichael Häusler
 
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016 Webi...
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016  Webi...Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016  Webi...
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016 Webi...Amazon Web Services
 

Ähnlich wie Using Hadoop to Build a Graph of Users and Content Connections (20)

Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Datalake Architecture
Datalake ArchitectureDatalake Architecture
Datalake Architecture
 
Streaming map reduce
Streaming map reduceStreaming map reduce
Streaming map reduce
 
Big dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlBig dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosql
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zing
 
AWS Webcast - Amazon Kinesis and Apache Storm
AWS Webcast - Amazon Kinesis and Apache StormAWS Webcast - Amazon Kinesis and Apache Storm
AWS Webcast - Amazon Kinesis and Apache Storm
 
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptxDowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
 
Building a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache SparkBuilding a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache Spark
 
Hadoop & Zing
Hadoop & ZingHadoop & Zing
Hadoop & Zing
 
Mashups in the Information Technology Classroom
Mashups in the Information Technology ClassroomMashups in the Information Technology Classroom
Mashups in the Information Technology Classroom
 
B3 - Business intelligence apps on aws
B3 - Business intelligence apps on awsB3 - Business intelligence apps on aws
B3 - Business intelligence apps on aws
 
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages JaunesBreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
 
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
 
Data science big data and analytics
Data science big data and analyticsData science big data and analytics
Data science big data and analytics
 
Data Ingestion in Big Data and IoT platforms
Data Ingestion in Big Data and IoT platformsData Ingestion in Big Data and IoT platforms
Data Ingestion in Big Data and IoT platforms
 
Tapping into Scientific Data with Hadoop and Flink
Tapping into Scientific Data with Hadoop and FlinkTapping into Scientific Data with Hadoop and Flink
Tapping into Scientific Data with Hadoop and Flink
 
No sql databases
No sql databasesNo sql databases
No sql databases
 
Big data technologies with Case Study Finance and Healthcare
Big data technologies with Case Study Finance and HealthcareBig data technologies with Case Study Finance and Healthcare
Big data technologies with Case Study Finance and Healthcare
 
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016 Webi...
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016  Webi...Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016  Webi...
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016 Webi...
 

Kürzlich hochgeladen

Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 

Kürzlich hochgeladen (20)

Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 

Using Hadoop to Build a Graph of Users and Content Connections

  • 1. Using a Hadoop Data Pipeline to Build a Graph of Users and Content Hadoop Summit - June 29, 2011 Bill Graham bill.graham@cbs.com
  • 2. About me Principal Software Engineer Technology, Business & News BU (TBN) TBN Platform Infrastructure Team Background in SW Systems Engineering and Integration Architecture Contributor: Pig, Hive, HBase Committer: Chukwa
  • 3. About CBSi – who are we? ENTERTAINMENT GAMES & MOVIES SPORTS TECH, BIZ & NEWS MUSIC
  • 4. About CBSi - scale Top 10 global web property 235M worldwide monthly uniques1 Hadoop Ecosystem CDH3, Pig, Hive, HBase, Chukwa, Oozie, Sqoop, Cascading Cluster size: Currently workers: 35 DW + 6 TBN (150TB) Next quarter: 100 nodes (500TB) DW peak processing: 400M events/day globally 1 - Source: comScore, March 2011
  • 5. Abstract At CBSi we’re developing a scalable, flexible platform to provide the ability to aggregate large volumes of data, to mine it for meaningful relationships and to produce a graph of connected users and content. This will enable us to better understand the connections between our users, our assets, and our authors.
  • 6. The Problem User always voting on what they find interesting Got-it, want-it, like, share, follow, comment, rate, review, helpful vote, etc. Users have multiple identities Anonymous Registered (logged in) Social Multiple devices Connections between entities are in silo-ized sub-graphs Wealth of valuable user connectedness going unrealized
  • 7. The Goal Create a back-end platform that enables us to assemble a holistic graph of our users and their connections to: Content Authors Each other Themselves Better understand how our users connect to our content Improved content recommendations Improved user segmentation and content/ad targeting
  • 8. Requirements Integrate with existing DW/BI Hadoop Infrastructure Aggregate data from across CBSi and beyond Connect disjointed user identities Flexible data model Assemble graph of relationships Enable rapid experimentation, data mining and hypothesis testing Power new site features and advertising optimizations
  • 9. The Approach Mirror data into HBase Use MapReduce to process data Export RDF data into a triple store
  • 10.
  • 11. ImportTsvatomic writes transform & load Social/UGC Systems DW Systems HDFS bulk load CMS Systems Content Tagging Systems
  • 12. NOSQL Data Models Key-value stores ColumnFamily Document databases Graph databases Data size Data complexity Credit: Emil Eifrem, Neotechnology
  • 13. Conceptual Graph PageEvent PageEvent contains contains Brand SessionId regId is also is also Asset had session follow Author anonId like is also Asset follow Asset Author is also authored by Product tagged with tagged with Story tag Activity firehose (real-time) CMS (batch + incr.) Tags (batch) DW (daily)
  • 15. HBase Loading Incremental Consuming from a JMS queue == real-time Batch Pig’s HBaseStorage== quick to develop & iterate HBase’sImportTsv== more efficient
  • 16. Generating RDF with Pig RDF1 is an XML standard to represent subject-predicate-object relationships Philosophy: Store large amounts of data in Hadoop, be selective of what goes into the triple store For example: “first class” graph citizens we plan to query on Implicit to explicit (i.e., derived) connections Content recommendations User segments Related users Content tags Easily join data to create new triples with Pig Run SPARQL2 queries, examine, refine, reload 1 - http://www.w3.org/RDF, 2 - http://www.w3.org/TR/rdf-sparql-query
  • 17. Example Pig RDF Script Create RDF triples of users to social events: RAW = LOAD 'hbase://user_info' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('event:*', '-loadKey true’) AS (id:bytearray, event_map:map[]); -- Convert our maps to bags so we can flatten them out A = FOREACH RAW GENERATE id, FLATTEN(mapToBag(event_map)) AS (social_k, social_v); -- Convert the JSON events into maps B = FOREACH A GENERATE id, social_k, jsonToMap(social_v) AS social_map:map[]; -- Pull values from map C = FOREACH B GENERATE id, social_map#'levt.asid' AS asid, social_map#'levt.xastid' AS astid, social_map#'levt.event' AS event, social_map#'levt.eventt' AS eventt, social_map#'levt.ssite' AS ssite, social_map#'levt.ts' AS eventtimestamp ; EVENT_TRIPLE = FOREACH C GENERATE GenerateRDFTriple( 'USER-EVENT', id, astid, asid, event, eventt, ssite, eventtimestamp ) ; STORE EVENT_TRIPLE INTO 'trident/rdf/out/user_event' USING PigStorage ();
  • 18. Example SPARQL query Recommend content based on Facebook “liked” items: SELECT ?asset1 ?tagname ?asset2 ?title2 ?pubdt2 WHERE { # anon-user who Like'd a content asset (news item, blog post) on Facebook <urn:com.cbs.dwh:ANON-Cg8JIU14kobSAAAAWyQ> <urn:com.cbs.trident:event:LIKE> ?x . ?x <urn:com.cbs.trident:eventt> "SOCIAL_SITE” . ?x <urn:com.cbs.trident:ssite> "www.facebook.com" . ?x <urn:com.cbs.trident:tasset> ?asset1 . ?asset1 a <urn:com.cbs.rb.contentdb:content_asset> . # a tag associated with the content asset ?asset1 <urn:com.cbs.cnb.bttrax:tag> ?tag1 . ?tag1 <urn:com.cbs.cnb.bttrax:tagname> ?tagname . # other content assets with the same tag and their title ?asset2 <urn:com.cbs.cnb.bttrax:tag> ?tag2 . FILTER (?asset2 != ?asset1) ?tag2 <urn:com.cbs.cnb.bttrax:tagname> ?tagname . ?asset2 <http://www.w3.org/2005/Atom#title> ?title2 . ?asset2 <http://www.w3.org/2005/Atom#published> ?pubdt2 . FILTER (?pubdt2 >= "2011-01-01T00:00:00"^^<http://www.w3.org/2001/XMLSchema#dateTime>) } ORDER BY DESC (?pubdt2) LIMIT 10
  • 19. Conclusions I - Power and Flexibility Architecture is flexible with respect to: Data modeling Integration patterns Data processing, querying techniques Multiple approaches for graph traversal SPARQL Traverse HBase MapReduce
  • 20. Conclusions II – Match Tool with the Job Hadoop - scale and computing horsepower HBase – atomic r/w access, speed, flexibility RDF Triple Store – complex graph querying Pig – rapid MR prototyping and ad-hoc analysis Future: HCatalog – Schema & table management Oozie or Azkaban – Workflow engine Mahout – Machine learning Hama – Graph processing
  • 21. Conclusions III – OSS, woot! If it doesn’t do what you want, submit a patch.

Hinweis der Redaktion

  1. CBSi has a number of brands, this slide shows the biggest ones. I’m in the TBN group and the work I’ll present is being done for CNET, with the intent to be extended horizontally.
  2. We have a lot of traffic and data. We’ve been using Hadoop quite extensively for a few years now. 135/150TB currently, soon to be 500TB.
  3. Summarize what I’ll discuss
  4. We do a number of these items already, but in disparate systems.
  5. Simplified overview of the approach. Details to be discussed on the next data flow slide.
  6. Multiple data load options – bulk, real-time, incremental update.MapReduce to examine data Export data to RDF in the triple store Analysts and engineers can access HBase or MR to explore data For now we’re using various triple stores for experimentation, we haven’t done a full evaluation yet. Technology for triple store or graph store still TBD.
  7. The slope of this plot is subjective, but conceptually this is the case. HBase would be in the upper left quadrant and a graph store would be in the lower right. Our solution leverages the strength of each and we use MR to go from one to the other.
  8. Just an example a graph we can build. The graph can be adapted to meet use cases. Anonymous user has relationships to other identities, as well as assets that he/she interacts with. The graph is built from items from different datasources: blue=firehose, orange=CMS, green=tagging systems, red=DW
  9. Simple schema.1..* for both aliases and events.
  10. The next few slides will though some specifics of the data flow.How do we get data into HBase? Once of the nice things about HBase is that it supports a number of techniques to load data.
  11. Once data is in HBase, we selectively build RDF relationships to store in the triple store. Pig allows for easy iteration.
  12. One of our more simple scripts. It’s 6 Pig statements to generate this set of RDF. We have a UDF to abstract out the RDF string construction.
  13. Recommend the most recent blog content that is tagged with the same tags as the users FB like.
  14. We’re going to need to support a number of use cases and integration patterns. This approach allows us to have multiple options on the table for each.
  15. We want to be able to create a graph and effectively query it, but we also want to be able to to ad-hoc analytics and experimentation over the entire corpus of entities.