SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Randomly Sampling YouTube Users:
 An Introduction to Random Prefix
         Sampling Method




             Cheng-Jun Wang

               Web Ming Lab
        City University of Hong Kong
                  20121225
YouTube growth curve




http://singularityhub.com/2012/05/25/now-serving-the-latest-in-exponential-growth-youtube/



https://gdata.youtube.com/feeds/api/standardfeeds/most_recent
Contents
Plan A: Sampling Users

∗ Unfortunately, YouTube’s user identifiers do not follow a
  standard format, YouTube’s user identifiers are user-specified
  strings. We were therefore unable to create a random sample
  of YouTube users.




  Mislove (2007) Measurement and Analysis of Online Social Networks. IMC
Plan B: Sampling Videos

∗ Using the YouTube search API, Zhou et al develop a random
  prefix sampling method, and find that roughly 500 millions
  YouTube videos by May, 2011.
∗ Sample the videos first, and then find the respective users.




  Zhou et al. (2011) Counting YouTube Videos via Random Prefix Sampling. IMC
Get proportional users?

∗ Limitation: selection bias towards those who uploading more
  videos. Therefore, weight against the number of videos per
  user (by the max value) is necessary to get a random sample of
  YouTube users.
∗ Is it possible?



                                                     1




              1    Videos crawled   Users detected
UserID   Video   Active
                                                 Num     Days
User   Video Weight   Active            1        10      20
ID     Num Factor     Days              2        5       15
                                        2        5       15
1      10    1        20                3        1       1
                               Weight   3        1       1
                               Cases
                                        3        1       1
2      5     2        15                3        1       1
                                        3        1       1
                                        3        1       1
3      1     10       1                 3        1       1
                                        3        1       1
                                        3        1       1
                                        3        1       1
Strategy




∗   60^10*16 = 9.674588e+18
∗   YouTube video is randomly generated from the id space
∗   Sampling space is tooooooo large!
∗   Any good idea?
∗   http://www.youtube.com/watch?v=1yo0zBFCMxo
∗   http://www.youtube.com/watch?v=_OBlgSz8sSM
YouTube Search API
∗ One unique property of YouTube search API we find is that when searching
  using a keyword string of the format “watch?v=xy...z” (including the quotes)
  where “xy...” is a prefix (of length L, 1 ≤ L ≤ 11) of a possible YouTube video id
  which does not contain the literal “-” in the prefix, YouTube will return a list
  of videos whose id’s begin with this prefix followed by “-”, if they exist.
∗ YouTube limits the number of returned results for any query.


∗ When the prefix is short (e.g., 1 or 2), it is more likely that the returned
  search results may contain such “noisy” video ids; also, the short prefix may
  match a large number of videos
∗ In contrast, if the prefix is too long (e.g., 6 or 7), no result may be returned
  by the search engine.
Practice

∗ However, in practice, a prefix of length L < 5 contains usually
  more than one hundred results, and YouTube API can only
  return at most 30 ids for each prefix query.
∗ On the other hand, based on our experimental results, a prefix
  with length L = 5 always contains less than 10 valid ids.
∗ Therefore, a prefix length of 5 is a good choice in practice.
∗ They find that querying prefixes with a prefix length of four
  will returned ids having a “-” in the fifth place, which provides
  a big enough result set so that each prefix returns some results
  and small enough to never reach the result limit set by the API.
∗ Zhou et al. found that there are about 500 million YouTube
  videos by 2011!




        Zhou et al. (2011) Counting YouTube Videos via Random Prefix Sampling. IMC
Python and gdata


             gdata                                    Code
∗ gdata is a module for         def SearchAndPrint(search_terms):
                                 yt_service = gdata.youtube.service.YouTubeService()
  connecting Google data         query = gdata.youtube.service.YouTubeVideoQuery()
  (including YouTube) via API    query.vq = search_terms
                                 query.orderby = 'viewCount'
                                 query.racy = 'include'
                                 feed = yt_service.YouTubeQuery(query)
                                 PrintVideoFeed(feed)
Test Validity

∗ http://www.youtube.com/watch?v=1yo0zBFCMxo
∗ The Secret State - The Biggest Mistake - Official Lyric Music
  Video
                                                    Cant’ find
                                                    the video!
∗ searchApi("watch?v=1yo0z")
Restricted query term

∗ searchApi('"watch?v=1yo0"')
Compare two random samples

∗   # summary(da$Freq)
∗   # Min. 1st Qu. Median Mean 3rd Qu. Max.
∗   # 1.00 7.00 25.00 17.15 25.00 75.00
∗
∗   # summary(db$Freq)
∗   # Min. 1st Qu. Median Mean 3rd Qu. Max.
∗   # 1.00 8.00 25.00 17.57 25.00 50.00
There are 604 million videos in
        YouTube by Dec, 2012!
∗ length(unique(subset(a[,1], b[,1]%in%a[,1]))) == 26
∗ 34361/x = 125/34361
∗ X = (34361^2/125)*64 == 604507300
Numeric simulation of random
                 prefix sampling
∗   # using degreenet to simulate decrete pareto distribution
∗   library(degreenet)
∗   a<-simdp(n=100000, v=3.5, maxdeg=10000)

∗   b<-data.frame(cbind(c(1:length(a)),a))
∗   c<-b[rep(1:nrow(b),b$a),]
∗   c$vid<-c(1:length(c$a))
∗   names(c)<-c("uid", "count", "vid")

∗   id<-sample(c(1:length(c$vid)), 2000, replace = F) #
∗   ds<-subset(c, c$vid%in%id)
∗   dat<-subset(ds, !duplicated(ds$uid))

∗   hist(dat$count)

∗   da<-as.data.frame(table(a))
∗   ds<-as.data.frame(table(dat$count))

∗   plot(log(da[,2])~log(as.numeric(as.character(da[,1]))), xlab = "Number of Videos (Log)", ylab = "Frequency (Log)" )
∗   points(log(ds[,2])~log(as.numeric(as.character(ds[,1]))), pch=2, col="red")
∗   legend("topright", c("population", "sample"),
∗               col = c( "black","red"),
∗               cex=0.9, pch= c(3, 2))
Reference

∗ Zhou et al. (2011) Counting YouTube Videos via Random Prefix
  Sampling. IMC
∗ Mislove (2007) Measurement and Analysis of Online Social
  Networks. IMC
∗ YouTube deverlopers guide for python
  https://developers.google.com/youtube/1.0/developers_guide_python

∗ Introduction to the library of gdata.youtube
  http://gdata-pythonclient.googlecode.com/svn/trunk/pydocs/gdata.youtube.html#YouTubeVideoEntry
20121225

Weitere ähnliche Inhalte

Ähnlich wie Randomly sampling YouTube users

Video summarization using clustering
Video summarization using clusteringVideo summarization using clustering
Video summarization using clusteringSahil Biswas
 
NoTube: Ad Insertion [compatibility mode]
NoTube: Ad Insertion [compatibility mode]NoTube: Ad Insertion [compatibility mode]
NoTube: Ad Insertion [compatibility mode]MODUL Technology GmbH
 
Thomas Kauders - Agile Test Design And Automation of a Life-Critical Medical ...
Thomas Kauders - Agile Test Design And Automation of a Life-Critical Medical ...Thomas Kauders - Agile Test Design And Automation of a Life-Critical Medical ...
Thomas Kauders - Agile Test Design And Automation of a Life-Critical Medical ...TEST Huddle
 
Phillipson learning from archives how historical content can be used to eng...
Phillipson learning from archives   how historical content can be used to eng...Phillipson learning from archives   how historical content can be used to eng...
Phillipson learning from archives how historical content can be used to eng...FIAT/IFTA
 
Develop Maintainable Apps - edUiConf
Develop Maintainable Apps - edUiConfDevelop Maintainable Apps - edUiConf
Develop Maintainable Apps - edUiConfAnnyce Davis
 
Rubinius For You - GoRuCo
Rubinius For You - GoRuCoRubinius For You - GoRuCo
Rubinius For You - GoRuCoevanphx
 
YouTube APIs presentation at Facultad de Ciencias, Universidad Nacional Autón...
YouTube APIs presentation at Facultad de Ciencias, Universidad Nacional Autón...YouTube APIs presentation at Facultad de Ciencias, Universidad Nacional Autón...
YouTube APIs presentation at Facultad de Ciencias, Universidad Nacional Autón...Jarek Wilkiewicz
 
Qtp interview questions and answers
Qtp interview questions and answersQtp interview questions and answers
Qtp interview questions and answersITeLearn
 
Real-Time Video Copy Detection in Big Data
Real-Time Video Copy Detection in Big DataReal-Time Video Copy Detection in Big Data
Real-Time Video Copy Detection in Big DataIRJET Journal
 
PredictionIO - Building Applications That Predict User Behavior Through Big D...
PredictionIO - Building Applications That Predict User Behavior Through Big D...PredictionIO - Building Applications That Predict User Behavior Through Big D...
PredictionIO - Building Applications That Predict User Behavior Through Big D...predictionio
 
Precomputing recommendations with Apache Beam
Precomputing recommendations with Apache BeamPrecomputing recommendations with Apache Beam
Precomputing recommendations with Apache BeamTatiana Al-Chueyr
 
UC2010_BRS1280_Eastman_Chemical_Johnston
UC2010_BRS1280_Eastman_Chemical_JohnstonUC2010_BRS1280_Eastman_Chemical_Johnston
UC2010_BRS1280_Eastman_Chemical_JohnstonH Eddie Newton
 
The Ring programming language version 1.3 book - Part 8 of 88
The Ring programming language version 1.3 book - Part 8 of 88The Ring programming language version 1.3 book - Part 8 of 88
The Ring programming language version 1.3 book - Part 8 of 88Mahmoud Samir Fayed
 
iOSDevCamp 2011 - Getting "Test"-y: Test Driven Development & Automated Deplo...
iOSDevCamp 2011 - Getting "Test"-y: Test Driven Development & Automated Deplo...iOSDevCamp 2011 - Getting "Test"-y: Test Driven Development & Automated Deplo...
iOSDevCamp 2011 - Getting "Test"-y: Test Driven Development & Automated Deplo...Rudy Jahchan
 
Why biased matrix factorization works well?
Why biased matrix factorization works well?Why biased matrix factorization works well?
Why biased matrix factorization works well?Joonyoung Yi
 
Scene Graphs & Component Based Game Engines
Scene Graphs & Component Based Game EnginesScene Graphs & Component Based Game Engines
Scene Graphs & Component Based Game EnginesBryan Duggan
 

Ähnlich wie Randomly sampling YouTube users (20)

Video summarization using clustering
Video summarization using clusteringVideo summarization using clustering
Video summarization using clustering
 
NoTube: Ad Insertion [compatibility mode]
NoTube: Ad Insertion [compatibility mode]NoTube: Ad Insertion [compatibility mode]
NoTube: Ad Insertion [compatibility mode]
 
Thomas Kauders - Agile Test Design And Automation of a Life-Critical Medical ...
Thomas Kauders - Agile Test Design And Automation of a Life-Critical Medical ...Thomas Kauders - Agile Test Design And Automation of a Life-Critical Medical ...
Thomas Kauders - Agile Test Design And Automation of a Life-Critical Medical ...
 
Phillipson learning from archives how historical content can be used to eng...
Phillipson learning from archives   how historical content can be used to eng...Phillipson learning from archives   how historical content can be used to eng...
Phillipson learning from archives how historical content can be used to eng...
 
Develop Maintainable Apps - edUiConf
Develop Maintainable Apps - edUiConfDevelop Maintainable Apps - edUiConf
Develop Maintainable Apps - edUiConf
 
Rubinius For You - GoRuCo
Rubinius For You - GoRuCoRubinius For You - GoRuCo
Rubinius For You - GoRuCo
 
YouTube APIs presentation at Facultad de Ciencias, Universidad Nacional Autón...
YouTube APIs presentation at Facultad de Ciencias, Universidad Nacional Autón...YouTube APIs presentation at Facultad de Ciencias, Universidad Nacional Autón...
YouTube APIs presentation at Facultad de Ciencias, Universidad Nacional Autón...
 
Qtp interview questions and answers
Qtp interview questions and answersQtp interview questions and answers
Qtp interview questions and answers
 
Real-Time Video Copy Detection in Big Data
Real-Time Video Copy Detection in Big DataReal-Time Video Copy Detection in Big Data
Real-Time Video Copy Detection in Big Data
 
PredictionIO - Building Applications That Predict User Behavior Through Big D...
PredictionIO - Building Applications That Predict User Behavior Through Big D...PredictionIO - Building Applications That Predict User Behavior Through Big D...
PredictionIO - Building Applications That Predict User Behavior Through Big D...
 
Java Performance Tuning
Java Performance TuningJava Performance Tuning
Java Performance Tuning
 
Precomputing recommendations with Apache Beam
Precomputing recommendations with Apache BeamPrecomputing recommendations with Apache Beam
Precomputing recommendations with Apache Beam
 
UC2010_BRS1280_Eastman_Chemical_Johnston
UC2010_BRS1280_Eastman_Chemical_JohnstonUC2010_BRS1280_Eastman_Chemical_Johnston
UC2010_BRS1280_Eastman_Chemical_Johnston
 
肉体言語 Tython
肉体言語 Tython肉体言語 Tython
肉体言語 Tython
 
The Ring programming language version 1.3 book - Part 8 of 88
The Ring programming language version 1.3 book - Part 8 of 88The Ring programming language version 1.3 book - Part 8 of 88
The Ring programming language version 1.3 book - Part 8 of 88
 
iOSDevCamp 2011 - Getting "Test"-y: Test Driven Development & Automated Deplo...
iOSDevCamp 2011 - Getting "Test"-y: Test Driven Development & Automated Deplo...iOSDevCamp 2011 - Getting "Test"-y: Test Driven Development & Automated Deplo...
iOSDevCamp 2011 - Getting "Test"-y: Test Driven Development & Automated Deplo...
 
NMSL_2017summer
NMSL_2017summerNMSL_2017summer
NMSL_2017summer
 
YouTube for Developers
YouTube for DevelopersYouTube for Developers
YouTube for Developers
 
Why biased matrix factorization works well?
Why biased matrix factorization works well?Why biased matrix factorization works well?
Why biased matrix factorization works well?
 
Scene Graphs & Component Based Game Engines
Scene Graphs & Component Based Game EnginesScene Graphs & Component Based Game Engines
Scene Graphs & Component Based Game Engines
 

Mehr von Chengjun Wang

计算传播学导论
计算传播学导论计算传播学导论
计算传播学导论Chengjun Wang
 
数据可视化 概念案例方法 王成军 20140104
数据可视化 概念案例方法 王成军 20140104数据可视化 概念案例方法 王成军 20140104
数据可视化 概念案例方法 王成军 20140104Chengjun Wang
 
An introduction to computational communication
An introduction to computational communication An introduction to computational communication
An introduction to computational communication Chengjun Wang
 
Pajek chapter2 Attributes and Relations
Pajek chapter2 Attributes and RelationsPajek chapter2 Attributes and Relations
Pajek chapter2 Attributes and RelationsChengjun Wang
 
Calculate Thresholds of Diffusion with Pajek
Calculate Thresholds of Diffusion with PajekCalculate Thresholds of Diffusion with Pajek
Calculate Thresholds of Diffusion with PajekChengjun Wang
 
Chapter 2. Multivariate Analysis of Stationary Time Series
 Chapter 2. Multivariate Analysis of Stationary Time Series Chapter 2. Multivariate Analysis of Stationary Time Series
Chapter 2. Multivariate Analysis of Stationary Time SeriesChengjun Wang
 
人类行为与最大熵原理
人类行为与最大熵原理人类行为与最大熵原理
人类行为与最大熵原理Chengjun Wang
 
Impact of human value, consumer perceived value
Impact of human value, consumer perceived valueImpact of human value, consumer perceived value
Impact of human value, consumer perceived valueChengjun Wang
 
Introduction to News diffusion On News Sharing Website
Introduction to News diffusion On News Sharing WebsiteIntroduction to News diffusion On News Sharing Website
Introduction to News diffusion On News Sharing WebsiteChengjun Wang
 
The Emergence of Spiral of Silence from Individual behaviors: Agent-based Mod...
The Emergence of Spiral of Silence from Individual behaviors: Agent-based Mod...The Emergence of Spiral of Silence from Individual behaviors: Agent-based Mod...
The Emergence of Spiral of Silence from Individual behaviors: Agent-based Mod...Chengjun Wang
 
Suppressor and distort variables
Suppressor and distort variablesSuppressor and distort variables
Suppressor and distort variablesChengjun Wang
 
Stata Learning From Treiman
Stata Learning From TreimanStata Learning From Treiman
Stata Learning From TreimanChengjun Wang
 
A M O S L E A R N I N G
A M O S  L E A R N I N GA M O S  L E A R N I N G
A M O S L E A R N I N GChengjun Wang
 

Mehr von Chengjun Wang (15)

计算传播学导论
计算传播学导论计算传播学导论
计算传播学导论
 
数据可视化 概念案例方法 王成军 20140104
数据可视化 概念案例方法 王成军 20140104数据可视化 概念案例方法 王成军 20140104
数据可视化 概念案例方法 王成军 20140104
 
An introduction to computational communication
An introduction to computational communication An introduction to computational communication
An introduction to computational communication
 
Pajek chapter2 Attributes and Relations
Pajek chapter2 Attributes and RelationsPajek chapter2 Attributes and Relations
Pajek chapter2 Attributes and Relations
 
Calculate Thresholds of Diffusion with Pajek
Calculate Thresholds of Diffusion with PajekCalculate Thresholds of Diffusion with Pajek
Calculate Thresholds of Diffusion with Pajek
 
Chapter 2. Multivariate Analysis of Stationary Time Series
 Chapter 2. Multivariate Analysis of Stationary Time Series Chapter 2. Multivariate Analysis of Stationary Time Series
Chapter 2. Multivariate Analysis of Stationary Time Series
 
人类行为与最大熵原理
人类行为与最大熵原理人类行为与最大熵原理
人类行为与最大熵原理
 
Impact of human value, consumer perceived value
Impact of human value, consumer perceived valueImpact of human value, consumer perceived value
Impact of human value, consumer perceived value
 
Introduction to News diffusion On News Sharing Website
Introduction to News diffusion On News Sharing WebsiteIntroduction to News diffusion On News Sharing Website
Introduction to News diffusion On News Sharing Website
 
The Emergence of Spiral of Silence from Individual behaviors: Agent-based Mod...
The Emergence of Spiral of Silence from Individual behaviors: Agent-based Mod...The Emergence of Spiral of Silence from Individual behaviors: Agent-based Mod...
The Emergence of Spiral of Silence from Individual behaviors: Agent-based Mod...
 
Suppressor and distort variables
Suppressor and distort variablesSuppressor and distort variables
Suppressor and distort variables
 
Pajek chapter1
Pajek chapter1Pajek chapter1
Pajek chapter1
 
Stata Learning From Treiman
Stata Learning From TreimanStata Learning From Treiman
Stata Learning From Treiman
 
A M O S L E A R N I N G
A M O S  L E A R N I N GA M O S  L E A R N I N G
A M O S L E A R N I N G
 
Amos Learning
Amos LearningAmos Learning
Amos Learning
 

Kürzlich hochgeladen

2k Shots ≽ 9205541914 ≼ Call Girls In Jasola (Delhi)
2k Shots ≽ 9205541914 ≼ Call Girls In Jasola (Delhi)2k Shots ≽ 9205541914 ≼ Call Girls In Jasola (Delhi)
2k Shots ≽ 9205541914 ≼ Call Girls In Jasola (Delhi)Delhi Call girls
 
Top Rated Pune Call Girls Tingre Nagar ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Tingre Nagar ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Tingre Nagar ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Tingre Nagar ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
WOMEN EMPOWERMENT women empowerment.pptx
WOMEN EMPOWERMENT women empowerment.pptxWOMEN EMPOWERMENT women empowerment.pptx
WOMEN EMPOWERMENT women empowerment.pptxpadhand000
 
2k Shots ≽ 9205541914 ≼ Call Girls In Palam (Delhi)
2k Shots ≽ 9205541914 ≼ Call Girls In Palam (Delhi)2k Shots ≽ 9205541914 ≼ Call Girls In Palam (Delhi)
2k Shots ≽ 9205541914 ≼ Call Girls In Palam (Delhi)Delhi Call girls
 
$ Love Spells^ 💎 (310) 882-6330 in West Virginia, WV | Psychic Reading Best B...
$ Love Spells^ 💎 (310) 882-6330 in West Virginia, WV | Psychic Reading Best B...$ Love Spells^ 💎 (310) 882-6330 in West Virginia, WV | Psychic Reading Best B...
$ Love Spells^ 💎 (310) 882-6330 in West Virginia, WV | Psychic Reading Best B...PsychicRuben LoveSpells
 
The Selfspace Journal Preview by Mindbrush
The Selfspace Journal Preview by MindbrushThe Selfspace Journal Preview by Mindbrush
The Selfspace Journal Preview by MindbrushShivain97
 
2k Shots ≽ 9205541914 ≼ Call Girls In Mukherjee Nagar (Delhi)
2k Shots ≽ 9205541914 ≼ Call Girls In Mukherjee Nagar (Delhi)2k Shots ≽ 9205541914 ≼ Call Girls In Mukherjee Nagar (Delhi)
2k Shots ≽ 9205541914 ≼ Call Girls In Mukherjee Nagar (Delhi)Delhi Call girls
 
8377087607 Full Enjoy @24/7-CLEAN-Call Girls In Chhatarpur,
8377087607 Full Enjoy @24/7-CLEAN-Call Girls In Chhatarpur,8377087607 Full Enjoy @24/7-CLEAN-Call Girls In Chhatarpur,
8377087607 Full Enjoy @24/7-CLEAN-Call Girls In Chhatarpur,dollysharma2066
 
call Now 9811711561 Cash Payment乂 Call Girls in Dwarka Mor
call Now 9811711561 Cash Payment乂 Call Girls in Dwarka Morcall Now 9811711561 Cash Payment乂 Call Girls in Dwarka Mor
call Now 9811711561 Cash Payment乂 Call Girls in Dwarka Morvikas rana
 
9892124323, Call Girls in mumbai, Vashi Call Girls , Kurla Call girls
9892124323, Call Girls in mumbai, Vashi Call Girls , Kurla Call girls9892124323, Call Girls in mumbai, Vashi Call Girls , Kurla Call girls
9892124323, Call Girls in mumbai, Vashi Call Girls , Kurla Call girlsPooja Nehwal
 
2k Shots ≽ 9205541914 ≼ Call Girls In Dashrath Puri (Delhi)
2k Shots ≽ 9205541914 ≼ Call Girls In Dashrath Puri (Delhi)2k Shots ≽ 9205541914 ≼ Call Girls In Dashrath Puri (Delhi)
2k Shots ≽ 9205541914 ≼ Call Girls In Dashrath Puri (Delhi)Delhi Call girls
 
Pokemon Go... Unraveling the Conspiracy Theory
Pokemon Go... Unraveling the Conspiracy TheoryPokemon Go... Unraveling the Conspiracy Theory
Pokemon Go... Unraveling the Conspiracy Theorydrae5
 
LC_YouSaidYes_NewBelieverBookletDone.pdf
LC_YouSaidYes_NewBelieverBookletDone.pdfLC_YouSaidYes_NewBelieverBookletDone.pdf
LC_YouSaidYes_NewBelieverBookletDone.pdfpastor83
 

Kürzlich hochgeladen (15)

(Anamika) VIP Call Girls Navi Mumbai Call Now 8250077686 Navi Mumbai Escorts ...
(Anamika) VIP Call Girls Navi Mumbai Call Now 8250077686 Navi Mumbai Escorts ...(Anamika) VIP Call Girls Navi Mumbai Call Now 8250077686 Navi Mumbai Escorts ...
(Anamika) VIP Call Girls Navi Mumbai Call Now 8250077686 Navi Mumbai Escorts ...
 
2k Shots ≽ 9205541914 ≼ Call Girls In Jasola (Delhi)
2k Shots ≽ 9205541914 ≼ Call Girls In Jasola (Delhi)2k Shots ≽ 9205541914 ≼ Call Girls In Jasola (Delhi)
2k Shots ≽ 9205541914 ≼ Call Girls In Jasola (Delhi)
 
Top Rated Pune Call Girls Tingre Nagar ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Tingre Nagar ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Tingre Nagar ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Tingre Nagar ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
WOMEN EMPOWERMENT women empowerment.pptx
WOMEN EMPOWERMENT women empowerment.pptxWOMEN EMPOWERMENT women empowerment.pptx
WOMEN EMPOWERMENT women empowerment.pptx
 
2k Shots ≽ 9205541914 ≼ Call Girls In Palam (Delhi)
2k Shots ≽ 9205541914 ≼ Call Girls In Palam (Delhi)2k Shots ≽ 9205541914 ≼ Call Girls In Palam (Delhi)
2k Shots ≽ 9205541914 ≼ Call Girls In Palam (Delhi)
 
$ Love Spells^ 💎 (310) 882-6330 in West Virginia, WV | Psychic Reading Best B...
$ Love Spells^ 💎 (310) 882-6330 in West Virginia, WV | Psychic Reading Best B...$ Love Spells^ 💎 (310) 882-6330 in West Virginia, WV | Psychic Reading Best B...
$ Love Spells^ 💎 (310) 882-6330 in West Virginia, WV | Psychic Reading Best B...
 
The Selfspace Journal Preview by Mindbrush
The Selfspace Journal Preview by MindbrushThe Selfspace Journal Preview by Mindbrush
The Selfspace Journal Preview by Mindbrush
 
2k Shots ≽ 9205541914 ≼ Call Girls In Mukherjee Nagar (Delhi)
2k Shots ≽ 9205541914 ≼ Call Girls In Mukherjee Nagar (Delhi)2k Shots ≽ 9205541914 ≼ Call Girls In Mukherjee Nagar (Delhi)
2k Shots ≽ 9205541914 ≼ Call Girls In Mukherjee Nagar (Delhi)
 
8377087607 Full Enjoy @24/7-CLEAN-Call Girls In Chhatarpur,
8377087607 Full Enjoy @24/7-CLEAN-Call Girls In Chhatarpur,8377087607 Full Enjoy @24/7-CLEAN-Call Girls In Chhatarpur,
8377087607 Full Enjoy @24/7-CLEAN-Call Girls In Chhatarpur,
 
call Now 9811711561 Cash Payment乂 Call Girls in Dwarka Mor
call Now 9811711561 Cash Payment乂 Call Girls in Dwarka Morcall Now 9811711561 Cash Payment乂 Call Girls in Dwarka Mor
call Now 9811711561 Cash Payment乂 Call Girls in Dwarka Mor
 
(Aarini) Russian Call Girls Surat Call Now 8250077686 Surat Escorts 24x7
(Aarini) Russian Call Girls Surat Call Now 8250077686 Surat Escorts 24x7(Aarini) Russian Call Girls Surat Call Now 8250077686 Surat Escorts 24x7
(Aarini) Russian Call Girls Surat Call Now 8250077686 Surat Escorts 24x7
 
9892124323, Call Girls in mumbai, Vashi Call Girls , Kurla Call girls
9892124323, Call Girls in mumbai, Vashi Call Girls , Kurla Call girls9892124323, Call Girls in mumbai, Vashi Call Girls , Kurla Call girls
9892124323, Call Girls in mumbai, Vashi Call Girls , Kurla Call girls
 
2k Shots ≽ 9205541914 ≼ Call Girls In Dashrath Puri (Delhi)
2k Shots ≽ 9205541914 ≼ Call Girls In Dashrath Puri (Delhi)2k Shots ≽ 9205541914 ≼ Call Girls In Dashrath Puri (Delhi)
2k Shots ≽ 9205541914 ≼ Call Girls In Dashrath Puri (Delhi)
 
Pokemon Go... Unraveling the Conspiracy Theory
Pokemon Go... Unraveling the Conspiracy TheoryPokemon Go... Unraveling the Conspiracy Theory
Pokemon Go... Unraveling the Conspiracy Theory
 
LC_YouSaidYes_NewBelieverBookletDone.pdf
LC_YouSaidYes_NewBelieverBookletDone.pdfLC_YouSaidYes_NewBelieverBookletDone.pdf
LC_YouSaidYes_NewBelieverBookletDone.pdf
 

Randomly sampling YouTube users

  • 1. Randomly Sampling YouTube Users: An Introduction to Random Prefix Sampling Method Cheng-Jun Wang Web Ming Lab City University of Hong Kong 20121225
  • 4. Plan A: Sampling Users ∗ Unfortunately, YouTube’s user identifiers do not follow a standard format, YouTube’s user identifiers are user-specified strings. We were therefore unable to create a random sample of YouTube users. Mislove (2007) Measurement and Analysis of Online Social Networks. IMC
  • 5. Plan B: Sampling Videos ∗ Using the YouTube search API, Zhou et al develop a random prefix sampling method, and find that roughly 500 millions YouTube videos by May, 2011. ∗ Sample the videos first, and then find the respective users. Zhou et al. (2011) Counting YouTube Videos via Random Prefix Sampling. IMC
  • 6. Get proportional users? ∗ Limitation: selection bias towards those who uploading more videos. Therefore, weight against the number of videos per user (by the max value) is necessary to get a random sample of YouTube users. ∗ Is it possible? 1 1 Videos crawled Users detected
  • 7. UserID Video Active Num Days User Video Weight Active 1 10 20 ID Num Factor Days 2 5 15 2 5 15 1 10 1 20 3 1 1 Weight 3 1 1 Cases 3 1 1 2 5 2 15 3 1 1 3 1 1 3 1 1 3 1 10 1 3 1 1 3 1 1 3 1 1 3 1 1
  • 8. Strategy ∗ 60^10*16 = 9.674588e+18 ∗ YouTube video is randomly generated from the id space ∗ Sampling space is tooooooo large! ∗ Any good idea? ∗ http://www.youtube.com/watch?v=1yo0zBFCMxo ∗ http://www.youtube.com/watch?v=_OBlgSz8sSM
  • 9. YouTube Search API ∗ One unique property of YouTube search API we find is that when searching using a keyword string of the format “watch?v=xy...z” (including the quotes) where “xy...” is a prefix (of length L, 1 ≤ L ≤ 11) of a possible YouTube video id which does not contain the literal “-” in the prefix, YouTube will return a list of videos whose id’s begin with this prefix followed by “-”, if they exist. ∗ YouTube limits the number of returned results for any query. ∗ When the prefix is short (e.g., 1 or 2), it is more likely that the returned search results may contain such “noisy” video ids; also, the short prefix may match a large number of videos ∗ In contrast, if the prefix is too long (e.g., 6 or 7), no result may be returned by the search engine.
  • 10. Practice ∗ However, in practice, a prefix of length L < 5 contains usually more than one hundred results, and YouTube API can only return at most 30 ids for each prefix query. ∗ On the other hand, based on our experimental results, a prefix with length L = 5 always contains less than 10 valid ids. ∗ Therefore, a prefix length of 5 is a good choice in practice.
  • 11. ∗ They find that querying prefixes with a prefix length of four will returned ids having a “-” in the fifth place, which provides a big enough result set so that each prefix returns some results and small enough to never reach the result limit set by the API.
  • 12. ∗ Zhou et al. found that there are about 500 million YouTube videos by 2011! Zhou et al. (2011) Counting YouTube Videos via Random Prefix Sampling. IMC
  • 13. Python and gdata gdata Code ∗ gdata is a module for def SearchAndPrint(search_terms): yt_service = gdata.youtube.service.YouTubeService() connecting Google data query = gdata.youtube.service.YouTubeVideoQuery() (including YouTube) via API query.vq = search_terms query.orderby = 'viewCount' query.racy = 'include' feed = yt_service.YouTubeQuery(query) PrintVideoFeed(feed)
  • 14. Test Validity ∗ http://www.youtube.com/watch?v=1yo0zBFCMxo ∗ The Secret State - The Biggest Mistake - Official Lyric Music Video Cant’ find the video! ∗ searchApi("watch?v=1yo0z")
  • 15. Restricted query term ∗ searchApi('"watch?v=1yo0"')
  • 16. Compare two random samples ∗ # summary(da$Freq) ∗ # Min. 1st Qu. Median Mean 3rd Qu. Max. ∗ # 1.00 7.00 25.00 17.15 25.00 75.00 ∗ ∗ # summary(db$Freq) ∗ # Min. 1st Qu. Median Mean 3rd Qu. Max. ∗ # 1.00 8.00 25.00 17.57 25.00 50.00
  • 17. There are 604 million videos in YouTube by Dec, 2012! ∗ length(unique(subset(a[,1], b[,1]%in%a[,1]))) == 26 ∗ 34361/x = 125/34361 ∗ X = (34361^2/125)*64 == 604507300
  • 18. Numeric simulation of random prefix sampling ∗ # using degreenet to simulate decrete pareto distribution ∗ library(degreenet) ∗ a<-simdp(n=100000, v=3.5, maxdeg=10000) ∗ b<-data.frame(cbind(c(1:length(a)),a)) ∗ c<-b[rep(1:nrow(b),b$a),] ∗ c$vid<-c(1:length(c$a)) ∗ names(c)<-c("uid", "count", "vid") ∗ id<-sample(c(1:length(c$vid)), 2000, replace = F) # ∗ ds<-subset(c, c$vid%in%id) ∗ dat<-subset(ds, !duplicated(ds$uid)) ∗ hist(dat$count) ∗ da<-as.data.frame(table(a)) ∗ ds<-as.data.frame(table(dat$count)) ∗ plot(log(da[,2])~log(as.numeric(as.character(da[,1]))), xlab = "Number of Videos (Log)", ylab = "Frequency (Log)" ) ∗ points(log(ds[,2])~log(as.numeric(as.character(ds[,1]))), pch=2, col="red") ∗ legend("topright", c("population", "sample"), ∗ col = c( "black","red"), ∗ cex=0.9, pch= c(3, 2))
  • 19. Reference ∗ Zhou et al. (2011) Counting YouTube Videos via Random Prefix Sampling. IMC ∗ Mislove (2007) Measurement and Analysis of Online Social Networks. IMC ∗ YouTube deverlopers guide for python https://developers.google.com/youtube/1.0/developers_guide_python ∗ Introduction to the library of gdata.youtube http://gdata-pythonclient.googlecode.com/svn/trunk/pydocs/gdata.youtube.html#YouTubeVideoEntry