SlideShare ist ein Scribd-Unternehmen logo
1 von 13
Downloaden Sie, um offline zu lesen
last.fm crawler
RW vs RWRW
             Mário Almeida      malmeida@kth.se
                 Zafar Gilani     szuhgi@kth.se
             Arinto Murdopo        arinto@kth.se
Outline
●   Parameters
●   Methodology
●   Results
●   Challenges
●   Conclusion
Parameters
1.   Playcounts
2.   Playlists
3.   Ages
4.   IDs
5.   Number of friends (degrees)

Compare average using RW and RWRW!
Methodology
Utilized lastfm APIs to obtain
   ● user info
   ● number of friends (degree)
RW with UIS-WR
On-the-fly, we apply RW formula:
Methodology
For RWRW, we apply:




The weight Wv is set to number of friends (degree)
Results
Crawled for ~10 hours
Number of samples: 48000
Number of age samples: 36363, not all users
show their age
Results - Ages


           RW
        estimates
          lower      After about 25k
       average age    samples, the     There is a big
         values.     age stabilizes.    correlation
                                       between age
                                         and the
                                          degree
Results - Playlists

       Most users do
         not have
         playlists.




                          RW estimates higher
                       numbers of playlists. Users
                       with higher degrees tend to
                           have more playlists.
Results - Playcounts


           We found some
           users having
           playcounts in the
           order of millions.
                                RW estimates higher
                                playcounts. Users with
                                higher degree tend to
                                have higher playcounts
Results - IDs

       Not yet stable.




                         RW estimates a lower
                         average ID compared to
                         RWRW. An user with lower
                         ID has generally a higher
                         degree
Results - Degrees




                 RWRW reduces the bias of nodes
                with higher probability to be visited
                  due to the high degree. This is
               indeed close to the expected degree
                               value.
Conclusion
● A simple random walk in a social network
  generally results into biased averages.
  ○ A node with higher degree has a higher probability of
    being discovered.
● RWRW normalizes the averages.
  ○ High variations do not abruptly impact the
    estimation.
  ○ RWRW reduces the biases of RW.
● Low variance means lower difference
  between RW and RWRW.
● Crawling lastfm produces many challenges
  ○ e.g.: 0 degree, banned user, huge playcounts
Questions
Check the code in:
● http://code.google.com/p/lastfm-rwrw/

Weitere ähnliche Inhalte

Mehr von Arinto Murdopo

An Integer Programming Representation for Data Center Power-Aware Management ...
An Integer Programming Representation for Data Center Power-Aware Management ...An Integer Programming Representation for Data Center Power-Aware Management ...
An Integer Programming Representation for Data Center Power-Aware Management ...Arinto Murdopo
 
Quantum Cryptography and Possible Attacks-slide
Quantum Cryptography and Possible Attacks-slideQuantum Cryptography and Possible Attacks-slide
Quantum Cryptography and Possible Attacks-slideArinto Murdopo
 
Quantum Cryptography and Possible Attacks
Quantum Cryptography and Possible AttacksQuantum Cryptography and Possible Attacks
Quantum Cryptography and Possible AttacksArinto Murdopo
 
Parallelization of Smith-Waterman Algorithm using MPI
Parallelization of Smith-Waterman Algorithm using MPIParallelization of Smith-Waterman Algorithm using MPI
Parallelization of Smith-Waterman Algorithm using MPIArinto Murdopo
 
Megastore - ID2220 Presentation
Megastore - ID2220 PresentationMegastore - ID2220 Presentation
Megastore - ID2220 PresentationArinto Murdopo
 
Flume Event Scalability
Flume Event ScalabilityFlume Event Scalability
Flume Event ScalabilityArinto Murdopo
 
Large Scale Distributed Storage Systems in Volunteer Computing - Slide
Large Scale Distributed Storage Systems in Volunteer Computing - SlideLarge Scale Distributed Storage Systems in Volunteer Computing - Slide
Large Scale Distributed Storage Systems in Volunteer Computing - SlideArinto Murdopo
 
Large-Scale Decentralized Storage Systems for Volunter Computing Systems
Large-Scale Decentralized Storage Systems for Volunter Computing SystemsLarge-Scale Decentralized Storage Systems for Volunter Computing Systems
Large-Scale Decentralized Storage Systems for Volunter Computing SystemsArinto Murdopo
 
Rise of Network Virtualization
Rise of Network VirtualizationRise of Network Virtualization
Rise of Network VirtualizationArinto Murdopo
 
Intelligent Placement of Datacenter for Internet Services
Intelligent Placement of Datacenter for Internet Services Intelligent Placement of Datacenter for Internet Services
Intelligent Placement of Datacenter for Internet Services Arinto Murdopo
 
Architecting a Cloud-Scale Identity Fabric
Architecting a Cloud-Scale Identity FabricArchitecting a Cloud-Scale Identity Fabric
Architecting a Cloud-Scale Identity FabricArinto Murdopo
 
Consistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System DesignConsistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System DesignArinto Murdopo
 
Distributed Storage System for Volunteer Computing
Distributed Storage System for Volunteer ComputingDistributed Storage System for Volunteer Computing
Distributed Storage System for Volunteer ComputingArinto Murdopo
 
Why File Sharing is Dangerous?
Why File Sharing is Dangerous?Why File Sharing is Dangerous?
Why File Sharing is Dangerous?Arinto Murdopo
 
Why Use “REST” Architecture for Web Services?
Why Use “REST” Architecture for Web Services?Why Use “REST” Architecture for Web Services?
Why Use “REST” Architecture for Web Services?Arinto Murdopo
 

Mehr von Arinto Murdopo (18)

An Integer Programming Representation for Data Center Power-Aware Management ...
An Integer Programming Representation for Data Center Power-Aware Management ...An Integer Programming Representation for Data Center Power-Aware Management ...
An Integer Programming Representation for Data Center Power-Aware Management ...
 
Quantum Cryptography and Possible Attacks-slide
Quantum Cryptography and Possible Attacks-slideQuantum Cryptography and Possible Attacks-slide
Quantum Cryptography and Possible Attacks-slide
 
Quantum Cryptography and Possible Attacks
Quantum Cryptography and Possible AttacksQuantum Cryptography and Possible Attacks
Quantum Cryptography and Possible Attacks
 
Parallelization of Smith-Waterman Algorithm using MPI
Parallelization of Smith-Waterman Algorithm using MPIParallelization of Smith-Waterman Algorithm using MPI
Parallelization of Smith-Waterman Algorithm using MPI
 
Dremel Paper Review
Dremel Paper ReviewDremel Paper Review
Dremel Paper Review
 
Megastore - ID2220 Presentation
Megastore - ID2220 PresentationMegastore - ID2220 Presentation
Megastore - ID2220 Presentation
 
Flume Event Scalability
Flume Event ScalabilityFlume Event Scalability
Flume Event Scalability
 
Large Scale Distributed Storage Systems in Volunteer Computing - Slide
Large Scale Distributed Storage Systems in Volunteer Computing - SlideLarge Scale Distributed Storage Systems in Volunteer Computing - Slide
Large Scale Distributed Storage Systems in Volunteer Computing - Slide
 
Large-Scale Decentralized Storage Systems for Volunter Computing Systems
Large-Scale Decentralized Storage Systems for Volunter Computing SystemsLarge-Scale Decentralized Storage Systems for Volunter Computing Systems
Large-Scale Decentralized Storage Systems for Volunter Computing Systems
 
Rise of Network Virtualization
Rise of Network VirtualizationRise of Network Virtualization
Rise of Network Virtualization
 
Intelligent Placement of Datacenter for Internet Services
Intelligent Placement of Datacenter for Internet Services Intelligent Placement of Datacenter for Internet Services
Intelligent Placement of Datacenter for Internet Services
 
Architecting a Cloud-Scale Identity Fabric
Architecting a Cloud-Scale Identity FabricArchitecting a Cloud-Scale Identity Fabric
Architecting a Cloud-Scale Identity Fabric
 
Consistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System DesignConsistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System Design
 
Distributed Storage System for Volunteer Computing
Distributed Storage System for Volunteer ComputingDistributed Storage System for Volunteer Computing
Distributed Storage System for Volunteer Computing
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
Why File Sharing is Dangerous?
Why File Sharing is Dangerous?Why File Sharing is Dangerous?
Why File Sharing is Dangerous?
 
Why Use “REST” Architecture for Web Services?
Why Use “REST” Architecture for Web Services?Why Use “REST” Architecture for Web Services?
Why Use “REST” Architecture for Web Services?
 
Distributed Systems
Distributed SystemsDistributed Systems
Distributed Systems
 

Kürzlich hochgeladen

Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...PsychoTech Services
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 

Kürzlich hochgeladen (20)

Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 

lastfm crawler

  • 1. last.fm crawler RW vs RWRW Mário Almeida malmeida@kth.se Zafar Gilani szuhgi@kth.se Arinto Murdopo arinto@kth.se
  • 2. Outline ● Parameters ● Methodology ● Results ● Challenges ● Conclusion
  • 3. Parameters 1. Playcounts 2. Playlists 3. Ages 4. IDs 5. Number of friends (degrees) Compare average using RW and RWRW!
  • 4. Methodology Utilized lastfm APIs to obtain ● user info ● number of friends (degree) RW with UIS-WR On-the-fly, we apply RW formula:
  • 5. Methodology For RWRW, we apply: The weight Wv is set to number of friends (degree)
  • 6. Results Crawled for ~10 hours Number of samples: 48000 Number of age samples: 36363, not all users show their age
  • 7. Results - Ages RW estimates lower After about 25k average age samples, the There is a big values. age stabilizes. correlation between age and the degree
  • 8. Results - Playlists Most users do not have playlists. RW estimates higher numbers of playlists. Users with higher degrees tend to have more playlists.
  • 9. Results - Playcounts We found some users having playcounts in the order of millions. RW estimates higher playcounts. Users with higher degree tend to have higher playcounts
  • 10. Results - IDs Not yet stable. RW estimates a lower average ID compared to RWRW. An user with lower ID has generally a higher degree
  • 11. Results - Degrees RWRW reduces the bias of nodes with higher probability to be visited due to the high degree. This is indeed close to the expected degree value.
  • 12. Conclusion ● A simple random walk in a social network generally results into biased averages. ○ A node with higher degree has a higher probability of being discovered. ● RWRW normalizes the averages. ○ High variations do not abruptly impact the estimation. ○ RWRW reduces the biases of RW. ● Low variance means lower difference between RW and RWRW. ● Crawling lastfm produces many challenges ○ e.g.: 0 degree, banned user, huge playcounts
  • 13. Questions Check the code in: ● http://code.google.com/p/lastfm-rwrw/