SlideShare ist ein Scribd-Unternehmen logo
1 von 32
Computing PageRank  Using Hadoop (+Introduction to MapReduce) Alexander Behm, Ajey Shah University of California, Irvine Instructor: Prof. Chen Li
Outline ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Motivation for MapReduce ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Motivation for MapReduce ,[object Object],[object Object],[object Object],[object Object],Parallel Programming Models ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],MapReduce
MapReduce Goals ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
MapReduce is NOT… ,[object Object],[object Object],[object Object],[object Object],MapReduce is a programming paradigm!
MapReduce Flow Input Map Key, Value Key, Value … = Map Map Split Input into Key-Value pairs. For each K-V pair call Map. Each Map produces new set of K-V pairs. Reduce(K, V[ ]) Sort Output Key, Value Key, Value … = For each distinct key, call reduce. Produces one K-V pair for each distinct key.  Output as a set of Key Value Pairs. Key, Value Key, Value … Key, Value Key, Value … Key, Value Key, Value …
MapReduce WordCount Example Output: Number of occurrences  of each word Input: File containing words Hello World Bye World Hello Hadoop Bye Hadoop Bye Hadoop Hello Hadoop Bye 3 Hadoop 4 Hello 3 World 2 MapReduce How can we do this within the MapReduce framework? Basic idea: parallelize on lines in input file!
MapReduce WordCount Example Input 1, “Hello World Bye World” 2, “Hello Hadoop Bye Hadoop” 3, “Bye Hadoop Hello Hadoop” Map Output <Hello,1> <World,1> <Bye,1> <World,1> <Hello,1> <Hadoop,1> <Bye,1> <Hadoop,1> <Bye,1> <Hadoop,1> <Hello,1> <Hadoop,1> Map(K, V) { For each word w in V Collect(w, 1); } Map Map Map
MapReduce WordCount Example Reduce(K, V[ ]) { Int count = 0; For each v in V count += v; Collect(K, count); } Map Output <Hello,1> <World,1> <Bye,1> <World,1> <Hello,1> <Hadoop,1> <Bye,1> <Hadoop,1> <Bye,1> <Hadoop,1> <Hello,1> <Hadoop,1> Internal Grouping <Bye    1, 1, 1> <Hadoop    1, 1, 1, 1> <Hello    1, 1, 1> <World    1, 1> Reduce Output <Bye, 3> <Hadoop, 4> <Hello, 3> <World, 2> Reduce Reduce Reduce Reduce
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],@Yahoo! Some Webmap size data: Number of links between pages in the index:  roughly 1 trillion links   Size of output:  over 300 TB, compressed!   Number of cores used to run a single Map-Reduce job:  over 10,000   Raw disk used in the production cluster:  over 5 Petabytes   (source: http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-largest-production-hadoop.html)
Typical Hadoop Setup
Our Hadoop Setup MASTER peach Namenode JobTracker TaskTracker DataNode SLAVES watermelon DataNode TaskTracker cherry DataNode TaskTracker avocado DataNode TaskTracker blueberry DataNode TaskTracker Switch
Our Hadoop Setup Demo: Hadoop Admin Pages!
[object Object],[object Object],Storage: HDFS
Run Application Job Tracker Task Tracker Task Tracker Task Tracker … Task Task Task Task Task Task Hadoop Black Box Job Execution Diagram
[object Object],Processing: Hadoop MapReduce
Using Hadoop To Program Reduce(…) Mapper(…) extends extends implements implements
Sample Map Class ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Sample Reduce Class public static class Reduce extends MapReduceBase  implements Reducer<Text, IntWritable, Text, IntWritable> {        public void reduce(Text key, Iterator<IntWritable> values,  OutputCollector<Text, IntWritable> output, Reporter reporter)  throws IOException  {        int sum = 0;         while (values.hasNext())  {       sum += values.next().get();         }    output.collect(key, new IntWritable(sum));        }      }
Running a Job Demo: Show WordCount Example
Project5: PageRank on Hadoop
#|colNum| NumOfRows| <R,val>…..<R,val>| #..... Link Analysis Crawled Pages Output Link Extractor
PageRank on MapReduce Very Basic PageRank Algorithm Input: PageRankVector DistributionMatrix ComputePageRank { Until converged { PageRankVector = DistributionMatrix * PageRankVector; } } Output: PageRankVector ,[object Object],[object Object],[object Object],[object Object],[object Object]
PageRank on MapReduce Why is storage a challenge? UCI domain: 500000 pages Assuming 4 Bytes per entry Size of Vector: 500000 * 4 = 2000000 = 2MB Size of Matrix: 500000 * 500000 * 4 = 10 12  = 1TB Assumes a fully connected graph. Cleary this is very unrealistic for web pages! Solution: Sparse Matrix But: Row-Wise or Column-Wise? Depends on usage patterns!  (i.e. how we do parallel matrix multiplication, updating of matrix, etc.)
PageRank on MapReduce Parallel Matrix Multiplication Requirement: Make it work!    Simple but practical solution X = M V M x V Every Row of M is “combined” with V, yielding one element of M x V each Intuition:  - Parallelize on rows: each parallel task computes one final value - Use row-wise sparse matrix, so above can be done easily! (column-wise is actually better for PageRank)
PageRank on MapReduce 0 0 0 0 1 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 1 0 0 0 1 1 2 3 4 5 6 1 2 3 4 5 6 Stored As 1 2 3 4 5 6 5, 1 2, 1 4, 1 1, 1 2, 1 4, 1 5, 1 1, 1 2, 1 2, 1 6, 1 Original Matrix Row-Wise Sparse Matrix New Storage Requirements UCI domain: 500000 pages Assuming 4 Bytes per entry Assuming max 100 outgoing links per page Size of Matrix: 500000 * 100 * (4 + 4) = 400 * 10 6  = 400MB Notice: No more random access!
PageRank on MapReduce Map(Key, Row) { Vector v = getVector(); Int sum = 0; For each Element e in Row  sum += e.value * v.at(e.columnNumber); collect(Key, sum); } Reduce(Key, Value) { collect(Key, Value); } Map-Reduce procedures for parallel matrix*vector multiplication using row-wise sparse matrix
Matrix Vector Multiplication Demo: Show Matrix-Vector Multiplication
Hadoop: Implementing Own File Format HDFS File ,[object Object],[object Object],[object Object],[object Object],InputSplit - Filename - Start Offset - End Offset - Hosts in HDFS ,[object Object],[object Object],[object Object],[object Object],InputSplit InputSplit RecordReader RecordReader Map Map Map
[1] Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters , Sixth Symposium on Operating System Design and Implementation  (OSDI'04), San Francisco, CA, December, 2004 [2] http://www.cs.cmu.edu/~knigam/15-505/HW1.html [3] http://bnrg.cs.berkeley.edu/~adj/cs16x/Nachos/project2.html [4] http://lucene.apache.org/hadoop/ References
Flow TextInputFormat implements InputFormat getSplits() getRecordReader() LineRecordReader implements RecordReader One for each Split Next(Key, Value) FileSplit implements InputSplit File Start Offset End Offset Hosts where chunks of File live FileInputFormat implements InputFormat getSplits() getRecordReader()

Weitere ähnliche Inhalte

Was ist angesagt?

key distribution in network security
key distribution in network securitykey distribution in network security
key distribution in network securitybabak danyal
 
13 DHCP Configuration in Linux
13 DHCP Configuration in Linux13 DHCP Configuration in Linux
13 DHCP Configuration in LinuxHameda Hurmat
 
ContikiMAC : Radio Duty Cycling Protocol
ContikiMAC : Radio Duty Cycling ProtocolContikiMAC : Radio Duty Cycling Protocol
ContikiMAC : Radio Duty Cycling ProtocolSalah Amean
 
Transport Layer Security
Transport Layer SecurityTransport Layer Security
Transport Layer SecurityHuda Seyam
 
Dc ch11 : routing in switched networks
Dc ch11 : routing in switched networksDc ch11 : routing in switched networks
Dc ch11 : routing in switched networksSyaiful Ahdan
 
Introduction for internet connectivity (IoT)
 Introduction for internet connectivity (IoT) Introduction for internet connectivity (IoT)
Introduction for internet connectivity (IoT)FabMinds
 
A very good introduction to IPv6
A very good introduction to IPv6A very good introduction to IPv6
A very good introduction to IPv6Syed Arshad
 
The constrained application protocol (CoAP)
The constrained application protocol (CoAP)The constrained application protocol (CoAP)
The constrained application protocol (CoAP)Hamdamboy (함담보이)
 
Socket programming or network programming
Socket programming or network programmingSocket programming or network programming
Socket programming or network programmingMmanan91
 
AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersCastLabKAIST
 
Simple Mail Transfer Protocol
Simple Mail Transfer ProtocolSimple Mail Transfer Protocol
Simple Mail Transfer ProtocolUjjayanta Bhaumik
 
Secure Socket Layer (SSL)
Secure Socket Layer (SSL)Secure Socket Layer (SSL)
Secure Socket Layer (SSL)Samip jain
 
IBM MQ Disaster Recovery
IBM MQ Disaster RecoveryIBM MQ Disaster Recovery
IBM MQ Disaster RecoveryMarkTaylorIBM
 

Was ist angesagt? (20)

Dial up security
Dial up securityDial up security
Dial up security
 
key distribution in network security
key distribution in network securitykey distribution in network security
key distribution in network security
 
13 DHCP Configuration in Linux
13 DHCP Configuration in Linux13 DHCP Configuration in Linux
13 DHCP Configuration in Linux
 
ContikiMAC : Radio Duty Cycling Protocol
ContikiMAC : Radio Duty Cycling ProtocolContikiMAC : Radio Duty Cycling Protocol
ContikiMAC : Radio Duty Cycling Protocol
 
Transport Layer Security
Transport Layer SecurityTransport Layer Security
Transport Layer Security
 
Dc ch11 : routing in switched networks
Dc ch11 : routing in switched networksDc ch11 : routing in switched networks
Dc ch11 : routing in switched networks
 
Introduction for internet connectivity (IoT)
 Introduction for internet connectivity (IoT) Introduction for internet connectivity (IoT)
Introduction for internet connectivity (IoT)
 
A very good introduction to IPv6
A very good introduction to IPv6A very good introduction to IPv6
A very good introduction to IPv6
 
The constrained application protocol (CoAP)
The constrained application protocol (CoAP)The constrained application protocol (CoAP)
The constrained application protocol (CoAP)
 
WebRTC DataChannels demystified
WebRTC DataChannels demystifiedWebRTC DataChannels demystified
WebRTC DataChannels demystified
 
Socket System Calls
Socket System CallsSocket System Calls
Socket System Calls
 
Chapter13
Chapter13Chapter13
Chapter13
 
Socket programming or network programming
Socket programming or network programmingSocket programming or network programming
Socket programming or network programming
 
Sctp tutorial
Sctp tutorialSctp tutorial
Sctp tutorial
 
AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud Datacenters
 
Message passing in Distributed Computing Systems
Message passing in Distributed Computing SystemsMessage passing in Distributed Computing Systems
Message passing in Distributed Computing Systems
 
Simple Mail Transfer Protocol
Simple Mail Transfer ProtocolSimple Mail Transfer Protocol
Simple Mail Transfer Protocol
 
IP security
IP securityIP security
IP security
 
Secure Socket Layer (SSL)
Secure Socket Layer (SSL)Secure Socket Layer (SSL)
Secure Socket Layer (SSL)
 
IBM MQ Disaster Recovery
IBM MQ Disaster RecoveryIBM MQ Disaster Recovery
IBM MQ Disaster Recovery
 

Andere mochten auch

Implementing page rank algorithm using hadoop map reduce
Implementing page rank algorithm using hadoop map reduceImplementing page rank algorithm using hadoop map reduce
Implementing page rank algorithm using hadoop map reduceFarzan Hajian
 
Hadoop Design and k -Means Clustering
Hadoop Design and k -Means ClusteringHadoop Design and k -Means Clustering
Hadoop Design and k -Means ClusteringGeorge Ang
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduceVarad Meru
 
K means Clustering
K means ClusteringK means Clustering
K means ClusteringEdureka!
 
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Titus Damaiyanti
 
Try It The Google Way .
Try It The Google Way .Try It The Google Way .
Try It The Google Way .abhinavbom
 
Kmeans in-hadoop
Kmeans in-hadoopKmeans in-hadoop
Kmeans in-hadoopTianwei Liu
 
The Google Pagerank algorithm - How does it work?
The Google Pagerank algorithm - How does it work?The Google Pagerank algorithm - How does it work?
The Google Pagerank algorithm - How does it work?Kundan Bhaduri
 
Lec4 Clustering
Lec4 ClusteringLec4 Clustering
Lec4 Clusteringmobius.cn
 
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduceDavid Gleich
 
Cascading - A Java Developer’s Companion to the Hadoop World
Cascading - A Java Developer’s Companion to the Hadoop WorldCascading - A Java Developer’s Companion to the Hadoop World
Cascading - A Java Developer’s Companion to the Hadoop WorldCascading
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraphsscdotopen
 
Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word countJeff Patti
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 

Andere mochten auch (20)

Implementing page rank algorithm using hadoop map reduce
Implementing page rank algorithm using hadoop map reduceImplementing page rank algorithm using hadoop map reduce
Implementing page rank algorithm using hadoop map reduce
 
Hadoop Design and k -Means Clustering
Hadoop Design and k -Means ClusteringHadoop Design and k -Means Clustering
Hadoop Design and k -Means Clustering
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
 
Parallel-kmeans
Parallel-kmeansParallel-kmeans
Parallel-kmeans
 
K means Clustering
K means ClusteringK means Clustering
K means Clustering
 
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
 
Chap14_Ecom
Chap14_EcomChap14_Ecom
Chap14_Ecom
 
Try It The Google Way .
Try It The Google Way .Try It The Google Way .
Try It The Google Way .
 
Kmeans in-hadoop
Kmeans in-hadoopKmeans in-hadoop
Kmeans in-hadoop
 
Seo and page rank algorithm
Seo and page rank algorithmSeo and page rank algorithm
Seo and page rank algorithm
 
The Google Pagerank algorithm - How does it work?
The Google Pagerank algorithm - How does it work?The Google Pagerank algorithm - How does it work?
The Google Pagerank algorithm - How does it work?
 
Lec4 Clustering
Lec4 ClusteringLec4 Clustering
Lec4 Clustering
 
Smart Crawler
Smart CrawlerSmart Crawler
Smart Crawler
 
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduce
 
Cascading - A Java Developer’s Companion to the Hadoop World
Cascading - A Java Developer’s Companion to the Hadoop WorldCascading - A Java Developer’s Companion to the Hadoop World
Cascading - A Java Developer’s Companion to the Hadoop World
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraph
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word count
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 

Ähnlich wie Behm Shah Pagerank

Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaDesing Pathshala
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxHARIKRISHNANU13
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview questionpappupassindia
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine ParallelismSri Prasanna
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopDilum Bandara
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examplesAndrea Iacono
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comsoftwarequery
 
Taste Java In The Clouds
Taste Java In The CloudsTaste Java In The Clouds
Taste Java In The CloudsJacky Chu
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueShay Sofer
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansattilacsordas
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesKelly Technologies
 
map reduce Technic in big data
map reduce Technic in big data map reduce Technic in big data
map reduce Technic in big data Jay Nagar
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreKelly Technologies
 

Ähnlich wie Behm Shah Pagerank (20)

Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with Hadoop
 
Tutorial5
Tutorial5Tutorial5
Tutorial5
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
 
Taste Java In The Clouds
Taste Java In The CloudsTaste Java In The Clouds
Taste Java In The Clouds
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the Rescue
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies
 
map reduce Technic in big data
map reduce Technic in big data map reduce Technic in big data
map reduce Technic in big data
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
 

Kürzlich hochgeladen

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 

Kürzlich hochgeladen (20)

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 

Behm Shah Pagerank

  • 1. Computing PageRank Using Hadoop (+Introduction to MapReduce) Alexander Behm, Ajey Shah University of California, Irvine Instructor: Prof. Chen Li
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7. MapReduce Flow Input Map Key, Value Key, Value … = Map Map Split Input into Key-Value pairs. For each K-V pair call Map. Each Map produces new set of K-V pairs. Reduce(K, V[ ]) Sort Output Key, Value Key, Value … = For each distinct key, call reduce. Produces one K-V pair for each distinct key. Output as a set of Key Value Pairs. Key, Value Key, Value … Key, Value Key, Value … Key, Value Key, Value …
  • 8. MapReduce WordCount Example Output: Number of occurrences of each word Input: File containing words Hello World Bye World Hello Hadoop Bye Hadoop Bye Hadoop Hello Hadoop Bye 3 Hadoop 4 Hello 3 World 2 MapReduce How can we do this within the MapReduce framework? Basic idea: parallelize on lines in input file!
  • 9. MapReduce WordCount Example Input 1, “Hello World Bye World” 2, “Hello Hadoop Bye Hadoop” 3, “Bye Hadoop Hello Hadoop” Map Output <Hello,1> <World,1> <Bye,1> <World,1> <Hello,1> <Hadoop,1> <Bye,1> <Hadoop,1> <Bye,1> <Hadoop,1> <Hello,1> <Hadoop,1> Map(K, V) { For each word w in V Collect(w, 1); } Map Map Map
  • 10. MapReduce WordCount Example Reduce(K, V[ ]) { Int count = 0; For each v in V count += v; Collect(K, count); } Map Output <Hello,1> <World,1> <Bye,1> <World,1> <Hello,1> <Hadoop,1> <Bye,1> <Hadoop,1> <Bye,1> <Hadoop,1> <Hello,1> <Hadoop,1> Internal Grouping <Bye  1, 1, 1> <Hadoop  1, 1, 1, 1> <Hello  1, 1, 1> <World  1, 1> Reduce Output <Bye, 3> <Hadoop, 4> <Hello, 3> <World, 2> Reduce Reduce Reduce Reduce
  • 11.
  • 13. Our Hadoop Setup MASTER peach Namenode JobTracker TaskTracker DataNode SLAVES watermelon DataNode TaskTracker cherry DataNode TaskTracker avocado DataNode TaskTracker blueberry DataNode TaskTracker Switch
  • 14. Our Hadoop Setup Demo: Hadoop Admin Pages!
  • 15.
  • 16. Run Application Job Tracker Task Tracker Task Tracker Task Tracker … Task Task Task Task Task Task Hadoop Black Box Job Execution Diagram
  • 17.
  • 18. Using Hadoop To Program Reduce(…) Mapper(…) extends extends implements implements
  • 19.
  • 20. Sample Reduce Class public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {      public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {       int sum = 0;        while (values.hasNext()) {      sum += values.next().get();        }   output.collect(key, new IntWritable(sum));      }    }
  • 21. Running a Job Demo: Show WordCount Example
  • 23. #|colNum| NumOfRows| <R,val>…..<R,val>| #..... Link Analysis Crawled Pages Output Link Extractor
  • 24.
  • 25. PageRank on MapReduce Why is storage a challenge? UCI domain: 500000 pages Assuming 4 Bytes per entry Size of Vector: 500000 * 4 = 2000000 = 2MB Size of Matrix: 500000 * 500000 * 4 = 10 12 = 1TB Assumes a fully connected graph. Cleary this is very unrealistic for web pages! Solution: Sparse Matrix But: Row-Wise or Column-Wise? Depends on usage patterns! (i.e. how we do parallel matrix multiplication, updating of matrix, etc.)
  • 26. PageRank on MapReduce Parallel Matrix Multiplication Requirement: Make it work!  Simple but practical solution X = M V M x V Every Row of M is “combined” with V, yielding one element of M x V each Intuition: - Parallelize on rows: each parallel task computes one final value - Use row-wise sparse matrix, so above can be done easily! (column-wise is actually better for PageRank)
  • 27. PageRank on MapReduce 0 0 0 0 1 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 1 0 0 0 1 1 2 3 4 5 6 1 2 3 4 5 6 Stored As 1 2 3 4 5 6 5, 1 2, 1 4, 1 1, 1 2, 1 4, 1 5, 1 1, 1 2, 1 2, 1 6, 1 Original Matrix Row-Wise Sparse Matrix New Storage Requirements UCI domain: 500000 pages Assuming 4 Bytes per entry Assuming max 100 outgoing links per page Size of Matrix: 500000 * 100 * (4 + 4) = 400 * 10 6 = 400MB Notice: No more random access!
  • 28. PageRank on MapReduce Map(Key, Row) { Vector v = getVector(); Int sum = 0; For each Element e in Row sum += e.value * v.at(e.columnNumber); collect(Key, sum); } Reduce(Key, Value) { collect(Key, Value); } Map-Reduce procedures for parallel matrix*vector multiplication using row-wise sparse matrix
  • 29. Matrix Vector Multiplication Demo: Show Matrix-Vector Multiplication
  • 30.
  • 31. [1] Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters , Sixth Symposium on Operating System Design and Implementation (OSDI'04), San Francisco, CA, December, 2004 [2] http://www.cs.cmu.edu/~knigam/15-505/HW1.html [3] http://bnrg.cs.berkeley.edu/~adj/cs16x/Nachos/project2.html [4] http://lucene.apache.org/hadoop/ References
  • 32. Flow TextInputFormat implements InputFormat getSplits() getRecordReader() LineRecordReader implements RecordReader One for each Split Next(Key, Value) FileSplit implements InputSplit File Start Offset End Offset Hosts where chunks of File live FileInputFormat implements InputFormat getSplits() getRecordReader()