Meetup - Hadoop User Group - Munich : 2013-05-22
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Meetup - Hadoop User Group - Munich : 2013-05-22

am

  • 448 Views

Abstract: A Hadoop cluster is the tool of choice for many large scale analytics applications. A large variety of commercial tools is available for typical SQL like or data warehouse applications, but ...

Abstract: A Hadoop cluster is the tool of choice for many large scale analytics applications. A large variety of commercial tools is available for typical SQL like or data warehouse applications, but how to deal with networks and time series? 
How to collect and store data for social media analysis and what are good practices for working with libraries like Mahout and Giraph? 
The sample use case deals with a data set from Wikipedia to illustrate how to combine multiple public data sources with personal data collections, e.g. from Twitter or even personal mailboxes. We discuss efficient approaches for data organisation, data preprocessing and for time dependent graph analysis.

Statistiken

Views

Gesamtviews
448
Views auf SlideShare
448
Views einbetten
0

Actions

Gefällt mir
1
Downloads
6
Kommentare
0

0 Einbettungen 0

No embeds

Zugänglichkeit

Kategorien

Details hochladen

Uploaded via as Adobe PDF

Benutzerrechte

© Alle Rechte vorbehalten

Report content

Als unangemessen gemeldet Als unangemessen melden
Als unangemessen melden

Wählen Sie Ihren Grund, warum Sie diese Präsentation als unangemessen melden.

Löschen
  • Full Name Full Name Comment goes here.
    Sind Sie sicher, dass Sie...
    Ihre Nachricht erscheint hier
    Processing...
Kommentar posten
Kommentar bearbeiten

Meetup - Hadoop User Group - Munich : 2013-05-22 Presentation Transcript

  • 1. Headline Goes HereSpeaker Name or Subhead Goes HereHadoop in Social Network Analysis- review on tools and some best practices -Hadoop User Group Munich | 2013-05-22Mirko Kämpf | mirko@cloudera.comMontag, 27. Mai 13
  • 2. OVERVIEW ...About me ... (Java, Physics,Wikipedia, Hadoop & Cloudera)Topic : Complex Systems Analysis, from Time Series to NetworksData, data, and even more data ... but how to handle it?Some results of our project ...Lessons learned ... some recommendations.Montag, 27. Mai 13
  • 3. urlurlHadoop in Social Network AnalysisAbstract: A Hadoop cluster is the tool of choice for manylarge scale analytics applications. A large variety ofcommercial tools is available for typical SQL like or datawarehouse applications, but how to deal with networksand time series?How to collect and store data for social media analysisand what are good practices for working with libraries likeMahout and Giraph?The sample use case deals with a data set from Wikipediato illustrate how to combine multiple public data sourceswith personal data collections, e.g. from Twitter or evenpersonal mailboxes. We discuss efficient approaches fordata organisation, data preprocessing and for timedependent graph analysis.Social networks consist of nodes,which are the real world objectsand edges, which are e.g. relations,interactions or dependenciesbetween nodes.Montag, 27. Mai 13
  • 4. urlurlHadoop in Social Network AnalysisAbstract: A Hadoop cluster is the tool of choice for manylarge scale analytics applications. A large variety ofcommercial tools is available for typical SQL like or datawarehouse applications, but how to deal with networksand time series?How to collect and store data for social media analysisand what are good practices for working with libraries likeMahout and Giraph?The sample use case deals with a data set from Wikipediato illustrate how to combine multiple public data sourceswith personal data collections, e.g. from Twitter or evenpersonal mailboxes. We discuss efficient approaches fordata organisation, data preprocessing and for timedependent graph analysis.Social networks consist of nodes,which are the real world objectsand edges, which are e.g. relations,interactions or dependenciesbetween nodes.Datatypes I/O-FormatsStorage System Random-AccessBulk ProcessingHBaseOpenTSDB HivePigSequenceFileVectorWriteablePartitionsSampling Rateimages from Google image search ...Montag, 27. Mai 13
  • 5. INTRODUCTIONSocial online-systems are complex systems used for, e.g., information spread.We develop and apply tools from time series analysis and network analysisto study the static and dynamic properties of social on-line systems and their relations.5Montag, 27. Mai 13
  • 6. INTRODUCTIONWebpages (the nodes of the WWW) are linked in different, but related ways:direct links pointing from one page to another (binary, directional)similar access activity (cross-correlated time series of download rates)similar edit activity (synchronized events of edits or changes)We extract the time-evolution of these three networks from real data.Nodes are identical for all three studied networks, but links and networkstructure as well as dynamics are different.We quantify how the inter-relations and inter-dependencies between the three networks change in time and affect each other.Social online-systems are complex systems used for, e.g., information spread.We develop and apply tools from time series analysis and network analysisto study the static and dynamic properties of social on-line systems and their relations.5Montag, 27. Mai 13
  • 7. INTRODUCTIONWebpages (the nodes of the WWW) are linked in different, but related ways:direct links pointing from one page to another (binary, directional)similar access activity (cross-correlated time series of download rates)similar edit activity (synchronized events of edits or changes)We extract the time-evolution of these three networks from real data.Nodes are identical for all three studied networks, but links and networkstructure as well as dynamics are different.We quantify how the inter-relations and inter-dependencies between the three networks change in time and affect each other.Example: Wikipedia → reconstruct co-evolving networks1. Cross-link network between articles (pages, nodes)2. Access behavior network (characterizing similarity in article reading behavior)3. Edit activity network (characterizing similarities in edit activity for each article)Social online-systems are complex systems used for, e.g., information spread.We develop and apply tools from time series analysis and network analysisto study the static and dynamic properties of social on-line systems and their relations.5Montag, 27. Mai 13
  • 8. CHALLENGES ...6• Data points (time series) collected at independent locations or obtained formindividual objects do not show dependencies directly.• It is a common task, to calculate several types of correlations, but how are theseresults affected by special properties of the raw data?• What meaning do different correlations have and how can we eliminate artifactsof the calculation method?Complex System Time evolution ???Montag, 27. Mai 13
  • 9. CHALLENGES ...6• Data points (time series) collected at independent locations or obtained formindividual objects do not show dependencies directly.• It is a common task, to calculate several types of correlations, but how are theseresults affected by special properties of the raw data?• What meaning do different correlations have and how can we eliminate artifactsof the calculation method?Complex System Time evolution ???Montag, 27. Mai 13
  • 10. CHALLENGES ...7• Data points (time series) collected at independent locations or obtained formindividual objects do not show dependencies directly.• It is a common task, to calculate several types of correlations, but how are theseresults affected by special properties of the raw data?• What meaning do different correlations have and how can we eliminate artifactsof the calculation method?Complex SystemElement propertiesMeasured data is“disconnected”Time evolution ???Montag, 27. Mai 13
  • 11. CHALLENGES ...7• Data points (time series) collected at independent locations or obtained formindividual objects do not show dependencies directly.• It is a common task, to calculate several types of correlations, but how are theseresults affected by special properties of the raw data?• What meaning do different correlations have and how can we eliminate artifactsof the calculation method?Complex SystemElement propertiesMeasured data is“disconnected”Time evolution ???Montag, 27. Mai 13
  • 12. CHALLENGES ...8• Data points (time series) collected at independent locations or obtained formindividual objects do not show dependencies directly.• It is a common task, to calculate several types of correlations, but how are theseresults affected by special properties of the raw data?• What meaning do different correlations have and how can we eliminate artifactsof the calculation method?Complex SystemElement propertiesMeasured data is“disconnected”System propertiesDerived from relationsbetween elements andstructure of the networkTime evolution ???Montag, 27. Mai 13
  • 13. CHALLENGES ...8• Data points (time series) collected at independent locations or obtained formindividual objects do not show dependencies directly.• It is a common task, to calculate several types of correlations, but how are theseresults affected by special properties of the raw data?• What meaning do different correlations have and how can we eliminate artifactsof the calculation method?Complex SystemElement propertiesMeasured data is“disconnected”System propertiesDerived from relationsbetween elements andstructure of the networkTime evolution ???Montag, 27. Mai 13
  • 14. TYPICALTASK :• Time Series Analysis• if our data set is prepared and we have recordswith well defined properties*,than UDF in Hive and Pig work well ( )• How to organize the data in records?• How to deal with sliding windows?• How to handle intermediate data?Montag, 27. Mai 13
  • 15. TIME SERIES :WIKIPEDIA ACTIVITYExamples of Wikipedia access time series for three articles with (a,b) stationary access rates (Illuminati (book)), (c,d) anendogenous burst of activity (Heidelberg), and (e,f) an exogenous burst of activity (Amoklauf Erfurt). The left parts show thecomplete hourly access rate time series (from January 1, 2009, till October 21, 2009; i.e. for 42 weeks = 294 days = 7056 hours).The right parts show edit-event data for the three representative articles.Node = article(specific topic)Available data:1. Hourly accessfrequency(number of articledownloads for eachhour in ≈ 300 days)2. Edit events(time stamps for all changesin the article pages)10Montag, 27. Mai 13
  • 16. TIME SERIES :WIKIPEDIA ACTIVITYNode = article(specific topic)Available data:1. Hourly accessfrequency(number of articledownloads for eachhour in ≈ 300 days)2. Edit events(time stamps for all changesin the article pages)11•filtering, resampling•feature extraction (peak detection)•creation of (non)-overlapping episodes or (sliding) windows•creation of time series pairs for cross-correlationor event synchronisationMap-Reduce / UDFpreprocessing calculation on single recordsMontag, 27. Mai 13
  • 17. • Graph Analysis• If network data is prepared as an adjacency list or anadjacency matrix, tools like Giraph or HAMA work well.But : only if one has the appro-priate InputFormat Reader.Montag, 27. Mai 13
  • 18. TYPES OF NETWORKS13a) unipartite network, one type of nodes and linksb) bipartite network, one type of connectionsc) hypergraph, one link relates more than two nodesMontag, 27. Mai 13
  • 19. MULTIPLEX NETWORKS14HIERARCHICALNETWORKTYPES OF NETWORKSMontag, 27. Mai 13
  • 20. • Graph Analysis• If network data is prepared as an adjacency list or anadjacency matrix, tools like Giraph or HAMA work well.• How to organize node/edge properties?• How to deal with time dependent properties?• How to calculate link properties on the fly?•creation of adjacency matrix is not trivial•adjacency matrix is an ineffiecient format•in HDFS files can not be changed•store dynamic edge / node properties in HBase•aggregate relevant data to network snapshots•store intermediate results back to HBase ...and preprocess this data in following stepMontag, 27. Mai 13
  • 21. RECAP : WHAT IS HADOOP ?• distributed platform for processing massive amounts of data in parallel• implements Map-Reduce paradigm on top of Hadoop Distributed File System.• Map-Reduce : JAVA API to implement a map and a reduce phase• Map phase uses key/value pairs,• Reduce phase uses a key/value list• HDFS files consist of one or more blocks(distributed chunks of data)• Chunks are distributed transparently(in background) and processed in parallel• using data locality when possible byassigning the map task to a node thatcontains the chunk locally.Montag, 27. Mai 13
  • 22. TYPICAL APPLICATIONS• Filter, group, and join operations of large data sets ...• the data set (or a part of it)* is streamedand processed in parallel, but not in real time• Algorithms like k-Means Clustering (in Apache Mahout)or Map-Reduce based implementations of SSSP workin multiple iterations• data is loaded from disk to CPU in each iteration* if partitio-ning is usedMontag, 27. Mai 13
  • 23. OVERVIEW - DATA FLOW18TextMontag, 27. Mai 13
  • 24. OVERVIEW - DATA FLOW19TextMR, UDFHive, Pig,ImpalaGiraph, MahoutSqoopMRHive, Pig,ImpalaHBase(cache)HDFS(static group)HBase(cache)Network Clustering != k-Means ClusteringMontag, 27. Mai 13
  • 25. RESULTS : CROSS-CORRELATION20  Distribution of cross-correlation coefficients for pairs ofaccess-rate time series of Wikipedia pages (top) comparedto surrogat data (bottom) - 100 shuffled configurations are consideredMontag, 27. Mai 13
  • 26. EVOLUTION OF CORRELATION NETWORKS21timetimeMontag, 27. Mai 13
  • 27. EVOLUTION OF CORRELATION NETWORKS21timetimeMontag, 27. Mai 13
  • 28. EVOLUTION OF CORRELATION NETWORKS22timetimeObviously, the distribution of link strengthvalues changes in time, but a only a calcultion ofstructural properties of the underlying networkallows a detailed view on the dynamics if thesystem.Montag, 27. Mai 13
  • 29. EVOLUTION OF CORRELATION NETWORKS22timetimeObviously, the distribution of link strengthvalues changes in time, but a only a calcultion ofstructural properties of the underlying networkallows a detailed view on the dynamics if thesystem.Montag, 27. Mai 13
  • 30. EXAMPLES OF RECONSTRUCTED NETWORKSSpatially embedded correlation networks,for wikipedia pages of all German cities.Access network Edit network23Montag, 27. Mai 13
  • 31. RECOMMENDATIONS (1.)• Create algorithms based on reusable components!• Select or create stable and standardized I/O-Formats!• Use preprocessing for re-organization of unstructured data,if you have to process the data many times.• Store intermediate data (e.g. time dependent properties)in HBase, near the raw data.• CollectTime-Series data in HBase and create Time-SeriesBuckets for advanced procedures on a subset of the data.Montag, 27. Mai 13
  • 32. RECOMMENDATIONS (2.)• Consider Design Patterns :• Partitioning vs. Binning• Map-Side vs. Reduce-Side Joins• Use Bulk Synchronuos Processing for graph processinginstead of Map-Reduce, or even a combination of both.• In classical programming: (and also in Hadoop!!!)find good data representation and find good algorithmsMontag, 27. Mai 13
  • 33. REFERENCES26Montag, 27. Mai 13
  • 34. 27Montag, 27. Mai 13