SlideShare a Scribd company logo
1 of 34
Download to read offline
Headline Goes Here
Speaker Name or Subhead Goes Here
Hadoop in Social Network Analysis
- review on tools and some best practices -
Hadoop User Group Munich | 2013-05-22
Mirko Kämpf | mirko@cloudera.com
Montag, 27. Mai 13
OVERVIEW ...
About me ... (Java, Physics,Wikipedia, Hadoop & Cloudera)
Topic : Complex Systems Analysis, from Time Series to Networks
Data, data, and even more data ... but how to handle it?
Some results of our project ...
Lessons learned ... some recommendations.
Montag, 27. Mai 13
urlurl
Hadoop in Social Network Analysis
Abstract: A Hadoop cluster is the tool of choice for many
large scale analytics applications. A large variety of
commercial tools is available for typical SQL like or data
warehouse applications, but how to deal with networks
and time series?
How to collect and store data for social media analysis
and what are good practices for working with libraries like
Mahout and Giraph?
The sample use case deals with a data set from Wikipedia
to illustrate how to combine multiple public data sources
with personal data collections, e.g. from Twitter or even
personal mailboxes. We discuss efficient approaches for
data organisation, data preprocessing and for time
dependent graph analysis.
Social networks consist of nodes,
which are the real world objects
and edges, which are e.g. relations,
interactions or dependencies
between nodes.
Montag, 27. Mai 13
urlurl
Hadoop in Social Network Analysis
Abstract: A Hadoop cluster is the tool of choice for many
large scale analytics applications. A large variety of
commercial tools is available for typical SQL like or data
warehouse applications, but how to deal with networks
and time series?
How to collect and store data for social media analysis
and what are good practices for working with libraries like
Mahout and Giraph?
The sample use case deals with a data set from Wikipedia
to illustrate how to combine multiple public data sources
with personal data collections, e.g. from Twitter or even
personal mailboxes. We discuss efficient approaches for
data organisation, data preprocessing and for time
dependent graph analysis.
Social networks consist of nodes,
which are the real world objects
and edges, which are e.g. relations,
interactions or dependencies
between nodes.
Datatypes I/O-Formats
Storage System Random-Access
Bulk Processing
HBaseOpenTSDB Hive
Pig
SequenceFile
VectorWriteable
Partitions
Sampling Rate
images from Google image search ...
Montag, 27. Mai 13
INTRODUCTION
Social online-systems are complex systems used for, e.g., information spread.
We develop and apply tools from time series analysis and network analysis
to study the static and dynamic properties of social on-line systems and their relations.
5
Montag, 27. Mai 13
INTRODUCTION
Webpages (the nodes of the WWW) are linked in different, but related ways:
direct links pointing from one page to another (binary, directional)
similar access activity (cross-correlated time series of download rates)
similar edit activity (synchronized events of edits or changes)
We extract the time-evolution of these three networks from real data.
Nodes are identical for all three studied networks, but links and network
structure as well as dynamics are different.We quantify how the inter-
relations and inter-dependencies between the three networks change in time and affect each other.
Social online-systems are complex systems used for, e.g., information spread.
We develop and apply tools from time series analysis and network analysis
to study the static and dynamic properties of social on-line systems and their relations.
5
Montag, 27. Mai 13
INTRODUCTION
Webpages (the nodes of the WWW) are linked in different, but related ways:
direct links pointing from one page to another (binary, directional)
similar access activity (cross-correlated time series of download rates)
similar edit activity (synchronized events of edits or changes)
We extract the time-evolution of these three networks from real data.
Nodes are identical for all three studied networks, but links and network
structure as well as dynamics are different.We quantify how the inter-
relations and inter-dependencies between the three networks change in time and affect each other.
Example: Wikipedia → reconstruct co-evolving networks
1. Cross-link network between articles (pages, nodes)
2. Access behavior network (characterizing similarity in article reading behavior)
3. Edit activity network (characterizing similarities in edit activity for each article)
Social online-systems are complex systems used for, e.g., information spread.
We develop and apply tools from time series analysis and network analysis
to study the static and dynamic properties of social on-line systems and their relations.
5
Montag, 27. Mai 13
CHALLENGES ...
6
• Data points (time series) collected at independent locations or obtained form
individual objects do not show dependencies directly.
• It is a common task, to calculate several types of correlations, but how are these
results affected by special properties of the raw data?
• What meaning do different correlations have and how can we eliminate artifacts
of the calculation method?
Complex System Time evolution ???
Montag, 27. Mai 13
CHALLENGES ...
6
• Data points (time series) collected at independent locations or obtained form
individual objects do not show dependencies directly.
• It is a common task, to calculate several types of correlations, but how are these
results affected by special properties of the raw data?
• What meaning do different correlations have and how can we eliminate artifacts
of the calculation method?
Complex System Time evolution ???
Montag, 27. Mai 13
CHALLENGES ...
7
• Data points (time series) collected at independent locations or obtained form
individual objects do not show dependencies directly.
• It is a common task, to calculate several types of correlations, but how are these
results affected by special properties of the raw data?
• What meaning do different correlations have and how can we eliminate artifacts
of the calculation method?
Complex System
Element properties
Measured data is
“disconnected”
Time evolution ???
Montag, 27. Mai 13
CHALLENGES ...
7
• Data points (time series) collected at independent locations or obtained form
individual objects do not show dependencies directly.
• It is a common task, to calculate several types of correlations, but how are these
results affected by special properties of the raw data?
• What meaning do different correlations have and how can we eliminate artifacts
of the calculation method?
Complex System
Element properties
Measured data is
“disconnected”
Time evolution ???
Montag, 27. Mai 13
CHALLENGES ...
8
• Data points (time series) collected at independent locations or obtained form
individual objects do not show dependencies directly.
• It is a common task, to calculate several types of correlations, but how are these
results affected by special properties of the raw data?
• What meaning do different correlations have and how can we eliminate artifacts
of the calculation method?
Complex System
Element properties
Measured data is
“disconnected”
System properties
Derived from relations
between elements and
structure of the network
Time evolution ???
Montag, 27. Mai 13
CHALLENGES ...
8
• Data points (time series) collected at independent locations or obtained form
individual objects do not show dependencies directly.
• It is a common task, to calculate several types of correlations, but how are these
results affected by special properties of the raw data?
• What meaning do different correlations have and how can we eliminate artifacts
of the calculation method?
Complex System
Element properties
Measured data is
“disconnected”
System properties
Derived from relations
between elements and
structure of the network
Time evolution ???
Montag, 27. Mai 13
TYPICALTASK :
• Time Series Analysis
• if our data set is prepared and we have records
with well defined properties*,
than UDF in Hive and Pig work well ( )
• How to organize the data in records?
• How to deal with sliding windows?
• How to handle intermediate data?
Montag, 27. Mai 13
TIME SERIES :WIKIPEDIA ACTIVITY
Examples of Wikipedia access time series for three articles with (a,b) stationary access rates ('Illuminati (book)'), (c,d) an
endogenous burst of activity ('Heidelberg'), and (e,f) an exogenous burst of activity ('Amoklauf Erfurt'). The left parts show the
complete hourly access rate time series (from January 1, 2009, till October 21, 2009; i.e. for 42 weeks = 294 days = 7056 hours).
The right parts show edit-event data for the three representative articles.
Node = article
(specific topic)
Available data:
1. Hourly access
frequency
(number of article
downloads for each
hour in ≈ 300 days)
2. Edit events
(time stamps for all changes
in the article pages)
10
Montag, 27. Mai 13
TIME SERIES :WIKIPEDIA ACTIVITY
Node = article
(specific topic)
Available data:
1. Hourly access
frequency
(number of article
downloads for each
hour in ≈ 300 days)
2. Edit events
(time stamps for all changes
in the article pages)
11
•filtering, resampling
•feature extraction (peak detection)
•creation of (non)-overlapping episodes or (sliding) windows
•creation of time series pairs for cross-correlation
or event synchronisation
Map-Reduce / UDF
preprocessing calculation on single records
Montag, 27. Mai 13
• Graph Analysis
• If network data is prepared as an adjacency list or an
adjacency matrix, tools like Giraph or HAMA work well.
But : only if one has the appro-
priate InputFormat Reader.
Montag, 27. Mai 13
TYPES OF NETWORKS
13
a) unipartite network, one type of nodes and links
b) bipartite network, one type of connections
c) hypergraph, one link relates more than two nodes
Montag, 27. Mai 13
MULTIPLEX NETWORKS
14
HIERARCHICAL
NETWORK
TYPES OF NETWORKS
Montag, 27. Mai 13
• Graph Analysis
• If network data is prepared as an adjacency list or an
adjacency matrix, tools like Giraph or HAMA work well.
• How to organize node/edge properties?
• How to deal with time dependent properties?
• How to calculate link properties on the fly?
•creation of adjacency matrix is not trivial
•adjacency matrix is an ineffiecient format
•in HDFS files can not be changed
•store dynamic edge / node properties in HBase
•aggregate relevant data to network snapshots
•store intermediate results back to HBase ...
and preprocess this data in following step
Montag, 27. Mai 13
RECAP : WHAT IS HADOOP ?
• distributed platform for processing massive amounts of data in parallel
• implements Map-Reduce paradigm on top of Hadoop Distributed File System.
• Map-Reduce : JAVA API to implement a map and a reduce phase
• Map phase uses key/value pairs,
• Reduce phase uses a key/value list
• HDFS files consist of one or more blocks
(distributed chunks of data)
• Chunks are distributed transparently
(in background) and processed in parallel
• using data locality when possible by
assigning the map task to a node that
contains the chunk locally.
Montag, 27. Mai 13
TYPICAL APPLICATIONS
• Filter, group, and join operations of large data sets ...
• the data set (or a part of it)* is streamed
and processed in parallel, but not in real time
• Algorithms like k-Means Clustering (in Apache Mahout)
or Map-Reduce based implementations of SSSP work
in multiple iterations
• data is loaded from disk to CPU in each iteration
* if partitio-ning is used
Montag, 27. Mai 13
OVERVIEW - DATA FLOW
18
Text
Montag, 27. Mai 13
OVERVIEW - DATA FLOW
19
Text
MR, UDF
Hive, Pig,
Impala
Giraph, Mahout
Sqoop
MR
Hive, Pig,
Impala
HBase
(cache)
HDFS
(static group)
HBase
(cache)
Network Clustering != k-Means Clustering
Montag, 27. Mai 13
RESULTS : CROSS-CORRELATION
20
 
 
Distribution of cross-correlation coefficients for pairs of
access-rate time series of Wikipedia pages (top) compared
to surrogat data (bottom) - 100 shuffled configurations are considered
Montag, 27. Mai 13
EVOLUTION OF CORRELATION NETWORKS
21
time
time
Montag, 27. Mai 13
EVOLUTION OF CORRELATION NETWORKS
21
time
time
Montag, 27. Mai 13
EVOLUTION OF CORRELATION NETWORKS
22
time
time
Obviously, the distribution of link strength
values changes in time, but a only a calcultion of
structural properties of the underlying network
allows a detailed view on the dynamics if the
system.
Montag, 27. Mai 13
EVOLUTION OF CORRELATION NETWORKS
22
time
time
Obviously, the distribution of link strength
values changes in time, but a only a calcultion of
structural properties of the underlying network
allows a detailed view on the dynamics if the
system.
Montag, 27. Mai 13
EXAMPLES OF RECONSTRUCTED NETWORKS
Spatially embedded correlation networks,
for wikipedia pages of all German cities.
Access network Edit network
23
Montag, 27. Mai 13
RECOMMENDATIONS (1.)
• Create algorithms based on reusable components!
• Select or create stable and standardized I/O-Formats!
• Use preprocessing for re-organization of unstructured data,
if you have to process the data many times.
• Store intermediate data (e.g. time dependent properties)
in HBase, near the raw data.
• CollectTime-Series data in HBase and create Time-Series
Buckets for advanced procedures on a subset of the data.
Montag, 27. Mai 13
RECOMMENDATIONS (2.)
• Consider Design Patterns :
• Partitioning vs. Binning
• Map-Side vs. Reduce-Side Joins
• Use Bulk Synchronuos Processing for graph processing
instead of Map-Reduce, or even a combination of both.
• In classical programming: (and also in Hadoop!!!)
find good data representation and find good algorithms
Montag, 27. Mai 13
REFERENCES
26
Montag, 27. Mai 13
27
Montag, 27. Mai 13

More Related Content

Viewers also liked

Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...
Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...
Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...Cloudera, Inc.
 
Debugging (Docker) containers in production
Debugging (Docker) containers in productionDebugging (Docker) containers in production
Debugging (Docker) containers in productionbcantrill
 
Nested Types in Impala
Nested Types in ImpalaNested Types in Impala
Nested Types in ImpalaCloudera, Inc.
 
Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationAlex Moundalexis
 
Building a Modern Analytic Database with Cloudera 5.8
Building a Modern Analytic Database with Cloudera 5.8Building a Modern Analytic Database with Cloudera 5.8
Building a Modern Analytic Database with Cloudera 5.8Cloudera, Inc.
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopCloudera, Inc.
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemCloudera, Inc.
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Chris Fregly
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Uri Laserson
 

Viewers also liked (10)

Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...
Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...
Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...
 
Debugging (Docker) containers in production
Debugging (Docker) containers in productionDebugging (Docker) containers in production
Debugging (Docker) containers in production
 
Nested Types in Impala
Nested Types in ImpalaNested Types in Impala
Nested Types in Impala
 
Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux Configuration
 
Building a Modern Analytic Database with Cloudera 5.8
Building a Modern Analytic Database with Cloudera 5.8Building a Modern Analytic Database with Cloudera 5.8
Building a Modern Analytic Database with Cloudera 5.8
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache Hadoop
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 

More from Dr. Mirko Kämpf

Time Series Analysis Using an Event Streaming Platform
 Time Series Analysis Using an Event Streaming Platform Time Series Analysis Using an Event Streaming Platform
Time Series Analysis Using an Event Streaming PlatformDr. Mirko Kämpf
 
IoT meets AI in the Clouds
IoT meets AI in the CloudsIoT meets AI in the Clouds
IoT meets AI in the CloudsDr. Mirko Kämpf
 
Improving computer vision models at scale (Strata Data NYC)
Improving computer vision models at scale  (Strata Data NYC)Improving computer vision models at scale  (Strata Data NYC)
Improving computer vision models at scale (Strata Data NYC)Dr. Mirko Kämpf
 
Improving computer vision models at scale presentation
Improving computer vision models at scale presentationImproving computer vision models at scale presentation
Improving computer vision models at scale presentationDr. Mirko Kämpf
 
Enterprise Metadata Integration
Enterprise Metadata IntegrationEnterprise Metadata Integration
Enterprise Metadata IntegrationDr. Mirko Kämpf
 
PCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System TuningPCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System TuningDr. Mirko Kämpf
 
Etosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road mapEtosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road mapDr. Mirko Kämpf
 
From Events to Networks: Time Series Analysis on Scale
From Events to Networks: Time Series Analysis on ScaleFrom Events to Networks: Time Series Analysis on Scale
From Events to Networks: Time Series Analysis on ScaleDr. Mirko Kämpf
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsDr. Mirko Kämpf
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsDr. Mirko Kämpf
 
DPG Berlin - SOE 18 - talk v1.2.4
DPG Berlin - SOE 18 - talk v1.2.4DPG Berlin - SOE 18 - talk v1.2.4
DPG Berlin - SOE 18 - talk v1.2.4Dr. Mirko Kämpf
 
Information Spread in the Context of Evacuation Optimization
Information Spread in the Context of Evacuation OptimizationInformation Spread in the Context of Evacuation Optimization
Information Spread in the Context of Evacuation OptimizationDr. Mirko Kämpf
 
Hadoop & Complex Systems Research
Hadoop & Complex Systems ResearchHadoop & Complex Systems Research
Hadoop & Complex Systems ResearchDr. Mirko Kämpf
 

More from Dr. Mirko Kämpf (13)

Time Series Analysis Using an Event Streaming Platform
 Time Series Analysis Using an Event Streaming Platform Time Series Analysis Using an Event Streaming Platform
Time Series Analysis Using an Event Streaming Platform
 
IoT meets AI in the Clouds
IoT meets AI in the CloudsIoT meets AI in the Clouds
IoT meets AI in the Clouds
 
Improving computer vision models at scale (Strata Data NYC)
Improving computer vision models at scale  (Strata Data NYC)Improving computer vision models at scale  (Strata Data NYC)
Improving computer vision models at scale (Strata Data NYC)
 
Improving computer vision models at scale presentation
Improving computer vision models at scale presentationImproving computer vision models at scale presentation
Improving computer vision models at scale presentation
 
Enterprise Metadata Integration
Enterprise Metadata IntegrationEnterprise Metadata Integration
Enterprise Metadata Integration
 
PCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System TuningPCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System Tuning
 
Etosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road mapEtosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road map
 
From Events to Networks: Time Series Analysis on Scale
From Events to Networks: Time Series Analysis on ScaleFrom Events to Networks: Time Series Analysis on Scale
From Events to Networks: Time Series Analysis on Scale
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
DPG Berlin - SOE 18 - talk v1.2.4
DPG Berlin - SOE 18 - talk v1.2.4DPG Berlin - SOE 18 - talk v1.2.4
DPG Berlin - SOE 18 - talk v1.2.4
 
Information Spread in the Context of Evacuation Optimization
Information Spread in the Context of Evacuation OptimizationInformation Spread in the Context of Evacuation Optimization
Information Spread in the Context of Evacuation Optimization
 
Hadoop & Complex Systems Research
Hadoop & Complex Systems ResearchHadoop & Complex Systems Research
Hadoop & Complex Systems Research
 

Recently uploaded

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 

Recently uploaded (20)

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Meetup - Hadoop User Group - Munich : 2013-05-22

  • 1. Headline Goes Here Speaker Name or Subhead Goes Here Hadoop in Social Network Analysis - review on tools and some best practices - Hadoop User Group Munich | 2013-05-22 Mirko Kämpf | mirko@cloudera.com Montag, 27. Mai 13
  • 2. OVERVIEW ... About me ... (Java, Physics,Wikipedia, Hadoop & Cloudera) Topic : Complex Systems Analysis, from Time Series to Networks Data, data, and even more data ... but how to handle it? Some results of our project ... Lessons learned ... some recommendations. Montag, 27. Mai 13
  • 3. urlurl Hadoop in Social Network Analysis Abstract: A Hadoop cluster is the tool of choice for many large scale analytics applications. A large variety of commercial tools is available for typical SQL like or data warehouse applications, but how to deal with networks and time series? How to collect and store data for social media analysis and what are good practices for working with libraries like Mahout and Giraph? The sample use case deals with a data set from Wikipedia to illustrate how to combine multiple public data sources with personal data collections, e.g. from Twitter or even personal mailboxes. We discuss efficient approaches for data organisation, data preprocessing and for time dependent graph analysis. Social networks consist of nodes, which are the real world objects and edges, which are e.g. relations, interactions or dependencies between nodes. Montag, 27. Mai 13
  • 4. urlurl Hadoop in Social Network Analysis Abstract: A Hadoop cluster is the tool of choice for many large scale analytics applications. A large variety of commercial tools is available for typical SQL like or data warehouse applications, but how to deal with networks and time series? How to collect and store data for social media analysis and what are good practices for working with libraries like Mahout and Giraph? The sample use case deals with a data set from Wikipedia to illustrate how to combine multiple public data sources with personal data collections, e.g. from Twitter or even personal mailboxes. We discuss efficient approaches for data organisation, data preprocessing and for time dependent graph analysis. Social networks consist of nodes, which are the real world objects and edges, which are e.g. relations, interactions or dependencies between nodes. Datatypes I/O-Formats Storage System Random-Access Bulk Processing HBaseOpenTSDB Hive Pig SequenceFile VectorWriteable Partitions Sampling Rate images from Google image search ... Montag, 27. Mai 13
  • 5. INTRODUCTION Social online-systems are complex systems used for, e.g., information spread. We develop and apply tools from time series analysis and network analysis to study the static and dynamic properties of social on-line systems and their relations. 5 Montag, 27. Mai 13
  • 6. INTRODUCTION Webpages (the nodes of the WWW) are linked in different, but related ways: direct links pointing from one page to another (binary, directional) similar access activity (cross-correlated time series of download rates) similar edit activity (synchronized events of edits or changes) We extract the time-evolution of these three networks from real data. Nodes are identical for all three studied networks, but links and network structure as well as dynamics are different.We quantify how the inter- relations and inter-dependencies between the three networks change in time and affect each other. Social online-systems are complex systems used for, e.g., information spread. We develop and apply tools from time series analysis and network analysis to study the static and dynamic properties of social on-line systems and their relations. 5 Montag, 27. Mai 13
  • 7. INTRODUCTION Webpages (the nodes of the WWW) are linked in different, but related ways: direct links pointing from one page to another (binary, directional) similar access activity (cross-correlated time series of download rates) similar edit activity (synchronized events of edits or changes) We extract the time-evolution of these three networks from real data. Nodes are identical for all three studied networks, but links and network structure as well as dynamics are different.We quantify how the inter- relations and inter-dependencies between the three networks change in time and affect each other. Example: Wikipedia → reconstruct co-evolving networks 1. Cross-link network between articles (pages, nodes) 2. Access behavior network (characterizing similarity in article reading behavior) 3. Edit activity network (characterizing similarities in edit activity for each article) Social online-systems are complex systems used for, e.g., information spread. We develop and apply tools from time series analysis and network analysis to study the static and dynamic properties of social on-line systems and their relations. 5 Montag, 27. Mai 13
  • 8. CHALLENGES ... 6 • Data points (time series) collected at independent locations or obtained form individual objects do not show dependencies directly. • It is a common task, to calculate several types of correlations, but how are these results affected by special properties of the raw data? • What meaning do different correlations have and how can we eliminate artifacts of the calculation method? Complex System Time evolution ??? Montag, 27. Mai 13
  • 9. CHALLENGES ... 6 • Data points (time series) collected at independent locations or obtained form individual objects do not show dependencies directly. • It is a common task, to calculate several types of correlations, but how are these results affected by special properties of the raw data? • What meaning do different correlations have and how can we eliminate artifacts of the calculation method? Complex System Time evolution ??? Montag, 27. Mai 13
  • 10. CHALLENGES ... 7 • Data points (time series) collected at independent locations or obtained form individual objects do not show dependencies directly. • It is a common task, to calculate several types of correlations, but how are these results affected by special properties of the raw data? • What meaning do different correlations have and how can we eliminate artifacts of the calculation method? Complex System Element properties Measured data is “disconnected” Time evolution ??? Montag, 27. Mai 13
  • 11. CHALLENGES ... 7 • Data points (time series) collected at independent locations or obtained form individual objects do not show dependencies directly. • It is a common task, to calculate several types of correlations, but how are these results affected by special properties of the raw data? • What meaning do different correlations have and how can we eliminate artifacts of the calculation method? Complex System Element properties Measured data is “disconnected” Time evolution ??? Montag, 27. Mai 13
  • 12. CHALLENGES ... 8 • Data points (time series) collected at independent locations or obtained form individual objects do not show dependencies directly. • It is a common task, to calculate several types of correlations, but how are these results affected by special properties of the raw data? • What meaning do different correlations have and how can we eliminate artifacts of the calculation method? Complex System Element properties Measured data is “disconnected” System properties Derived from relations between elements and structure of the network Time evolution ??? Montag, 27. Mai 13
  • 13. CHALLENGES ... 8 • Data points (time series) collected at independent locations or obtained form individual objects do not show dependencies directly. • It is a common task, to calculate several types of correlations, but how are these results affected by special properties of the raw data? • What meaning do different correlations have and how can we eliminate artifacts of the calculation method? Complex System Element properties Measured data is “disconnected” System properties Derived from relations between elements and structure of the network Time evolution ??? Montag, 27. Mai 13
  • 14. TYPICALTASK : • Time Series Analysis • if our data set is prepared and we have records with well defined properties*, than UDF in Hive and Pig work well ( ) • How to organize the data in records? • How to deal with sliding windows? • How to handle intermediate data? Montag, 27. Mai 13
  • 15. TIME SERIES :WIKIPEDIA ACTIVITY Examples of Wikipedia access time series for three articles with (a,b) stationary access rates ('Illuminati (book)'), (c,d) an endogenous burst of activity ('Heidelberg'), and (e,f) an exogenous burst of activity ('Amoklauf Erfurt'). The left parts show the complete hourly access rate time series (from January 1, 2009, till October 21, 2009; i.e. for 42 weeks = 294 days = 7056 hours). The right parts show edit-event data for the three representative articles. Node = article (specific topic) Available data: 1. Hourly access frequency (number of article downloads for each hour in ≈ 300 days) 2. Edit events (time stamps for all changes in the article pages) 10 Montag, 27. Mai 13
  • 16. TIME SERIES :WIKIPEDIA ACTIVITY Node = article (specific topic) Available data: 1. Hourly access frequency (number of article downloads for each hour in ≈ 300 days) 2. Edit events (time stamps for all changes in the article pages) 11 •filtering, resampling •feature extraction (peak detection) •creation of (non)-overlapping episodes or (sliding) windows •creation of time series pairs for cross-correlation or event synchronisation Map-Reduce / UDF preprocessing calculation on single records Montag, 27. Mai 13
  • 17. • Graph Analysis • If network data is prepared as an adjacency list or an adjacency matrix, tools like Giraph or HAMA work well. But : only if one has the appro- priate InputFormat Reader. Montag, 27. Mai 13
  • 18. TYPES OF NETWORKS 13 a) unipartite network, one type of nodes and links b) bipartite network, one type of connections c) hypergraph, one link relates more than two nodes Montag, 27. Mai 13
  • 20. • Graph Analysis • If network data is prepared as an adjacency list or an adjacency matrix, tools like Giraph or HAMA work well. • How to organize node/edge properties? • How to deal with time dependent properties? • How to calculate link properties on the fly? •creation of adjacency matrix is not trivial •adjacency matrix is an ineffiecient format •in HDFS files can not be changed •store dynamic edge / node properties in HBase •aggregate relevant data to network snapshots •store intermediate results back to HBase ... and preprocess this data in following step Montag, 27. Mai 13
  • 21. RECAP : WHAT IS HADOOP ? • distributed platform for processing massive amounts of data in parallel • implements Map-Reduce paradigm on top of Hadoop Distributed File System. • Map-Reduce : JAVA API to implement a map and a reduce phase • Map phase uses key/value pairs, • Reduce phase uses a key/value list • HDFS files consist of one or more blocks (distributed chunks of data) • Chunks are distributed transparently (in background) and processed in parallel • using data locality when possible by assigning the map task to a node that contains the chunk locally. Montag, 27. Mai 13
  • 22. TYPICAL APPLICATIONS • Filter, group, and join operations of large data sets ... • the data set (or a part of it)* is streamed and processed in parallel, but not in real time • Algorithms like k-Means Clustering (in Apache Mahout) or Map-Reduce based implementations of SSSP work in multiple iterations • data is loaded from disk to CPU in each iteration * if partitio-ning is used Montag, 27. Mai 13
  • 23. OVERVIEW - DATA FLOW 18 Text Montag, 27. Mai 13
  • 24. OVERVIEW - DATA FLOW 19 Text MR, UDF Hive, Pig, Impala Giraph, Mahout Sqoop MR Hive, Pig, Impala HBase (cache) HDFS (static group) HBase (cache) Network Clustering != k-Means Clustering Montag, 27. Mai 13
  • 25. RESULTS : CROSS-CORRELATION 20     Distribution of cross-correlation coefficients for pairs of access-rate time series of Wikipedia pages (top) compared to surrogat data (bottom) - 100 shuffled configurations are considered Montag, 27. Mai 13
  • 26. EVOLUTION OF CORRELATION NETWORKS 21 time time Montag, 27. Mai 13
  • 27. EVOLUTION OF CORRELATION NETWORKS 21 time time Montag, 27. Mai 13
  • 28. EVOLUTION OF CORRELATION NETWORKS 22 time time Obviously, the distribution of link strength values changes in time, but a only a calcultion of structural properties of the underlying network allows a detailed view on the dynamics if the system. Montag, 27. Mai 13
  • 29. EVOLUTION OF CORRELATION NETWORKS 22 time time Obviously, the distribution of link strength values changes in time, but a only a calcultion of structural properties of the underlying network allows a detailed view on the dynamics if the system. Montag, 27. Mai 13
  • 30. EXAMPLES OF RECONSTRUCTED NETWORKS Spatially embedded correlation networks, for wikipedia pages of all German cities. Access network Edit network 23 Montag, 27. Mai 13
  • 31. RECOMMENDATIONS (1.) • Create algorithms based on reusable components! • Select or create stable and standardized I/O-Formats! • Use preprocessing for re-organization of unstructured data, if you have to process the data many times. • Store intermediate data (e.g. time dependent properties) in HBase, near the raw data. • CollectTime-Series data in HBase and create Time-Series Buckets for advanced procedures on a subset of the data. Montag, 27. Mai 13
  • 32. RECOMMENDATIONS (2.) • Consider Design Patterns : • Partitioning vs. Binning • Map-Side vs. Reduce-Side Joins • Use Bulk Synchronuos Processing for graph processing instead of Map-Reduce, or even a combination of both. • In classical programming: (and also in Hadoop!!!) find good data representation and find good algorithms Montag, 27. Mai 13