SlideShare ist ein Scribd-Unternehmen logo
1 von 21
Downloaden Sie, um offline zu lesen
MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
MULTIMEDIA BIG DATA
COMPUTING FOR
TREND DETECTION
Director:
Codirector:
MASTER THESIS
Phd. Ruben Tous Liesa
Phd. Jordi Torres VIñals
Presented by
Omar IvĂĄn Sulca Correa
FACULTAT D’INFORMÀTICA DE BARCELONA
MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
Agenda
1. Introduction
2. Background
3. Multimedia Big Data Computing for Trend Detection
4. Results and Conclusions
MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
Introduction
‱ The relevance of social media data has had an explosive growing in the last few years, because
the user’s interactions and communications in social networks provide key information
(government and non-government organizations).
‱ Social media data is vast, noisy, distributed, unstructured, and dynamic in nature; thus traditional
analysis methods prove to be inefficient and expensive with it. It’s necessary looking new
alternatives
‱ Exist a lot potentiality in the photo-sharing social networks as Instagram, especially in digital
marketing.
1
This work is a proof of concept on Streaming and Machine Learning functionalities of the new Big
Data platform: Apache Spark. Using Spark subprojects MLlib and Spark Streaming. It seeks to
implement an application which allows find the Trending Topics (using the model LDA) on data
collected from the social network Instagram
MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
II. Background
MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
Older msgs Newer msgs
Kafka topic
Producer
Consumer
Apache Kafka
‱ Apache Kafka is an open source, distributed, partitioned, and replicated commit-
log-based publish-subscribe messaging system
2
‱ Streams
‱ Batch
Kafka Cluster
Broker 1
Broker 2
Broker 3
Zookeeper
Producer Consumer
Front End
Front End
Front End
Haddop
Real Time
Security
Kafka
Cluster
Zookeeper
MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
Spark
SQL
Spark
Streaming
MLlib GraphX
Apache Spark
Apache Spark (I)
‱ Apache Spark is an open source cluster-computing
platform designed to be fast and general-purpose.
‱ Allows combine different types of computations in
one single plataform
‱ Spark support in-memory processing, allowing a
performance up to 100x
2 Interactive
queries
streaming
Batch
applications
Iterative
algorithms
DataFrame DStream Vector & Matrix Vertex & Edge
RDDs Actions and Transformations
A RDD (Resilient Distributed Datasets) represent a
collection of elements that can be manipulated
Spark
Streaming
MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
Apache Spark (II)
‱ Spark Streaming is a Spark component that enables processing live streams of
data. Spark Streaming provides an API for manipulating data streams that closely
matches the Spark Core’s RDD API
2
‱ Spark Streaming provides a
high-level abstraction called
Discretized Stream or Dstream
‱ A DStream is a sequence (a
series of RDDs) of data arriving
over time
Sockets
File Stream
Actors (Akka)
Quenue RDDs
Transformations
Window Operations
Output Operations
MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
1:k : Los tĂłpicos
: Es una distribuciĂłn sobre el vocabulario
: Las proporciones de tĂłpicos para el th documento
, : Es la proporciĂłn de tĂłpicos del tĂłpico en el documento
: Las asignaciones de tĂłpicos para el th documento
, : Es la asignaciĂłn tĂłpicos para la n-sima palabra en el documento
: Son las palabras observadas en el documento,
, : Es la nth palabra en el documento , que es un elemento del vocabulario fijo
: , : , : , : = ∏ ∏ (∏ , | ( , | : , , ))
Topic Modeling
‱ It’s a suite of statistical
algorithms that aim to discover
and annotate large archives of
documents with thematic
information
‱ topic models do not require any
prior annotations or labeling of
the documents, the topics
emerge from the analysis of the
original texts
2 Latent Dirichlet Allocation (LDA)
Documents
Topic proportions and
assigments
gene 0.04
dna 0.02
genetic 0.01

.
life 0.02
evolve 0.01
organism 0.01
brain 0.04
neuron 0.02
nerve 0.01
data 0.02
number 0.02
computer 0.01
Topics
MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
III. Multimedia Big Data
Computing for Trend Detection
MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
Overview
JSON
files
Spark
Streaming
MLlib
Apache Spark
Spark SQL
3
Instgram’s API
Kafka
Ingest Read Data Pre-processing & filtering Topic Modeling
Visualization
MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
Stage 1: Data Ingest3
JSON
files
Instgram’s API
Kafka
Ingest
OAuth2
Basic
Objects
‱ Users
‱ Tags
‱ Locations
‱ Geographies
Caption
Text
Client ID
Acces Token
Callback URL
Registration
Autentication
Request
Marketing Campaign Event
Tags #desigual #lavidaeschula #worldcup2014
MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
Stage 2: Read Data3
JSON
files
Spark SQL
Instgram’s API
Kafka
Ingest Read Data
single node – single broker
clĂșster de Kafka
consumerproducer
{
"type": "record",
"name": "JSON",
"namespace": "avro",
"fields": [ {
"name": "text",
"type": "string",
"doc": " The content of the user's
message "
} ],
"doc": "A basic schema for storing
Instagram metadata"
}
val batchInterval = Seconds(5)
"testing-input"
Zookeeper
MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
consumer
Stage 3: Filtering & Preprocessing3
Spark
Streaming
Spark SQL
Kafka
Read Data Pre-processing & filtering
‱ Unnecessary characters
‱ Stop words
Do not provide any benefit
LDA and the others ML algorithms
from MLlib doesn't work on
streaming, that why is necessary to
store the results in files
MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
the first MLlib algorithm built upon GraphX
Expectation-Maximization (EM)
With Spark 1.3, MLlib now supports
Latent Dirichlet Allocation (LDA)
Stage 4: Topic Modeling3
Spark
Streaming
Topic ModelingPre-processing & filtering
MLlib
vectors of word counts
(Word, frecuency)
MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
Stage 5: Visualization3
MLlib
Topic Modeling
Visualization
val topicIndices = ldaModel.describeTopics(maxTermsPerTopic = 10)
topicIndices.foreach { case (terms, termWeights) =>
println("TOPIC:")
terms.zip(termWeights).foreach { case (term, weight) =>
println(s"${vocabArray(term.toInt)}t$weight")
}
println()
TOPIC :
forza 0.005192139298507153
after 0.004892488884851435
side 0.004873250064472125
first 0.004837548137749428
partido 0.004762856710768013
netherlands 0.004658646778099761
going 0.004562908515970665


MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
IV. Results and conclusions
MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
Experiments
Hardware
Core i5-3337U de 1.80 Ghz
6GB RAM
OS Ubuntu v14.04
4
Software
Apache Spark 1.3
Scala 2.11
JDK 7
Spark Cluster
‱ Standalone mode
‱ 1 master y 5 workers
#desigual #Lavidaeschula #worldcup14
Processed posts 3258 12471 501
Size in bytes 8.4 MB 33.7 MB 995.1 kb
Number of words 43763 185390 6594
2014
August November
MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
Results (I)4
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
Forza 0.0063970
partido 0.0063084
after 0.0061240
going 0.0060892
more 0.0056626
love 0.0051733
netherlands 0.0049420
fifa 0.0049102
watching 0.0047937
your 0.0047628
forza 0.0051921
after 0.0048924
side 0.0048732
first 0.0048375
partido 0.0047628
netherlands 0.0046586
going 0.0045629
more 0.0045417
viva 0.0043852
watching 0.0042613
partido 0.0053369
number 0.0052725
love 0.0050586
fifa 0.0048055
after 0.0047837
mundial 0.0046185
forza 0.0045615
italy 0.0044705
viva 0.0043077
life 0.0042818
your 0.0055476
still 0.0048858
going 0.0047801
good 0.0045066
italy 0.0043234
first 0.0043203
before 0.0042764
partido 0.0042220
after 0.0041471
netherlands 0.0041207
netherlands 0.0061293
mexico 0.0049672
fifa 0.0047969
brazil 0.0046036
viva 0.0045627
number 0.0045547
about 0.0045023
going 0.0043588
forza 0.0042632
mundo 0.0042261
#worldcup2014 dataset
Iterations : 3, 5, 7
Topics : 5
Processing Time : 145 seg
Netherlands vs MexicoItaly
MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
Results (II)
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
â€«ï»ŁïșŽïș·ïșŽŰĄï·Č‬ 0.0044680
just 0.0041317
like 0.0030537
that 0.0027004
happy 0.0024657
â€«ÙŠŰłŰčŰŻÙ„ÙŠâ€Ź 0.0024515
best 0.0023213
minha 0.0020576
todos 0.0020428
‫ï·Č‬ 0.0019363
just 0.0038730
â€«ï»ŁïșŽïș·ïșŽŰĄï·Č‬ 0.0035151
that 0.0031203
like 0.0026153
â€«Ű§â€Źâ€«Ű§Ù„ŰȘÙŠâ€Ź 0.0025304
â€«ÙŠŰłŰčŰŻÙ„ÙŠâ€Ź 0.0024014
â€«Ű§Ù„ŰȘÙŠâ€Ź 0.0023665
â€«Ű§Ù„Ű­Ù†ÙŠÙ†â€Ź 0.0019909
best 0.0019291
apenas 0.0019265
â€«ï»ŁïșŽïș·ïșŽŰĄï·Č‬ 0.0030372
that 0.0028397
just 0.0026653
like 0.0024058
‫۳ŰčïșŽŰŻÙ‡â€Ź 0.0018991
todos 0.0018570
â€«ÙŠŰłŰčŰŻÙ„ÙŠâ€Ź 0.0018346
â€«Ű§Ù„Ű­Ù†ÙŠÙ†â€Ź 0.0018248
‫ï·Č‬ 0.0017642
from 0.0017207
â€«ï»ŁïșŽïș·ïșŽŰĄï·Č‬ 0.0034158
just 0.0028399
like 0.0028013
that 0.0024385
â€«Ű§Ù„ŰȘÙŠâ€Ź 0.0023819
â€«ÙŠŰłŰčŰŻÙ„ÙŠâ€Ź 0.0021320
best 0.0020799
happy 0.0020573
‫۳ŰčïșŽŰŻÙ‡â€Ź 0.0018726
last 0.0018066
â€«ï»ŁïșŽïș·ïșŽŰĄï·Č‬ 0.0035831
â€«ÙŠŰłŰčŰŻÙ„ÙŠâ€Ź 0.0027622
like 0.0026387
just 0.0025732
â€«Ű§ï»·ŰčŰČۧۡ‬ 0.0024545
that 0.0024106
minha 0.0023892
â€«ï»ŁŰȘïșŽŰšŰčÙŠÙ†ÙŠâ€Ź 0.0021237
‫۳ŰčïșŽŰŻÙ‡â€Ź 0.0020168
from 0.0019959
4
#Desigual dataset
Iterations : 2
Topics : 5
Processing Time : 53 min
‱ Bangkok (Thailandia)
‱ Bhaucha Dhakka (India)
‱ Maldives (Mumbai)
#Lavidaeschula dataset Âż?
Iterations : 1
Topics : 5
Processing Time : 129 min
MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
Conclusions
‱ Spark fulfills its purpose efficiently; however Spark Streaming is not yet entirely
stable.
‱ The algorithms available in MLlib are basic and do not work in streaming (until
Spark 1.3.0)
‱ Factors that influence the performance of the application: the size of the dataset,
number of iterations and number of searched topics.
‱ It is necessary that the dataset count on a certain amount of words so that the
result is consistent
4
MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
Thanks

Weitere Àhnliche Inhalte

Ähnlich wie Tfm slides

My Master's Thesis
My Master's ThesisMy Master's Thesis
My Master's Thesis
Humoyun Ahmedov
 

Ähnlich wie Tfm slides (20)

Data Onboarding Breakout Session
Data Onboarding Breakout SessionData Onboarding Breakout Session
Data Onboarding Breakout Session
 
Apache Spark and future of advanced analytics
Apache Spark and future of advanced analyticsApache Spark and future of advanced analytics
Apache Spark and future of advanced analytics
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Big Data to SMART Data : Process Scenario
Big Data to SMART Data : Process ScenarioBig Data to SMART Data : Process Scenario
Big Data to SMART Data : Process Scenario
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017
 
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaTrends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
 
Building search and discovery services for Schibsted (LSRS '17)
Building search and discovery services for Schibsted (LSRS '17)Building search and discovery services for Schibsted (LSRS '17)
Building search and discovery services for Schibsted (LSRS '17)
 
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...
 
Splunk App for Stream - Einblicke in Ihren Netzwerkverkehr
Splunk App for Stream - Einblicke in Ihren NetzwerkverkehrSplunk App for Stream - Einblicke in Ihren Netzwerkverkehr
Splunk App for Stream - Einblicke in Ihren Netzwerkverkehr
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
Splunk App for Stream
Splunk App for StreamSplunk App for Stream
Splunk App for Stream
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
 
My Master's Thesis
My Master's ThesisMy Master's Thesis
My Master's Thesis
 
Event-Driven Applications Done Right - Pulsar Summit SF 2022
Event-Driven Applications Done Right - Pulsar Summit SF 2022Event-Driven Applications Done Right - Pulsar Summit SF 2022
Event-Driven Applications Done Right - Pulsar Summit SF 2022
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
 
Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications
 
What do we want computers to do for us?
What do we want computers to do for us? What do we want computers to do for us?
What do we want computers to do for us?
 
Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025
Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025
Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025
 
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkData-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
 

KĂŒrzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
Enterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

KĂŒrzlich hochgeladen (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

Tfm slides

  • 1. MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION Director: Codirector: MASTER THESIS Phd. Ruben Tous Liesa Phd. Jordi Torres VIñals Presented by Omar IvĂĄn Sulca Correa FACULTAT D’INFORMÀTICA DE BARCELONA
  • 2. MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION Agenda 1. Introduction 2. Background 3. Multimedia Big Data Computing for Trend Detection 4. Results and Conclusions
  • 3. MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION Introduction ‱ The relevance of social media data has had an explosive growing in the last few years, because the user’s interactions and communications in social networks provide key information (government and non-government organizations). ‱ Social media data is vast, noisy, distributed, unstructured, and dynamic in nature; thus traditional analysis methods prove to be inefficient and expensive with it. It’s necessary looking new alternatives ‱ Exist a lot potentiality in the photo-sharing social networks as Instagram, especially in digital marketing. 1 This work is a proof of concept on Streaming and Machine Learning functionalities of the new Big Data platform: Apache Spark. Using Spark subprojects MLlib and Spark Streaming. It seeks to implement an application which allows find the Trending Topics (using the model LDA) on data collected from the social network Instagram
  • 4. MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION II. Background
  • 5. MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION Older msgs Newer msgs Kafka topic Producer Consumer Apache Kafka ‱ Apache Kafka is an open source, distributed, partitioned, and replicated commit- log-based publish-subscribe messaging system 2 ‱ Streams ‱ Batch Kafka Cluster Broker 1 Broker 2 Broker 3 Zookeeper Producer Consumer Front End Front End Front End Haddop Real Time Security Kafka Cluster Zookeeper
  • 6. MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION Spark SQL Spark Streaming MLlib GraphX Apache Spark Apache Spark (I) ‱ Apache Spark is an open source cluster-computing platform designed to be fast and general-purpose. ‱ Allows combine different types of computations in one single plataform ‱ Spark support in-memory processing, allowing a performance up to 100x 2 Interactive queries streaming Batch applications Iterative algorithms DataFrame DStream Vector & Matrix Vertex & Edge RDDs Actions and Transformations A RDD (Resilient Distributed Datasets) represent a collection of elements that can be manipulated Spark Streaming
  • 7. MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION Apache Spark (II) ‱ Spark Streaming is a Spark component that enables processing live streams of data. Spark Streaming provides an API for manipulating data streams that closely matches the Spark Core’s RDD API 2 ‱ Spark Streaming provides a high-level abstraction called Discretized Stream or Dstream ‱ A DStream is a sequence (a series of RDDs) of data arriving over time Sockets File Stream Actors (Akka) Quenue RDDs Transformations Window Operations Output Operations
  • 8. MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION 1:k : Los tĂłpicos : Es una distribuciĂłn sobre el vocabulario : Las proporciones de tĂłpicos para el th documento , : Es la proporciĂłn de tĂłpicos del tĂłpico en el documento : Las asignaciones de tĂłpicos para el th documento , : Es la asignaciĂłn tĂłpicos para la n-sima palabra en el documento : Son las palabras observadas en el documento, , : Es la nth palabra en el documento , que es un elemento del vocabulario fijo : , : , : , : = ∏ ∏ (∏ , | ( , | : , , )) Topic Modeling ‱ It’s a suite of statistical algorithms that aim to discover and annotate large archives of documents with thematic information ‱ topic models do not require any prior annotations or labeling of the documents, the topics emerge from the analysis of the original texts 2 Latent Dirichlet Allocation (LDA) Documents Topic proportions and assigments gene 0.04 dna 0.02 genetic 0.01 
. life 0.02 evolve 0.01 organism 0.01 brain 0.04 neuron 0.02 nerve 0.01 data 0.02 number 0.02 computer 0.01 Topics
  • 9. MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION III. Multimedia Big Data Computing for Trend Detection
  • 10. MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION Overview JSON files Spark Streaming MLlib Apache Spark Spark SQL 3 Instgram’s API Kafka Ingest Read Data Pre-processing & filtering Topic Modeling Visualization
  • 11. MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION Stage 1: Data Ingest3 JSON files Instgram’s API Kafka Ingest OAuth2 Basic Objects ‱ Users ‱ Tags ‱ Locations ‱ Geographies Caption Text Client ID Acces Token Callback URL Registration Autentication Request Marketing Campaign Event Tags #desigual #lavidaeschula #worldcup2014
  • 12. MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION Stage 2: Read Data3 JSON files Spark SQL Instgram’s API Kafka Ingest Read Data single node – single broker clĂșster de Kafka consumerproducer { "type": "record", "name": "JSON", "namespace": "avro", "fields": [ { "name": "text", "type": "string", "doc": " The content of the user's message " } ], "doc": "A basic schema for storing Instagram metadata" } val batchInterval = Seconds(5) "testing-input" Zookeeper
  • 13. MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION consumer Stage 3: Filtering & Preprocessing3 Spark Streaming Spark SQL Kafka Read Data Pre-processing & filtering ‱ Unnecessary characters ‱ Stop words Do not provide any benefit LDA and the others ML algorithms from MLlib doesn't work on streaming, that why is necessary to store the results in files
  • 14. MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION the first MLlib algorithm built upon GraphX Expectation-Maximization (EM) With Spark 1.3, MLlib now supports Latent Dirichlet Allocation (LDA) Stage 4: Topic Modeling3 Spark Streaming Topic ModelingPre-processing & filtering MLlib vectors of word counts (Word, frecuency)
  • 15. MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION Stage 5: Visualization3 MLlib Topic Modeling Visualization val topicIndices = ldaModel.describeTopics(maxTermsPerTopic = 10) topicIndices.foreach { case (terms, termWeights) => println("TOPIC:") terms.zip(termWeights).foreach { case (term, weight) => println(s"${vocabArray(term.toInt)}t$weight") } println() TOPIC : forza 0.005192139298507153 after 0.004892488884851435 side 0.004873250064472125 first 0.004837548137749428 partido 0.004762856710768013 netherlands 0.004658646778099761 going 0.004562908515970665 

  • 16. MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION IV. Results and conclusions
  • 17. MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION Experiments Hardware Core i5-3337U de 1.80 Ghz 6GB RAM OS Ubuntu v14.04 4 Software Apache Spark 1.3 Scala 2.11 JDK 7 Spark Cluster ‱ Standalone mode ‱ 1 master y 5 workers #desigual #Lavidaeschula #worldcup14 Processed posts 3258 12471 501 Size in bytes 8.4 MB 33.7 MB 995.1 kb Number of words 43763 185390 6594 2014 August November
  • 18. MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION Results (I)4 Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Forza 0.0063970 partido 0.0063084 after 0.0061240 going 0.0060892 more 0.0056626 love 0.0051733 netherlands 0.0049420 fifa 0.0049102 watching 0.0047937 your 0.0047628 forza 0.0051921 after 0.0048924 side 0.0048732 first 0.0048375 partido 0.0047628 netherlands 0.0046586 going 0.0045629 more 0.0045417 viva 0.0043852 watching 0.0042613 partido 0.0053369 number 0.0052725 love 0.0050586 fifa 0.0048055 after 0.0047837 mundial 0.0046185 forza 0.0045615 italy 0.0044705 viva 0.0043077 life 0.0042818 your 0.0055476 still 0.0048858 going 0.0047801 good 0.0045066 italy 0.0043234 first 0.0043203 before 0.0042764 partido 0.0042220 after 0.0041471 netherlands 0.0041207 netherlands 0.0061293 mexico 0.0049672 fifa 0.0047969 brazil 0.0046036 viva 0.0045627 number 0.0045547 about 0.0045023 going 0.0043588 forza 0.0042632 mundo 0.0042261 #worldcup2014 dataset Iterations : 3, 5, 7 Topics : 5 Processing Time : 145 seg Netherlands vs MexicoItaly
  • 19. MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION Results (II) Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 â€«ï»ŁïșŽïș·ïșŽŰĄï·Č‬ 0.0044680 just 0.0041317 like 0.0030537 that 0.0027004 happy 0.0024657 â€«ÙŠŰłŰčŰŻÙ„ÙŠâ€Ź 0.0024515 best 0.0023213 minha 0.0020576 todos 0.0020428 ‫ï·Č‬ 0.0019363 just 0.0038730 â€«ï»ŁïșŽïș·ïșŽŰĄï·Č‬ 0.0035151 that 0.0031203 like 0.0026153 â€«Ű§â€Źâ€«Ű§Ù„ŰȘÙŠâ€Ź 0.0025304 â€«ÙŠŰłŰčŰŻÙ„ÙŠâ€Ź 0.0024014 â€«Ű§Ù„ŰȘÙŠâ€Ź 0.0023665 â€«Ű§Ù„Ű­Ù†ÙŠÙ†â€Ź 0.0019909 best 0.0019291 apenas 0.0019265 â€«ï»ŁïșŽïș·ïșŽŰĄï·Č‬ 0.0030372 that 0.0028397 just 0.0026653 like 0.0024058 ‫۳ŰčïșŽŰŻÙ‡â€Ź 0.0018991 todos 0.0018570 â€«ÙŠŰłŰčŰŻÙ„ÙŠâ€Ź 0.0018346 â€«Ű§Ù„Ű­Ù†ÙŠÙ†â€Ź 0.0018248 ‫ï·Č‬ 0.0017642 from 0.0017207 â€«ï»ŁïșŽïș·ïșŽŰĄï·Č‬ 0.0034158 just 0.0028399 like 0.0028013 that 0.0024385 â€«Ű§Ù„ŰȘÙŠâ€Ź 0.0023819 â€«ÙŠŰłŰčŰŻÙ„ÙŠâ€Ź 0.0021320 best 0.0020799 happy 0.0020573 ‫۳ŰčïșŽŰŻÙ‡â€Ź 0.0018726 last 0.0018066 â€«ï»ŁïșŽïș·ïșŽŰĄï·Č‬ 0.0035831 â€«ÙŠŰłŰčŰŻÙ„ÙŠâ€Ź 0.0027622 like 0.0026387 just 0.0025732 â€«Ű§ï»·ŰčŰČۧۡ‬ 0.0024545 that 0.0024106 minha 0.0023892 â€«ï»ŁŰȘïșŽŰšŰčÙŠÙ†ÙŠâ€Ź 0.0021237 ‫۳ŰčïșŽŰŻÙ‡â€Ź 0.0020168 from 0.0019959 4 #Desigual dataset Iterations : 2 Topics : 5 Processing Time : 53 min ‱ Bangkok (Thailandia) ‱ Bhaucha Dhakka (India) ‱ Maldives (Mumbai) #Lavidaeschula dataset Âż? Iterations : 1 Topics : 5 Processing Time : 129 min
  • 20. MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION Conclusions ‱ Spark fulfills its purpose efficiently; however Spark Streaming is not yet entirely stable. ‱ The algorithms available in MLlib are basic and do not work in streaming (until Spark 1.3.0) ‱ Factors that influence the performance of the application: the size of the dataset, number of iterations and number of searched topics. ‱ It is necessary that the dataset count on a certain amount of words so that the result is consistent 4
  • 21. MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION Thanks