Tfm slides

MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
MULTIMEDIA BIG DATA
COMPUTING FOR
TREND DETECTION
Director:
Codirector:
MASTER THESIS
Phd. Ruben Tous Liesa
Phd. Jordi Torres VIñals
Presented by
Omar Iván Sulca Correa
FACULTAT D’INFORMÀTICA DE BARCELONA

Agenda
1. Introduction
2. Background
3. Multimedia Big Data Computing for Trend Detection
4. Results and Conclusions

Introduction
• The relevance of social media data has had an explosive growing in the last few years, because
the user’s interactions and communications in social networks provide key information
(government and non-government organizations).
• Social media data is vast, noisy, distributed, unstructured, and dynamic in nature; thus traditional
analysis methods prove to be inefficient and expensive with it. It’s necessary looking new
alternatives
• Exist a lot potentiality in the photo-sharing social networks as Instagram, especially in digital
marketing.
1
This work is a proof of concept on Streaming and Machine Learning functionalities of the new Big
Data platform: Apache Spark. Using Spark subprojects MLlib and Spark Streaming. It seeks to
implement an application which allows find the Trending Topics (using the model LDA) on data
collected from the social network Instagram

II. Background

Older msgs Newer msgs
Kafka topic
Producer
Consumer
Apache Kafka
• Apache Kafka is an open source, distributed, partitioned, and replicated commit-
log-based publish-subscribe messaging system
2
• Streams
• Batch
Kafka Cluster
Broker 1
Broker 2
Broker 3
Zookeeper
Producer Consumer
Front End
Front End
Front End
Haddop
Real Time
Security
Kafka
Cluster
Zookeeper

Spark
SQL
Spark
Streaming
MLlib GraphX
Apache Spark
Apache Spark (I)
• Apache Spark is an open source cluster-computing
platform designed to be fast and general-purpose.
• Allows combine different types of computations in
one single plataform
• Spark support in-memory processing, allowing a
performance up to 100x
2 Interactive
queries
streaming
Batch
applications
Iterative
algorithms
DataFrame DStream Vector & Matrix Vertex & Edge
RDDs Actions and Transformations
A RDD (Resilient Distributed Datasets) represent a
collection of elements that can be manipulated
Spark
Streaming

Apache Spark (II)
• Spark Streaming is a Spark component that enables processing live streams of
data. Spark Streaming provides an API for manipulating data streams that closely
matches the Spark Core’s RDD API
2
• Spark Streaming provides a
high-level abstraction called
Discretized Stream or Dstream
• A DStream is a sequence (a
series of RDDs) of data arriving
over time
Sockets
File Stream
Actors (Akka)
Quenue RDDs
Transformations
Window Operations
Output Operations

1:k : Los tópicos
: Es una distribución sobre el vocabulario
: Las proporciones de tópicos para el th documento
, : Es la proporción de tópicos del tópico en el documento
: Las asignaciones de tópicos para el th documento
, : Es la asignación tópicos para la n-sima palabra en el documento
: Son las palabras observadas en el documento,
, : Es la nth palabra en el documento , que es un elemento del vocabulario fijo
: , : , : , : = ∏ ∏ (∏ , | ( , | : , , ))
Topic Modeling
• It’s a suite of statistical
algorithms that aim to discover
and annotate large archives of
documents with thematic
information
• topic models do not require any
prior annotations or labeling of
the documents, the topics
emerge from the analysis of the
original texts
2 Latent Dirichlet Allocation (LDA)
Documents
Topic proportions and
assigments
gene 0.04
dna 0.02
genetic 0.01
….
life 0.02
evolve 0.01
organism 0.01
brain 0.04
neuron 0.02
nerve 0.01
data 0.02
number 0.02
computer 0.01
Topics

III. Multimedia Big Data
Computing for Trend Detection

Overview
JSON
files
Spark
Streaming
MLlib
Apache Spark
Spark SQL
3
Instgram’s API
Kafka
Ingest Read Data Pre-processing & filtering Topic Modeling
Visualization

Stage 1: Data Ingest3
JSON
files
Instgram’s API
Kafka
Ingest
OAuth2
Basic
Objects
• Users
• Tags
• Locations
• Geographies
Caption
Text
Client ID
Acces Token
Callback URL
Registration
Autentication
Request
Marketing Campaign Event
Tags #desigual #lavidaeschula #worldcup2014

Stage 2: Read Data3
JSON
files
Spark SQL
Instgram’s API
Kafka
Ingest Read Data
single node – single broker
clúster de Kafka
consumerproducer
{
"type": "record",
"name": "JSON",
"namespace": "avro",
"fields": [ {
"name": "text",
"type": "string",
"doc": " The content of the user's
message "
} ],
"doc": "A basic schema for storing
Instagram metadata"
}
val batchInterval = Seconds(5)
"testing-input"
Zookeeper

consumer
Stage 3: Filtering & Preprocessing3
Spark
Streaming
Spark SQL
Kafka
Read Data Pre-processing & filtering
• Unnecessary characters
• Stop words
Do not provide any benefit
LDA and the others ML algorithms
from MLlib doesn't work on
streaming, that why is necessary to
store the results in files

the first MLlib algorithm built upon GraphX
Expectation-Maximization (EM)
With Spark 1.3, MLlib now supports
Latent Dirichlet Allocation (LDA)
Stage 4: Topic Modeling3
Spark
Streaming
Topic ModelingPre-processing & filtering
MLlib
vectors of word counts
(Word, frecuency)

Stage 5: Visualization3
MLlib
Topic Modeling
Visualization
val topicIndices = ldaModel.describeTopics(maxTermsPerTopic = 10)
topicIndices.foreach { case (terms, termWeights) =>
println("TOPIC:")
terms.zip(termWeights).foreach { case (term, weight) =>
println(s"${vocabArray(term.toInt)}t$weight")
}
println()
TOPIC :
forza 0.005192139298507153
after 0.004892488884851435
side 0.004873250064472125
first 0.004837548137749428
partido 0.004762856710768013
netherlands 0.004658646778099761
going 0.004562908515970665
…

IV. Results and conclusions

Experiments
Hardware
Core i5-3337U de 1.80 Ghz
6GB RAM
OS Ubuntu v14.04
4
Software
Apache Spark 1.3
Scala 2.11
JDK 7
Spark Cluster
• Standalone mode
• 1 master y 5 workers
#desigual #Lavidaeschula #worldcup14
Processed posts 3258 12471 501
Size in bytes 8.4 MB 33.7 MB 995.1 kb
Number of words 43763 185390 6594
2014
August November

Results (I)4
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
Forza 0.0063970
partido 0.0063084
after 0.0061240
going 0.0060892
more 0.0056626
love 0.0051733
netherlands 0.0049420
fifa 0.0049102
watching 0.0047937
your 0.0047628
forza 0.0051921
after 0.0048924
side 0.0048732
first 0.0048375
partido 0.0047628
going 0.0045629
more 0.0045417
viva 0.0043852
watching 0.0042613
partido 0.0053369
number 0.0052725
love 0.0050586
fifa 0.0048055
after 0.0047837
mundial 0.0046185
forza 0.0045615
italy 0.0044705
viva 0.0043077
life 0.0042818
your 0.0055476
still 0.0048858
going 0.0047801
good 0.0045066
italy 0.0043234
first 0.0043203
before 0.0042764
partido 0.0042220
after 0.0041471
mexico 0.0049672
fifa 0.0047969
brazil 0.0046036
viva 0.0045627
number 0.0045547
about 0.0045023
going 0.0043588
forza 0.0042632
mundo 0.0042261
#worldcup2014 dataset
Iterations : 3, 5, 7
Topics : 5
Processing Time : 145 seg
Netherlands vs MexicoItaly

Results (II)
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
‫ﻣﺎﺷﺎءﷲ‬ 0.0044680
just 0.0041317
like 0.0030537
that 0.0027004
happy 0.0024657
‫يسعدلي‬ 0.0024515
best 0.0023213
minha 0.0020576
todos 0.0020428
‫ﷲ‬ 0.0019363
just 0.0038730
‫ﻣﺎﺷﺎءﷲ‬ 0.0035151
that 0.0031203
like 0.0026153
‫ا‬‫التي‬ 0.0025304
‫يسعدلي‬ 0.0024014
‫التي‬ 0.0023665
‫الحنين‬ 0.0019909
best 0.0019291
apenas 0.0019265
‫ﻣﺎﺷﺎءﷲ‬ 0.0030372
that 0.0028397
just 0.0026653
like 0.0024058
‫سعﺎده‬ 0.0018991
todos 0.0018570
‫يسعدلي‬ 0.0018346
‫الحنين‬ 0.0018248
‫ﷲ‬ 0.0017642
from 0.0017207
‫ﻣﺎﺷﺎءﷲ‬ 0.0034158
just 0.0028399
like 0.0028013
that 0.0024385
‫التي‬ 0.0023819
‫يسعدلي‬ 0.0021320
best 0.0020799
happy 0.0020573
‫سعﺎده‬ 0.0018726
last 0.0018066
‫ﻣﺎﺷﺎءﷲ‬ 0.0035831
‫يسعدلي‬ 0.0027622
like 0.0026387
just 0.0025732
‫اﻷعزاء‬ 0.0024545
that 0.0024106
minha 0.0023892
‫ﻣتﺎبعيني‬ 0.0021237
‫سعﺎده‬ 0.0020168
from 0.0019959
4
#Desigual dataset
Iterations : 2
Topics : 5
Processing Time : 53 min
• Bangkok (Thailandia)
• Bhaucha Dhakka (India)
• Maldives (Mumbai)
#Lavidaeschula dataset ¿?
Iterations : 1
Topics : 5
Processing Time : 129 min

Conclusions
• Spark fulfills its purpose efficiently; however Spark Streaming is not yet entirely
stable.
• The algorithms available in MLlib are basic and do not work in streaming (until
Spark 1.3.0)
• Factors that influence the performance of the application: the size of the dataset,
number of iterations and number of searched topics.
• It is necessary that the dataset count on a certain amount of words so that the
result is consistent
4

Thanks

Tfm slides

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Tfm slides

Ähnlich wie Tfm slides (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Tfm slides