IAC 2024 - IA Fast Track to Search Focused AI Solutions
Â
Tfm slides
1. MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
MULTIMEDIA BIG DATA
COMPUTING FOR
TREND DETECTION
Director:
Codirector:
MASTER THESIS
Phd. Ruben Tous Liesa
Phd. Jordi Torres VIñals
Presented by
Omar IvĂĄn Sulca Correa
FACULTAT DâINFORMĂTICA DE BARCELONA
2. MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
Agenda
1. Introduction
2. Background
3. Multimedia Big Data Computing for Trend Detection
4. Results and Conclusions
3. MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
Introduction
âą The relevance of social media data has had an explosive growing in the last few years, because
the userâs interactions and communications in social networks provide key information
(government and non-government organizations).
âą Social media data is vast, noisy, distributed, unstructured, and dynamic in nature; thus traditional
analysis methods prove to be inefficient and expensive with it. Itâs necessary looking new
alternatives
âą Exist a lot potentiality in the photo-sharing social networks as Instagram, especially in digital
marketing.
1
This work is a proof of concept on Streaming and Machine Learning functionalities of the new Big
Data platform: Apache Spark. Using Spark subprojects MLlib and Spark Streaming. It seeks to
implement an application which allows find the Trending Topics (using the model LDA) on data
collected from the social network Instagram
5. MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
Older msgs Newer msgs
Kafka topic
Producer
Consumer
Apache Kafka
âą Apache Kafka is an open source, distributed, partitioned, and replicated commit-
log-based publish-subscribe messaging system
2
âą Streams
âą Batch
Kafka Cluster
Broker 1
Broker 2
Broker 3
Zookeeper
Producer Consumer
Front End
Front End
Front End
Haddop
Real Time
Security
Kafka
Cluster
Zookeeper
6. MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
Spark
SQL
Spark
Streaming
MLlib GraphX
Apache Spark
Apache Spark (I)
âą Apache Spark is an open source cluster-computing
platform designed to be fast and general-purpose.
âą Allows combine different types of computations in
one single plataform
âą Spark support in-memory processing, allowing a
performance up to 100x
2 Interactive
queries
streaming
Batch
applications
Iterative
algorithms
DataFrame DStream Vector & Matrix Vertex & Edge
RDDs Actions and Transformations
A RDD (Resilient Distributed Datasets) represent a
collection of elements that can be manipulated
Spark
Streaming
7. MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
Apache Spark (II)
âą Spark Streaming is a Spark component that enables processing live streams of
data. Spark Streaming provides an API for manipulating data streams that closely
matches the Spark Coreâs RDD API
2
âą Spark Streaming provides a
high-level abstraction called
Discretized Stream or Dstream
âą A DStream is a sequence (a
series of RDDs) of data arriving
over time
Sockets
File Stream
Actors (Akka)
Quenue RDDs
Transformations
Window Operations
Output Operations
8. MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
1:k : Los tĂłpicos
: Es una distribuciĂłn sobre el vocabulario
: Las proporciones de tĂłpicos para el th documento
, : Es la proporciĂłn de tĂłpicos del tĂłpico en el documento
: Las asignaciones de tĂłpicos para el th documento
, : Es la asignaciĂłn tĂłpicos para la n-sima palabra en el documento
: Son las palabras observadas en el documento,
, : Es la nth palabra en el documento , que es un elemento del vocabulario fijo
: , : , : , : = â â (â , | ( , | : , , ))
Topic Modeling
âą Itâs a suite of statistical
algorithms that aim to discover
and annotate large archives of
documents with thematic
information
âą topic models do not require any
prior annotations or labeling of
the documents, the topics
emerge from the analysis of the
original texts
2 Latent Dirichlet Allocation (LDA)
Documents
Topic proportions and
assigments
gene 0.04
dna 0.02
genetic 0.01
âŠ.
life 0.02
evolve 0.01
organism 0.01
brain 0.04
neuron 0.02
nerve 0.01
data 0.02
number 0.02
computer 0.01
Topics
9. MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
III. Multimedia Big Data
Computing for Trend Detection
10. MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
Overview
JSON
files
Spark
Streaming
MLlib
Apache Spark
Spark SQL
3
Instgramâs API
Kafka
Ingest Read Data Pre-processing & filtering Topic Modeling
Visualization
11. MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
Stage 1: Data Ingest3
JSON
files
Instgramâs API
Kafka
Ingest
OAuth2
Basic
Objects
âą Users
âą Tags
âą Locations
âą Geographies
Caption
Text
Client ID
Acces Token
Callback URL
Registration
Autentication
Request
Marketing Campaign Event
Tags #desigual #lavidaeschula #worldcup2014
12. MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
Stage 2: Read Data3
JSON
files
Spark SQL
Instgramâs API
Kafka
Ingest Read Data
single node â single broker
clĂșster de Kafka
consumerproducer
{
"type": "record",
"name": "JSON",
"namespace": "avro",
"fields": [ {
"name": "text",
"type": "string",
"doc": " The content of the user's
message "
} ],
"doc": "A basic schema for storing
Instagram metadata"
}
val batchInterval = Seconds(5)
"testing-input"
Zookeeper
13. MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
consumer
Stage 3: Filtering & Preprocessing3
Spark
Streaming
Spark SQL
Kafka
Read Data Pre-processing & filtering
âą Unnecessary characters
âą Stop words
Do not provide any benefit
LDA and the others ML algorithms
from MLlib doesn't work on
streaming, that why is necessary to
store the results in files
14. MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
the first MLlib algorithm built upon GraphX
Expectation-Maximization (EM)
With Spark 1.3, MLlib now supports
Latent Dirichlet Allocation (LDA)
Stage 4: Topic Modeling3
Spark
Streaming
Topic ModelingPre-processing & filtering
MLlib
vectors of word counts
(Word, frecuency)
15. MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
Stage 5: Visualization3
MLlib
Topic Modeling
Visualization
val topicIndices = ldaModel.describeTopics(maxTermsPerTopic = 10)
topicIndices.foreach { case (terms, termWeights) =>
println("TOPIC:")
terms.zip(termWeights).foreach { case (term, weight) =>
println(s"${vocabArray(term.toInt)}t$weight")
}
println()
TOPIC :
forza 0.005192139298507153
after 0.004892488884851435
side 0.004873250064472125
first 0.004837548137749428
partido 0.004762856710768013
netherlands 0.004658646778099761
going 0.004562908515970665
âŠ
16. MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
IV. Results and conclusions
17. MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
Experiments
Hardware
Core i5-3337U de 1.80 Ghz
6GB RAM
OS Ubuntu v14.04
4
Software
Apache Spark 1.3
Scala 2.11
JDK 7
Spark Cluster
âą Standalone mode
âą 1 master y 5 workers
#desigual #Lavidaeschula #worldcup14
Processed posts 3258 12471 501
Size in bytes 8.4 MB 33.7 MB 995.1 kb
Number of words 43763 185390 6594
2014
August November
18. MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
Results (I)4
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
Forza 0.0063970
partido 0.0063084
after 0.0061240
going 0.0060892
more 0.0056626
love 0.0051733
netherlands 0.0049420
fifa 0.0049102
watching 0.0047937
your 0.0047628
forza 0.0051921
after 0.0048924
side 0.0048732
first 0.0048375
partido 0.0047628
netherlands 0.0046586
going 0.0045629
more 0.0045417
viva 0.0043852
watching 0.0042613
partido 0.0053369
number 0.0052725
love 0.0050586
fifa 0.0048055
after 0.0047837
mundial 0.0046185
forza 0.0045615
italy 0.0044705
viva 0.0043077
life 0.0042818
your 0.0055476
still 0.0048858
going 0.0047801
good 0.0045066
italy 0.0043234
first 0.0043203
before 0.0042764
partido 0.0042220
after 0.0041471
netherlands 0.0041207
netherlands 0.0061293
mexico 0.0049672
fifa 0.0047969
brazil 0.0046036
viva 0.0045627
number 0.0045547
about 0.0045023
going 0.0043588
forza 0.0042632
mundo 0.0042261
#worldcup2014 dataset
Iterations : 3, 5, 7
Topics : 5
Processing Time : 145 seg
Netherlands vs MexicoItaly
19. MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
Results (II)
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
â«ï»Łïșïș·ïșŰĄï·Č⏠0.0044680
just 0.0041317
like 0.0030537
that 0.0027004
happy 0.0024657
â«ÙŰłŰčŰŻÙÙ⏠0.0024515
best 0.0023213
minha 0.0020576
todos 0.0020428
â«ï·Č⏠0.0019363
just 0.0038730
â«ï»Łïșïș·ïșŰĄï·Č⏠0.0035151
that 0.0031203
like 0.0026153
â«Ű§âŹâ«Ű§ÙŰȘÙ⏠0.0025304
â«ÙŰłŰčŰŻÙÙ⏠0.0024014
â«Ű§ÙŰȘÙ⏠0.0023665
â«Ű§ÙŰÙÙÙ⏠0.0019909
best 0.0019291
apenas 0.0019265
â«ï»Łïșïș·ïșŰĄï·Č⏠0.0030372
that 0.0028397
just 0.0026653
like 0.0024058
â«ŰłŰčïșŰŻÙ⏠0.0018991
todos 0.0018570
â«ÙŰłŰčŰŻÙÙ⏠0.0018346
â«Ű§ÙŰÙÙÙ⏠0.0018248
â«ï·Č⏠0.0017642
from 0.0017207
â«ï»Łïșïș·ïșŰĄï·Č⏠0.0034158
just 0.0028399
like 0.0028013
that 0.0024385
â«Ű§ÙŰȘÙ⏠0.0023819
â«ÙŰłŰčŰŻÙÙ⏠0.0021320
best 0.0020799
happy 0.0020573
â«ŰłŰčïșŰŻÙ⏠0.0018726
last 0.0018066
â«ï»Łïșïș·ïșŰĄï·Č⏠0.0035831
â«ÙŰłŰčŰŻÙÙ⏠0.0027622
like 0.0026387
just 0.0025732
â«Ű§ï»·ŰčŰČۧۥ⏠0.0024545
that 0.0024106
minha 0.0023892
â«ï»ŁŰȘïșŰšŰčÙÙÙ⏠0.0021237
â«ŰłŰčïșŰŻÙ⏠0.0020168
from 0.0019959
4
#Desigual dataset
Iterations : 2
Topics : 5
Processing Time : 53 min
âą Bangkok (Thailandia)
âą Bhaucha Dhakka (India)
âą Maldives (Mumbai)
#Lavidaeschula dataset Âż?
Iterations : 1
Topics : 5
Processing Time : 129 min
20. MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
Conclusions
âą Spark fulfills its purpose efficiently; however Spark Streaming is not yet entirely
stable.
âą The algorithms available in MLlib are basic and do not work in streaming (until
Spark 1.3.0)
âą Factors that influence the performance of the application: the size of the dataset,
number of iterations and number of searched topics.
âą It is necessary that the dataset count on a certain amount of words so that the
result is consistent
4