TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR

O C T O B E R 1 3 - 1 6 , 2 0 1 5 • A U S T I N , T X

TweetMogaz: The Arabic Tweets Platform
Ahmed Adel
Team Lead, BADR

3
01
Who Am I?
• Bs.c. Engineering from Alexandria University 
• BADR Co-Founder 
• Now: Part-time Team Lead @ BADR
• 8+ years experience in software development 
• Mainly Java, JavaScript 
• Solr, Hadoop, Hive, ...

4
02
BADR
• Established Software House in Egypt 
• Was founded in 2006 
• Provide BigData consulting services 
and solutions 
• Machine Learning, NLP, Data Science, ... 
• Hadoop, Solr, Spark, Hive, Flume, Incorta, ...

5
02
Agenda
• What is TweetMogaz
• System Modules
• Tweets processing
• Indexing
• Event detection
• Archivers
• …
• System Architecture
• Tricks and Challenges
• What’s Next

6
02
What Is TweetMogaz?
• Innovation and applied research project @ BADR
• Portal for browsing, ﬁltering and searching Arabic Tweets
• ... and events detection
• Based on several research papers
• Magdy W. and A. Ali, and K. Darwish. A Summarization Tool for Time-Sensitive Social Media. 
CIKM 2012
• Magdy W. TweetMogaz: A News Portal of Tweets. SIGIR 2013 
• Elsawy E., M. Mokhtar, and W. Magdy. TweetMogaz v2: Identifying News Stories in 
Social Media. CIKM 2014 
• Magdy W. and T. Elsayed. Adaptive Method for Following Dynamic Topics on Twitter. 
ICWSM 2014

7
02
Why Arabic
• 230 Millions speakers
• 6th largest in 
the world (native + 2nd)
• One of the 6 UN 
ofﬁcial languages
Mandarin Chinese
English
Hindi
Spanish
Russian
Arabic
German
Bengali
Portuguese
Japanese
Speakers in Millions
0 300 600 900 1,200
Native 2nd

8
02
Main Features
• Classifying • Browsing• Searching
• Event Detection • Time machine

9
02
System Modules
• Tweets processing module
• Indexing module
• Event detection module
• Events
• Active Hashtags
• WordCloud generator
• Archivers
• Short-term
• Long-term
• Analytics

11
02
Tweets Processing Module
• Retrieves tweets 
(streams and search q's)
• Filters out inappropriate 
tweets
• Text pre-processing
• Normalization
• ‫ى‬ ، ‫ي‬
• ‫آ‬ ، ‫إ‬ ، ‫ا‬ ، ‫أ‬
• ‫ة‬ ، ‫ه‬
• Kashida: ‫ـ‬ ، ْ
• Removing stop-words

12
02
• Classification at indexing time
• Multiple classes map to multi-value field (politics, sport, religious, etc) 
• Boolean classifier 
• Adaptive classifier (Naïve Bayes/SVM (experimental))
• Scoring at indexing time
• Recent (date): latest tweets in a specific category 
• Top (score field): trending tweets (high retweet rate in the past 48 hours) 
Tweets Processing Module

13
02
Score
Score
0
0.005
0.009
0.014
0.018
Tweet Age (seconds)
0 3k 6k 9k 12k 15k 18k 21k 24k 27k 30k 33k 36k 39k

15
02
Indexing Module
• Responsible for indexing 
tweets to corresponding 
Solr cores
• Realtime core (< 10 mins)
• up to 48 hours cores
• Media: photos, videos
• Text only and text that contains 
links
• All tweets
• Short term archives cores 
(>48 hours and <30 days)

18
Event Detection Module
• Responsible for detecting events
• Elsawy E., M. Mokhtar, and W. Magdy. 
TweetMogaz v2: Identifying News Stories 
in Social Media. CIKM 2014
• Feature-pivot (term) approach

19
02
• Clusters are created based on 
a distance threshold (fuzzy clusters)
• Distance threshold 0.4 (experimental)
S
SS
S
• In 8 hours window
• Processed text faceting with using min_count
• Builds facets for stems
• For each facet, calculate distance 
to all other facets O(n2)

20
02
• Cluster enrichment
• Enhancing clusters with less than 6 terms
• Running Solr AND query with all keywords and 
selecting terms with highest TFIDF to 
enrich the cluster

21
02
• Cluster de-duplication over time
• Search using cluster keywords of each detected 
cluster
• For each response result, build stem frequency 
vector
• Compare the two vectors for similarity 
(Cosine = 0.5: experimental)
• Old clusters are updated to maintain the 
chronological order of events

22
02
• Relevant tweets retrieval
• Query against 48 hours cores

23
02
• Active hash tag detection
• Separate ﬁeld added at index time
• Stored in events core with type hashtag
• Build normalized top hashtag facets every 24 hours for the past week
• Query Solr for hashtags older that 1 week and eliminate them

25
02
Word Cloud: Bi-gram detection
• Facet for speciﬁc class
• Facets next to each other, with a speciﬁc threshold, tend to be a bi-gram
• For example: ‫العالم‬ ‫كأس‬ - ‫مدريد‬ ‫ريال‬ (Real Madrid - World Cup)
• min_count applies

27
02
Archiving Module
• Why?
• Space in ﬁnite!
• Faster performance of searching recent cores 
• Short-term archiving
• Archive tweets that are older than 48 hours
• Same Solr instance 
• Long-term archiving
• Archive tweets that are older than 30 days
• Separate Solr instance

29
02
System Architecture
• SolrCloud
• 2 Shards
• Replication factor of 2
• Zookeeper ensemble 
for distribution management
• SolrJ API 
• Front-end
• Node.js
• AngularJS (Web and mobile web) 
• Long-term archive
• Separate Solr Instance

30
Analytics and Visualization

31
02
Analytics and Visualization
• Banana Dashboards
• Deployed on both realtime 
and archive
• Insights on the tweets distribution 
per class, trends over time of 
speciﬁc search queries
• Realtime on production with 
‘Auto-refresh’ feature
• Users with highest retweets

33
02
• Archiving
• Initially developed on Solr 4.4
• Upgrade to 4.7+ for deep paging
• Archivers Sync’ing
• Short-term is writing and long term is reading
• Have to sync in case of deep paging
Short-term
cores
Long-term
cores
Short-term archiver 
(W)
Long-term archiver 
(R)
Tricks

34
02
Challenges
• Twitter (Micro-blogs) very short text
• Arabic has many dialects: colloquial, formal, regional variations

36
02
Next Steps
• Integrating an adaptive classiﬁer that can handle the 
characteristics of micro-blogs
• Search query trend over time
• Engage system users
• Integrate R for statistical processing (classiﬁcation, detection, …)

37
03
Thank you!
Ahmed Adel
email: me@aadel.io
twitter: @ahmadadel
website: badrit.com

TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR

Similar to TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR (20)

More from Lucidworks

More from Lucidworks (20)

Recently uploaded

Recently uploaded (20)

TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR