3. 3
01
Who Am I?
• Bs.c. Engineering from Alexandria University
• BADR Co-Founder
• Now: Part-time Team Lead @ BADR
• 8+ years experience in software development
• Mainly Java, JavaScript
• Solr, Hadoop, Hive, ...
4. 4
02
BADR
• Established Software House in Egypt
• Was founded in 2006
• Provide BigData consulting services
and solutions
• Machine Learning, NLP, Data Science, ...
• Hadoop, Solr, Spark, Hive, Flume, Incorta, ...
5. 5
02
Agenda
• What is TweetMogaz
• System Modules
• Tweets processing
• Indexing
• Event detection
• Archivers
• …
• System Architecture
• Tricks and Challenges
• What’s Next
6. 6
02
What Is TweetMogaz?
• Innovation and applied research project @ BADR
• Portal for browsing, filtering and searching Arabic Tweets
• ... and events detection
• Based on several research papers
• Magdy W. and A. Ali, and K. Darwish. A Summarization Tool for Time-Sensitive Social Media.
CIKM 2012
• Magdy W. TweetMogaz: A News Portal of Tweets. SIGIR 2013
• Elsawy E., M. Mokhtar, and W. Magdy. TweetMogaz v2: Identifying News Stories in
Social Media. CIKM 2014
• Magdy W. and T. Elsayed. Adaptive Method for Following Dynamic Topics on Twitter.
ICWSM 2014
7. 7
02
Why Arabic
• 230 Millions speakers
• 6th largest in
the world (native + 2nd)
• One of the 6 UN
official languages
Mandarin Chinese
English
Hindi
Spanish
Russian
Arabic
German
Bengali
Portuguese
Japanese
Speakers in Millions
0 300 600 900 1,200
Native 2nd
15. 15
02
Indexing Module
• Responsible for indexing
tweets to corresponding
Solr cores
• Realtime core (< 10 mins)
• up to 48 hours cores
• Media: photos, videos
• Text only and text that contains
links
• All tweets
• Short term archives cores
(>48 hours and <30 days)
18. 18
Event Detection Module
• Responsible for detecting events
• Elsawy E., M. Mokhtar, and W. Magdy.
TweetMogaz v2: Identifying News Stories
in Social Media. CIKM 2014
• Feature-pivot (term) approach
19. 19
02
Event Detection Module
• Clusters are created based on
a distance threshold (fuzzy clusters)
• Distance threshold 0.4 (experimental)
S
SS
S
• In 8 hours window
• Processed text faceting with using min_count
• Builds facets for stems
• For each facet, calculate distance
to all other facets O(n2)
20. 20
02
Event Detection Module
• Cluster enrichment
• Enhancing clusters with less than 6 terms
• Running Solr AND query with all keywords and
selecting terms with highest TFIDF to
enrich the cluster
21. 21
02
Event Detection Module
• Cluster de-duplication over time
• Search using cluster keywords of each detected
cluster
• For each response result, build stem frequency
vector
• Compare the two vectors for similarity
(Cosine = 0.5: experimental)
• Old clusters are updated to maintain the
chronological order of events
23. 23
02
Event Detection Module
• Active hash tag detection
• Separate field added at index time
• Stored in events core with type hashtag
• Build normalized top hashtag facets every 24 hours for the past week
• Query Solr for hashtags older that 1 week and eliminate them
25. 25
02
Word Cloud: Bi-gram detection
• Facet for specific class
• Facets next to each other, with a specific threshold, tend to be a bi-gram
• For example: العالم كأس - مدريد ريال (Real Madrid - World Cup)
• min_count applies
27. 27
02
Archiving Module
• Why?
• Space in finite!
• Faster performance of searching recent cores
• Short-term archiving
• Archive tweets that are older than 48 hours
• Same Solr instance
• Long-term archiving
• Archive tweets that are older than 30 days
• Separate Solr instance
31. 31
02
Analytics and Visualization
• Banana Dashboards
• Deployed on both realtime
and archive
• Insights on the tweets distribution
per class, trends over time of
specific search queries
• Realtime on production with
‘Auto-refresh’ feature
• Users with highest retweets
33. 33
02
• Archiving
• Initially developed on Solr 4.4
• Upgrade to 4.7+ for deep paging
• Archivers Sync’ing
• Short-term is writing and long term is reading
• Have to sync in case of deep paging
Short-term
cores
Long-term
cores
Short-term archiver
(W)
Long-term archiver
(R)
Tricks
36. 36
02
Next Steps
• Integrating an adaptive classifier that can handle the
characteristics of micro-blogs
• Search query trend over time
• Engage system users
• Integrate R for statistical processing (classification, detection, …)