2. Search keyword, in a time window, inside social network
â User ID
â Keyword
â Time Window (start + end)
â Social network (followers relationship)
Motivation
5. Input Data
User ID Timestamp Tweets
123456789 2015-01-01 Hello world!
User ID Follower ID Earliest Retweet Time
123456789 987654321 2015-01-01
Table 1. Tweets Table
Table 2. User and Follower Relationship
1.4 TB json tweets
6. 1. Track the reach impact inside the user and follower network
Challenges
Tweets date: 05/10/2015
Mother's Day is hard when your mom deserves
an island but you can only afford a candle
7. 1. Track the reach impact inside the user and follower network
Tweets date: 05/10/2015
Mother's Day is hard when your mom deserves
an island but you can only afford a candle
Earliest retweet date: 5/15/2015
My mom taught me not to break people's heart.
Earliest retweet date: 05/09/2015
@PerfectAmeezy mOM
Earliest retweet date: 5/11/2015
my mom is either my best friend or satan there is no in between
Earliest retweet date: 05/01/2015
Happy Mother's Day to the best mom
out there â¤ď¸ http://t.co/TSoesf2vuw
Challenges
8. Current Version:
Find earliest retweets time of each user and follower pair
increase Spark memory to 6GB
Benchmark Map-reduce Job
Future improvement:
A better way - airflow
9. 2. How to improve the search efficiency and scalability?
â Multi-step filters (sequential query VS. big table)
â Optimization I/O
Challenges
10. About Yan
Data Analyst, US EPA, 2016
â design & auto-process metric
â geospatial/statistical modeling
M.S., Purdue University, 2013
â environmental informatics
â mathematical modeling
I LOVE nature,
yoga,
meditation...
Hinweis der Redaktion
Business use cases
Keep in touch with friends, engaged with their interested topics
Measure âreachâ impact