This project aim to build a network of locations and examine its properties through complex network analysis.
Here, we document the implementation of Twitter crawler and of the network.
CSE5656 Complex Networks - Location Correlation in Human Mobility, Implementation
1. Complex Networks Class Project
!
Location Correlation in Human Mobility
!
Marcello Tomasini
Bio-Complex Lab
Department of Computer Sciences
Florida Tech
2. Twitter Miner Implementation
The application which mine Twitter is developed in Python and uses the
following libraries:
• twitter (Python Twitter Tools): data is collected through Twitter stream
API and appended to local buffer
• pymongo (MongoDB): data is stored on the Biocomplex Lab MongoDB
instance
• logging: Python logging facility is used to keep track of code exceptions,
and non-standard twitter messages in the stream (warning, limit,
disconnect). Mostly for debugging. Exceptions don’t stop program
execution (mostly), but try to recover instead, in order to avoid manual
intervention
• collections: collections.deque is used for a thread-safe high-performance
local buffering in order to reduce Network IO and overhead on BioComplex
Lab MongoDB server
• threading: data is pushed to BioComplex Lab MongoDB instance by a
separate thread. Thread pop out a fixed amount of elements from the
deque and try the insert operation. If insert operation fails, revert back the
transition. No tweets lost. Python GIL is not an issue here since the thread
is IO bounded
!
Code runs on Amazon EC2 t2.micro instance for maximum reliability (SLA
99.95%).
Code performance: easily handle ~8Mbps twitter stream (worldwide stream
of geotagged tweets) corresponding to ~2000 tweet/s.
3. Network Builder Implementation
The application which build the network is developed in Python and uses the
following libraries:
• pymongo (MongoDB): filter tweets with a bounding box (due to a Twitter
bug) and retrieve data from BioComplex Lab MongoDB instance. Query
projections help to reduce data transferred over network
• scikit-learn: provides functions to compute k-means clustering of
coordinate points. Clusters will represent locations
• numpy: provides fast arrays and matrices data structures
• matplotlib.pyplot: plot graphs
• igraph: create and export the network structure
!
Clustering need a distance metric; coordinates are not in an euclidean space,
but in a spherical space, thus to compute the great-circle distance [1]
between two points we could use haversine formula [2]
!
However, most implementations use a distance matrix when supplied with a
non standard metric, which requires O(n2) space. Given the size of the
dataset that’s impractical, thus we use Mercator projection [3] to project
coordinates in an euclidean space and then use standard k-means algorithm.
!!!!!
[1] http://en.wikipedia.org/wiki/Great-circle_distance
[2] http://en.wikipedia.org/wiki/Haversine_formula
[3] http://en.wikipedia.org/wiki/Mercator_projection