Presentation at State of the Map, Brussels, 24.9.2016, about a data engineering project with a Twitter bot. It's goal is to find significant viewing activity worldwide on the main web map ("slippy map") of OpenStreetMap (OSM.org). See http://2016.stateofthemap.org/2016/trending-places-on-openstreetmap/ and Twitter @trending_places.
2. Trending Places on
OpenStreetMap
• A big data project with a Twitter bot
• @trending_places (and github)
• Goal: Find significant viewing activity
worldwide on the main web map (“slippy
map”) of OpenStreetMap (OSM)
• This activity may be indicative of popular
news or events in that region
3. Log data
• A web map consists of map tiles at
different zoom levels
• The views of these tiles are logged daily
and published in an anonymized form with
a delay of 2 days
• http://planet.openstreetmap.org/tile_logs/
4. Log count
for each line in all the logs
{
z, x, y = extract coordinate from line
ip = extract source IP address from line
counter[z, x, y] += 1
source_addresses[z, x, y].append(ip)
}
for each (z, x, y) key in counter
{
if counter[z, x, y] >= 10 {
if count_unique(source_addresses[z, x, y]) >= 3 {
print z, x, y, counter[z, x, y]
}
}
}
File Format (as TSV): date,z,y,z where z=zoom, x/y=TMS index
5. How?
• For previous 7 days the tile view logs are aggregated
up to zoom level 14
• A T-score is calculated to standardize the data
• Values above a certain threshold are filtered out to
catch spikes
• These spikes are ranked relative to the mean increase
in views overall (compensates growth of OSM)
• Clustering eliminates locations that are near one
another
• Tile coordinates are reverse geocoded using
Nominatim in order to get geographic names
• A Twitter bot @trending_places announces the top 10
each day arfter 10 a.m. (or en error in case)
10. Challenges: Reverse
Geocoding
• Given a coordinates (from tile boundary)
• Give most relevant geographic name
inside / nearby
• Using place geographic names
=> Nominatim
• (no POIs yet)
11. Ex. of strong correlation:
Fort McMurray (CA)
1-3 May 2016: Wildfire across approximately 5900 square km
(1/6 Belgium 2x Luxembourg), destroying ~2,400 homes
12. Ex. of strong correlation #2:
Flüelen (CH)
1 June 2016: Switzerland celebrated the world's longest railway
tunnel (“The Gotthard Base Tunnel”) through the Alps…
13. Example of strong corr. #3:
San Severino Marche (IT)
24 August 2016: Earthquake of 6.2 on the moment magnitude scale hit
Central Italy. Its epicentre was southeast of Perugia and north of L'Aquila,
in an area near the borders of the Umbria, Lazio, Abruzzo and Marche
regions. As of 16 September 2016, 297 people have been killed
14. More statistics…
• Processing time: 5h (using SQLite / Python)
• Reporting period: 2016-04-11 - 2016-09-18
• No. reports: 125 (out of 160 days)
• Top 10 countries overall: RU 293, US 131, DE
70, UA 67, FR 46, PL 44, NO 43, ES 35, RO 33,
GB 31
• Top 10 place names overall: Saratovsky District
(RU) 16, 57.04.53.26 (RU) 13, Stara Emetivka
(UA) 13, Tatarstan (RU) 13, Jambyl Province (KZ)
11, Johor Bahru (MY Malaysia) 11, Odessa (UA)
11, Shimen (TW) 11, N.N. 11, Black Point (US) 10
15. Open questions
• Why so much russian places (and places
from post-Soviet states)?
• Influence of crawling?
• Bias of places with spikes after zero
activity vs. crowded places?
• Other bias?
• Better than T-Score? E.g. w/ Poisson
Distribution (multivariate ARIMA?)
16. Final open questions…
• Do you know…
– Sea Cliff (US),
– Sitionuevo (CO),
– Athens (GR),
– Sacele (RO), or
– Pretoria (ZA) ………?
• Wonder why !
18. Thanks
Also to Bhavya Chandra (main author, NTU
Singapore), Matt Amos, Lukas Martinelli
(@lukasmartinelli), Pavel Tyslacki (@tbicr),
Joost Schouppe (joostjakob)
Stefan Keller
Geometa Lab at HSR
University of Applied Sciences
Rapperswil (Switzerland)
www.hsr.ch/geometalab
@sfkeller