Video: https://media.ccc.de/browse/conferences/mrmcd/mrmcd15/MRMCD15-6986-i_like_trains.html
Accessing undocumented APIs from big companies is fun. Especially if you get loads of data to store and analyze from them. Even more though, if that data bears humiliating evidence.
When you fetch half a million records per day from a well known railroad company, it's many power, much Bahn, so data. And it only gets better, the more you play with that data.
Buried between the first and last station of many railway tracks lies priceless data for customers and competitors... or better: this is our bucket of gold at the end of the rainbow, with just a bit less magic.
Where are the most delays? Will I catch the next train? Who is the driver of the Hogwarts Express and how do I get to platform nine and three quarters without hurting me while running against a wall?
We have the answers for all these questions. Well, for most of them.
19. How much we talkin’?
~160 million
datasets per year
...stored as JSON… yikes!
20. What to do...
Full search over
80 GB worth of JSON?
Nope.
SomeSQL?
NOPE NOPE NOPE
21. What to do?
no budget, high expectations
ElasticSearch
→ performs well with
large datasets
→ easy clustering
22. How it works
Collect / Normalize
request all that data
fix formats
convert location
Store
save everything to a
file
Import
import to ES
import everything
again because you
forgot something
23. Current stack
3 ES servers
~3.4 GB res/srv
~40 GB disk/srv
~2 CPUs/srv
1 nginx + kibana