Lessons learned while building a solution to crunch 100 billion+ positions for better navigation algorithms. This talk should highlight how you can employ big data technology on commodity hardware and without spending a fortune on it.
More details on: http://2013.howtoweb.co/
2. What do I know about big data?
- skobbler logs all positions
from our users (100 billion+)
- > 10TB of data from users
- Products / revenues
significantly Improved with
Business Intelligence
Big data on a small budget
@apphil #2
3. Why should you learn about big data?
ï Harvard Business Review: âData Scientist: The
Sexiest Job of the 21st Centuryâ
ï Obama became president of the US in big parts
due to the use of big dataâŠ
ï World class sports teams enhance their
performance by big data
ï Amazon, Google, Facebook, etc. have all their devprocesses by now data-driven
Big data on a small budget
@apphil #3
4. What are some great use-cases for big
data?
ï Analyzing of log files
and user behavior (and
predictions about future
behavior)
ï A/B testing and
automatic optimization
of functionality
ï Improving monetization
(e.g. ad optimization,
etc.)
ï Checking adoption and
usage of new features
Big data on a small budget
@apphil #4
5. When better not to rely on big data?
ï When qualitative feedback is
better than quantitative one
(e.g. very early stage
companies)
ï When you donât have
enough users yet to get
statistically relevant results
ï When you do not know what
you are optimizing for
Big data on a small budget
@apphil #5
6. How does a solid and simple workflow for
big data analysis look like?
Proces
s
Log
Analyse
Eval /
Test
Big data on a small budget
Improv
e
@apphil #6
7. Tools / technologies for a good big data
setup
ï Logging: MongoDB, VoltDB,
Cassandra
ï Processing & Analyzing /
Storing: Hadoop & Hbase
(batch), Storm (real-time),
Samza (real-time)
ï Optimizing: Mahout (machine
learning)
Big data on a small budget
@apphil #7
8. How can you build this without breaking
the bank?
- Analyse / process Async
- Cheap dedicated servers
(vs. cloud)
- Use Open / Free
Software
Big data on a small budget
@apphil #8
9. Key cost factor: Real-time, near-time vs.
batch
- Real-time much more
expensive than batch
- Leverage as much preprocessing as possible
- Try using in-memory
technology for realtime analytics
Big data on a small budget
@apphil #9
10. #1 Log: Initially as much data as feasible
should be logged so itâs available later
- Define interesting data
(rather log too much if
unsure)
- Upload / collect data
- Decide on real-time, neartime or batch processing in
the chain
Big data on a small budget
@apphil #10
11. #2 Process: Enhance the data and make it
as rich as possible and easy to query
- Move data to processing environment
- Run logged data through processing
chain so it can be queried
- Enhance the logged data with any
additional data available (e.g.
geography, social data, user data, etc.)
Big data on a small budget
@apphil
12. #3 Analyse: Cluster the data in meaningful
groups and compare it
Big data on a small budget
- Define Key performance
Indicators (KPI)
- Cluster data in a meaningful
way (e.g. by geography, time
of day, customer past
behaviour)
- Compare data vs. reference
sets
@apphil #12
13. #4 Improve: Learn from analysis where
your challenges are to optimize behavior
- Manually / Automatically adjust
features (e.g. lower prices in
certain regions, etc.)
- Develop A/B testing scenarios
and formulate improvement
theories
Big data on a small budget
@apphil #13
14. #5 Evaluate
ï Check if the KPIs
improve after applying
the changes
ï Accept changes that
improved your users
behavior / reject changes
that kept them the same
ï Define which additional
logs you might need to
better cluster / identify
behaviour
ï Go back to step #1
Big data on a small budget
@apphil #14
15. #1 Log: Practical example on how this
works at skobbler
ï Software version
ï Routing profile used
ï Device
ï Raw Positions
ï Geography (e.g. country)
ï Rating of the route (optional)
ï Destination reached (yes / no)
ï Etc.
Big data on a small budget
@apphil #15
16. #2 Process: Enhance and split the data
based on drives and segments
ï Combine the data on a per drive basis (= session)
ï Combine the data on a per segment basis (= how
fast are people driving on a street versus our
estimate)
ï Identify key behavior across the route (e.g. reroutings, etc.)
Big data on a small budget
@apphil #16
17. Example: Real time analysis with Twitter
Storm framework to detect road changes
Example visualization of
drives in last five
minutes (real-time)
Big data on a small budget
@apphil #17
18. Example: Historic driving patterns
(processed with Hadoop / HBase)
Big data on a small budget
@apphil #18
19. #3 Analyse: Try to see in which areas our
routing is not optimal
ï KPIs are:
ï Route rating (if given)
ï # of re-routings (the smaller the better)
ï Time to destination vs. estimation by routing
ï Cluster the data by
ï Routing algorithm (and parameters used)
ï Geography
Big data on a small budget
@apphil #19
20. #4 Improve: Come up with strategies to
improve routing experience based on data
ï For future routes improve the estimation on time
taken on a segment vs. time actually travelled
ï Alter routing parameters based on country specifics
to get better results (e.g. in Germany people drive
faster on the Autobahn)
Big data on a small budget
@apphil #20
21. #5 Evaluate: Deploy the changes and
compare them to reference data
- Deploy changes to production
and compare ratings / timings
vs. base values (~weekly)
- Verify if other parameters such
as usage, etc. also improve
Big data on a small budget
@apphil #21
22. Summary: Big data can drive big value but
stay affordable
Simple formula:
Log -> Process -> Analyze ->
Improve -> Evaluate
= Success
Big data on a small budget
@apphil #22
23. Thank you for your attention!
Get in Touch: philipp.kandal@skobbler.com
Phone: +49-172-4597015
Follow me on
.com/apphil