4. 4
- Send tracking events as query string params to
server hosted on Rackspace
- Hourly job to parse log files and insert summary data into SQL
- Problems:
- Network Bottleneck – dropping events
- Managing SQL server drive space
- No scalability
- Because of sizing problems we limited ourselves in
what we collected – poor analytics
- No enrichment process
Solution 1
5. 5
- Distribute the collection of the tracking events to Akamai cloud (GET
requests to CDN endpoint)
- Akamai aggregate logs and send every 4 hours a batch of logs via
FTP
- Hadoop – Hive – SQL summary tables all hosted in Azure cloud
- Problems:
- Need for faster end to end reporting
- To stay scalable need for summary tables- lose granular reporting
- Changes to the data we need to report on requires re-building and
possibly re-importing of raw data – data modeling
Hadoop/HIVE/SQL
Akamai
Solution 2
6. 6
Requirements doc for new solution
- Work with Flash and Javascript trackers
- Robust data modeling - Ability to change business requirements on the
fly
- No need for summary data – granular reporting
- Robust and reliable enrichment process
- Fast and flexible end to end solution
3rd Party Solution
- Ability to send unlimited events and unstructured data
- Pricing not based on event volume (Dec. 779 Million)
- We own the data
- Hand holding- Managed service
- Beautiful and useful visualizations and data export API (may require
additional 3rd party)
7. 7
How’d we do?
- Work with Flash and Javascript trackers
- Pricing not based on event volume (Dec. 779 Million)
- Ability to send unlimited events and unstructured data
- Hand Holding
- Fast and flexible end to end solution
- We own the data
- Robust data modeling - Ability to change business
requirements on the fly
- No need for summary data – granular reporting
- Robust and reliable enrichment process
- Beautiful and useful visualizations and data export
API (may require additional 3rd party)
Solution- Snowplow
- We wrote an Open Source AS3 tracker
- Fixed monthly fee + AWS usage
- No limits on size or event type
- Amazing customer service
- Pipeline can be adjusted based on needs
- Sits in our AWS account
- Because all data is stored we can change the
pipeline rules and at any time and re-run
- We learned to live with summary data
- Constantly growing- today surpasses our needs
- Today using Bime Analytics – soon to be in house
charting components or Amazon Quicksite
8. 8
Gotchas we ran into
- Errors in the raw data being sent in – garbage in garbage out!
- Solution- at the time- was not auto-scaling.
- Redshift is not MS SQL server- need to understand nuances of
columnar database queries and optimizations
- Real data analysts don’t want charts- they want data. We spent
a lot of time and money perfecting our charts when ultimately our
customers want csv exports. Today our charts are about 95% for
marketing purposes.
- AWS cost forecasting and control
- Data modeling - Ultimately we do need to summarize but at an
acceptable level.
- Invest heavily in this stage.
- Overestimate your needs – You don’t know what you don’t
know.
- Work with Snowplow (at extra cost) to get it right
9. 9
What value do our analytics
provide?
It’s not that big data is bad, but by looking
for the big wins, we risk losing the most
exciting potential of big data: the very
small actionable insights that are unique
to each individual. The real future
potential of big data isn’t in its capacity to
be big, but rather in just how small it can
get.
Glen Tullman - Forbes
“
“