Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Clickstream Analysis with Apache Spark

1.616 Aufrufe

Veröffentlicht am

Apache Big Data Conference 2016, Vancouver BC: Talk by Andreas Zitzelsberger (@andreasz82, Principal Software Architect at QAware)

Abstract: On large-scale web sites, users leave thousands of traces every second. Businesses need to process and interpret these traces in real-time to be able to react to the behavior of their users. In this talk, Andreas shows a real world example of the power of a modern open-source stack. He will walk you through the design of a real-time clickstream analysis PAAS solution based on Apache Spark, Kafka, Parquet and HDFS. Andreas explains our decision making and presents our lessons learned.

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

Clickstream Analysis with Apache Spark

  3. 3. ONE POT TO RULE THEM ALL Web Tracking Ad Tracking ERP CRM ▪ Products ▪ Inventory ▪ Margins ▪ Customer ▪ Orders ▪ Creditworthiness ▪ Ad Impressions ▪ Ad Costs ▪ Clicks & Views ▪ Conversions
  4. 4. ONE POT TO RULE THEM ALL Retention Reach Monetarization steer … ▪ Campaigns ▪ Offers ▪ Contents
  5. 5. REACT ON WEB SITE TRAFFIC IN REAL TIME Image: https://www.flickr.com/photos/nick-m/3663923048
  6. 6. SAMPLE RESULTS Geolocated and gender- specific conversions. Frequency of visits Performance of an ad campaign
  7. 7. THE CONCEPTS Image: Randy Paulino
  8. 8. THE FIRST SKETCH (= real-time) SQL
  9. 9. CALCULATING USER JOURNEYS C V VT VT VT C X C V V V V V V V V C V V C V V V VT VT V V V VT C V X Event stream: User journeys: Web / Ad tracking KPIs: ▪ Unique users ▪ Conversions ▪ Ad costs / conversion value ▪ … V X VT C Click View View Time Conversion
  10. 10. THE ARCHITECTURE Big Data
  11. 11. „LARRY & FRIENDS“ ARCHITECTURE Runs not well for more than 1 TB data in terms of ingestion speed, query time and optimization efforts
  12. 12. Image: adweek.com Nope. Sorry, no Big Data.
  13. 13. „HADOOP & FRIENDS“ ARCHITECTURE Aggregation takes too long Cumbersome programming model (can be solved with pig, cascading et al.) Not interactive enough
  14. 14. Nope. Too sluggish.
  15. 15. Κ-ARCHITECTURE Cumbersome programming model Over-engineered: We only need 15min real-time ;-) Stateful aggregations (unique x, conversions) require a separate DB with high throughput and fast aggregations & lookups.
  16. 16. Λ-ARCHITECTURE Cumbersome programming model Complex architecture Redundant logic
  17. 17. FEELS OVER-ENGINEERED… http://www.brainlazy.com/article/random-nonsense/over-engineered
  18. 18. The Final Architecture* *) Maybe called µ-architecture one day ;-)
  19. 19. FUNCTIONAL ARCHITECTURE Strange Events IngestionRaw Event Stream Collection Events Processing Analytics Warehouse Fact Entries Atomic Event Frames Data Lake Master Data Integration ▪ Buffers load peeks ▪ Ensures message delivery (fire & forget for client) ▪ Create user journeys and unique user sets ▪ Enrich dimensions ▪ Aggregate events to KPIs ▪ Ability to replay for schema evolution ▪ The representation of truth ▪ Multidimensional data model ▪ Interactive queries for actions in realtime and data exploration ▪ Eternal memory for all events (even strange ones) ▪ One schema per event type. Time partitioned. ▪ Fault tolerant message handling ▪ Event handling: Apply schema, time-partitioning, De-dup, sanity checks, pre-aggregation, filtering, fraud detection ▪ Tolerates delayed events ▪ High throughput, moderate latency (~ 1min)
  20. 20. SERIAL CONNECTION OF STREAMING AND BATCHING IngestionRaw Event Stream Collection Event Data Lake Processing Analytics Warehouse Fact Entries SQL Interface Atomic Event Frames ▪ Cool programming model ▪ Uniform dev&ops ▪ Simple solution ▪ High compression ratio due to column-oriented storage ▪ High scan speed ▪ Cool programming model ▪ Uniform dev&ops ▪ High performance ▪ Interface to R out-of-the-box ▪ Useful libs: MLlib, GraphX, NLP, … ▪ Good connectivity (JDBC, ODBC, …) ▪ Interactive queries ▪ Uniform ops ▪ Can easily be replaced due to Hive Metastore ▪ Obvious choice for cloud-scale messaging ▪ Way the best throughput and scalability of all evaluated alternatives
  21. 21. public Map<Long, UserJourney> sessionize(JavaRDD<AtomicEvent> events) { 
 return events
 // Convert to a pair RDD with the userId as key
 .mapToPair(e -> new Tuple2<>(e.getUserId(), e))
 // Build user journeys
 .<UserJourneyAcc>combineByKey( UserJourneyAcc::create, UserJourneyAcc::add, UserJourneyAcc::combine)
 // Convert to a Java map
  22. 22. STREAM VERSUS BATCH https://en.wikipedia.org/wiki/Tanker_(ship)#/media/File:Sirius_Star_2008b.jpghttps://blog.allstate.com/top-5-safety-tips-at-the-gas-pump/
  23. 23. APACHE FLINK ■ Also has a nice, Spark-like API ■ Promises similar or better performance than spark ■ Looks like the best solution for a κ- Architecture ■ But it’s also the newest kid on the block
  24. 24. EVENT VERSUS PROCESSING TIME ■ There’s a difference between even time (te) and processing time (tp). ■ Events arrive out-of order even during normal operation. ■ Events may arrive arbitrary late. Apply a grace period before processing events. Allow arbitrary update windows of metrics.
  25. 25. EXAMPLE Minute Hour Day Week Month Quarter Year I U U U U U U I U U U U U U U Resolution 
 in Time Time dtp tp tp: Processing Time ti: Ingestion time te: Event Time dtp: Aggregation time frame dtw: Grace period : Insert fact : Update fact dtw te ti
  26. 26. LESSONS LEARNED Image: http://hochmeister-alpin.at
  27. 27. BEST-OF-BREED INSTEAD OF COMMODITY SOLUTIONS ETL Analytics Realtime Analytics Slice & Dice Data Exploration Polyglot Processing http://datadventures.ghost.io/2014/07/06/polyglot-processing
  28. 28. POLYGLOT ANALYTICS Data Lake Analytics Warehouse SQL 
 lane R
 lane Timeseries
 lane Reporting Data Exploration Data Science
  29. 29. NO RETENTION PARANOIA Data Lake Analytics Warehouse ▪ Eternal memory ▪ Close to raw events ▪ Allows replays and refills
 into warehouse Aggressive forgetting with clearly defined 
 retention policy per aggregation level like: ▪ 15min:30d ▪ 1h:4m ▪ … Events Strange Events
  30. 30. BEWARE OF THE HIPSTERS Image: h&m
  31. 31. ENSURE YOUR SOFTWARE RUNS LOCALLY The entire architecture must be able to run locally. Keep the round trips low for development and testing. Throughput and reaction times need to be monitored continuously. Tune your software and the underlying frameworks as needed.
  32. 32. TUNE CONTINUOUSLY IngestionRaw Event Stream Collection Event Data Lake Processing Analytics Warehouse Fact Entries SQL Interface Atomic Event Frames Load generator Throughput & latency probes System, container and process monitoring
  33. 33. IN NUMBERS Overall dev effort until the first release: 250 person days Dimensions: 10 KPIs: 26 Integrated 3rd party systems: 7 Inbound data volume per day: 80GB New data in DWH per day: 2GB Total price of cheapest cluster which is able to handle production load:
  34. 34. THANK YOU @andreasz82 andreas.zitzelsberger@qaware.de
  35. 35. BONUS SLIDES
  36. 36. CALCULATING UNIQUE USERS ■ We need an exact unique user count. ■ If you can, you should use an approximation such as HyperLogLog. U1 U2 U3 U1 U4 Time Users 3 UU 2 UU 4 UU Flajolet, P.; Fusy, E.; Gandouet, O.; Meunier, F. (2007). "HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm". AOFA ’07: Proceedings of the 2007 International Conference on the Analysis of Algorithms.
  37. 37. CHARTING TECHNOLOGY https://github.com/qaware/big-data-landscape
  38. 38. CHOOSING WHERE TO AGGREGATE Ingestion Event Data Lake Processing Analytics Warehouse Fact Entries Analytics Atomic Event Frames 1 2 3 - Enrichment - Preprocessing - Validation The hard lifting. - Processing steps that can be done at query time. - Interactive queries.