SlideShare ist ein Scribd-Unternehmen logo
1 von 32
Downloaden Sie, um offline zu lesen
CLICKSTREAM ANALYSIS WITH
SPARK – UNDERSTANDING
VISITORS IN REAL-TIME
Dr. Josef Adersberger
QAware GmbH, Germany
THE CHALLENGE
One Kettle to Rule ‘em All
Web Tracking Ad Tracking
ERP CRM
One Kettle to Rule ‘em All
Retention Reach
Monetarization
steer …
§ Campaigns
§ Offers
§ Contents
THE CONCEPTS
by Randy Paulino
The First Sketch
(= real-time)
Tableau
User Journey Analysis
C V VT VT VT C X
C V
V V V V V V V
C V V C V V V
VT VT V V V VT C
V X
Event stream: User journeys:
Web / Ad tracking
KPIs:
§ Unique users
§ Conversions
§ Ad costs / conversion value
§ …
V
X
VT
C Click
View
View Time
Conversion
THE ARCHITECTURE
„Larry & Friends“ Architecture
Collector Aggregation
SQL DB
Runs not well for more
than 1 TB data in terms of
ingestion speed, query time
and optimization efforts
by adweek.com
Nope.
Sorry, no Big Data.
„Hadoop & Friends“ Architecture
Collector
Batch Processor
[Hadoop]
Collector Event Data Lake Batch Processor Analytics DB
JSON
Stream
Aggregation
takes too long
Cumbersome
programming model
(can be solved with
pig, cascading et al.)
Not
interactive
enough
Nope.
Too sluggish.
κ-Architecture
Collector Stream Processor
Analytics DB
Persistence
JSON
Stream
Cumbersome
programming model
Over-engineered: We only need
15min real-time ;-)
Stateful aggregations (unique x,
conversions) require a separate DB
with high throughput and fast
aggregations & lookups.
λ-Architecture
Collector
Event Processor
Event Data Lake Batch Processor
Analytics DB
Speed Layer
Batch Layer
JSON
Stream
Cumbersome
programming model
Complex
architecture
Redundant
logic
Feels Over-Engineered…
http://www.brainlazy.com/article/random-nonsense/over-engineered
The Final Architecture*
*) Maybe called μ-architecture one day ;-)
Functional Architecture
Strange Events
IngestionRaw Event
Stream
Collection Events Processing Analytics
Warehouse
Fact
Entries
Atomic Event
Frames
Data Lake
Master Data Integration
§ Buffers load peeks
§ Ensures message
delivery (fire & forget
for client)
§ Create user journeys and
unique user sets
§ Enrich dimensions
§ Aggregate events to KPIs
§ Ability to replay for schema
evolution
§ The representation of truth
§ Multidimensional data
model
§ Interactive queries for
actions in realtime and
data exploration
§ Eternal memory for all
events (even strange
ones)
§ One schema per event
type. Time partitioned.
class Analytics Model
´ factª
WebFact
´ dimensionª
Zeit
´ dimensionª
Kampagne
Jahr
Quartal
Monat
Woche
Tag
Stunde
Minute
Kunde
+ Land: String
Partner
´ dimensionª
Tracking
Tracking Group
SensorTag
+ Typ: SensorTagType
Platzierung
+ Format: ImageSize
+ Kostenmodell: KostenmodellArt
Werbemittel
+ AdGroup: String
+ Format: ImageSize
+ Grˆ fle: KiloBytes
+ LandingPage: URL
+ Motif: URL
Kampagne
´ dimensionª
Client
Kategorie
Dev ice
+ Bezeichner: String
+ Hersteller: String
+ Typ: String
Brow ser
+ Typ: String
+ Version: int
´ dimensionª
Ausspielort
LandRegion
Stadt
´ dimensionª
Kanal
Kanal
´ dimensionª
Vermarktung
´ enumerationª
SensorTagType
ORDER_TAG
MASTER_TAG
CUSTOM_TAG
Betriebssystem
+ Typ: String
+ Version: Version
? Dimension:	Unabh‰ng ig es	Pr‰dikat	auf	Metriken	bei	der	Analyse	("kann	isoliert	dar¸ ber	nachdenken	/	isoliert	dazu	Analysen	
fahren")										
? H ierarchie:	Sub-Pr‰dikat	auf	Metriken.	Erzeug t	mehr	als	eine	(zueinander	diskunkte)	Teilmeng en	der	Metriken.	Entspricht	den	g ‰ng ig en	
Drill-Down-Pfaden	in	den	Reports		bzw.	den	Batch-Ag g reg ate-Up-Pfaden	in	der	Ag g reg ationslog ik.		Semantische	Unterstrukturen:	"ist	Teil	
von	& 	kann	nicht	existieren	ohne".		
? Asssoziation:	Nicht	verwendet.	Separates	Stammdatenmodell.														
? Attribut:	Ermˆ g licht	eine	weitere	(querschneidende)	Einschr‰nkung 	der	Metrikmeng e	erg ‰nzend	zu	den	Hierarchien.											
Domain
Website
Tracking Site
Vermarkter
Auslieferungs-
Domain
Referral
´ enumerati...
KostenmodellArt
CPC
CPM
CPO
CPA
´ abstractª
DimensionValue
+ id: int
+ name: String
+ sourceId: String
WebsiteFact
+ Bounces: int
+ Verweildauer: float
+ Visits: int
BasicAdFact
+ Clicks: int
+ Sichtbare Views: int
+ Validierte Clicks: int
+ View (angefragt): int
+ View (ausgeliefert): int
+ View (gemessen): int
´ dimensionª
Produkt
Shop
Produkt
+ Produktkategorie: String
´ dimensionª
Zeitfenster
Letzte X Tage
´ dimensionª
User
User Segment
´ dimensionª
Order
OrderStatus
+ Status: OrderStatus
´ enumerationª
OrderStatus
IN_BEARBEITUNG
ERFOLGREICH (AKTIVIERT)
ABGELEHNT
NICHT_IN_BEARBEITUNG
UniquesFact
+ Unique Clicks: int
+ Unique Users: int
+ Unique Views: int
AdCostFact
+ CPC: int
+ Kosten: float
Conv ersionFact
+ PC: int
+ PR: int
+ PV: int
+ Umsatz PC: float
+ Umsatz PR: float
+ Umsatz PV: float
AdVisibilityFact
+ Sichtbarkeitsdauer: float
Activ atedOrderFact
+ Orders: int
+ Umsatz: float
TrackingFact
+ Orders: int
+ Page Impressions: int
+ Umsatz: float
X	= 	{7,	14,	28,	30}
§ Fault tolerant message handling
§ Event handling: Apply schema, time-partitioning, De-dup, sanity
checks, pre-aggregation, filtering, fraud detection
§ Tolerates delayed events
§ High throughput, moderate latency (~ 1min)
Series Connection of Streaming
and Batching - all based on Spark.
IngestionRaw Event
Stream
Collection Event Data Lake Processing Analytics
Warehouse
Fact
Entries
SQL Interface
Atomic Event
Frames
§ Cool programming model
§ Uniform dev&ops
§ Simple solution
§ High compression ratio due to
column-oriented storage
§ High scan speed
§ Cool programming model
§ Uniform dev&ops
§ High performance
§ Interface to R out-of-the-box
§ Useful libs: MLlib, GraphX, NLP, …
§ Good connectivity (JDBC,
ODBC, …)
§ Interactive queries
§ Uniform ops
§ Can easily be replaced
due to Hive Metastore
§ Obvious choice for
cloud-scale messaging
§ Way the best throughput
and scalability of all
evaluated alternatives
LESSONS LEARNED
by: http://hochmeister-alpin.at
Technology Mapping
https://github.com/qaware/big-data-landscape
User Interface
Data Lake
Data Warehouse
Ingestion
Processing
Data Science Interactive Analysis Reporting & Dashboards
Data Sources
Analytics
Micro Analytics Services instead of reporting
servers.
Charting Libraries:Dashboards:
Analytics Frontends
Algorithm Libraries
Structured Data Lake: The eternal memory.
Efficient data serialization formats:
§ Integated compression
§ Column-oriented storage
§ Predicate pushdown
Distributed Filesystem or NoSQL DB
Data Workflows ETL Jobs Massive
Parallelization
Pig Open Studio
Data Logicstics Stream
Processing
NewSQL: SQL meets NoSQL.
Polyglott Persistence
Index Machines: Fast aggregation and search.
In-Memory Databases: Fast access.
Time Series Databases
Atlas
Polyglott Analytics
Data Lake
Analytics
Warehouse
SQL
lane
R
lane
Timeseries
lane
Reporting Data Exploration
Data Science
Micro Analytics Services
Microservice
Dashboard
Microservice …
No Retention Paranoia
Data Lake
Analytics
Warehouse
§ Eternal memory
§ Close to raw events
§ Allows replays and refills
into warehouse
Aggressive forgetting with clearly defined
retention policy per aggregation level like:
§ 15min:30d
§ 1h:4m
§ …
Events
Strange Events
Continuous Tuning
IngestionRaw Event
Stream
Collection Event Data Lake Processing Analytics
Warehouse
Fact
Entries
SQL Interface
Atomic Event
Frames
Load
Generator Throughput & latency probes
In Numbers
Overall dev effort until the first release: 250 person days
Dimensions: 10 KPIs: 26
Integrated 3rd party systems: 7
Inbound data volume per day: 80GB
New data in DWH per day: 2GB
Total price of cheapest cluster which is able to handle production load:
THANK YOU.
@adersberger
josef.adersberger@qaware.de
Download slides
Bonus Topic: Roadmap
SparkKafka HDFS
IaaS
Simplify Ops with Mesos
Faster aggregation & easier updates
with Spark-on-Solr
http://qaware.blogspot.de/2015/06/solr-with-sparks-or-how-to-submit-spark.html
Bonus Topic: Smart Aggregation
Ingestion Event Data Lake Processing Analytics
Warehouse
Fact
Entries
Analytics
Atomic Event
Frames
1 2
3
Architecture follows requirements
class Analytics Model
´ factª
WebFact
´ dimensionª
Zeit
´ dimensionª
Kampagne
Jahr
Quartal
Monat
Woche
Tag
Stunde
Minute
Kunde
+ Land: String
Partner
´ dimensionª
Tracking
Tracking Group
SensorTag
+ Typ: SensorTagType
Platzierung
+ Format: ImageSize
+ Kostenmodell: KostenmodellArt
Werbemittel
+ AdGroup: String
+ Format: ImageSize
+ Grˆ fle: KiloBytes
+ LandingPage: URL
+ Motif: URL
Kampagne
´ dimensionª
Client
Kategorie
Dev ice
+ Bezeichner: String
+ Hersteller: String
+ Typ: String
Browser
+ Typ: String
+ Version: int
´ dimensionª
Ausspielort
LandRegion
Stadt
´ dimensionª
Kanal
Kanal
´ dimensionª
Vermarktung
´ enumerationª
SensorTagType
ORDER_TAG
MASTER_TAG
CUSTOM_TAG
Betriebssystem
+ Typ: String
+ Version: Version
? Dimension:	Unabh‰ngiges	Pr‰dikat	auf	Metriken	bei	der	Analyse	("kann	isoliert	dar¸ ber	nachdenken	/	isoliert	dazu	Analysen	
fahren")										
? H ierarchie:	Sub-Pr‰dikat	auf	Metriken.	Erzeugt	mehr	als	eine	(zueinander	diskunkte)	Teilmengen	der	Metriken.	Entspricht	den	g‰ngigen	
Drill-Down-Pfaden	in	den	Reports		bzw.	den	Batch-Aggregate-Up-Pfaden	in	der	Aggregationslogik.		Semantische	Unterstrukturen:	"ist	Teil	
von	&	kann	nicht	existieren	ohne".		
? Asssoziation:	Nicht	verwendet.	Separates	Stammdatenmodell.														
? Attribut:	Ermˆ glicht	eine	weitere	(querschneidende)	Einschr‰nkung	der	Metrikmenge	erg‰nzend	zu	den	Hierarchien.											
Domain
Website
Tracking Site
Vermarkter
Auslieferungs-
Domain
Referral
´ enumerati...
KostenmodellArt
CPC
CPM
CPO
CPA
´ abstractª
DimensionValue
+ id: int
+ name: String
+ sourceId: String
WebsiteFact
+ Bounces: int
+ Verweildauer: float
+ Visits: int
BasicAdFact
+ Clicks: int
+ Sichtbare Views: int
+ Validierte Clicks: int
+ View (angefragt): int
+ View (ausgeliefert): int
+ View (gemessen): int
´ dimensionª
Produkt
Shop
Produkt
+ Produktkategorie: String
´ dimensionª
Zeitfenster
Letzte X Tage
´ dimensionª
User
User Segment
´ dimensionª
Order
OrderStatus
+ Status: OrderStatus
´ enumerationª
OrderStatus
IN_BEARBEITUNG
ERFOLGREICH (AKTIVIERT)
ABGELEHNT
NICHT_IN_BEARBEITUNG
UniquesFact
+ Unique Clicks: int
+ Unique Users: int
+ Unique Views: int
AdCostFact
+ CPC: int
+ Kosten: float
Conv ersionFact
+ PC: int
+ PR: int
+ PV: int
+ Umsatz PC: float
+ Umsatz PR: float
+ Umsatz PV: float
AdVisibilityFact
+ Sichtbarkeitsdauer: float
Activ atedOrderFact
+ Orders: int
+ Umsatz: float
TrackingFact
+ Orders: int
+ Page Impressions: int
+ Umsatz: float
X	= 	{7,	14,	28,	30}
act Processing
Processing
WebsiteOrderActivationAdCostConversionUniques+
Overlap
AdVisibilityBasicAdsTracking
Dimensionsv erzeichnis
aktualisieren
Page
Impressions
und Orders
zählen
Aggregation
TrackingFact
Ad
Impressions
zählen
Aggregation
BasicAdFact
CTR und
Sichtbarkeitsrate
berechnen
Inferenz der
Sichtbarkeitsdauer
Aggregation
AdVisibilityFact
Dimensionsraum
aufspannen
Pro Vektor die Ev ents und
die dort enthaltenen UserIds
und Interaktionsarten
ermitteln
Menge der UserIds pro
Interaktionsart erzeugen und deren
Mächtigkeit bestimmen
UniquesFact
User Journeys
erstellen (bzw.
LV/LC pro User)
Attribution v on Conv ersion
auf Basis v orkonfiguriertem
Conv ersion Modell
Conv ersions
zählen (PV, PC, PR)
Warenkorbwert ermitteln
bei Order-Conv ersion
(PV, PC, PR)
Aggregation
Conv ersionFact
Kosten pro
Ev ent
ermitteln Aggregation
Nicht zuweisbare
Kosten ermitteln
CostFact
Inferenz der Visits
Visits zählen
Bounces zählen
Analyse der Verweildauer
Aggregation
WebsiteFact
Order Status
ermitteln
Anzahl der nicht getrackten Orders ermitteln
Aggregation
Activ atedOrderFact
:Analytics Warehouse:Event Data Lake
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
Data Processing WorkflowMultidimensional Data Model
Sample results
Geolocated and gender-
specific conversions.
Frequency of visits
Performance of an ad campaign

Weitere ähnliche Inhalte

Was ist angesagt?

Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Databricks
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 

Was ist angesagt? (20)

Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Introducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseIntroducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data Warehouse
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
 
“Open Data Web” – A Linked Open Data Repository Built with CKAN
“Open Data Web” – A Linked Open Data Repository Built with CKAN“Open Data Web” – A Linked Open Data Repository Built with CKAN
“Open Data Web” – A Linked Open Data Repository Built with CKAN
 
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Performance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsPerformance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud Environments
 
Databricks on AWS.pptx
Databricks on AWS.pptxDatabricks on AWS.pptx
Databricks on AWS.pptx
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Streaming data for real time analysis
Streaming data for real time analysisStreaming data for real time analysis
Streaming data for real time analysis
 
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta LakeNear Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 

Andere mochten auch

Andere mochten auch (6)

Clickstream Analysis
Clickstream AnalysisClickstream Analysis
Clickstream Analysis
 
Clickstream Analysis with Spark
Clickstream Analysis with Spark Clickstream Analysis with Spark
Clickstream Analysis with Spark
 
Clickstream ppt copy
Clickstream ppt   copyClickstream ppt   copy
Clickstream ppt copy
 
Web log & clickstream
Web log & clickstream Web log & clickstream
Web log & clickstream
 
clickstream analysis
 clickstream analysis clickstream analysis
clickstream analysis
 
Click Stream Analysis
Click Stream AnalysisClick Stream Analysis
Click Stream Analysis
 

Ähnlich wie Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef Adersberger

Analytic powerhouse parallel data warehouse und r
Analytic powerhouse parallel data warehouse und rAnalytic powerhouse parallel data warehouse und r
Analytic powerhouse parallel data warehouse und r
Marcel Franke
 
Skalierung & Performance
Skalierung & PerformanceSkalierung & Performance
Skalierung & Performance
glembotzky
 
Automatisierung von Client-seitigen Web-Performance-Optimierungen
Automatisierung von Client-seitigen Web-Performance-OptimierungenAutomatisierung von Client-seitigen Web-Performance-Optimierungen
Automatisierung von Client-seitigen Web-Performance-Optimierungen
Jakob
 
Data Is The New Oil
Data Is The New OilData Is The New Oil
Data Is The New Oil
ParStream
 
Einsatz von Subversion bei der Entwicklung technisch-wissenschaftlicher Software
Einsatz von Subversion bei der Entwicklung technisch-wissenschaftlicher SoftwareEinsatz von Subversion bei der Entwicklung technisch-wissenschaftlicher Software
Einsatz von Subversion bei der Entwicklung technisch-wissenschaftlicher Software
Andreas Schreiber
 

Ähnlich wie Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef Adersberger (20)

Clickstream Analysis with Spark - Understanding Visitors in Real Time
Clickstream Analysis with Spark - Understanding Visitors in Real TimeClickstream Analysis with Spark - Understanding Visitors in Real Time
Clickstream Analysis with Spark - Understanding Visitors in Real Time
 
Analytic powerhouse parallel data warehouse und r
Analytic powerhouse parallel data warehouse und rAnalytic powerhouse parallel data warehouse und r
Analytic powerhouse parallel data warehouse und r
 
Dataservices - Data Processing mit Microservices
Dataservices - Data Processing mit MicroservicesDataservices - Data Processing mit Microservices
Dataservices - Data Processing mit Microservices
 
EventDB - Hamburg 2013
EventDB - Hamburg 2013EventDB - Hamburg 2013
EventDB - Hamburg 2013
 
Amazon Redshift
Amazon RedshiftAmazon Redshift
Amazon Redshift
 
worldiety GmbH - Datenanalyse
worldiety GmbH - Datenanalyse worldiety GmbH - Datenanalyse
worldiety GmbH - Datenanalyse
 
Big Data Webinar (Deutsch)
Big Data Webinar (Deutsch)Big Data Webinar (Deutsch)
Big Data Webinar (Deutsch)
 
Skalierung & Performance
Skalierung & PerformanceSkalierung & Performance
Skalierung & Performance
 
Kennst du ein Unternehmen, dass erfolgreich die QS outtasked hat?“
Kennst du einUnternehmen, dass erfolgreichdie QS outtasked hat?“Kennst du einUnternehmen, dass erfolgreichdie QS outtasked hat?“
Kennst du ein Unternehmen, dass erfolgreich die QS outtasked hat?“
 
Roadshow: What's new in Microsoft SQL Server 2016
Roadshow: What's new in Microsoft SQL Server 2016Roadshow: What's new in Microsoft SQL Server 2016
Roadshow: What's new in Microsoft SQL Server 2016
 
Make Developers Fly: Principles for Platform Engineering
Make Developers Fly: Principles for Platform EngineeringMake Developers Fly: Principles for Platform Engineering
Make Developers Fly: Principles for Platform Engineering
 
Meet Magento - High performance magento
Meet Magento - High performance magentoMeet Magento - High performance magento
Meet Magento - High performance magento
 
Der Status Quo des Chaos Engineerings
Der Status Quo des Chaos EngineeringsDer Status Quo des Chaos Engineerings
Der Status Quo des Chaos Engineerings
 
Automatisierung von Client-seitigen Web-Performance-Optimierungen
Automatisierung von Client-seitigen Web-Performance-OptimierungenAutomatisierung von Client-seitigen Web-Performance-Optimierungen
Automatisierung von Client-seitigen Web-Performance-Optimierungen
 
JavaFX Real-World Apps
JavaFX Real-World AppsJavaFX Real-World Apps
JavaFX Real-World Apps
 
Data Is The New Oil
Data Is The New OilData Is The New Oil
Data Is The New Oil
 
Data Vault DWH Automation
Data Vault DWH AutomationData Vault DWH Automation
Data Vault DWH Automation
 
Python in der Luft- und Raumfahrt
Python in der Luft- und RaumfahrtPython in der Luft- und Raumfahrt
Python in der Luft- und Raumfahrt
 
TDD für Testmuffel
TDD für TestmuffelTDD für Testmuffel
TDD für Testmuffel
 
Einsatz von Subversion bei der Entwicklung technisch-wissenschaftlicher Software
Einsatz von Subversion bei der Entwicklung technisch-wissenschaftlicher SoftwareEinsatz von Subversion bei der Entwicklung technisch-wissenschaftlicher Software
Einsatz von Subversion bei der Entwicklung technisch-wissenschaftlicher Software
 

Mehr von Spark Summit

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

Mehr von Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef Adersberger

  • 1. CLICKSTREAM ANALYSIS WITH SPARK – UNDERSTANDING VISITORS IN REAL-TIME Dr. Josef Adersberger QAware GmbH, Germany
  • 3. One Kettle to Rule ‘em All Web Tracking Ad Tracking ERP CRM
  • 4. One Kettle to Rule ‘em All Retention Reach Monetarization steer … § Campaigns § Offers § Contents
  • 5.
  • 7. The First Sketch (= real-time) Tableau
  • 8. User Journey Analysis C V VT VT VT C X C V V V V V V V V C V V C V V V VT VT V V V VT C V X Event stream: User journeys: Web / Ad tracking KPIs: § Unique users § Conversions § Ad costs / conversion value § … V X VT C Click View View Time Conversion
  • 10. „Larry & Friends“ Architecture Collector Aggregation SQL DB Runs not well for more than 1 TB data in terms of ingestion speed, query time and optimization efforts
  • 12. „Hadoop & Friends“ Architecture Collector Batch Processor [Hadoop] Collector Event Data Lake Batch Processor Analytics DB JSON Stream Aggregation takes too long Cumbersome programming model (can be solved with pig, cascading et al.) Not interactive enough
  • 14. κ-Architecture Collector Stream Processor Analytics DB Persistence JSON Stream Cumbersome programming model Over-engineered: We only need 15min real-time ;-) Stateful aggregations (unique x, conversions) require a separate DB with high throughput and fast aggregations & lookups.
  • 15. λ-Architecture Collector Event Processor Event Data Lake Batch Processor Analytics DB Speed Layer Batch Layer JSON Stream Cumbersome programming model Complex architecture Redundant logic
  • 17. The Final Architecture* *) Maybe called μ-architecture one day ;-)
  • 18. Functional Architecture Strange Events IngestionRaw Event Stream Collection Events Processing Analytics Warehouse Fact Entries Atomic Event Frames Data Lake Master Data Integration § Buffers load peeks § Ensures message delivery (fire & forget for client) § Create user journeys and unique user sets § Enrich dimensions § Aggregate events to KPIs § Ability to replay for schema evolution § The representation of truth § Multidimensional data model § Interactive queries for actions in realtime and data exploration § Eternal memory for all events (even strange ones) § One schema per event type. Time partitioned. class Analytics Model ´ factª WebFact ´ dimensionª Zeit ´ dimensionª Kampagne Jahr Quartal Monat Woche Tag Stunde Minute Kunde + Land: String Partner ´ dimensionª Tracking Tracking Group SensorTag + Typ: SensorTagType Platzierung + Format: ImageSize + Kostenmodell: KostenmodellArt Werbemittel + AdGroup: String + Format: ImageSize + Grˆ fle: KiloBytes + LandingPage: URL + Motif: URL Kampagne ´ dimensionª Client Kategorie Dev ice + Bezeichner: String + Hersteller: String + Typ: String Brow ser + Typ: String + Version: int ´ dimensionª Ausspielort LandRegion Stadt ´ dimensionª Kanal Kanal ´ dimensionª Vermarktung ´ enumerationª SensorTagType ORDER_TAG MASTER_TAG CUSTOM_TAG Betriebssystem + Typ: String + Version: Version ? Dimension: Unabh‰ng ig es Pr‰dikat auf Metriken bei der Analyse ("kann isoliert dar¸ ber nachdenken / isoliert dazu Analysen fahren") ? H ierarchie: Sub-Pr‰dikat auf Metriken. Erzeug t mehr als eine (zueinander diskunkte) Teilmeng en der Metriken. Entspricht den g ‰ng ig en Drill-Down-Pfaden in den Reports bzw. den Batch-Ag g reg ate-Up-Pfaden in der Ag g reg ationslog ik. Semantische Unterstrukturen: "ist Teil von & kann nicht existieren ohne". ? Asssoziation: Nicht verwendet. Separates Stammdatenmodell. ? Attribut: Ermˆ g licht eine weitere (querschneidende) Einschr‰nkung der Metrikmeng e erg ‰nzend zu den Hierarchien. Domain Website Tracking Site Vermarkter Auslieferungs- Domain Referral ´ enumerati... KostenmodellArt CPC CPM CPO CPA ´ abstractª DimensionValue + id: int + name: String + sourceId: String WebsiteFact + Bounces: int + Verweildauer: float + Visits: int BasicAdFact + Clicks: int + Sichtbare Views: int + Validierte Clicks: int + View (angefragt): int + View (ausgeliefert): int + View (gemessen): int ´ dimensionª Produkt Shop Produkt + Produktkategorie: String ´ dimensionª Zeitfenster Letzte X Tage ´ dimensionª User User Segment ´ dimensionª Order OrderStatus + Status: OrderStatus ´ enumerationª OrderStatus IN_BEARBEITUNG ERFOLGREICH (AKTIVIERT) ABGELEHNT NICHT_IN_BEARBEITUNG UniquesFact + Unique Clicks: int + Unique Users: int + Unique Views: int AdCostFact + CPC: int + Kosten: float Conv ersionFact + PC: int + PR: int + PV: int + Umsatz PC: float + Umsatz PR: float + Umsatz PV: float AdVisibilityFact + Sichtbarkeitsdauer: float Activ atedOrderFact + Orders: int + Umsatz: float TrackingFact + Orders: int + Page Impressions: int + Umsatz: float X = {7, 14, 28, 30} § Fault tolerant message handling § Event handling: Apply schema, time-partitioning, De-dup, sanity checks, pre-aggregation, filtering, fraud detection § Tolerates delayed events § High throughput, moderate latency (~ 1min)
  • 19. Series Connection of Streaming and Batching - all based on Spark. IngestionRaw Event Stream Collection Event Data Lake Processing Analytics Warehouse Fact Entries SQL Interface Atomic Event Frames § Cool programming model § Uniform dev&ops § Simple solution § High compression ratio due to column-oriented storage § High scan speed § Cool programming model § Uniform dev&ops § High performance § Interface to R out-of-the-box § Useful libs: MLlib, GraphX, NLP, … § Good connectivity (JDBC, ODBC, …) § Interactive queries § Uniform ops § Can easily be replaced due to Hive Metastore § Obvious choice for cloud-scale messaging § Way the best throughput and scalability of all evaluated alternatives
  • 21. Technology Mapping https://github.com/qaware/big-data-landscape User Interface Data Lake Data Warehouse Ingestion Processing Data Science Interactive Analysis Reporting & Dashboards Data Sources Analytics Micro Analytics Services instead of reporting servers. Charting Libraries:Dashboards: Analytics Frontends Algorithm Libraries Structured Data Lake: The eternal memory. Efficient data serialization formats: § Integated compression § Column-oriented storage § Predicate pushdown Distributed Filesystem or NoSQL DB Data Workflows ETL Jobs Massive Parallelization Pig Open Studio Data Logicstics Stream Processing NewSQL: SQL meets NoSQL. Polyglott Persistence Index Machines: Fast aggregation and search. In-Memory Databases: Fast access. Time Series Databases Atlas
  • 24. No Retention Paranoia Data Lake Analytics Warehouse § Eternal memory § Close to raw events § Allows replays and refills into warehouse Aggressive forgetting with clearly defined retention policy per aggregation level like: § 15min:30d § 1h:4m § … Events Strange Events
  • 25. Continuous Tuning IngestionRaw Event Stream Collection Event Data Lake Processing Analytics Warehouse Fact Entries SQL Interface Atomic Event Frames Load Generator Throughput & latency probes
  • 26. In Numbers Overall dev effort until the first release: 250 person days Dimensions: 10 KPIs: 26 Integrated 3rd party systems: 7 Inbound data volume per day: 80GB New data in DWH per day: 2GB Total price of cheapest cluster which is able to handle production load:
  • 27.
  • 29. Bonus Topic: Roadmap SparkKafka HDFS IaaS Simplify Ops with Mesos Faster aggregation & easier updates with Spark-on-Solr http://qaware.blogspot.de/2015/06/solr-with-sparks-or-how-to-submit-spark.html
  • 30. Bonus Topic: Smart Aggregation Ingestion Event Data Lake Processing Analytics Warehouse Fact Entries Analytics Atomic Event Frames 1 2 3
  • 31. Architecture follows requirements class Analytics Model ´ factª WebFact ´ dimensionª Zeit ´ dimensionª Kampagne Jahr Quartal Monat Woche Tag Stunde Minute Kunde + Land: String Partner ´ dimensionª Tracking Tracking Group SensorTag + Typ: SensorTagType Platzierung + Format: ImageSize + Kostenmodell: KostenmodellArt Werbemittel + AdGroup: String + Format: ImageSize + Grˆ fle: KiloBytes + LandingPage: URL + Motif: URL Kampagne ´ dimensionª Client Kategorie Dev ice + Bezeichner: String + Hersteller: String + Typ: String Browser + Typ: String + Version: int ´ dimensionª Ausspielort LandRegion Stadt ´ dimensionª Kanal Kanal ´ dimensionª Vermarktung ´ enumerationª SensorTagType ORDER_TAG MASTER_TAG CUSTOM_TAG Betriebssystem + Typ: String + Version: Version ? Dimension: Unabh‰ngiges Pr‰dikat auf Metriken bei der Analyse ("kann isoliert dar¸ ber nachdenken / isoliert dazu Analysen fahren") ? H ierarchie: Sub-Pr‰dikat auf Metriken. Erzeugt mehr als eine (zueinander diskunkte) Teilmengen der Metriken. Entspricht den g‰ngigen Drill-Down-Pfaden in den Reports bzw. den Batch-Aggregate-Up-Pfaden in der Aggregationslogik. Semantische Unterstrukturen: "ist Teil von & kann nicht existieren ohne". ? Asssoziation: Nicht verwendet. Separates Stammdatenmodell. ? Attribut: Ermˆ glicht eine weitere (querschneidende) Einschr‰nkung der Metrikmenge erg‰nzend zu den Hierarchien. Domain Website Tracking Site Vermarkter Auslieferungs- Domain Referral ´ enumerati... KostenmodellArt CPC CPM CPO CPA ´ abstractª DimensionValue + id: int + name: String + sourceId: String WebsiteFact + Bounces: int + Verweildauer: float + Visits: int BasicAdFact + Clicks: int + Sichtbare Views: int + Validierte Clicks: int + View (angefragt): int + View (ausgeliefert): int + View (gemessen): int ´ dimensionª Produkt Shop Produkt + Produktkategorie: String ´ dimensionª Zeitfenster Letzte X Tage ´ dimensionª User User Segment ´ dimensionª Order OrderStatus + Status: OrderStatus ´ enumerationª OrderStatus IN_BEARBEITUNG ERFOLGREICH (AKTIVIERT) ABGELEHNT NICHT_IN_BEARBEITUNG UniquesFact + Unique Clicks: int + Unique Users: int + Unique Views: int AdCostFact + CPC: int + Kosten: float Conv ersionFact + PC: int + PR: int + PV: int + Umsatz PC: float + Umsatz PR: float + Umsatz PV: float AdVisibilityFact + Sichtbarkeitsdauer: float Activ atedOrderFact + Orders: int + Umsatz: float TrackingFact + Orders: int + Page Impressions: int + Umsatz: float X = {7, 14, 28, 30} act Processing Processing WebsiteOrderActivationAdCostConversionUniques+ Overlap AdVisibilityBasicAdsTracking Dimensionsv erzeichnis aktualisieren Page Impressions und Orders zählen Aggregation TrackingFact Ad Impressions zählen Aggregation BasicAdFact CTR und Sichtbarkeitsrate berechnen Inferenz der Sichtbarkeitsdauer Aggregation AdVisibilityFact Dimensionsraum aufspannen Pro Vektor die Ev ents und die dort enthaltenen UserIds und Interaktionsarten ermitteln Menge der UserIds pro Interaktionsart erzeugen und deren Mächtigkeit bestimmen UniquesFact User Journeys erstellen (bzw. LV/LC pro User) Attribution v on Conv ersion auf Basis v orkonfiguriertem Conv ersion Modell Conv ersions zählen (PV, PC, PR) Warenkorbwert ermitteln bei Order-Conv ersion (PV, PC, PR) Aggregation Conv ersionFact Kosten pro Ev ent ermitteln Aggregation Nicht zuweisbare Kosten ermitteln CostFact Inferenz der Visits Visits zählen Bounces zählen Analyse der Verweildauer Aggregation WebsiteFact Order Status ermitteln Anzahl der nicht getrackten Orders ermitteln Aggregation Activ atedOrderFact :Analytics Warehouse:Event Data Lake «flow» «flow» «flow» «flow» «flow» «flow» «flow» «flow» «flow» «flow» «flow» «flow» «flow» «flow» «flow» «flow» «flow» «flow» «flow» «flow» «flow» Data Processing WorkflowMultidimensional Data Model
  • 32. Sample results Geolocated and gender- specific conversions. Frequency of visits Performance of an ad campaign