Rebuilding Web Tracking Infrastructure for Scale

•Als PPTX, PDF herunterladen•

2 gefällt mir•1,145 views

DataWorks Summit/Hadoop Summit

Technologie

Page 3
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
What is Web Tracking at Marketo?
• Ingest web page visits and clicks on customer’s website
• Trigger campaigns in response to web activity
• Trigger real-time personalization of web experience
• Provide lead level analytics for known leads
• Provide aggregate analytics for all lead activity
• Typically known leads < 10 % of all traffic

Page 4
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Legacy Web Tracking Infrastructure

Page 5
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Legacy Web Tracking Infrastructure

Page 6
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Legacy Problems
• Throughput limitations – 2 million activities per day
• Processing delays can be on the order of hours
• Large customers cause web server brownouts
• Web reporting does not scale
• Fixed-sized clusters prohibit horizontal scaling
• Brittle infrastructure prevents feature development

Page 8
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Orion Initiative
• Increase scale to support IoT for Marketers
• Support billions of marketing activities each day
• Trigger on activities in near real time (< 2 minute @ 99th %)
• Reduce operational costs
• Improve multitenancy and QoS

Page 10
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Business Requirements
• 200 MM activities per customer per day
• Near real-time web activity processing (SLA of < 1
minute lag)
• Improve cost efficiency
• Improve flexibility for feature enhancements

Page 11
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Technical Requirements
• Multitenancy support with brownout protections
• Infrastructure must scale horizontally
• Decouple web processing from downstream processing
• Anonymous leads should cost next to nothing to track

Page 13
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016

Page 14
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016

Page 15
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Why Hbase + Phoenix?
• Horizontally scalable
• Leverages the Hadoop cluster for storage and scaling
• Provides secondary indices for query patterns through
Phoenix
• Natural integration with JDBC and Spark JDBC RDDs

Page 16
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016

Page 17
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Marketo Lambda Architecture
Spark Streaming
Consumers
Campaign Triggers
Solr Indexing
Solr
Spark Streaming Indexer
Ingestion Processor
Scala/Tomcat
HBase
Kafka
CRM Sync
Partner APIs
Other Marketing
Activities
Web Activity
RTP Activity
Mobile Activity
Marketo UI
Campaign Detail
Lead Detail
Other Clients
CRM Sync
Revenue Cycle Analylitcs
APIs
Email Report Loader
Web Activity Processor

Page 18
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Why Spark Streaming?
• Micro-batching provides sink-side efficiencies
• This is especially important with MySQL touchpoints
• Great integration with Kafka
• No strict real-time processing requirements
• Great community and industry adoption

Page 19
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Multitenancy
• One topic per customer (sized by volume)
• Traffic storms are isolated to a single customer
• Fairness/throttling is easy to control
• Spark Streaming job consumes from many topics
• Allows us to turn a customer off under error conditions
• See “Elastic Streaming” by Neelesh Shastry –
Spark Summit

Page 20
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Making Spark Streaming Performant
• Coalesce small partitions for the same customer
• Aggressive caching of metadata (mostly from MySQL)
• Heavily leverage Scala future composition for parallelism
• Persist RDDs that are used for multiple outputs
• e.g. write to Kafka and Activity Service

Page 21
Marketo Proprietary and Confidential | © Marketo, Inc.
10/31/2016
Making Anonymous Traffic Cheap
• High costs of web traffic in legacy system
• MySQL storage for all traffic
• Down streaming processing of all events (even anonymous)
• V2 only processes and stores known traffic in MySQL
• Defer triggering for anonymous data until promotion

• Rolled out to our highest volume customers
• Processing latencies < 30s (at 99.9th %)
• Allowed key customers to scale from ~2MM/day to > 20
MM/day
Impact and Results

• Mitigations of straggler effects on processing delays
• Adding sessionization for web reporting
• Scaling Kafka topics as customers increase volume
• Globally distributed ingestion for a single customer
Future Work

Weitere ähnliche Inhalte

Was ist angesagt?

Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...DataWorks Summit/Hadoop Summit

Practice of large Hadoop cluster in China MobileDataWorks Summit

Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseDataWorks Summit

Data Regions: Modernizing your company's data ecosystemDataWorks Summit/Hadoop Summit

Avoiding Log Data Overload in a CI/CD System While Streaming 190 Billion Even...DataWorks Summit

Hive edw-dataworks summit-eu-april-2017alanfgates

Introduction to Apache NiFi 1.10Timothy Spann

IoT with Apache MXNet and Apache NiFi and MiniFiDataWorks Summit

HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit

Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...DataWorks Summit

Pivotal Real Time Data Stream Analyticskgshukla

Realizing the promise of portable data processing with Apache BeamDataWorks Summit

Curing the Kafka Blindness – Streams Messaging ManagerDataWorks Summit

Lessons learned running a container cloud on YARNDataWorks Summit

Apache deep learning 101DataWorks Summit

MiNiFi 0.0.1 MeetUp talkJoe Percivall

Why is my Hadoop cluster slow?DataWorks Summit/Hadoop Summit

Apache Hadoop YARN: Past, Present and FutureDataWorks Summit/Hadoop Summit

Ingest and Stream Processing - What will you choose?DataWorks Summit/Hadoop Summit

Fast SQL on Hadoop, Really?DataWorks Summit

Was ist angesagt? (20)

Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...

Practice of large Hadoop cluster in China Mobile

Using Spark Streaming and NiFi for the next generation of ETL in the enterprise

Data Regions: Modernizing your company's data ecosystem

Avoiding Log Data Overload in a CI/CD System While Streaming 190 Billion Even...

Hive edw-dataworks summit-eu-april-2017

Introduction to Apache NiFi 1.10

IoT with Apache MXNet and Apache NiFi and MiniFi

HBase Global Indexing to support large-scale data ingestion at Uber

Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...

Pivotal Real Time Data Stream Analytics

Realizing the promise of portable data processing with Apache Beam

Curing the Kafka Blindness – Streams Messaging Manager

Lessons learned running a container cloud on YARN

Apache deep learning 101

MiNiFi 0.0.1 MeetUp talk

Why is my Hadoop cluster slow?

Apache Hadoop YARN: Past, Present and Future

Ingest and Stream Processing - What will you choose?

Fast SQL on Hadoop, Really?

Andere mochten auch

The truth about SQL and Data Warehousing on HadoopDataWorks Summit/Hadoop Summit

Comparison of Transactional Libraries for HBaseDataWorks Summit/Hadoop Summit

From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...DataWorks Summit/Hadoop Summit

SEGA : Growth hacking by Spark ML for Mobile gamesDataWorks Summit/Hadoop Summit

The real world use of Big Data to change businessDataWorks Summit/Hadoop Summit

Use case and Live demo : Agile data integration from Legacy system to Hadoop ...DataWorks Summit/Hadoop Summit

Streamline Hadoop DevOps with Apache AmbariDataWorks Summit/Hadoop Summit

Case study of DevOps for Hadoop in Recruit.DataWorks Summit/Hadoop Summit

Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...DataWorks Summit/Hadoop Summit

Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEDataWorks Summit/Hadoop Summit

#HSTokyo16 Apache Spark Crash Course DataWorks Summit/Hadoop Summit

Network for the Large-scale Hadoop cluster at Yahoo! JAPANDataWorks Summit/Hadoop Summit

How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...DataWorks Summit/Hadoop Summit

Hadoop Summit Tokyo HDP Sandbox Workshop DataWorks Summit/Hadoop Summit

Hadoop Summit Tokyo Apache NiFi Crash CourseDataWorks Summit/Hadoop Summit

Major advancements in Apache Hive towards full support of SQL complianceDataWorks Summit/Hadoop Summit

Introduction to Hadoop and Spark (before joining the other talk) and An Overv...DataWorks Summit/Hadoop Summit

A Container-based Sizing Framework for Apache Hadoop/Spark ClustersDataWorks Summit/Hadoop Summit

Apache Hadoop 3.0 What's new in YARN and MapReduceDataWorks Summit/Hadoop Summit

Data infrastructure architecture for medium size organization: tips for colle...DataWorks Summit/Hadoop Summit

Andere mochten auch (20)

The truth about SQL and Data Warehousing on Hadoop

Comparison of Transactional Libraries for HBase

From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...

SEGA : Growth hacking by Spark ML for Mobile games

The real world use of Big Data to change business

Use case and Live demo : Agile data integration from Legacy system to Hadoop ...

Streamline Hadoop DevOps with Apache Ambari

Case study of DevOps for Hadoop in Recruit.

Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...

Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

#HSTokyo16 Apache Spark Crash Course

Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...

Hadoop Summit Tokyo HDP Sandbox Workshop

Hadoop Summit Tokyo Apache NiFi Crash Course

Major advancements in Apache Hive towards full support of SQL compliance

Introduction to Hadoop and Spark (before joining the other talk) and An Overv...

A Container-based Sizing Framework for Apache Hadoop/Spark Clusters

Apache Hadoop 3.0 What's new in YARN and MapReduce

Data infrastructure architecture for medium size organization: tips for colle...

Ähnlich wie Rebuilding Web Tracking Infrastructure for Scale

Marketing Automation at Scale: How Marketo Solved Key Data Management Challen...Continuent

Enabling product personalisation using Apache Kafka, Apache Pinot and Trino w...HostedbyConfluent

Adobe Ask the AEM Community Expert Session Oct 2016AdobeMarketingCloud

Big KahunaRitesh Nayak

Enabling Telco to Build and Run Modern Applications Tugdual Grall

Accelerating a Path to Digital with a Cloud Data StrategyMongoDB

Successes, Challenges, and Pitfalls Migrating a SAAS business to HadoopDataWorks Summit/Hadoop Summit

SUGCON NA 2023 - Crafting Lightning Fast Composable Experiences.pptxVasiliy Fomichev

Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...Big Data Spain

Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Ser...Kai Wähner

Deliver Unrivaled End-User Experience With Confidence - How Synthetic Monitor...DevOps.com

Understanding the Top Four Use Cases for IoTVoltDB

Using SQL and Salesforce Data to Build a Product Catalog (or Anything) in Con...ServiceRocket

Using SQL and Salesforce data to build a Product Catalog (or anything) in Con...ServiceRocket

Acting on Real-time Behavior: How Peak Games Won TransactionsVoltDB

Webinar Slides: High Volume MySQL HA: SaaS Continuous Operations with Terabyt...Continuent

JAMStackSamundra khatri

Digital Transformation in Market Data and Trading PlatformsSolace

The role of NoSQL in the Next Generation of Financial InformaticsAerospike, Inc.

Accelerating a Path to Digital With a Cloud Data StrategyMongoDB

Ähnlich wie Rebuilding Web Tracking Infrastructure for Scale (20)

Marketing Automation at Scale: How Marketo Solved Key Data Management Challen...

Enabling product personalisation using Apache Kafka, Apache Pinot and Trino w...

Adobe Ask the AEM Community Expert Session Oct 2016

Big Kahuna

Enabling Telco to Build and Run Modern Applications

Accelerating a Path to Digital with a Cloud Data Strategy

Successes, Challenges, and Pitfalls Migrating a SAAS business to Hadoop

SUGCON NA 2023 - Crafting Lightning Fast Composable Experiences.pptx

Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...

Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Ser...

Deliver Unrivaled End-User Experience With Confidence - How Synthetic Monitor...

Understanding the Top Four Use Cases for IoT

Using SQL and Salesforce Data to Build a Product Catalog (or Anything) in Con...

Using SQL and Salesforce data to build a Product Catalog (or anything) in Con...

Acting on Real-time Behavior: How Peak Games Won Transactions

Webinar Slides: High Volume MySQL HA: SaaS Continuous Operations with Terabyt...

JAMStack

Digital Transformation in Market Data and Trading Platforms

The role of NoSQL in the Next Generation of Financial Informatics

Accelerating a Path to Digital With a Cloud Data Strategy

Mehr von DataWorks Summit/Hadoop Summit

Running Apache Spark & Apache Zeppelin in ProductionDataWorks Summit/Hadoop Summit

State of Security: Apache Spark & Apache ZeppelinDataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit

Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit

Revolutionize Text Mining with Spark and ZeppelinDataWorks Summit/Hadoop Summit

Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit

Hadoop Crash CourseDataWorks Summit/Hadoop Summit

Data Science Crash CourseDataWorks Summit/Hadoop Summit

Apache Spark Crash CourseDataWorks Summit/Hadoop Summit

Dataflow with Apache NiFiDataWorks Summit/Hadoop Summit

Schema Registry - Set you Data FreeDataWorks Summit/Hadoop Summit

Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit

Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit

Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit

How Hadoop Makes the Natixis Pack More Efficient DataWorks Summit/Hadoop Summit

HBase in Practice DataWorks Summit/Hadoop Summit

The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit

Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopDataWorks Summit/Hadoop Summit

From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit

Backup and Disaster Recovery in Hadoop DataWorks Summit/Hadoop Summit

Mehr von DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production

State of Security: Apache Spark & Apache Zeppelin

Unleashing the Power of Apache Atlas with Apache Ranger

Enabling Digital Diagnostics with a Data Science Platform

Revolutionize Text Mining with Spark and Zeppelin

Double Your Hadoop Performance with Hortonworks SmartSense

Hadoop Crash Course

Data Science Crash Course

Apache Spark Crash Course

Dataflow with Apache NiFi

Schema Registry - Set you Data Free

Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...

Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...

Mool - Automated Log Analysis using Data Science and ML

How Hadoop Makes the Natixis Pack More Efficient

HBase in Practice

The Challenge of Driving Business Value from the Analytics of Things (AOT)

Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop

From Regulatory Process Verification to Predictive Maintenance and Beyond wit...

Backup and Disaster Recovery in Hadoop

Kürzlich hochgeladen

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

Histor y of HAM Radio presentation slidevu2urc

Slack Application Development 101 Slidespraypatel2

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal

Real Time Object Detection Using Open CVKhem

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

GenCyber Cyber Security Day PresentationMichael W. Hawkins

How to convert PDF to text with Nanonetsnaman860154

A Domino Admins Adventures (Engage 2024)Gabriella Davis

What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

Kürzlich hochgeladen (20)

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

08448380779 Call Girls In Friends Colony Women Seeking Men

08448380779 Call Girls In Civil Lines Women Seeking Men

Histor y of HAM Radio presentation slide

Slack Application Development 101 Slides

[2024]Digital Global Overview Report 2024 Meltwater.pdf

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

Real Time Object Detection Using Open CV

The Codex of Business Writing Software for Real-World Solutions 2.pptx

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

The 7 Things I Know About Cyber Security After 25 Years | April 2024

IAC 2024 - IA Fast Track to Search Focused AI Solutions

GenCyber Cyber Security Day Presentation

How to convert PDF to text with Nanonets

A Domino Admins Adventures (Engage 2024)

What Are The Drone Anti-jamming Systems Technology?

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Driving Behavioral Change for Information Management through Data-Driven Gree...

Rebuilding Web Tracking Infrastructure for Scale

1. Rebuilding Web Tracking Infrastructure for Scale Stephen Oakley Principal Engineer Marketo

2. What is Marketo?

3. Page 3 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 What is Web Tracking at Marketo? • Ingest web page visits and clicks on customer’s website • Trigger campaigns in response to web activity • Trigger real-time personalization of web experience • Provide lead level analytics for known leads • Provide aggregate analytics for all lead activity • Typically known leads < 10 % of all traffic

6. Page 6 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 Legacy Problems • Throughput limitations – 2 million activities per day • Processing delays can be on the order of hours • Large customers cause web server brownouts • Web reporting does not scale • Fixed-sized clusters prohibit horizontal scaling • Brittle infrastructure prevents feature development

7. The Vision

8. Page 8 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 Orion Initiative • Increase scale to support IoT for Marketers • Support billions of marketing activities each day • Trigger on activities in near real time (< 2 minute @ 99th %) • Reduce operational costs • Improve multitenancy and QoS

9. Requirements

10. Page 10 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 Business Requirements • 200 MM activities per customer per day • Near real-time web activity processing (SLA of < 1 minute lag) • Improve cost efficiency • Improve flexibility for feature enhancements

11. Page 11 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 Technical Requirements • Multitenancy support with brownout protections • Infrastructure must scale horizontally • Decouple web processing from downstream processing • Anonymous leads should cost next to nothing to track

12. Architecture & Design

15. Page 15 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 Why Hbase + Phoenix? • Horizontally scalable • Leverages the Hadoop cluster for storage and scaling • Provides secondary indices for query patterns through Phoenix • Natural integration with JDBC and Spark JDBC RDDs

17. Page 17 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 Marketo Lambda Architecture Spark Streaming Consumers Campaign Triggers Solr Indexing Solr Spark Streaming Indexer Ingestion Processor Scala/Tomcat HBase Kafka CRM Sync Partner APIs Other Marketing Activities Web Activity RTP Activity Mobile Activity Marketo UI Campaign Detail Lead Detail Other Clients CRM Sync Revenue Cycle Analylitcs APIs Email Report Loader Web Activity Processor

18. Page 18 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 Why Spark Streaming? • Micro-batching provides sink-side efficiencies • This is especially important with MySQL touchpoints • Great integration with Kafka • No strict real-time processing requirements • Great community and industry adoption

19. Page 19 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 Multitenancy • One topic per customer (sized by volume) • Traffic storms are isolated to a single customer • Fairness/throttling is easy to control • Spark Streaming job consumes from many topics • Allows us to turn a customer off under error conditions • See “Elastic Streaming” by Neelesh Shastry – Spark Summit

20. Page 20 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 Making Spark Streaming Performant • Coalesce small partitions for the same customer • Aggressive caching of metadata (mostly from MySQL) • Heavily leverage Scala future composition for parallelism • Persist RDDs that are used for multiple outputs • e.g. write to Kafka and Activity Service

21. Page 21 Marketo Proprietary and Confidential | © Marketo, Inc. 10/31/2016 Making Anonymous Traffic Cheap • High costs of web traffic in legacy system • MySQL storage for all traffic • Down streaming processing of all events (even anonymous) • V2 only processes and stores known traffic in MySQL • Defer triggering for anonymous data until promotion

22. • Rolled out to our highest volume customers • Processing latencies < 30s (at 99.9th %) • Allowed key customers to scale from ~2MM/day to > 20 MM/day Impact and Results

23. • Mitigations of straggler effects on processing delays • Adding sessionization for web reporting • Scaling Kafka topics as customers increase volume • Globally distributed ingestion for a single customer Future Work

24. We’re Hiring! Http://Marketo.Jobs Q & A

Hinweis der Redaktion

Next phase was when we were ready to validate our newly built event ingestion system Marketo is a powerful Engagement Marketing Platform. There are several applications that make up the platform, such as ABM, Marketing analytics, predictive content, Digital Ads, and Marketing Automation. Marketing automation is what we are focusing on today. Marketing Automation enables the marketer to create, automate and measure marketing campaigns across channels. A simple example of an automated campaign or workflow is User visits your website and fills out a form Web tracking sees that they spent most their time looking at pages about spark streaming Automatically Send an email to the user to Invite them to a webinar on spark streaming services If they attend the webinar, register their interests in your crm and request a sales person contacts the user The campaigns can be complex and can reach out and track customers across channels like web, email, mobile, social
Explain what a known vs anonymous lead is Known is targetable on other channels, anonymous is only web activity Speak to how the traffic patterns are heavily skewed toward anonymous given our customer base Talk about how anonymous converts to known. Aggregate analytics include company web report, landing page reports, etc.
Speak to the pod Mention how there are many many pods
An additional complication is the fact that the same two webservers also serve the mlm app, soap apis, and the landing pages
Although the talk isn’t about the project… we have a few slides up front to set the context around what we are working on If you have been near technology at all in the last couple of years you know that the world has become very connected. The number of connected devices blows my mind. It’s not just phones anymore… Amazon dash buttons, coffee makers, propane tanks, garage doors. These devices are sending 10’s of billions of activities and user interactions every day... Orion is our platfor Our marketing platform ingests the user interactions process them into relevant marketing touchpoints Its enables marketers to create marketing campaigns around these activities to build relationships with their customers Become the fabric for marketers Its been a great experience building this
Here are a few of the requirements Near real time processing At least a 1 billion activities per customer per day. customer demands from increasing devices caused us to evaluate next get queueing and streaming... reduction in infrastructure COGS primarily from expensive enterprise class filers... reduction in people COGS by gained efficiency from reducing tech stack from using too many similar technologies ... Multitenant… of course Secure Customer isolation and improved resource management
Arch requirement driven from biz requirement Improve utilization over the existing system Lots of customers in same infra, without starving Encryption from day 1 for safe data storage Aim for horz scalability Coming from standard 3 tier app Radically reduce processing latency Eliminate backlogs Brownout protection
A few words about the architecture Main goal is to inject, process and store marketing events
Details overview of Munchkin FE component Spray.io for MFE Frontend has the simple job of verifying subscription status, collecting metrics and persisting to kafka Use Avro to allow for schema evolution, strong typing and compact representation in topic Use Schema registry to allow the schema to be upgraded by the producer and them automatically picked up by the spark streaming component Use asynchronous API for kafka to allow high throughput.
Details overview of LeadService component Spray.io for leadservice Hbase for Cookie and anonymous lead storage Salted table Key structure is subscription-cookie-leadid Secondary index for subscription-lead-createdat MySQL for known lead storage Masterdata for reverse ip information enrichments
Overall view for the system Describe how there is a Kafka topic per subscription Spark streaming transforms the raw events into activities by Enriching with web page metadata from MySQL Lead and reverse IP enrichment from LeadService Persist activities to AS for storage and secondary processing (e.g. triggering and solr indexing) Push enriched web events to Kafka for the downstream Druid OLAP infrastructure.
High level diagram of our event processor Enhanced Lambda Architecture Inbound activities written to Ingestion Processor Hbase and then Kafka High volume (e.g. web) activities First written to Kafka, then enriched Spark Streaming applications consume events from Kafka Solr Indexing Email Reports Campaign Processing HBase is used for simple historical queries, and is system of record
While it is not “true” streaming, we exactly need this as an optimization
Our multitenant Kafka framework coalesces small kafka paritions into large spark rdd partitions to improve batch utilization Several components of the event enrichment requires outbound RPC calls, using async clients and performing the calls in parallel and then composing the futures pipelines the computation and significantly improves throughput. Caching web assets and cookies for temporal locality Cache is > 60% of the executor memory Enriched events are written out to multiple sources and be selective about persisting RDDS prevents recomputing expensive transformations (multiple RPC calls or MySQL queries)
Traditionally both anonymous and known data was treated equally in MLM. This is problematic because Anonymous volumes are usually 10-20x higher than known. Additionally there is very little intrinsic value in performing downstream processing on anonymous data since you cannot target anonymous leads for Campaigns. To improve this, in Munchkin V2 we only allow known traffic to flow to downstream processing. Anonymous data is passed for downstream processing when the lead converts to a known lead Via form fillout, api calls, etc.
Reiterates my points on the last slide. I included in case you wanted to look at the slides later
Give a quick overview of the activities architecture. Introduce Kafka in the presentation
Spend more time on this – purple is our code , teal is spark standard # SubscriptionRegistry is using ZK # OffsetManager is a library, uses low level kafka consumer API # Provisioning framework – Sirius, a new subscription provisioned to registry via oozie

Rebuilding Web Tracking Infrastructure for Scale

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Rebuilding Web Tracking Infrastructure for Scale

Ähnlich wie Rebuilding Web Tracking Infrastructure for Scale (20)

Mehr von DataWorks Summit/Hadoop Summit

Mehr von DataWorks Summit/Hadoop Summit (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Rebuilding Web Tracking Infrastructure for Scale

Hinweis der Redaktion