SlideShare a Scribd company logo
1 of 24
Download to read offline
Keira	Zhou
Senior	Data	Engineer	@	Capital	One
§ Data	Engineer	@	Capital	One
§ Previous:	
§ Fellow	@	Insight	Data	Engineering
§ BS	+	MS	in	Systems	Engineering	@UVA
1
2
§ Task:	Predict	Phishing	website	in	near	real	time
3
§ Public	Dataset:	Phishing	Websites	Dataset
§ Some	examples
§ Using	URL	Shortening	Services	“TinyURL”:	
§ bit.ly/19DXSk4
§ Age	of	Domain:	
§ minimum	age	of	the	legitimate	domain	is	6	months
§ Adding	Prefix	or	Suffix	Separated	by	(-)	to	the	Domain:
§ http://www.confirme-paypal.com/
4
§ Classification	problem
§ Logistic	Regression	with	Stochastic	Gradient	Descent
5
6
§ Offline
§ Retrain	the	model	with	historical	data	+	new	data
§ Model	fits	the	global	distribution	of	the	data better
§ Can	be	unpractical	for	large	data	sets
§ Online
§ Use	new	observation	to	further	train	your	model
§ Model	is	more	influenced	by	the	recent	data
§ Adapt	to	new	trend	faster
§ Batch	/	Mini-batch
§ Wait	for	a	batch	of	observation	to	further	train	your	model
7
8
§ Near	live..
9
10
§ Spark	2.0	feature:	Save	model	to	file
• Load	model	file
11
§ Online	training
§ Online	algorithms	Pros:
§ Computationally	much	faster
§ Useful	when	dataset	is	too	big
§ Adapt	to	new	trend	faster
§ Online	algorithms	Cons:
§ Majority	of	the	algorithms	only	work	in	batch
§ Some	feature	extractions	are	slow
§ IP	Geo	lookup
§ Hard	to	always	get	it	right	in	automatic	way
12
§ Building	and	Maintaining	a	streaming	
pipeline	can	be	challenging
§ But	why??
13
14
• Message	buffer
• Keeps	logs	of	messages
15
Unordered Duplicates Unbounded
16
Processing	Time
Event	Time
17
Time-agonistic Event	time	matters
• Second	Look:	
generous	tips
• Top	merchants	in	
the	past	hour
• Second	Look:	double	
swipe
• Deduplication
§ At-most-once:	Potential	data	loss
§ e.g.	Video	data	sent	via	UDP
§ At-least-once:	Potential	duplication
§ e.g.	Email	alerts
§ Exactly-once:	Ideal	world
§ Requires	better	configuration	of	your	
infrastructure
§ Consistency	in	light	of	machine	failures
18
§ Challenges	from	data	stream	itself
§ How	to	handle	time	depend	on	use	case
§ Event	time:	reflects	real	life	but	harder	to	implement
§ Challenges	from	your	infrastructure
§ Exactly	once	delivery	is	critical	for	accurate	streaming	analytical	results
§ You	probably	would	want	that	for	your	online	model
§ Streaming	gives	you	more	timely	results
§ Not	everything	needs	to	be	real-time
19
BATCH	OR	STREAMING?
§ Model	Updating
§ Batch:	improve	accuracy	everytime you	retrain	the	model
§ Online:	adapt	to	new	data	points	as	they	comes	in
§ Latency	&	Correctness
§ Batch:	high	latency	but	more	control	of	the	data
§ Streaming:	low	latency	but	less	control
§ Monitoring	
§ Maintainability
§ It	is	easier	to	maintain	one	pipeline	rather	than	two
§ Lambda	vs.	Kappa
20
21
Data	Science Data	Engineering
• Find	the	right	features
• Get	labeled	data
• Manual	labeling
• Develop	a	model
• Guarantee	“exact	once”	for	
the	streaming	pipeline
• Tuning	Spark
• #	of	executors
• Memory
• Event	&	Processing	
Time
• Spark	Programming
• Spark	connector	or	Python	
connector
• Use	MLlib
22
§ Github Repo	for	the	example
§ https://github.com/keiraqz/StreamingLogisticRegression
§ Phishing	Websites	Data	Set
§ https://archive.ics.uci.edu/ml/datasets/Phishing+Websites#
§ Spark	Streaming	ML	Algorithm
§ https://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression
§ The	World	Beyond	Batch:	Streaming	101
§ https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
§ Lambda	Architecture
§ http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html
§ Kappa	Architecture
§ https://www.oreilly.com/ideas/questioning-the-lambda-architectur
23

More Related Content

What's hot

What's hot (8)

Big Data for Managers: From hadoop to streaming and beyond
Big Data for Managers: From hadoop to streaming and beyondBig Data for Managers: From hadoop to streaming and beyond
Big Data for Managers: From hadoop to streaming and beyond
 
Tableau & MongoDB: Visual Analytics at the Speed of Thought
Tableau & MongoDB: Visual Analytics at the Speed of ThoughtTableau & MongoDB: Visual Analytics at the Speed of Thought
Tableau & MongoDB: Visual Analytics at the Speed of Thought
 
Finding the needle in the haystack: how Nestle is leveraging big data to defe...
Finding the needle in the haystack: how Nestle is leveraging big data to defe...Finding the needle in the haystack: how Nestle is leveraging big data to defe...
Finding the needle in the haystack: how Nestle is leveraging big data to defe...
 
OWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuOWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - Dataiku
 
Big data pipelines
Big data pipelinesBig data pipelines
Big data pipelines
 
Journey to the Real-Time Analytics in Extreme Growth
Journey to the Real-Time Analytics in Extreme GrowthJourney to the Real-Time Analytics in Extreme Growth
Journey to the Real-Time Analytics in Extreme Growth
 
Generic Crawler
Generic CrawlerGeneric Crawler
Generic Crawler
 
Cloud-Con: Integration & Web APIs
Cloud-Con: Integration & Web APIsCloud-Con: Integration & Web APIs
Cloud-Con: Integration & Web APIs
 

Similar to Keira Zhou - Batch and Streaming Processing in the World of Data Engineering and Data Science

Data Services and the Modern Data Ecosystem (Middle East)
Data Services and the Modern Data Ecosystem (Middle East)Data Services and the Modern Data Ecosystem (Middle East)
Data Services and the Modern Data Ecosystem (Middle East)
Denodo
 

Similar to Keira Zhou - Batch and Streaming Processing in the World of Data Engineering and Data Science (20)

Batch and Streaming
Batch and StreamingBatch and Streaming
Batch and Streaming
 
Data Services and the Modern Data Ecosystem (Middle East)
Data Services and the Modern Data Ecosystem (Middle East)Data Services and the Modern Data Ecosystem (Middle East)
Data Services and the Modern Data Ecosystem (Middle East)
 
Practical Machine Learning in Information Security
Practical Machine Learning in Information SecurityPractical Machine Learning in Information Security
Practical Machine Learning in Information Security
 
Transform Your Mainframe with Microsoft Azure
Transform Your Mainframe with Microsoft AzureTransform Your Mainframe with Microsoft Azure
Transform Your Mainframe with Microsoft Azure
 
Enhance Your Flask Web Project With a Database Python Guide.pdf
Enhance Your Flask Web Project With a Database  Python Guide.pdfEnhance Your Flask Web Project With a Database  Python Guide.pdf
Enhance Your Flask Web Project With a Database Python Guide.pdf
 
The New Basics of Business Intelligence Lesson 3: Multi Source Analysis
The New Basics of Business Intelligence Lesson 3: Multi Source AnalysisThe New Basics of Business Intelligence Lesson 3: Multi Source Analysis
The New Basics of Business Intelligence Lesson 3: Multi Source Analysis
 
Migrating Data with SAP Hybris Cloud for Customer Concepts and Best Practices
Migrating Data with SAP Hybris Cloud for Customer Concepts and Best PracticesMigrating Data with SAP Hybris Cloud for Customer Concepts and Best Practices
Migrating Data with SAP Hybris Cloud for Customer Concepts and Best Practices
 
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
Bigdata-Intro.pptx
Bigdata-Intro.pptxBigdata-Intro.pptx
Bigdata-Intro.pptx
 
Deploying Big Data Platforms
Deploying Big Data PlatformsDeploying Big Data Platforms
Deploying Big Data Platforms
 
10 Reasons Snowflake Is Great for Analytics
10 Reasons Snowflake Is Great for Analytics10 Reasons Snowflake Is Great for Analytics
10 Reasons Snowflake Is Great for Analytics
 
Designing Modern Web Applications
Designing Modern Web ApplicationsDesigning Modern Web Applications
Designing Modern Web Applications
 
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdfOSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
 
Managing Complexity and Privacy Debt with Drupal
Managing Complexity and Privacy Debt with DrupalManaging Complexity and Privacy Debt with Drupal
Managing Complexity and Privacy Debt with Drupal
 
Big Data Introduction
Big Data IntroductionBig Data Introduction
Big Data Introduction
 
#DataUnlimited - Google Big Data Unlimited
#DataUnlimited - Google Big Data Unlimited#DataUnlimited - Google Big Data Unlimited
#DataUnlimited - Google Big Data Unlimited
 
Unlocking the value of your data assets with talend 6
Unlocking the value of your data assets with talend 6Unlocking the value of your data assets with talend 6
Unlocking the value of your data assets with talend 6
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
 

More from PyData

More from PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 

Keira Zhou - Batch and Streaming Processing in the World of Data Engineering and Data Science