SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Downloaden Sie, um offline zu lesen
data science
and enterprise
engineering
how data scientists and engineers work in tandem to achieve
real-time personalization at overstock.com
overstock
© Overstock2
pushing boundaries of retail since 1999
Midvale, Utah
5 million unique products for sale
billions of visits and page views
160+ countries
team in the beginning.....
© Overstock
4
• pigs
• data science team – 3 data scientists
• engineering team - 2 developers, 1 QA, and a
product owner
• chickens
• plenty of business owners
• lots of channel managers
remember animal farm?
overstock
marketing
dev
the problems we were up against
© Overstock
5
• data scientists were not working with the
business to solve business problems
• engineers were not working with data
scientists
• engineering was in a relational data mindset
• not regularly delivering business value
• solving yesterday's problems - today
overstock
marketing
dev
days
minutes
seconds
© Overstock6
1st problem we came together to solve
© Overstock
7
• real-time bidding (RTB) is a means by
which advertising inventory is bought and sold on a
per-impression basis via real-time bidding, similar to
financial markets
• low-latency operation – need to bid >10ms
• pre-compute scores nightly
• push scored to cache in AWS
• partnered with Bees Wax
• replace existing 3rd party partners
This is where I add text. If
I want it to keep going, it
looks like this.
business
problem
propensity to purchase
© Overstock
8
• identifying unique patterns and tendencies that indicate
a user is ready to purchase
• user score: represents a customer’s likelihood of
purchase
• collected at the visit/user level
• basic steps
• turn raw user interactions into ‘features’
• train classifiers on months of data with label =
purchase vs. no purchase
• predict on new users/visits
This is where I add text. If
I want it to keep going, it
looks like this.
problem to
solve
class imbalance
© Overstock
9
• many new customers
• billions of unique page views in a calendar year
• many users we are seeing for the first time
(potentially)
• low conversion
• a small percentage of sessions end in a purchase
• sparse web logs mean we have to digest an
enormous amount of data to generate useful features
• we are interested in accuracy on the positive label
• recall over precision
data science
challenges
what the team was up against
© Overstock
10
• Constraints with current infrastructure and
processes
• used to pulling data instead of streaming it
• data scientists are a scarce resource
• working on many non-relative tasks
• not a lot of experience with the care and feeding of
enterprise software
• a bit set in the academia mindset
challenges
© Overstock
11
Faster Scotty
© Overstock
12
what we accomplished in 6 months
• we were solving today's problems... by the end of
the day
• engineering was thinking about streaming data
• had the pieces in place to start scoring users in
minutes
recap
days
minutes
seconds
© Overstock
13
needed to move faster
© Overstock
14
• score users faster
• moved from daily to hourly
• picked up a fraud project
• assign fraud score within 5 mins
data science +
engineering
business
challenges
moving from daily batches to more regular micro-batches
© Overstock
15
• hourly jobs
• much more responsive scoring = larger server footprint
• can still train offline in batch, but must have pipelines better
tuned.
• every 5 minutes
• ETL must be tightly honed
• at this point you hit the edge of what is possible in the batch
setting
• feedback loops become critical for success
data science +
engineering
challenges
balancing business goals with enterprise engineering
© Overstock
16
• as adoption increases visibility increases
• need standardized data representations
• have to make sure architecture is hardened and
robust
• need fail-safe mechanisms for critical processes
• MOST IMPORTANT
• you can never, ever slow down overstock.com
business vs engineering
challenges
© Overstock
17
Asset Data Platform
User ID
Content
UserID
AttributesAttributes
Content
User Training Data
Asset Training Data
Experience
Training Data
Trained
Models
User ML
Asset ML
needed to move faster
© Overstock
18
• what was accomplished in 6 months
• fraud scores in a minute or less
• users were being scored in a minute
• what was left
• still not fast enough for real time personalization on
overstock.com
data science +
engineering
recap
days
minutes
seconds
© Overstock
19
near real-time personalization on the shopping site
© Overstock
20
• near real-time personalization on overstock
• putting custom recs in front of the user.
• can't slow the site down
This is where I add text. If
I want it to keep going, it
looks like this.
business
problem
from micro-batches to near real-time
© Overstock
21
• near real-time can limit the size of models and
complexity of calculations
• we aren’t trading stocks, we sell home goods
• we may not know much about a user we are
trying to personalize for
• what can you personalize for a new user as they
initially interact with your site
• heuristics moving towards full models
data science +
engineering +
business
challenges
from micro-batches to near real-time
© Overstock
22
• real-time data flow requires rea-time analytics
and strategy
• introspection into processes becomes critical
• must have automated process control to prevent
algorithms from running wild
• empower business owners to operate strategically
without suffering from information overload
data science +
engineering +
business
challenges
from micro-batches to near real-time
© Overstock
23
• every piece of the process needs to function as
a cohesive flow
• do we need to wait for the data to get there, then act
(fraud?)
• or act on the data we have (recommendations for a
shopping site)
data science +
engineering +
business
challenges
© Overstock
24
asset
data platform
live rex
pcm
asset exhaust
user exhaust
audiences &
user attributes
targeted
assets
userattributesassetattributes
tracking
deep links
tracking
deep links
id
bids
bids
auction, bid, & win logs
campaign logs
trking
deep
links
userattributes
assets
campaign
logs
assetevents
userevents
raw
data
ml
features
ml
features
raw
data
what was done and undone
© Overstock
25
• still not quite to near-real-time/low-latency recs for
shopping site
• fraud score and email recommendations are fast
enough
data science +
engineering +
business
recap
roadmap
© Overstock
26
• page-less personalization
• deep learning for fraud
• balancing real-time needs with efficient gains
We all need to be daring and collaborative to
accelerate innovation!
data science +
engineering +
business
where we
are going
lessons learned
© Overstock
27
• the cloud presents a whole new set of problems
• tech is hard, politics are harder
• we can’t operate in silos
• we must be aligned with the business
• the process must be iterative
data science +
engineering +
business
take
always
questions?
and of course, we are hiring!
Data Science and Enterprise Engineering with Michael Finger and Chris Robison

Weitere ähnliche Inhalte

Was ist angesagt?

Successful AI/ML Projects with End-to-End Cloud Data Engineering
Successful AI/ML Projects with End-to-End Cloud Data EngineeringSuccessful AI/ML Projects with End-to-End Cloud Data Engineering
Successful AI/ML Projects with End-to-End Cloud Data EngineeringDatabricks
 
Spark and the Enterprise by Tony Baer
Spark and the Enterprise by Tony BaerSpark and the Enterprise by Tony Baer
Spark and the Enterprise by Tony BaerSpark Summit
 
Big data-science-oanyc
Big data-science-oanycBig data-science-oanyc
Big data-science-oanycOpen Analytics
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningDatabricks
 
Mastering Your Customer Data on Apache Spark by Elliott Cordo
Mastering Your Customer Data on Apache Spark by Elliott CordoMastering Your Customer Data on Apache Spark by Elliott Cordo
Mastering Your Customer Data on Apache Spark by Elliott CordoSpark Summit
 
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...Big Data Spain
 
The Real-Time CDO and the Cloud-Forward Path to Predictive Analytics
The Real-Time CDO and the Cloud-Forward Path to Predictive AnalyticsThe Real-Time CDO and the Cloud-Forward Path to Predictive Analytics
The Real-Time CDO and the Cloud-Forward Path to Predictive AnalyticsSingleStore
 
Netflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupNetflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupBlake Irvine
 
Big Data Ecosystem- Impetus Technologies
Big Data Ecosystem-  Impetus TechnologiesBig Data Ecosystem-  Impetus Technologies
Big Data Ecosystem- Impetus TechnologiesImpetus Technologies
 
Cloud expo june 2013: Building a Real Time Analytics Platform on Big Data in ...
Cloud expo june 2013: Building a Real Time Analytics Platform on Big Data in ...Cloud expo june 2013: Building a Real Time Analytics Platform on Big Data in ...
Cloud expo june 2013: Building a Real Time Analytics Platform on Big Data in ...Sanjay Sharma
 
Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...
Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...
Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...Databricks
 
In-Memory Computing Webcast. Market Predictions 2017
In-Memory Computing Webcast. Market Predictions 2017In-Memory Computing Webcast. Market Predictions 2017
In-Memory Computing Webcast. Market Predictions 2017SingleStore
 
Rapid Data Analytics @ Netflix
Rapid Data Analytics @ NetflixRapid Data Analytics @ Netflix
Rapid Data Analytics @ NetflixData Con LA
 
Building a Just in Time Data Warehouse by Dan Morris and Jason Pohl
Building a Just in Time Data Warehouse by Dan Morris and Jason PohlBuilding a Just in Time Data Warehouse by Dan Morris and Jason Pohl
Building a Just in Time Data Warehouse by Dan Morris and Jason PohlSpark Summit
 
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...Spark Summit
 
Operationalizing analytics to scale
Operationalizing analytics to scaleOperationalizing analytics to scale
Operationalizing analytics to scaleLooker
 
An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko...
An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko...An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko...
An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko...Databricks
 
NYC Cassandra March 13- lighting talk
NYC Cassandra March 13- lighting talkNYC Cassandra March 13- lighting talk
NYC Cassandra March 13- lighting talkSanjay Sharma
 
Getting It Right Exactly Once: Principles for Streaming Architectures
Getting It Right Exactly Once: Principles for Streaming ArchitecturesGetting It Right Exactly Once: Principles for Streaming Architectures
Getting It Right Exactly Once: Principles for Streaming ArchitecturesSingleStore
 

Was ist angesagt? (20)

Successful AI/ML Projects with End-to-End Cloud Data Engineering
Successful AI/ML Projects with End-to-End Cloud Data EngineeringSuccessful AI/ML Projects with End-to-End Cloud Data Engineering
Successful AI/ML Projects with End-to-End Cloud Data Engineering
 
Spark and the Enterprise by Tony Baer
Spark and the Enterprise by Tony BaerSpark and the Enterprise by Tony Baer
Spark and the Enterprise by Tony Baer
 
Big data-science-oanyc
Big data-science-oanycBig data-science-oanyc
Big data-science-oanyc
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine Learning
 
Mastering Your Customer Data on Apache Spark by Elliott Cordo
Mastering Your Customer Data on Apache Spark by Elliott CordoMastering Your Customer Data on Apache Spark by Elliott Cordo
Mastering Your Customer Data on Apache Spark by Elliott Cordo
 
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
 
The Real-Time CDO and the Cloud-Forward Path to Predictive Analytics
The Real-Time CDO and the Cloud-Forward Path to Predictive AnalyticsThe Real-Time CDO and the Cloud-Forward Path to Predictive Analytics
The Real-Time CDO and the Cloud-Forward Path to Predictive Analytics
 
Netflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupNetflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering Meetup
 
Big Data Ecosystem- Impetus Technologies
Big Data Ecosystem-  Impetus TechnologiesBig Data Ecosystem-  Impetus Technologies
Big Data Ecosystem- Impetus Technologies
 
Cloud expo june 2013: Building a Real Time Analytics Platform on Big Data in ...
Cloud expo june 2013: Building a Real Time Analytics Platform on Big Data in ...Cloud expo june 2013: Building a Real Time Analytics Platform on Big Data in ...
Cloud expo june 2013: Building a Real Time Analytics Platform on Big Data in ...
 
Zero Downtime App Deployment using Hadoop
Zero Downtime App Deployment using HadoopZero Downtime App Deployment using Hadoop
Zero Downtime App Deployment using Hadoop
 
Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...
Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...
Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...
 
In-Memory Computing Webcast. Market Predictions 2017
In-Memory Computing Webcast. Market Predictions 2017In-Memory Computing Webcast. Market Predictions 2017
In-Memory Computing Webcast. Market Predictions 2017
 
Rapid Data Analytics @ Netflix
Rapid Data Analytics @ NetflixRapid Data Analytics @ Netflix
Rapid Data Analytics @ Netflix
 
Building a Just in Time Data Warehouse by Dan Morris and Jason Pohl
Building a Just in Time Data Warehouse by Dan Morris and Jason PohlBuilding a Just in Time Data Warehouse by Dan Morris and Jason Pohl
Building a Just in Time Data Warehouse by Dan Morris and Jason Pohl
 
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
 
Operationalizing analytics to scale
Operationalizing analytics to scaleOperationalizing analytics to scale
Operationalizing analytics to scale
 
An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko...
An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko...An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko...
An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko...
 
NYC Cassandra March 13- lighting talk
NYC Cassandra March 13- lighting talkNYC Cassandra March 13- lighting talk
NYC Cassandra March 13- lighting talk
 
Getting It Right Exactly Once: Principles for Streaming Architectures
Getting It Right Exactly Once: Principles for Streaming ArchitecturesGetting It Right Exactly Once: Principles for Streaming Architectures
Getting It Right Exactly Once: Principles for Streaming Architectures
 

Ähnlich wie Data Science and Enterprise Engineering with Michael Finger and Chris Robison

Assessing New Databases– Translytical Use Cases
Assessing New Databases– Translytical Use CasesAssessing New Databases– Translytical Use Cases
Assessing New Databases– Translytical Use CasesDATAVERSITY
 
TIBCO Innovation Workshop Series: Reducing Decision Latency with Streaming An...
TIBCO Innovation Workshop Series: Reducing Decision Latency with Streaming An...TIBCO Innovation Workshop Series: Reducing Decision Latency with Streaming An...
TIBCO Innovation Workshop Series: Reducing Decision Latency with Streaming An...Nelson Petracek
 
Learnings from 7 Years of Integrating Mission-Critical IBM Z® and IBM i with ...
Learnings from 7 Years of Integrating Mission-Critical IBM Z® and IBM i with ...Learnings from 7 Years of Integrating Mission-Critical IBM Z® and IBM i with ...
Learnings from 7 Years of Integrating Mission-Critical IBM Z® and IBM i with ...Precisely
 
Le big data à l'épreuve des projets d'entreprise
Le big data à l'épreuve des projets d'entrepriseLe big data à l'épreuve des projets d'entreprise
Le big data à l'épreuve des projets d'entrepriseRubedo, a WebTales solution
 
5 Amazing Reasons DBAs Need to Love Extended Events
5 Amazing Reasons DBAs Need to Love Extended Events5 Amazing Reasons DBAs Need to Love Extended Events
5 Amazing Reasons DBAs Need to Love Extended EventsJason Strate
 
GraphTalk Berlin - Einführung in Graphdatenbanken
GraphTalk Berlin - Einführung in GraphdatenbankenGraphTalk Berlin - Einführung in Graphdatenbanken
GraphTalk Berlin - Einführung in GraphdatenbankenNeo4j
 
Power to the People: A Stack to Empower Every User to Make Data-Driven Decisions
Power to the People: A Stack to Empower Every User to Make Data-Driven DecisionsPower to the People: A Stack to Empower Every User to Make Data-Driven Decisions
Power to the People: A Stack to Empower Every User to Make Data-Driven DecisionsLooker
 
Financial Services Technology Leader Turns Mainframe Logs into Real-Time Insi...
Financial Services Technology Leader Turns Mainframe Logs into Real-Time Insi...Financial Services Technology Leader Turns Mainframe Logs into Real-Time Insi...
Financial Services Technology Leader Turns Mainframe Logs into Real-Time Insi...Precisely
 
Using Web Data to Drive Revenue and Reduce Costs
Using Web Data to Drive Revenue and Reduce CostsUsing Web Data to Drive Revenue and Reduce Costs
Using Web Data to Drive Revenue and Reduce CostsConnotate
 
Wanta OConnell Presentation 2012 v4
Wanta OConnell Presentation 2012 v4Wanta OConnell Presentation 2012 v4
Wanta OConnell Presentation 2012 v4Becky Wanta
 
Enable Advanced Analytics with Hadoop and an Enterprise Data Hub
Enable Advanced Analytics with Hadoop and an Enterprise Data HubEnable Advanced Analytics with Hadoop and an Enterprise Data Hub
Enable Advanced Analytics with Hadoop and an Enterprise Data HubCloudera, Inc.
 
Using Web Data to Drive Revenue and Reduce Costs
Using Web Data to Drive Revenue and Reduce CostsUsing Web Data to Drive Revenue and Reduce Costs
Using Web Data to Drive Revenue and Reduce CostsConnotate
 
Playing to Win: Turbocharged Tableau with a GPU Database
Playing to Win: Turbocharged Tableau with a GPU DatabasePlaying to Win: Turbocharged Tableau with a GPU Database
Playing to Win: Turbocharged Tableau with a GPU DatabaseKinetica
 
Neo4j GraphTalks - Introduction to GraphDatabases and Neo4j
Neo4j GraphTalks - Introduction to GraphDatabases and Neo4jNeo4j GraphTalks - Introduction to GraphDatabases and Neo4j
Neo4j GraphTalks - Introduction to GraphDatabases and Neo4jNeo4j
 
Self-Service Analytics with Guard Rails
Self-Service Analytics with Guard RailsSelf-Service Analytics with Guard Rails
Self-Service Analytics with Guard RailsDenodo
 
Levelling up your data infrastructure
Levelling up your data infrastructureLevelling up your data infrastructure
Levelling up your data infrastructureSimon Belak
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...DATAVERSITY
 
Building Your Enterprise Data Marketplace with DMX-h
Building Your Enterprise Data Marketplace with DMX-hBuilding Your Enterprise Data Marketplace with DMX-h
Building Your Enterprise Data Marketplace with DMX-hPrecisely
 
Future of Making Things
Future of Making ThingsFuture of Making Things
Future of Making ThingsJC Davis
 

Ähnlich wie Data Science and Enterprise Engineering with Michael Finger and Chris Robison (20)

Assessing New Databases– Translytical Use Cases
Assessing New Databases– Translytical Use CasesAssessing New Databases– Translytical Use Cases
Assessing New Databases– Translytical Use Cases
 
TIBCO Innovation Workshop Series: Reducing Decision Latency with Streaming An...
TIBCO Innovation Workshop Series: Reducing Decision Latency with Streaming An...TIBCO Innovation Workshop Series: Reducing Decision Latency with Streaming An...
TIBCO Innovation Workshop Series: Reducing Decision Latency with Streaming An...
 
Learnings from 7 Years of Integrating Mission-Critical IBM Z® and IBM i with ...
Learnings from 7 Years of Integrating Mission-Critical IBM Z® and IBM i with ...Learnings from 7 Years of Integrating Mission-Critical IBM Z® and IBM i with ...
Learnings from 7 Years of Integrating Mission-Critical IBM Z® and IBM i with ...
 
Le big data à l'épreuve des projets d'entreprise
Le big data à l'épreuve des projets d'entrepriseLe big data à l'épreuve des projets d'entreprise
Le big data à l'épreuve des projets d'entreprise
 
5 Amazing Reasons DBAs Need to Love Extended Events
5 Amazing Reasons DBAs Need to Love Extended Events5 Amazing Reasons DBAs Need to Love Extended Events
5 Amazing Reasons DBAs Need to Love Extended Events
 
GraphTalk Berlin - Einführung in Graphdatenbanken
GraphTalk Berlin - Einführung in GraphdatenbankenGraphTalk Berlin - Einführung in Graphdatenbanken
GraphTalk Berlin - Einführung in Graphdatenbanken
 
Power to the People: A Stack to Empower Every User to Make Data-Driven Decisions
Power to the People: A Stack to Empower Every User to Make Data-Driven DecisionsPower to the People: A Stack to Empower Every User to Make Data-Driven Decisions
Power to the People: A Stack to Empower Every User to Make Data-Driven Decisions
 
Financial Services Technology Leader Turns Mainframe Logs into Real-Time Insi...
Financial Services Technology Leader Turns Mainframe Logs into Real-Time Insi...Financial Services Technology Leader Turns Mainframe Logs into Real-Time Insi...
Financial Services Technology Leader Turns Mainframe Logs into Real-Time Insi...
 
Using Web Data to Drive Revenue and Reduce Costs
Using Web Data to Drive Revenue and Reduce CostsUsing Web Data to Drive Revenue and Reduce Costs
Using Web Data to Drive Revenue and Reduce Costs
 
Wanta OConnell Presentation 2012 v4
Wanta OConnell Presentation 2012 v4Wanta OConnell Presentation 2012 v4
Wanta OConnell Presentation 2012 v4
 
Enable Advanced Analytics with Hadoop and an Enterprise Data Hub
Enable Advanced Analytics with Hadoop and an Enterprise Data HubEnable Advanced Analytics with Hadoop and an Enterprise Data Hub
Enable Advanced Analytics with Hadoop and an Enterprise Data Hub
 
Using Web Data to Drive Revenue and Reduce Costs
Using Web Data to Drive Revenue and Reduce CostsUsing Web Data to Drive Revenue and Reduce Costs
Using Web Data to Drive Revenue and Reduce Costs
 
Playing to Win: Turbocharged Tableau with a GPU Database
Playing to Win: Turbocharged Tableau with a GPU DatabasePlaying to Win: Turbocharged Tableau with a GPU Database
Playing to Win: Turbocharged Tableau with a GPU Database
 
Neo4j GraphTalks - Introduction to GraphDatabases and Neo4j
Neo4j GraphTalks - Introduction to GraphDatabases and Neo4jNeo4j GraphTalks - Introduction to GraphDatabases and Neo4j
Neo4j GraphTalks - Introduction to GraphDatabases and Neo4j
 
PHP + Business = Money!
PHP + Business = Money!PHP + Business = Money!
PHP + Business = Money!
 
Self-Service Analytics with Guard Rails
Self-Service Analytics with Guard RailsSelf-Service Analytics with Guard Rails
Self-Service Analytics with Guard Rails
 
Levelling up your data infrastructure
Levelling up your data infrastructureLevelling up your data infrastructure
Levelling up your data infrastructure
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
 
Building Your Enterprise Data Marketplace with DMX-h
Building Your Enterprise Data Marketplace with DMX-hBuilding Your Enterprise Data Marketplace with DMX-h
Building Your Enterprise Data Marketplace with DMX-h
 
Future of Making Things
Future of Making ThingsFuture of Making Things
Future of Making Things
 

Mehr von Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Mehr von Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Kürzlich hochgeladen

Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfnikeshsingh56
 
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...ThinkInnovation
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...ThinkInnovation
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etclalithasri22
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformationAnnie Melnic
 
Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are successPratikSingh115843
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfNicoChristianSunaryo
 

Kürzlich hochgeladen (16)

Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdf
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etc
 
2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformation
 
Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are success
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdf
 

Data Science and Enterprise Engineering with Michael Finger and Chris Robison

  • 1. data science and enterprise engineering how data scientists and engineers work in tandem to achieve real-time personalization at overstock.com
  • 2. overstock © Overstock2 pushing boundaries of retail since 1999 Midvale, Utah 5 million unique products for sale billions of visits and page views 160+ countries
  • 3. team in the beginning..... © Overstock 4 • pigs • data science team – 3 data scientists • engineering team - 2 developers, 1 QA, and a product owner • chickens • plenty of business owners • lots of channel managers remember animal farm? overstock marketing dev
  • 4. the problems we were up against © Overstock 5 • data scientists were not working with the business to solve business problems • engineers were not working with data scientists • engineering was in a relational data mindset • not regularly delivering business value • solving yesterday's problems - today overstock marketing dev
  • 6. 1st problem we came together to solve © Overstock 7 • real-time bidding (RTB) is a means by which advertising inventory is bought and sold on a per-impression basis via real-time bidding, similar to financial markets • low-latency operation – need to bid >10ms • pre-compute scores nightly • push scored to cache in AWS • partnered with Bees Wax • replace existing 3rd party partners This is where I add text. If I want it to keep going, it looks like this. business problem
  • 7. propensity to purchase © Overstock 8 • identifying unique patterns and tendencies that indicate a user is ready to purchase • user score: represents a customer’s likelihood of purchase • collected at the visit/user level • basic steps • turn raw user interactions into ‘features’ • train classifiers on months of data with label = purchase vs. no purchase • predict on new users/visits This is where I add text. If I want it to keep going, it looks like this. problem to solve
  • 8. class imbalance © Overstock 9 • many new customers • billions of unique page views in a calendar year • many users we are seeing for the first time (potentially) • low conversion • a small percentage of sessions end in a purchase • sparse web logs mean we have to digest an enormous amount of data to generate useful features • we are interested in accuracy on the positive label • recall over precision data science challenges
  • 9. what the team was up against © Overstock 10 • Constraints with current infrastructure and processes • used to pulling data instead of streaming it • data scientists are a scarce resource • working on many non-relative tasks • not a lot of experience with the care and feeding of enterprise software • a bit set in the academia mindset challenges
  • 11. Faster Scotty © Overstock 12 what we accomplished in 6 months • we were solving today's problems... by the end of the day • engineering was thinking about streaming data • had the pieces in place to start scoring users in minutes recap
  • 13. needed to move faster © Overstock 14 • score users faster • moved from daily to hourly • picked up a fraud project • assign fraud score within 5 mins data science + engineering business challenges
  • 14. moving from daily batches to more regular micro-batches © Overstock 15 • hourly jobs • much more responsive scoring = larger server footprint • can still train offline in batch, but must have pipelines better tuned. • every 5 minutes • ETL must be tightly honed • at this point you hit the edge of what is possible in the batch setting • feedback loops become critical for success data science + engineering challenges
  • 15. balancing business goals with enterprise engineering © Overstock 16 • as adoption increases visibility increases • need standardized data representations • have to make sure architecture is hardened and robust • need fail-safe mechanisms for critical processes • MOST IMPORTANT • you can never, ever slow down overstock.com business vs engineering challenges
  • 16. © Overstock 17 Asset Data Platform User ID Content UserID AttributesAttributes Content User Training Data Asset Training Data Experience Training Data Trained Models User ML Asset ML
  • 17. needed to move faster © Overstock 18 • what was accomplished in 6 months • fraud scores in a minute or less • users were being scored in a minute • what was left • still not fast enough for real time personalization on overstock.com data science + engineering recap
  • 19. near real-time personalization on the shopping site © Overstock 20 • near real-time personalization on overstock • putting custom recs in front of the user. • can't slow the site down This is where I add text. If I want it to keep going, it looks like this. business problem
  • 20. from micro-batches to near real-time © Overstock 21 • near real-time can limit the size of models and complexity of calculations • we aren’t trading stocks, we sell home goods • we may not know much about a user we are trying to personalize for • what can you personalize for a new user as they initially interact with your site • heuristics moving towards full models data science + engineering + business challenges
  • 21. from micro-batches to near real-time © Overstock 22 • real-time data flow requires rea-time analytics and strategy • introspection into processes becomes critical • must have automated process control to prevent algorithms from running wild • empower business owners to operate strategically without suffering from information overload data science + engineering + business challenges
  • 22. from micro-batches to near real-time © Overstock 23 • every piece of the process needs to function as a cohesive flow • do we need to wait for the data to get there, then act (fraud?) • or act on the data we have (recommendations for a shopping site) data science + engineering + business challenges
  • 23. © Overstock 24 asset data platform live rex pcm asset exhaust user exhaust audiences & user attributes targeted assets userattributesassetattributes tracking deep links tracking deep links id bids bids auction, bid, & win logs campaign logs trking deep links userattributes assets campaign logs assetevents userevents raw data ml features ml features raw data
  • 24. what was done and undone © Overstock 25 • still not quite to near-real-time/low-latency recs for shopping site • fraud score and email recommendations are fast enough data science + engineering + business recap
  • 25. roadmap © Overstock 26 • page-less personalization • deep learning for fraud • balancing real-time needs with efficient gains We all need to be daring and collaborative to accelerate innovation! data science + engineering + business where we are going
  • 26. lessons learned © Overstock 27 • the cloud presents a whole new set of problems • tech is hard, politics are harder • we can’t operate in silos • we must be aligned with the business • the process must be iterative data science + engineering + business take always
  • 27. questions? and of course, we are hiring!