SlideShare ist ein Scribd-Unternehmen logo
1 von 27
11
SPARK AT ZILLOW
Steven Hoelscher, Data Scientist
Alex Chang, Senior BI Developer
David Fagnan, Senior Data Scientist
22
DATA LAKE
Steven Hoelscher, Data Scientist
3
Goals for the Data Lake
• Convergence of disparate data
– Ability to store data in any format
– Raw data is readily available for machine learning scenarios
• Separation of compute and storage
– Faster ingestion of the data
– Ease of scalability
• Centralized data store for the entirety of Zillow
– One place for all data that fosters sharing cross company
4
Data Lake High Level Architecture
Data Science,
Predictive Analytics,
BI Use Cases
Zillow Group Data Lake
Data Models
Data Marts
Dictionaries
Metadata Lake
Databases
Business
Reporting
API
Web/Mobile
Applications
Future
Sources
5
Use Case: Historical Data Storage (HDS)
Kinesis Streams Raw Data Lake HDS Spark Job HDS Output
• Maintain history of data such that events could be theoretically replayed
• Generate property information tables by coalescing across multiple data
sources
• Dedupe records according to unique keys or special, event-specific logic
• Standardize output format (parquet)and partition for downstream jobs
6
Stumbling Block #1: Ingesting data from S3
Example raw data path:
s3://data-lake-raw/foo/<regionid>/<processid>/<eventname>/<eventname>_X.json
s3://data-lake-raw/foo/<regionid>/<processid>/<eventname>/<eventname>_SUCCESS
Goals:
• Ingest all successfully completed data that has not been
processed before.
• Prevent Spark from inferring schema from JSON data
7
Solution: Ingestion Queue and Schema Artifact
Lambda function:
Monitors S3 raw
bucket for
_SUCCESS files
DynamoDB:
Stores S3 paths with
region, event type, and
process history metadata
Upstream team
creates a versioned
artifact of Avro
schemas that can be
used at read time
For example:
8
Stumbling Block #2: Partitioning output data
Goals:
• Partition each event stream by region
• Ensure one partition per region
First thought:
DataFrameWriter’s partitionBy method looks attractive
But:
A coalesce(1) call would be required to have one partition per
regionid.
9
Solution: Partition intelligently using HiveQL
(New) Goals:
• Ensure records with same uid are located on same
partition
• Reduce read and shuffle cost of downstream Spark jobs
Number of total partitions will be equal to spark.sql.shuffle.partitions
1010
USER SEGMENTATION
Alex Chang, Senior BI Developer
11
Who Am I
• Background as Data Warehouse Engineer
• Started the Spark journey roughly a year and ongoing
• Previously drugstore.com, Walgreens
12
Big Data? Small Data?
13
Zillow Mission
Power to the People
Build the most vibrant and trusted real estate marketplace
14
How does the model work?
15
How does the model work?
Persona
16
Architecture
17
Making Spark work for Zillow
• Spark is a MPP engine
• Model
– Requires that all activity belongs to the same users to predict Persona
• Partition
– Distributed the data by user id
– All activity of the same users will be contained in a single partition
18
Evolution of Spark process
• Started with pipe() to R model with Spark 1.3
– Works well after a faction - poor handshake mechanism
• Standard IN/OUT
• Does not wait for actual completion of R script
• Use of rpy2
– Works much better. Return all prediction
• Serialization of data
• Yet another potential memory issue source
19
Lesson Learn
• Broadcast Join
– Increasing spark.sql.autoBroadcastJoinThreshold is not a one size fits all
– df1.join(broadcast(df2), key)
• Narrow vs Wide transformation
– When possible use narrow transformation
– Shuffle better
• Cache only when data would be use multiple times
2020
ZESTIMATE
David Fagnan, Senior Data Scientist
21
What are Zestimates?
• Provide an independent estimate on the value of homes
– Starting point to determine homes value
• A price on every rooftop (Over 100 million Zestimates)
– Rent as well!
22
• Goals:
– Independent
– Transparency
– Accuracy
– Bias
– Changes to user edits
– Diagnostics
ZESTIMATE:
Value: Range:
$550,000 $525,000-575,000
CLEANING
TRAINING
SCORING
Models applied to all
homes every day
Models trained with
recent sales
Reconciling
property
attributes
Physical
attributes
Listings
Sales
Data
from
multiple
sources
User updates
Making the Zestimates
23
Zestimates before Spark
24
Zestimates in Spark
Data lake Spark Sql Partition By
Region/Other
Save
mapPartitio
ns
Custom ML models
python, R (via rpy2)
25
Final
Model
Stage 4
Models F
Zestimate Model Design
• Multiple models
• Single App + Share RDDs/Dataframes in memory
• Run stages in parallel
– Multi-threading + spark scheduling
– Optional Saving/Resuming for stages
– Design every piece to run off data within sparkContext or external
Data lake
User Data
Listing Data
Awesome Data
Public data
Messy Data
Core Model
Special Model
Simple Model
Stage
3
Model
F
Awesome Model
Tax
Model Messy
Model
Stage 2
Models
D,E
Stage 1
Models
A,B,C
26
USE CASES USING PREDICTIVE ANALYTICS
Home Valuation
• Zestimate
• Pricing Tool
• Turbo Zestimate
• Zillow Home Value Index
• Rent Zestimate
• Zillow Rent Index
• Zestimate Forecast
• Market Report
• Best Time to List
Personalization & Search
• Personalized Homes for Sale
• Homes You’ll Love
• Nearby Similar Sales
• Search – Homes for Sale
• Search – Homes for Rent
• Trending Homes
• Collections
B2B
• Ad Campaigns
• Agent Leads
• Search Engine Marketing (SEM)
Deep Learning
• Videos
• Photos
Content
• Digs Photo
Recommendations
• Content
Recommendations
User Profiles
• Persona Predictions
• Home Owner Predictions
• Lender Recommendations
• People Also Viewed
2727
THANK YOU!
We are hiring!
• Big Data Engineer
• BI Developer
• Software Dev Engineer, Machine Learning
• Data Scientist
• Architect
• Director, Data Science
• Director, Analytic Applications
http://www.zillow.com/jobs/

Weitere ähnliche Inhalte

Was ist angesagt?

Phar Data Platform: From the Lakehouse Paradigm to the Reality
Phar Data Platform: From the Lakehouse Paradigm to the RealityPhar Data Platform: From the Lakehouse Paradigm to the Reality
Phar Data Platform: From the Lakehouse Paradigm to the Reality
Databricks
 
An Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDBAn Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDB
William LaForest
 

Was ist angesagt? (20)

Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
 
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Adobe Behance Scales to Millions of Users at Lower TCO with Neo4j
Adobe Behance Scales to Millions of Users at Lower TCO with Neo4jAdobe Behance Scales to Millions of Users at Lower TCO with Neo4j
Adobe Behance Scales to Millions of Users at Lower TCO with Neo4j
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPBuilding End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCP
 
Google Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline PatternsGoogle Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline Patterns
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Big Data Fabric 2.0 Drives Data Democratization
Big Data Fabric 2.0 Drives Data DemocratizationBig Data Fabric 2.0 Drives Data Democratization
Big Data Fabric 2.0 Drives Data Democratization
 
Phar Data Platform: From the Lakehouse Paradigm to the Reality
Phar Data Platform: From the Lakehouse Paradigm to the RealityPhar Data Platform: From the Lakehouse Paradigm to the Reality
Phar Data Platform: From the Lakehouse Paradigm to the Reality
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Neo4j Popular use case
Neo4j Popular use case Neo4j Popular use case
Neo4j Popular use case
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
An Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDBAn Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDB
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Relational to Big Graph
Relational to Big GraphRelational to Big Graph
Relational to Big Graph
 
Considerations for Data Access in the Lakehouse
Considerations for Data Access in the LakehouseConsiderations for Data Access in the Lakehouse
Considerations for Data Access in the Lakehouse
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 

Andere mochten auch

Andere mochten auch (10)

Overview of Data Science at Zillow
Overview of Data Science at ZillowOverview of Data Science at Zillow
Overview of Data Science at Zillow
 
From Cognitive Computing to Artificial Intelligence Dr Peter Waggett Dire...
From Cognitive Computing to Artificial Intelligence Dr Peter WaggettDire...From Cognitive Computing to Artificial Intelligence Dr Peter WaggettDire...
From Cognitive Computing to Artificial Intelligence Dr Peter Waggett Dire...
 
Building Your Own Watson Powered Application on Bluemix
Building Your Own Watson Powered Application on BluemixBuilding Your Own Watson Powered Application on Bluemix
Building Your Own Watson Powered Application on Bluemix
 
IBM Bluemix™ Architecture & Deep Dive
IBM Bluemix™ Architecture & Deep DiveIBM Bluemix™ Architecture & Deep Dive
IBM Bluemix™ Architecture & Deep Dive
 
AI Everywhere: How Microsoft is Democratizing AI - Lightning Version
AI Everywhere: How Microsoft is Democratizing AI - Lightning VersionAI Everywhere: How Microsoft is Democratizing AI - Lightning Version
AI Everywhere: How Microsoft is Democratizing AI - Lightning Version
 
Database Design
Database DesignDatabase Design
Database Design
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
 
Introduction to Real-time data processing
Introduction to Real-time data processingIntroduction to Real-time data processing
Introduction to Real-time data processing
 
AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendatio...
AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendatio...AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendatio...
AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendatio...
 
Architectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingArchitectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark Streaming
 

Ähnlich wie Spark at Zillow

ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
eswcsummerschool
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Perficient, Inc.
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Databricks
 

Ähnlich wie Spark at Zillow (20)

Apache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CITApache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CIT
 
XDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
XDF 2019 Xilinx Accelerated Database and Data Analytics EcosystemXDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
XDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
 
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Sa introduction to big data pipelining with cassandra &amp; spark   west mins...Sa introduction to big data pipelining with cassandra &amp; spark   west mins...
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sql
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
 
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
 
Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Serverless SQL
Serverless SQLServerless SQL
Serverless SQL
 
Managing data analytics in a hybrid cloud
Managing data analytics in a hybrid cloudManaging data analytics in a hybrid cloud
Managing data analytics in a hybrid cloud
 
Agility and Scalability with MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDB
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
Neo4j in Depth
Neo4j in DepthNeo4j in Depth
Neo4j in Depth
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
 
Oracle OpenWo2014 review part 03 three_paa_s_database
Oracle OpenWo2014 review part 03 three_paa_s_databaseOracle OpenWo2014 review part 03 three_paa_s_database
Oracle OpenWo2014 review part 03 three_paa_s_database
 
Big Data training
Big Data trainingBig Data training
Big Data training
 
Taking Splunk to the Next Level – Architecture
Taking Splunk to the Next Level – ArchitectureTaking Splunk to the Next Level – Architecture
Taking Splunk to the Next Level – Architecture
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

Spark at Zillow

  • 1. 11 SPARK AT ZILLOW Steven Hoelscher, Data Scientist Alex Chang, Senior BI Developer David Fagnan, Senior Data Scientist
  • 3. 3 Goals for the Data Lake • Convergence of disparate data – Ability to store data in any format – Raw data is readily available for machine learning scenarios • Separation of compute and storage – Faster ingestion of the data – Ease of scalability • Centralized data store for the entirety of Zillow – One place for all data that fosters sharing cross company
  • 4. 4 Data Lake High Level Architecture Data Science, Predictive Analytics, BI Use Cases Zillow Group Data Lake Data Models Data Marts Dictionaries Metadata Lake Databases Business Reporting API Web/Mobile Applications Future Sources
  • 5. 5 Use Case: Historical Data Storage (HDS) Kinesis Streams Raw Data Lake HDS Spark Job HDS Output • Maintain history of data such that events could be theoretically replayed • Generate property information tables by coalescing across multiple data sources • Dedupe records according to unique keys or special, event-specific logic • Standardize output format (parquet)and partition for downstream jobs
  • 6. 6 Stumbling Block #1: Ingesting data from S3 Example raw data path: s3://data-lake-raw/foo/<regionid>/<processid>/<eventname>/<eventname>_X.json s3://data-lake-raw/foo/<regionid>/<processid>/<eventname>/<eventname>_SUCCESS Goals: • Ingest all successfully completed data that has not been processed before. • Prevent Spark from inferring schema from JSON data
  • 7. 7 Solution: Ingestion Queue and Schema Artifact Lambda function: Monitors S3 raw bucket for _SUCCESS files DynamoDB: Stores S3 paths with region, event type, and process history metadata Upstream team creates a versioned artifact of Avro schemas that can be used at read time For example:
  • 8. 8 Stumbling Block #2: Partitioning output data Goals: • Partition each event stream by region • Ensure one partition per region First thought: DataFrameWriter’s partitionBy method looks attractive But: A coalesce(1) call would be required to have one partition per regionid.
  • 9. 9 Solution: Partition intelligently using HiveQL (New) Goals: • Ensure records with same uid are located on same partition • Reduce read and shuffle cost of downstream Spark jobs Number of total partitions will be equal to spark.sql.shuffle.partitions
  • 10. 1010 USER SEGMENTATION Alex Chang, Senior BI Developer
  • 11. 11 Who Am I • Background as Data Warehouse Engineer • Started the Spark journey roughly a year and ongoing • Previously drugstore.com, Walgreens
  • 13. 13 Zillow Mission Power to the People Build the most vibrant and trusted real estate marketplace
  • 14. 14 How does the model work?
  • 15. 15 How does the model work? Persona
  • 17. 17 Making Spark work for Zillow • Spark is a MPP engine • Model – Requires that all activity belongs to the same users to predict Persona • Partition – Distributed the data by user id – All activity of the same users will be contained in a single partition
  • 18. 18 Evolution of Spark process • Started with pipe() to R model with Spark 1.3 – Works well after a faction - poor handshake mechanism • Standard IN/OUT • Does not wait for actual completion of R script • Use of rpy2 – Works much better. Return all prediction • Serialization of data • Yet another potential memory issue source
  • 19. 19 Lesson Learn • Broadcast Join – Increasing spark.sql.autoBroadcastJoinThreshold is not a one size fits all – df1.join(broadcast(df2), key) • Narrow vs Wide transformation – When possible use narrow transformation – Shuffle better • Cache only when data would be use multiple times
  • 21. 21 What are Zestimates? • Provide an independent estimate on the value of homes – Starting point to determine homes value • A price on every rooftop (Over 100 million Zestimates) – Rent as well!
  • 22. 22 • Goals: – Independent – Transparency – Accuracy – Bias – Changes to user edits – Diagnostics ZESTIMATE: Value: Range: $550,000 $525,000-575,000 CLEANING TRAINING SCORING Models applied to all homes every day Models trained with recent sales Reconciling property attributes Physical attributes Listings Sales Data from multiple sources User updates Making the Zestimates
  • 24. 24 Zestimates in Spark Data lake Spark Sql Partition By Region/Other Save mapPartitio ns Custom ML models python, R (via rpy2)
  • 25. 25 Final Model Stage 4 Models F Zestimate Model Design • Multiple models • Single App + Share RDDs/Dataframes in memory • Run stages in parallel – Multi-threading + spark scheduling – Optional Saving/Resuming for stages – Design every piece to run off data within sparkContext or external Data lake User Data Listing Data Awesome Data Public data Messy Data Core Model Special Model Simple Model Stage 3 Model F Awesome Model Tax Model Messy Model Stage 2 Models D,E Stage 1 Models A,B,C
  • 26. 26 USE CASES USING PREDICTIVE ANALYTICS Home Valuation • Zestimate • Pricing Tool • Turbo Zestimate • Zillow Home Value Index • Rent Zestimate • Zillow Rent Index • Zestimate Forecast • Market Report • Best Time to List Personalization & Search • Personalized Homes for Sale • Homes You’ll Love • Nearby Similar Sales • Search – Homes for Sale • Search – Homes for Rent • Trending Homes • Collections B2B • Ad Campaigns • Agent Leads • Search Engine Marketing (SEM) Deep Learning • Videos • Photos Content • Digs Photo Recommendations • Content Recommendations User Profiles • Persona Predictions • Home Owner Predictions • Lender Recommendations • People Also Viewed
  • 27. 2727 THANK YOU! We are hiring! • Big Data Engineer • BI Developer • Software Dev Engineer, Machine Learning • Data Scientist • Architect • Director, Data Science • Director, Analytic Applications http://www.zillow.com/jobs/