SlideShare ist ein Scribd-Unternehmen logo
1 von 27
11
SPARK AT ZILLOW
Steven Hoelscher, Data Scientist
Alex Chang, Senior BI Developer
David Fagnan, Senior Data Scientist
22
DATA LAKE
Steven Hoelscher, Data Scientist
3
Goals for the Data Lake
• Convergence of disparate data
– Ability to store data in any format
– Raw data is readily available for machine learning scenarios
• Separation of compute and storage
– Faster ingestion of the data
– Ease of scalability
• Centralized data store for the entirety of Zillow
– One place for all data that fosters sharing cross company
4
Data Lake High Level Architecture
Data Science,
Predictive Analytics,
BI Use Cases
Zillow Group Data Lake
Data Models
Data Marts
Dictionaries
Metadata Lake
Databases
Business
Reporting
API
Web/Mobile
Applications
Future
Sources
5
Use Case: Historical Data Storage (HDS)
Kinesis Streams Raw Data Lake HDS Spark Job HDS Output
• Maintain history of data such that events could be theoretically replayed
• Generate property information tables by coalescing across multiple data
sources
• Dedupe records according to unique keys or special, event-specific logic
• Standardize output format (parquet)and partition for downstream jobs
6
Stumbling Block #1: Ingesting data from S3
Example raw data path:
s3://data-lake-raw/foo/<regionid>/<processid>/<eventname>/<eventname>_X.json
s3://data-lake-raw/foo/<regionid>/<processid>/<eventname>/<eventname>_SUCCESS
Goals:
• Ingest all successfully completed data that has not been
processed before.
• Prevent Spark from inferring schema from JSON data
7
Solution: Ingestion Queue and Schema Artifact
Lambda function:
Monitors S3 raw
bucket for
_SUCCESS files
DynamoDB:
Stores S3 paths with
region, event type, and
process history metadata
Upstream team
creates a versioned
artifact of Avro
schemas that can be
used at read time
For example:
8
Stumbling Block #2: Partitioning output data
Goals:
• Partition each event stream by region
• Ensure one partition per region
First thought:
DataFrameWriter’s partitionBy method looks attractive
But:
A coalesce(1) call would be required to have one partition per
regionid.
9
Solution: Partition intelligently using HiveQL
(New) Goals:
• Ensure records with same uid are located on same
partition
• Reduce read and shuffle cost of downstream Spark jobs
Number of total partitions will be equal to spark.sql.shuffle.partitions
1010
USER SEGMENTATION
Alex Chang, Senior BI Developer
11
Who Am I
• Background as Data Warehouse Engineer
• Started the Spark journey roughly a year and ongoing
• Previously drugstore.com, Walgreens
12
Big Data? Small Data?
13
Zillow Mission
Power to the People
Build the most vibrant and trusted real estate marketplace
14
How does the model work?
15
How does the model work?
Persona
16
Architecture
17
Making Spark work for Zillow
• Spark is a MPP engine
• Model
– Requires that all activity belongs to the same users to predict Persona
• Partition
– Distributed the data by user id
– All activity of the same users will be contained in a single partition
18
Evolution of Spark process
• Started with pipe() to R model with Spark 1.3
– Works well after a faction - poor handshake mechanism
• Standard IN/OUT
• Does not wait for actual completion of R script
• Use of rpy2
– Works much better. Return all prediction
• Serialization of data
• Yet another potential memory issue source
19
Lesson Learn
• Broadcast Join
– Increasing spark.sql.autoBroadcastJoinThreshold is not a one size fits all
– df1.join(broadcast(df2), key)
• Narrow vs Wide transformation
– When possible use narrow transformation
– Shuffle better
• Cache only when data would be use multiple times
2020
ZESTIMATE
David Fagnan, Senior Data Scientist
21
What are Zestimates?
• Provide an independent estimate on the value of homes
– Starting point to determine homes value
• A price on every rooftop (Over 100 million Zestimates)
– Rent as well!
22
• Goals:
– Independent
– Transparency
– Accuracy
– Bias
– Changes to user edits
– Diagnostics
ZESTIMATE:
Value: Range:
$550,000 $525,000-575,000
CLEANING
TRAINING
SCORING
Models applied to all
homes every day
Models trained with
recent sales
Reconciling
property
attributes
Physical
attributes
Listings
Sales
Data
from
multiple
sources
User updates
Making the Zestimates
23
Zestimates before Spark
24
Zestimates in Spark
Data lake Spark Sql Partition By
Region/Other
Save
mapPartitio
ns
Custom ML models
python, R (via rpy2)
25
Final
Model
Stage 4
Models F
Zestimate Model Design
• Multiple models
• Single App + Share RDDs/Dataframes in memory
• Run stages in parallel
– Multi-threading + spark scheduling
– Optional Saving/Resuming for stages
– Design every piece to run off data within sparkContext or external
Data lake
User Data
Listing Data
Awesome Data
Public data
Messy Data
Core Model
Special Model
Simple Model
Stage
3
Model
F
Awesome Model
Tax
Model Messy
Model
Stage 2
Models
D,E
Stage 1
Models
A,B,C
26
USE CASES USING PREDICTIVE ANALYTICS
Home Valuation
• Zestimate
• Pricing Tool
• Turbo Zestimate
• Zillow Home Value Index
• Rent Zestimate
• Zillow Rent Index
• Zestimate Forecast
• Market Report
• Best Time to List
Personalization & Search
• Personalized Homes for Sale
• Homes You’ll Love
• Nearby Similar Sales
• Search – Homes for Sale
• Search – Homes for Rent
• Trending Homes
• Collections
B2B
• Ad Campaigns
• Agent Leads
• Search Engine Marketing (SEM)
Deep Learning
• Videos
• Photos
Content
• Digs Photo
Recommendations
• Content
Recommendations
User Profiles
• Persona Predictions
• Home Owner Predictions
• Lender Recommendations
• People Also Viewed
2727
THANK YOU!
We are hiring!
• Big Data Engineer
• BI Developer
• Software Dev Engineer, Machine Learning
• Data Scientist
• Architect
• Director, Data Science
• Director, Analytic Applications
http://www.zillow.com/jobs/

Weitere ähnliche Inhalte

Was ist angesagt?

Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyNeo4j
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseDatabricks
 
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...Databricks
 
NoSQL Data Modeling 101
NoSQL Data Modeling 101NoSQL Data Modeling 101
NoSQL Data Modeling 101ScyllaDB
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptxAlex Ivy
 
Presto: Distributed sql query engine
Presto: Distributed sql query engine Presto: Distributed sql query engine
Presto: Distributed sql query engine kiran palaka
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for ExperimentationGleb Kanterov
 
Neo4j 4.1 overview
Neo4j 4.1 overviewNeo4j 4.1 overview
Neo4j 4.1 overviewNeo4j
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
 
Getting Started with Delta Lake on Databricks
Getting Started with Delta Lake on DatabricksGetting Started with Delta Lake on Databricks
Getting Started with Delta Lake on DatabricksKnoldus Inc.
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKKriangkrai Chaonithi
 
Azure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene PolonichkoAzure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene PolonichkoDimko Zhluktenko
 
Scaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with DatabricksScaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with DatabricksDatabricks
 
Data Architecture Strategies: The Rise of the Graph Database
Data Architecture Strategies: The Rise of the Graph DatabaseData Architecture Strategies: The Rise of the Graph Database
Data Architecture Strategies: The Rise of the Graph DatabaseDATAVERSITY
 
Intro to Neo4j and Graph Databases
Intro to Neo4j and Graph DatabasesIntro to Neo4j and Graph Databases
Intro to Neo4j and Graph DatabasesNeo4j
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
Scaling Data Quality @ Netflix
Scaling Data Quality @ NetflixScaling Data Quality @ Netflix
Scaling Data Quality @ NetflixMichelle Ufford
 
Siligong.Data - May 2021 - Transforming your analytics workflow with dbt
Siligong.Data - May 2021 - Transforming your analytics workflow with dbtSiligong.Data - May 2021 - Transforming your analytics workflow with dbt
Siligong.Data - May 2021 - Transforming your analytics workflow with dbtJon Su
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQLDatabricks
 

Was ist angesagt? (20)

Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
 
NoSQL Data Modeling 101
NoSQL Data Modeling 101NoSQL Data Modeling 101
NoSQL Data Modeling 101
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Presto: Distributed sql query engine
Presto: Distributed sql query engine Presto: Distributed sql query engine
Presto: Distributed sql query engine
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
Neo4j 4.1 overview
Neo4j 4.1 overviewNeo4j 4.1 overview
Neo4j 4.1 overview
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Getting Started with Delta Lake on Databricks
Getting Started with Delta Lake on DatabricksGetting Started with Delta Lake on Databricks
Getting Started with Delta Lake on Databricks
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
 
Azure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene PolonichkoAzure DataBricks for Data Engineering by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene Polonichko
 
Scaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with DatabricksScaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with Databricks
 
Data Architecture Strategies: The Rise of the Graph Database
Data Architecture Strategies: The Rise of the Graph DatabaseData Architecture Strategies: The Rise of the Graph Database
Data Architecture Strategies: The Rise of the Graph Database
 
Intro to Neo4j and Graph Databases
Intro to Neo4j and Graph DatabasesIntro to Neo4j and Graph Databases
Intro to Neo4j and Graph Databases
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Scaling Data Quality @ Netflix
Scaling Data Quality @ NetflixScaling Data Quality @ Netflix
Scaling Data Quality @ Netflix
 
Siligong.Data - May 2021 - Transforming your analytics workflow with dbt
Siligong.Data - May 2021 - Transforming your analytics workflow with dbtSiligong.Data - May 2021 - Transforming your analytics workflow with dbt
Siligong.Data - May 2021 - Transforming your analytics workflow with dbt
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 

Andere mochten auch

Overview of Data Science at Zillow
Overview of Data Science at ZillowOverview of Data Science at Zillow
Overview of Data Science at Zillownjstevens
 
From Cognitive Computing to Artificial Intelligence Dr Peter Waggett Dire...
From Cognitive Computing to Artificial Intelligence Dr Peter WaggettDire...From Cognitive Computing to Artificial Intelligence Dr Peter WaggettDire...
From Cognitive Computing to Artificial Intelligence Dr Peter Waggett Dire...Sudeep Sakalle
 
Building Your Own Watson Powered Application on Bluemix
Building Your Own Watson Powered Application on BluemixBuilding Your Own Watson Powered Application on Bluemix
Building Your Own Watson Powered Application on BluemixIBM
 
IBM Bluemix™ Architecture & Deep Dive
IBM Bluemix™ Architecture & Deep DiveIBM Bluemix™ Architecture & Deep Dive
IBM Bluemix™ Architecture & Deep DiveIBM
 
AI Everywhere: How Microsoft is Democratizing AI - Lightning Version
AI Everywhere: How Microsoft is Democratizing AI - Lightning VersionAI Everywhere: How Microsoft is Democratizing AI - Lightning Version
AI Everywhere: How Microsoft is Democratizing AI - Lightning VersionPaul Prae
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfCharles Givre
 
Introduction to Real-time data processing
Introduction to Real-time data processingIntroduction to Real-time data processing
Introduction to Real-time data processingYogi Devendra Vyavahare
 
AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendatio...
AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendatio...AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendatio...
AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendatio...Amazon Web Services
 
Architectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingArchitectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingApache Apex
 

Andere mochten auch (10)

Overview of Data Science at Zillow
Overview of Data Science at ZillowOverview of Data Science at Zillow
Overview of Data Science at Zillow
 
From Cognitive Computing to Artificial Intelligence Dr Peter Waggett Dire...
From Cognitive Computing to Artificial Intelligence Dr Peter WaggettDire...From Cognitive Computing to Artificial Intelligence Dr Peter WaggettDire...
From Cognitive Computing to Artificial Intelligence Dr Peter Waggett Dire...
 
Building Your Own Watson Powered Application on Bluemix
Building Your Own Watson Powered Application on BluemixBuilding Your Own Watson Powered Application on Bluemix
Building Your Own Watson Powered Application on Bluemix
 
IBM Bluemix™ Architecture & Deep Dive
IBM Bluemix™ Architecture & Deep DiveIBM Bluemix™ Architecture & Deep Dive
IBM Bluemix™ Architecture & Deep Dive
 
AI Everywhere: How Microsoft is Democratizing AI - Lightning Version
AI Everywhere: How Microsoft is Democratizing AI - Lightning VersionAI Everywhere: How Microsoft is Democratizing AI - Lightning Version
AI Everywhere: How Microsoft is Democratizing AI - Lightning Version
 
Database Design
Database DesignDatabase Design
Database Design
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
 
Introduction to Real-time data processing
Introduction to Real-time data processingIntroduction to Real-time data processing
Introduction to Real-time data processing
 
AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendatio...
AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendatio...AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendatio...
AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendatio...
 
Architectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingArchitectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark Streaming
 

Ähnlich wie Spark at Zillow

Apache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CITApache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CITApache Geode
 
XDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
XDF 2019 Xilinx Accelerated Database and Data Analytics EcosystemXDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
XDF 2019 Xilinx Accelerated Database and Data Analytics EcosystemDan Eaton
 
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Sa introduction to big data pipelining with cassandra &amp; spark   west mins...Sa introduction to big data pipelining with cassandra &amp; spark   west mins...
Sa introduction to big data pipelining with cassandra &amp; spark west mins...Simon Ambridge
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sqlaftab alam
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesMongoDB
 
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...eswcsummerschool
 
Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)Anthony Baker
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
Managing data analytics in a hybrid cloud
Managing data analytics in a hybrid cloudManaging data analytics in a hybrid cloud
Managing data analytics in a hybrid cloudKaran Singh
 
Agility and Scalability with MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDBMongoDB
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...DataStax
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RDatabricks
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Perficient, Inc.
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache SparkMiklos Christine
 
Taking Splunk to the Next Level – Architecture
Taking Splunk to the Next Level – ArchitectureTaking Splunk to the Next Level – Architecture
Taking Splunk to the Next Level – ArchitectureSplunk
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Databricks
 

Ähnlich wie Spark at Zillow (20)

Apache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CITApache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CIT
 
XDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
XDF 2019 Xilinx Accelerated Database and Data Analytics EcosystemXDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
XDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
 
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Sa introduction to big data pipelining with cassandra &amp; spark   west mins...Sa introduction to big data pipelining with cassandra &amp; spark   west mins...
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sql
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
 
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
 
Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Serverless SQL
Serverless SQLServerless SQL
Serverless SQL
 
Managing data analytics in a hybrid cloud
Managing data analytics in a hybrid cloudManaging data analytics in a hybrid cloud
Managing data analytics in a hybrid cloud
 
Agility and Scalability with MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDB
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
Neo4j in Depth
Neo4j in DepthNeo4j in Depth
Neo4j in Depth
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
 
Oracle OpenWo2014 review part 03 three_paa_s_database
Oracle OpenWo2014 review part 03 three_paa_s_databaseOracle OpenWo2014 review part 03 three_paa_s_database
Oracle OpenWo2014 review part 03 three_paa_s_database
 
Big Data training
Big Data trainingBig Data training
Big Data training
 
Taking Splunk to the Next Level – Architecture
Taking Splunk to the Next Level – ArchitectureTaking Splunk to the Next Level – Architecture
Taking Splunk to the Next Level – Architecture
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
 

Kürzlich hochgeladen

What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 

Kürzlich hochgeladen (20)

What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 

Spark at Zillow

  • 1. 11 SPARK AT ZILLOW Steven Hoelscher, Data Scientist Alex Chang, Senior BI Developer David Fagnan, Senior Data Scientist
  • 3. 3 Goals for the Data Lake • Convergence of disparate data – Ability to store data in any format – Raw data is readily available for machine learning scenarios • Separation of compute and storage – Faster ingestion of the data – Ease of scalability • Centralized data store for the entirety of Zillow – One place for all data that fosters sharing cross company
  • 4. 4 Data Lake High Level Architecture Data Science, Predictive Analytics, BI Use Cases Zillow Group Data Lake Data Models Data Marts Dictionaries Metadata Lake Databases Business Reporting API Web/Mobile Applications Future Sources
  • 5. 5 Use Case: Historical Data Storage (HDS) Kinesis Streams Raw Data Lake HDS Spark Job HDS Output • Maintain history of data such that events could be theoretically replayed • Generate property information tables by coalescing across multiple data sources • Dedupe records according to unique keys or special, event-specific logic • Standardize output format (parquet)and partition for downstream jobs
  • 6. 6 Stumbling Block #1: Ingesting data from S3 Example raw data path: s3://data-lake-raw/foo/<regionid>/<processid>/<eventname>/<eventname>_X.json s3://data-lake-raw/foo/<regionid>/<processid>/<eventname>/<eventname>_SUCCESS Goals: • Ingest all successfully completed data that has not been processed before. • Prevent Spark from inferring schema from JSON data
  • 7. 7 Solution: Ingestion Queue and Schema Artifact Lambda function: Monitors S3 raw bucket for _SUCCESS files DynamoDB: Stores S3 paths with region, event type, and process history metadata Upstream team creates a versioned artifact of Avro schemas that can be used at read time For example:
  • 8. 8 Stumbling Block #2: Partitioning output data Goals: • Partition each event stream by region • Ensure one partition per region First thought: DataFrameWriter’s partitionBy method looks attractive But: A coalesce(1) call would be required to have one partition per regionid.
  • 9. 9 Solution: Partition intelligently using HiveQL (New) Goals: • Ensure records with same uid are located on same partition • Reduce read and shuffle cost of downstream Spark jobs Number of total partitions will be equal to spark.sql.shuffle.partitions
  • 10. 1010 USER SEGMENTATION Alex Chang, Senior BI Developer
  • 11. 11 Who Am I • Background as Data Warehouse Engineer • Started the Spark journey roughly a year and ongoing • Previously drugstore.com, Walgreens
  • 13. 13 Zillow Mission Power to the People Build the most vibrant and trusted real estate marketplace
  • 14. 14 How does the model work?
  • 15. 15 How does the model work? Persona
  • 17. 17 Making Spark work for Zillow • Spark is a MPP engine • Model – Requires that all activity belongs to the same users to predict Persona • Partition – Distributed the data by user id – All activity of the same users will be contained in a single partition
  • 18. 18 Evolution of Spark process • Started with pipe() to R model with Spark 1.3 – Works well after a faction - poor handshake mechanism • Standard IN/OUT • Does not wait for actual completion of R script • Use of rpy2 – Works much better. Return all prediction • Serialization of data • Yet another potential memory issue source
  • 19. 19 Lesson Learn • Broadcast Join – Increasing spark.sql.autoBroadcastJoinThreshold is not a one size fits all – df1.join(broadcast(df2), key) • Narrow vs Wide transformation – When possible use narrow transformation – Shuffle better • Cache only when data would be use multiple times
  • 21. 21 What are Zestimates? • Provide an independent estimate on the value of homes – Starting point to determine homes value • A price on every rooftop (Over 100 million Zestimates) – Rent as well!
  • 22. 22 • Goals: – Independent – Transparency – Accuracy – Bias – Changes to user edits – Diagnostics ZESTIMATE: Value: Range: $550,000 $525,000-575,000 CLEANING TRAINING SCORING Models applied to all homes every day Models trained with recent sales Reconciling property attributes Physical attributes Listings Sales Data from multiple sources User updates Making the Zestimates
  • 24. 24 Zestimates in Spark Data lake Spark Sql Partition By Region/Other Save mapPartitio ns Custom ML models python, R (via rpy2)
  • 25. 25 Final Model Stage 4 Models F Zestimate Model Design • Multiple models • Single App + Share RDDs/Dataframes in memory • Run stages in parallel – Multi-threading + spark scheduling – Optional Saving/Resuming for stages – Design every piece to run off data within sparkContext or external Data lake User Data Listing Data Awesome Data Public data Messy Data Core Model Special Model Simple Model Stage 3 Model F Awesome Model Tax Model Messy Model Stage 2 Models D,E Stage 1 Models A,B,C
  • 26. 26 USE CASES USING PREDICTIVE ANALYTICS Home Valuation • Zestimate • Pricing Tool • Turbo Zestimate • Zillow Home Value Index • Rent Zestimate • Zillow Rent Index • Zestimate Forecast • Market Report • Best Time to List Personalization & Search • Personalized Homes for Sale • Homes You’ll Love • Nearby Similar Sales • Search – Homes for Sale • Search – Homes for Rent • Trending Homes • Collections B2B • Ad Campaigns • Agent Leads • Search Engine Marketing (SEM) Deep Learning • Videos • Photos Content • Digs Photo Recommendations • Content Recommendations User Profiles • Persona Predictions • Home Owner Predictions • Lender Recommendations • People Also Viewed
  • 27. 2727 THANK YOU! We are hiring! • Big Data Engineer • BI Developer • Software Dev Engineer, Machine Learning • Data Scientist • Architect • Director, Data Science • Director, Analytic Applications http://www.zillow.com/jobs/