3. 3
Goals for the Data Lake
• Convergence of disparate data
– Ability to store data in any format
– Raw data is readily available for machine learning scenarios
• Separation of compute and storage
– Faster ingestion of the data
– Ease of scalability
• Centralized data store for the entirety of Zillow
– One place for all data that fosters sharing cross company
4. 4
Data Lake High Level Architecture
Data Science,
Predictive Analytics,
BI Use Cases
Zillow Group Data Lake
Data Models
Data Marts
Dictionaries
Metadata Lake
Databases
Business
Reporting
API
Web/Mobile
Applications
Future
Sources
5. 5
Use Case: Historical Data Storage (HDS)
Kinesis Streams Raw Data Lake HDS Spark Job HDS Output
• Maintain history of data such that events could be theoretically replayed
• Generate property information tables by coalescing across multiple data
sources
• Dedupe records according to unique keys or special, event-specific logic
• Standardize output format (parquet)and partition for downstream jobs
6. 6
Stumbling Block #1: Ingesting data from S3
Example raw data path:
s3://data-lake-raw/foo/<regionid>/<processid>/<eventname>/<eventname>_X.json
s3://data-lake-raw/foo/<regionid>/<processid>/<eventname>/<eventname>_SUCCESS
Goals:
• Ingest all successfully completed data that has not been
processed before.
• Prevent Spark from inferring schema from JSON data
7. 7
Solution: Ingestion Queue and Schema Artifact
Lambda function:
Monitors S3 raw
bucket for
_SUCCESS files
DynamoDB:
Stores S3 paths with
region, event type, and
process history metadata
Upstream team
creates a versioned
artifact of Avro
schemas that can be
used at read time
For example:
8. 8
Stumbling Block #2: Partitioning output data
Goals:
• Partition each event stream by region
• Ensure one partition per region
First thought:
DataFrameWriter’s partitionBy method looks attractive
But:
A coalesce(1) call would be required to have one partition per
regionid.
9. 9
Solution: Partition intelligently using HiveQL
(New) Goals:
• Ensure records with same uid are located on same
partition
• Reduce read and shuffle cost of downstream Spark jobs
Number of total partitions will be equal to spark.sql.shuffle.partitions
17. 17
Making Spark work for Zillow
• Spark is a MPP engine
• Model
– Requires that all activity belongs to the same users to predict Persona
• Partition
– Distributed the data by user id
– All activity of the same users will be contained in a single partition
18. 18
Evolution of Spark process
• Started with pipe() to R model with Spark 1.3
– Works well after a faction - poor handshake mechanism
• Standard IN/OUT
• Does not wait for actual completion of R script
• Use of rpy2
– Works much better. Return all prediction
• Serialization of data
• Yet another potential memory issue source
19. 19
Lesson Learn
• Broadcast Join
– Increasing spark.sql.autoBroadcastJoinThreshold is not a one size fits all
– df1.join(broadcast(df2), key)
• Narrow vs Wide transformation
– When possible use narrow transformation
– Shuffle better
• Cache only when data would be use multiple times
21. 21
What are Zestimates?
• Provide an independent estimate on the value of homes
– Starting point to determine homes value
• A price on every rooftop (Over 100 million Zestimates)
– Rent as well!
22. 22
• Goals:
– Independent
– Transparency
– Accuracy
– Bias
– Changes to user edits
– Diagnostics
ZESTIMATE:
Value: Range:
$550,000 $525,000-575,000
CLEANING
TRAINING
SCORING
Models applied to all
homes every day
Models trained with
recent sales
Reconciling
property
attributes
Physical
attributes
Listings
Sales
Data
from
multiple
sources
User updates
Making the Zestimates
24. 24
Zestimates in Spark
Data lake Spark Sql Partition By
Region/Other
Save
mapPartitio
ns
Custom ML models
python, R (via rpy2)
25. 25
Final
Model
Stage 4
Models F
Zestimate Model Design
• Multiple models
• Single App + Share RDDs/Dataframes in memory
• Run stages in parallel
– Multi-threading + spark scheduling
– Optional Saving/Resuming for stages
– Design every piece to run off data within sparkContext or external
Data lake
User Data
Listing Data
Awesome Data
Public data
Messy Data
Core Model
Special Model
Simple Model
Stage
3
Model
F
Awesome Model
Tax
Model Messy
Model
Stage 2
Models
D,E
Stage 1
Models
A,B,C
26. 26
USE CASES USING PREDICTIVE ANALYTICS
Home Valuation
• Zestimate
• Pricing Tool
• Turbo Zestimate
• Zillow Home Value Index
• Rent Zestimate
• Zillow Rent Index
• Zestimate Forecast
• Market Report
• Best Time to List
Personalization & Search
• Personalized Homes for Sale
• Homes You’ll Love
• Nearby Similar Sales
• Search – Homes for Sale
• Search – Homes for Rent
• Trending Homes
• Collections
B2B
• Ad Campaigns
• Agent Leads
• Search Engine Marketing (SEM)
Deep Learning
• Videos
• Photos
Content
• Digs Photo
Recommendations
• Content
Recommendations
User Profiles
• Persona Predictions
• Home Owner Predictions
• Lender Recommendations
• People Also Viewed
27. 2727
THANK YOU!
We are hiring!
• Big Data Engineer
• BI Developer
• Software Dev Engineer, Machine Learning
• Data Scientist
• Architect
• Director, Data Science
• Director, Analytic Applications
http://www.zillow.com/jobs/