More Related Content Similar to Power Real Estate Property Analytics with MongoDB + Spark (20) Power Real Estate Property Analytics with MongoDB + Spark1. © 2017 CoreLogic, Inc. [NYSE:CLGX] All Rights Reserved. Proprietary.
Powering
Real Estate
Property
Analytics
Gheni Abla
for DART, CoreLogic
June 21, 2017
1
MongoDB
+ Spark
2. © 2017 CoreLogic, Inc. [NYSE:CLGX] All Rights Reserved. Proprietary.
Managing and storing data for real estate properties in MongoDB at CoreLogic®
Distributing large-scale analytics processing using Spark
Utilizing MongoDB replication for implementing
high-availability between two geographically
dispersed data centers
Learning Objectives
2
3. © 2017 CoreLogic, Inc. [NYSE:CLGX] All Rights Reserved. Proprietary.
CoreLogic – Provider of Property Data,
Financial Data, Analytics and Services
3
* as of Feb 27, 2017
Market Cap:
$3.3
billion*
Operations:
8
countries
Employees:
6,000+
worldwide*
Principal Markets:
U.S. &
Australia
Property
Intelligence
Headquarters:
Irvine
CA
Risk Mgmt.
& Workflow
Principal Businesses:
4. © 2017 CoreLogic, Inc. [NYSE:CLGX] All Rights Reserved. Proprietary.
Unique Insights and Reach
Across the Housing Ecosystem
4
RENTAL
PROPERTIES
REAL ESTATE MORTGAGE &
CAPITAL MARKETS
INSURANCE GOVERNMENT
Consumer
Experience
Rents an
Apartment
Decides to
Buy a House
Needs Financing
or Refinancing
Needs Insurance
& Makes Claims
Expects Regulatory
Protection
Clients
Property
Managers,
Property Owners
Realtors, Property
Information
Services,
Contractors
Lenders, Servicers,
Capital Markets,
GSEs
Insurance Carriers,
Re-Insurance
Government,
Regulators
Underwriting
Risk Management
Valuations
Market Intelligence
1 of 3 Rental
Properties
70% of Real Estate
Agents
3 Out of Every
4 Loans
70% of
Homeowner
Insurance Policies
Almost Every
Housing Regulator
CoreLogic Solutions
CoreLogic Reach
5. © 2017 CoreLogic, Inc. [NYSE:CLGX] All Rights Reserved. Proprietary.
Property Information Differentiators
5
3,100+
counties
5,000+
data fields
Rapid
daily data refresh
99.5%
standard of
accuracy driven
by automated
keying
processes
ACCURACY DEPTH
OF DETAIL
99.9%
of U.S. property
records
BREADTH
OF COVERAGE
4.5B+
records spanning
more than
50 years
DEEP
PROPERTY DATA
6. © 2017 CoreLogic, Inc. [NYSE:CLGX] All Rights Reserved. Proprietary. 6
Complete. Current. Connected.
7. © 2017 CoreLogic, Inc. [NYSE:CLGX] All Rights Reserved. Proprietary.
A new addition – not entire data warehouse
Data for every real estate property in the US
Location, address, zip, city,
county, characteristics, owner…
Sale transactions history, loans, payments
Computation results
Estimated values
Confidence scores
Statistical distributions
Repository for Multiple Data Sets
MongoDB
7
ZIP2ZIP1
ZIP3 ZIP4
Property
Data
Property
Transactions
MLS Data
AVM
Build.
Permits
Payments
8. © 2017 CoreLogic, Inc. [NYSE:CLGX] All Rights Reserved. Proprietary.
MongoDB over HDFS
Used for both batch computation process and for serving data for real-time applications
Majority of applications are read-heavy
Updated daily, weekly or monthly
More frequent analytics
Schema-less
Not all records are same – good support for storing sparse data
Property information and disclosure rules are different state-by-state and county-by-county
Data sets are getting richer everyday, but not at the same rate for every property
for Real Estate Property Data
MongoDB as a Repository
8
ZIP2ZIP1
ZIP3 ZIP4
9. © 2017 CoreLogic, Inc. [NYSE:CLGX] All Rights Reserved. Proprietary.
Support for Replication
Data replicated within data centers and across multiple data centers
Support for parallel reads
Support for Sharding
Utilized for separating data among storage medium
SSD for frequently accessed data and rotational disks for less frequently accessed data
for Real Estate Property Data
MongoDB as a Repository
9
ZIP4
10. © 2017 CoreLogic, Inc. [NYSE:CLGX] All Rights Reserved. Proprietary.
Store latitude, longitude and
other geo data
Search by location or area
(e.g polygon, circle)
Geospatial operators used:
$geoWithin
$geoIntersects
$near
Specialized geospatial index
for fast search
Efficient Support for Geospatial Information
10
11. © 2017 CoreLogic, Inc. [NYSE:CLGX] All Rights Reserved. Proprietary.
Multiple analytics processes depend on
MongoDB
Examples: Residential Property Appraisal, Automatic
Home Valuation, Marketing and Propensity Models etc.
Executed on daily, weekly or monthly
Spark cluster used for distributed computation
Computation is distributed by geographical entities
e.g: zip codes, counties, states
Computation results are also stored in MongoDB
Computations and Analytics
11
Cluster
Master
Worker
ZIP2ZIP1
Worker Worker Worker
ZIP3 ZIP4
Data for ZIP2
Property
Data
Property
Transactions
MLS Data
AVM
Build.
Permits
Payments
12. © 2017 CoreLogic, Inc. [NYSE:CLGX] All Rights Reserved. Proprietary.
Static Model
(Batch Process)
Comparable Select.
Market Price
Tier Calculations
Outlier Detection
A software tool that help home
appraisers quickly complete their
assessments and produce higher-
quality appraisals
Static Model(Batch Process)
Calculates data required by the regression
model (market price, tiers, etc.)
Dynamic Model(Model Service)
Comparable property search
Regression-based analytics
Two Components
Example: Appraisal Adjustment Model
12
Comps Stats
Property
Data
HPI
ZIPs
Database
Data Feed
Dynamic Model
(REST Web Service)
Comparable Search
Regression
Weekly Batch Job
Mobile
App
Service
Tier
Model Service
13. © 2017 CoreLogic, Inc. [NYSE:CLGX] All Rights Reserved. Proprietary.
Spark - Cluster computing platform software
Task scheduling
Memory management
Fault recovery
Support for database access (including MongoDB)
Support for Scala, Java, Python
This model code is written in Scala
MongoDB access is done via MongoDB-spark connector
Good performance
Direct dataset inference – MongoDB data as Spark DataFrames
Computation performed:
Hundreds of millions of regressions
Hundreds of millions geo searches for comparable properties
~100 million properties analyzed and provided with key statistics
~10 hours to execute for all house properties in US
Runs via Spark
Static Model
13
14. © 2017 CoreLogic, Inc. [NYSE:CLGX] All Rights Reserved. Proprietary.
Batch Processing
Static Model
14
Data Center
Replica Replica
Cluster
Master
Comps
Stats
Property
Data
HPI
ZIPs
Worker
Addrb ZIP2Addra ZIP1
Worker Worker Worker
Addrc ZIP3 Addrd ZIP4
Primary
ETL
Server
Data
Sources
Comps
Stats
Property
Data
HPI
ZIPs
Comps
Stats
Property
Data
HPI
ZIPs
15. © 2017 CoreLogic, Inc. [NYSE:CLGX] All Rights Reserved. Proprietary.
RESTful API
Request includes:
Standardized address
Neighborhood boundaries
Regression with four independent variables to calculate adjusted price
Number of beds, number of baths, building square feet, land square feet
Response returns:
Adjusted value of the house
Comparable houses used in regression
Statistics of the adjustments to measure the confidence
Model code is written in Scala and data access to MongoDB is via Casbah
Configured for multi datacenter redundancy
Service Tier
Dynamic Model
15
16. © 2017 CoreLogic, Inc. [NYSE:CLGX] All Rights Reserved. Proprietary.
Dynamic Model
Dynamic Model
16
Data Center 1 Data Center 2
Replica Replica Replica ReplicaPrimary
Requests sent
to topologically
nearest
MongoDB server
Global Traffic Manager
Service Tier
Tomcat
Dynamic
Model
Dynamic
Model
Tomcat Tomcat
Dynamic
Model
Tomcat
Dynamic
Model
Comps
Stats
Property
Data
HPI
ZIPs
Comps
Stats
Property
Data
HPI
ZIPs
Comps
Stats
Property
Data
HPI
ZIPs
Comps
Stats
Property
Data
HPI
ZIPs
Comps
Stats
Property
Data
HPI
ZIPs
17. © 2017 CoreLogic, Inc. [NYSE:CLGX] All Rights Reserved. Proprietary.
MongoDB provided powerful support for storing and searching location-based
real estate property data
MongoDB’s replication capability provided high-availability across data centers
MongoDB supports data needs of both batch-oriented distributed computation
and real-time web services
Scala language, Casbah library and MongoDB-Spark connector facilitated
seamless integration between data access and analytics
Conclusion
17
ZIP2ZIP1
ZIP3 ZIP4