A living hell - lessons learned in eight years of parsing real estate data
1. A living hell: lessons learned in eight years of processing real estate listings
Ed Freyfogle
CSVConf Berlin
15 July 2014
2. Residential property search engine in nine markets
3-4 million unique users per month
Processing close to 20M listings daily
Extensive experience / painful lessons in ETL, geocoding, deduping, ...
http://www.nestoria.com
3. What we do
Real estate is complex, high value transaction. Our goal is :
Simple
Comprehensive
Fast (user time and time to market)
5. Where does the data come from?
Seller
Agent 1
Agent 2
Agent 3
6. Where does the data come from?
Seller
Agent 1
Agent 2
Agent 3
Portal 1
Portal 2
Portal 3
7. Where does the data come from?
Seller
Agent 1
Agent 2
Agent 3
Portal 1
Portal 2
Portal 3
8. Where does the data come from?
Seller
Agent 1
Agent 2
Agent 3
Portal 1
Portal 2
Portal 3
9. Where does the data come from?
Seller
Agent 1
Agent 2
Agent 3
Portal 1
Portal 2
Portal 3
Plenty of chances for data to go bad
11. India
Very, very good at:
Cricket
Amazing cuisine
World’s largest democracy
Too many other things to list here
12. India
Very, very good at:
Cricket
Amazing cuisine
World’s largest democracy
Too many other things to list here
Utterly fucking terrible at:
Real Estate data quality
Addresses / Geodata
13. Must garbage in be garbage out?
Can we turn multiple bits of shit into something useful?
What we really do
14. Must garbage in be garbage out?
Can we turn multiple bits of shit into something useful?
What we really do
something
useful
Chaos
15. Caveat: I love our clients
All the examples you are about to see are all theoretical *wink, wink*
Examples / Horror stories
16. Us: “Please set up an automated data transfer. Thx!”
Them: “It’s impossible to export the data from the database”
Them: “Just crawl our website”
Them: “Let’s do incremental updates to save bandwidth”
Them: “I’ll just send you an email when there is new stuff … starting when I get back from
holiday”
Getting the data
17. zip or tar full of subdirs, names of which change with each upload
filename “feed.xml?key=SsKpyM62QN0RbqCwnaAc”
One file per agent, when file not supplied no way to know if missing due to error or
intentionally
Format A on Monday, B on Tuesday, ...
Fun with files
18. <Description>Residential Plot available in Suncity&lt;br
/&gt; &lt;br&gt;&lt;br
/&gt; &lt;br&gt;SUNCITY PROJECT&lt;br
/&gt; &lt;br&gt;&lt;br
/&gt; &lt;br&gt;A complete township...
"&gt;" - for when you really, really want to be sure you've escaped
your XML
anyone?
XML, LOL
19. One 500 MB file of XML
On a single line … to save space
Go grep yourself
20. Newlines, newlines,
newlines
Choose your delimiter wisely - ^B
So simple even a child could get it wrong
Microsoft quotes vs. ASCII quotes
Excel vs. CSV
CSV, LOL
21. Them “we will send the data in X (where X is large industry player) format”
Us “not even X uses that format”
Them “We use X format, but changed it slightly so we could ….”
Us *sigh*
Wrong tool for right job
22. Are they really unique?
Are the unique across time?
Partner re-uses numeric unique ids … in case there is ever a shortage of numbers
Unique identifiers
23. I’m ranting
Topics we haven’t yet even touched upon:
Character encodings
Geocoding / Parsing addresses
Image processing/classification at scale
Parsing free text descriptions
Deduplication
Too many other things to list here
24. Never trust, check everything, every single time
Tests, tests, tests, tests
Embrace UNIX philosophy of many small tools in a chain
Reuse rather than reinvent (but not always)
Technology helps manage the problem, it is not “the solution”.
Problems are almost always cultural not technical
What have we learned?
26. Tricked you - there is of course no single perfect solution
Closest thing is dialog, ideally face to face.
People generally want to do right thing, need help to know why and how to do it.
One five minute conversation often more useful than five months of email
The solution
27. Unless you hate life, do NOT try to scrape real estate data
Re-read the line above.
Our API: http://nestoria.com/api
One more thing
28. http://nestoria.com and http://nestoria.com/api
http://devblog.nestoria.com - our dev blog
http://www.lokku.com - our parent company
http://opencagedata.com - all your geocoding are belong to us
Twitter: @nestoria, @lokku, @opencagedata, @freyfogle
Slides will be on http://slideshare.net/lokku later today
Learn more