A living hell - lessons learned in eight years of parsing real estate data

lokku
A living hell: lessons learned in eight years of processing real estate listings
Ed Freyfogle
CSVConf Berlin
15 July 2014
Residential property search engine in nine markets
3-4 million unique users per month
Processing close to 20M listings daily
Extensive experience / painful lessons in ETL, geocoding, deduping, ...
http://www.nestoria.com
What we do
Real estate is complex, high value transaction. Our goal is :
Simple
Comprehensive
Fast (user time and time to market)
A living hell - lessons learned in eight years of parsing real estate data
Where does the data come from?
Seller
Agent 1
Agent 2
Agent 3
Where does the data come from?
Seller
Agent 1
Agent 2
Agent 3
Portal 1
Portal 2
Portal 3
Where does the data come from?
Seller
Agent 1
Agent 2
Agent 3
Portal 1
Portal 2
Portal 3
Where does the data come from?
Seller
Agent 1
Agent 2
Agent 3
Portal 1
Portal 2
Portal 3
Where does the data come from?
Seller
Agent 1
Agent 2
Agent 3
Portal 1
Portal 2
Portal 3
Plenty of chances for data to go bad
Where we do it
India
Very, very good at:
Cricket
Amazing cuisine
World’s largest democracy
Too many other things to list here
India
Very, very good at:
Cricket
Amazing cuisine
World’s largest democracy
Too many other things to list here
Utterly fucking terrible at:
Real Estate data quality
Addresses / Geodata
Must garbage in be garbage out?
Can we turn multiple bits of shit into something useful?
What we really do
Must garbage in be garbage out?
Can we turn multiple bits of shit into something useful?
What we really do
something
useful
Chaos
Caveat: I love our clients
All the examples you are about to see are all theoretical *wink, wink*
Examples / Horror stories
Us: “Please set up an automated data transfer. Thx!”
Them: “It’s impossible to export the data from the database”
Them: “Just crawl our website”
Them: “Let’s do incremental updates to save bandwidth”
Them: “I’ll just send you an email when there is new stuff … starting when I get back from
holiday”
Getting the data
zip or tar full of subdirs, names of which change with each upload
filename “feed.xml?key=SsKpyM62QN0RbqCwnaAc”
One file per agent, when file not supplied no way to know if missing due to error or
intentionally
Format A on Monday, B on Tuesday, ...
Fun with files
<Description>Residential Plot available in Suncity&amp;lt;br
/&amp;gt;&#13;&amp;lt;br&amp;gt;&amp;lt;br
/&amp;gt;&#13;&amp;lt;br&amp;gt;SUNCITY PROJECT&amp;lt;br
/&amp;gt;&#13;&amp;lt;br&amp;gt;&amp;lt;br
/&amp;gt;&#13;&amp;lt;br&amp;gt;A complete township...
"&amp;gt;" - for when you really, really want to be sure you've escaped
your XML
&#13; anyone?
XML, LOL
One 500 MB file of XML
On a single line … to save space
Go grep yourself
Newlines, newlines,
newlines
Choose your delimiter wisely - ^B
So simple even a child could get it wrong
Microsoft quotes vs. ASCII quotes
Excel vs. CSV
CSV, LOL
Them “we will send the data in X (where X is large industry player) format”
Us “not even X uses that format”
Them “We use X format, but changed it slightly so we could ….”
Us *sigh*
Wrong tool for right job
Are they really unique?
Are the unique across time?
Partner re-uses numeric unique ids … in case there is ever a shortage of numbers
Unique identifiers
I’m ranting
Topics we haven’t yet even touched upon:
Character encodings
Geocoding / Parsing addresses
Image processing/classification at scale
Parsing free text descriptions
Deduplication
Too many other things to list here
Never trust, check everything, every single time
Tests, tests, tests, tests
Embrace UNIX philosophy of many small tools in a chain
Reuse rather than reinvent (but not always)
Technology helps manage the problem, it is not “the solution”.
Problems are almost always cultural not technical
What have we learned?
Misaligned incentives
Technology laggards
Apathy
Ignorance
Why do they hate us?
Tricked you - there is of course no single perfect solution
Closest thing is dialog, ideally face to face.
People generally want to do right thing, need help to know why and how to do it.
One five minute conversation often more useful than five months of email
The solution
Unless you hate life, do NOT try to scrape real estate data
Re-read the line above.
Our API: http://nestoria.com/api
One more thing
http://nestoria.com and http://nestoria.com/api
http://devblog.nestoria.com - our dev blog
http://www.lokku.com - our parent company
http://opencagedata.com - all your geocoding are belong to us
Twitter: @nestoria, @lokku, @opencagedata, @freyfogle
Slides will be on http://slideshare.net/lokku later today
Learn more
1 von 28

Recomendados

Lessons learned in doing lots with few people von
Lessons learned in  doing lots with few peopleLessons learned in  doing lots with few people
Lessons learned in doing lots with few peoplelokku
3.1K views13 Folien
Wagner whats buggingyou-voyager von
Wagner whats buggingyou-voyagerWagner whats buggingyou-voyager
Wagner whats buggingyou-voyagerENUG
249 views2 Folien
Best von
BestBest
BestTomoya Shimaguchi
517 views10 Folien
Small Team, Big Success von
Small Team, Big SuccessSmall Team, Big Success
Small Team, Big SuccessAlex Nguyen
1.1K views19 Folien
Tom Limoncelli's Top 5 Time Management Tips for SysAdmins/DevOps/Devs. von
Tom Limoncelli's Top 5 Time Management Tips for SysAdmins/DevOps/Devs.Tom Limoncelli's Top 5 Time Management Tips for SysAdmins/DevOps/Devs.
Tom Limoncelli's Top 5 Time Management Tips for SysAdmins/DevOps/Devs.Tom Limoncelli
7.6K views33 Folien
Low Code Development: Workflow von
Low Code Development: WorkflowLow Code Development: Workflow
Low Code Development: WorkflowInnoTech
560 views91 Folien

Más contenido relacionado

Similar a A living hell - lessons learned in eight years of parsing real estate data

Roelof Temmingh FIRST07 slides von
Roelof Temmingh FIRST07 slidesRoelof Temmingh FIRST07 slides
Roelof Temmingh FIRST07 slidesLeon Kuunders
5.3K views81 Folien
Drew Conway: A Social Scientist's Perspective on Data Science von
Drew Conway: A Social Scientist's Perspective on Data ScienceDrew Conway: A Social Scientist's Perspective on Data Science
Drew Conway: A Social Scientist's Perspective on Data Sciencemortardata
4.8K views28 Folien
Apps as Machines — at FH Potsdam von
Apps as Machines — at FH PotsdamApps as Machines — at FH Potsdam
Apps as Machines — at FH PotsdamMartin Jordan
4.9K views127 Folien
Better the devil you know von
Better the devil you knowBetter the devil you know
Better the devil you knowAlexandra Deschamps-Sonsino
281 views31 Folien
Algorithm Marketplace and the new "Algorithm Economy" von
Algorithm Marketplace and the new "Algorithm Economy"Algorithm Marketplace and the new "Algorithm Economy"
Algorithm Marketplace and the new "Algorithm Economy"Diego Oppenheimer
2.3K views47 Folien
From DevOps to NoOps how not to get Equifaxed Apidays von
From DevOps to NoOps how not to get Equifaxed ApidaysFrom DevOps to NoOps how not to get Equifaxed Apidays
From DevOps to NoOps how not to get Equifaxed ApidaysOri Pekelman
507 views54 Folien

Similar a A living hell - lessons learned in eight years of parsing real estate data (20)

Roelof Temmingh FIRST07 slides von Leon Kuunders
Roelof Temmingh FIRST07 slidesRoelof Temmingh FIRST07 slides
Roelof Temmingh FIRST07 slides
Leon Kuunders5.3K views
Drew Conway: A Social Scientist's Perspective on Data Science von mortardata
Drew Conway: A Social Scientist's Perspective on Data ScienceDrew Conway: A Social Scientist's Perspective on Data Science
Drew Conway: A Social Scientist's Perspective on Data Science
mortardata4.8K views
Apps as Machines — at FH Potsdam von Martin Jordan
Apps as Machines — at FH PotsdamApps as Machines — at FH Potsdam
Apps as Machines — at FH Potsdam
Martin Jordan4.9K views
Algorithm Marketplace and the new "Algorithm Economy" von Diego Oppenheimer
Algorithm Marketplace and the new "Algorithm Economy"Algorithm Marketplace and the new "Algorithm Economy"
Algorithm Marketplace and the new "Algorithm Economy"
Diego Oppenheimer2.3K views
From DevOps to NoOps how not to get Equifaxed Apidays von Ori Pekelman
From DevOps to NoOps how not to get Equifaxed ApidaysFrom DevOps to NoOps how not to get Equifaxed Apidays
From DevOps to NoOps how not to get Equifaxed Apidays
Ori Pekelman507 views
OpenFest 2012 : Leveraging the public internet von tkisason
OpenFest 2012 : Leveraging the public internetOpenFest 2012 : Leveraging the public internet
OpenFest 2012 : Leveraging the public internet
tkisason492 views
What does "monitoring" mean? (FOSDEM 2017) von Brian Brazil
What does "monitoring" mean? (FOSDEM 2017)What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)
Brian Brazil2.4K views
SpringOne Tour: The Influential Software Engineer von VMware Tanzu
SpringOne Tour: The Influential Software EngineerSpringOne Tour: The Influential Software Engineer
SpringOne Tour: The Influential Software Engineer
VMware Tanzu41 views
What your employees need to learn to work with data in the 21 st century von Human Capital Media
What your employees need to learn to work with data in the 21 st century What your employees need to learn to work with data in the 21 st century
What your employees need to learn to work with data in the 21 st century
Cybercrime and the Developer Java2Days 2016 Sofia von Steve Poole
Cybercrime and the Developer Java2Days 2016 SofiaCybercrime and the Developer Java2Days 2016 Sofia
Cybercrime and the Developer Java2Days 2016 Sofia
Steve Poole401 views
Log Mining: Beyond Log Analysis von Anton Chuvakin
Log Mining: Beyond Log AnalysisLog Mining: Beyond Log Analysis
Log Mining: Beyond Log Analysis
Anton Chuvakin20.7K views
Pc magazine january 2015 usa von Nhóc Nhóc
Pc magazine   january 2015  usaPc magazine   january 2015  usa
Pc magazine january 2015 usa
Nhóc Nhóc3.5K views
State of the art in Natural Language Processing (March 2019) von Liad Magen
State of the art in Natural Language Processing (March 2019)State of the art in Natural Language Processing (March 2019)
State of the art in Natural Language Processing (March 2019)
Liad Magen1.7K views
AI-SDV 2020: Special Hypertext Information Treatment in is Special Hypertext ... von Dr. Haxel Consult
AI-SDV 2020: Special Hypertext Information Treatment in is Special Hypertext ...AI-SDV 2020: Special Hypertext Information Treatment in is Special Hypertext ...
AI-SDV 2020: Special Hypertext Information Treatment in is Special Hypertext ...
Dr. Haxel Consult651 views
How Did We End up Here? von C4Media
 How Did We End up Here? How Did We End up Here?
How Did We End up Here?
C4Media786 views
Choose Boring Technology von Dan McKinley
Choose Boring TechnologyChoose Boring Technology
Choose Boring Technology
Dan McKinley36K views

Más de lokku

Geocoding Overview von
Geocoding OverviewGeocoding Overview
Geocoding Overviewlokku
2.5K views36 Folien
OpenCage Data and sustainable business models for open data von
OpenCage Data and sustainable business models for open data OpenCage Data and sustainable business models for open data
OpenCage Data and sustainable business models for open data lokku
4K views89 Folien
Presenting the OpenCage Geocoder at #londonapi 17 Sept 2014 von
Presenting the OpenCage Geocoder at #londonapi 17 Sept 2014Presenting the OpenCage Geocoder at #londonapi 17 Sept 2014
Presenting the OpenCage Geocoder at #londonapi 17 Sept 2014lokku
2K views31 Folien
Geo-search-location-based-results-for-site-search von
Geo-search-location-based-results-for-site-searchGeo-search-location-based-results-for-site-search
Geo-search-location-based-results-for-site-searchlokku
1.6K views24 Folien
Geocoding India - talk delivered on 31 Jan 2014 at the Bangalore goeBLR event von
Geocoding India - talk delivered on 31 Jan 2014 at the Bangalore goeBLR eventGeocoding India - talk delivered on 31 Jan 2014 at the Bangalore goeBLR event
Geocoding India - talk delivered on 31 Jan 2014 at the Bangalore goeBLR eventlokku
7.6K views37 Folien
Nestoria new design von
Nestoria new designNestoria new design
Nestoria new designlokku
1.7K views86 Folien

Más de lokku(20)

Geocoding Overview von lokku
Geocoding OverviewGeocoding Overview
Geocoding Overview
lokku2.5K views
OpenCage Data and sustainable business models for open data von lokku
OpenCage Data and sustainable business models for open data OpenCage Data and sustainable business models for open data
OpenCage Data and sustainable business models for open data
lokku4K views
Presenting the OpenCage Geocoder at #londonapi 17 Sept 2014 von lokku
Presenting the OpenCage Geocoder at #londonapi 17 Sept 2014Presenting the OpenCage Geocoder at #londonapi 17 Sept 2014
Presenting the OpenCage Geocoder at #londonapi 17 Sept 2014
lokku2K views
Geo-search-location-based-results-for-site-search von lokku
Geo-search-location-based-results-for-site-searchGeo-search-location-based-results-for-site-search
Geo-search-location-based-results-for-site-search
lokku1.6K views
Geocoding India - talk delivered on 31 Jan 2014 at the Bangalore goeBLR event von lokku
Geocoding India - talk delivered on 31 Jan 2014 at the Bangalore goeBLR eventGeocoding India - talk delivered on 31 Jan 2014 at the Bangalore goeBLR event
Geocoding India - talk delivered on 31 Jan 2014 at the Bangalore goeBLR event
lokku7.6K views
Nestoria new design von lokku
Nestoria new designNestoria new design
Nestoria new design
lokku1.7K views
CSS::SpriteMaker in action! von lokku
CSS::SpriteMaker in action!CSS::SpriteMaker in action!
CSS::SpriteMaker in action!
lokku1.1K views
Reducing the technical hurdle - why we started OpenCage Data von lokku
Reducing the technical hurdle - why we started OpenCage DataReducing the technical hurdle - why we started OpenCage Data
Reducing the technical hurdle - why we started OpenCage Data
lokku1K views
Css sprite_maker-1 von lokku
Css  sprite_maker-1Css  sprite_maker-1
Css sprite_maker-1
lokku1.9K views
Nestoria case study - The effective use of geo-data for search marketing von lokku
Nestoria case study - The effective use of geo-data for search marketingNestoria case study - The effective use of geo-data for search marketing
Nestoria case study - The effective use of geo-data for search marketing
lokku1.5K views
The Nestoria GeoChallenge von lokku
The Nestoria GeoChallengeThe Nestoria GeoChallenge
The Nestoria GeoChallenge
lokku4.5K views
Geo-Data for Search Marketing SEM & SEO von lokku
Geo-Data for Search Marketing SEM & SEOGeo-Data for Search Marketing SEM & SEO
Geo-Data for Search Marketing SEM & SEO
lokku847 views
Making using OSM data simpler - OpenCage Data von lokku
Making using OSM data simpler - OpenCage Data Making using OSM data simpler - OpenCage Data
Making using OSM data simpler - OpenCage Data
lokku1.4K views
What’s next in mapping for portals? ppw2012 von lokku
What’s next in mapping for portals? ppw2012What’s next in mapping for portals? ppw2012
What’s next in mapping for portals? ppw2012
lokku1.3K views
How Nestoria switched to OpenStreetMap maps von lokku
How Nestoria switched to OpenStreetMap mapsHow Nestoria switched to OpenStreetMap maps
How Nestoria switched to OpenStreetMap maps
lokku941 views
Remote Geocoding von lokku
Remote GeocodingRemote Geocoding
Remote Geocoding
lokku1K views
Mapstraction von lokku
MapstractionMapstraction
Mapstraction
lokku665 views
Bar Camp London 7 von lokku
Bar Camp London 7Bar Camp London 7
Bar Camp London 7
lokku442 views
The path ahead for property portals von lokku
The path ahead for property portalsThe path ahead for property portals
The path ahead for property portals
lokku1.2K views
How People Search For Locations von lokku
How People Search For LocationsHow People Search For Locations
How People Search For Locations
lokku547 views

Último

Amine el bouzalimi von
Amine el bouzalimiAmine el bouzalimi
Amine el bouzalimiAmine EL BOUZALIMI
5 views38 Folien
ATPMOUSE_융합2조.pptx von
ATPMOUSE_융합2조.pptxATPMOUSE_융합2조.pptx
ATPMOUSE_융합2조.pptxkts120898
35 views70 Folien
ARNAB12.pdf von
ARNAB12.pdfARNAB12.pdf
ARNAB12.pdfArnabChakraborty499766
5 views83 Folien
How to think like a threat actor for Kubernetes.pptx von
How to think like a threat actor for Kubernetes.pptxHow to think like a threat actor for Kubernetes.pptx
How to think like a threat actor for Kubernetes.pptxLibbySchulze1
7 views33 Folien
Affiliate Marketing von
Affiliate MarketingAffiliate Marketing
Affiliate MarketingNavin Dhanuka
20 views30 Folien
cis5-Project-11a-Harry Lai von
cis5-Project-11a-Harry Laicis5-Project-11a-Harry Lai
cis5-Project-11a-Harry Laiharrylai126
9 views11 Folien

Último(10)

A living hell - lessons learned in eight years of parsing real estate data

  • 1. A living hell: lessons learned in eight years of processing real estate listings Ed Freyfogle CSVConf Berlin 15 July 2014
  • 2. Residential property search engine in nine markets 3-4 million unique users per month Processing close to 20M listings daily Extensive experience / painful lessons in ETL, geocoding, deduping, ... http://www.nestoria.com
  • 3. What we do Real estate is complex, high value transaction. Our goal is : Simple Comprehensive Fast (user time and time to market)
  • 5. Where does the data come from? Seller Agent 1 Agent 2 Agent 3
  • 6. Where does the data come from? Seller Agent 1 Agent 2 Agent 3 Portal 1 Portal 2 Portal 3
  • 7. Where does the data come from? Seller Agent 1 Agent 2 Agent 3 Portal 1 Portal 2 Portal 3
  • 8. Where does the data come from? Seller Agent 1 Agent 2 Agent 3 Portal 1 Portal 2 Portal 3
  • 9. Where does the data come from? Seller Agent 1 Agent 2 Agent 3 Portal 1 Portal 2 Portal 3 Plenty of chances for data to go bad
  • 11. India Very, very good at: Cricket Amazing cuisine World’s largest democracy Too many other things to list here
  • 12. India Very, very good at: Cricket Amazing cuisine World’s largest democracy Too many other things to list here Utterly fucking terrible at: Real Estate data quality Addresses / Geodata
  • 13. Must garbage in be garbage out? Can we turn multiple bits of shit into something useful? What we really do
  • 14. Must garbage in be garbage out? Can we turn multiple bits of shit into something useful? What we really do something useful Chaos
  • 15. Caveat: I love our clients All the examples you are about to see are all theoretical *wink, wink* Examples / Horror stories
  • 16. Us: “Please set up an automated data transfer. Thx!” Them: “It’s impossible to export the data from the database” Them: “Just crawl our website” Them: “Let’s do incremental updates to save bandwidth” Them: “I’ll just send you an email when there is new stuff … starting when I get back from holiday” Getting the data
  • 17. zip or tar full of subdirs, names of which change with each upload filename “feed.xml?key=SsKpyM62QN0RbqCwnaAc” One file per agent, when file not supplied no way to know if missing due to error or intentionally Format A on Monday, B on Tuesday, ... Fun with files
  • 18. <Description>Residential Plot available in Suncity&amp;lt;br /&amp;gt;&#13;&amp;lt;br&amp;gt;&amp;lt;br /&amp;gt;&#13;&amp;lt;br&amp;gt;SUNCITY PROJECT&amp;lt;br /&amp;gt;&#13;&amp;lt;br&amp;gt;&amp;lt;br /&amp;gt;&#13;&amp;lt;br&amp;gt;A complete township... "&amp;gt;" - for when you really, really want to be sure you've escaped your XML &#13; anyone? XML, LOL
  • 19. One 500 MB file of XML On a single line … to save space Go grep yourself
  • 20. Newlines, newlines, newlines Choose your delimiter wisely - ^B So simple even a child could get it wrong Microsoft quotes vs. ASCII quotes Excel vs. CSV CSV, LOL
  • 21. Them “we will send the data in X (where X is large industry player) format” Us “not even X uses that format” Them “We use X format, but changed it slightly so we could ….” Us *sigh* Wrong tool for right job
  • 22. Are they really unique? Are the unique across time? Partner re-uses numeric unique ids … in case there is ever a shortage of numbers Unique identifiers
  • 23. I’m ranting Topics we haven’t yet even touched upon: Character encodings Geocoding / Parsing addresses Image processing/classification at scale Parsing free text descriptions Deduplication Too many other things to list here
  • 24. Never trust, check everything, every single time Tests, tests, tests, tests Embrace UNIX philosophy of many small tools in a chain Reuse rather than reinvent (but not always) Technology helps manage the problem, it is not “the solution”. Problems are almost always cultural not technical What have we learned?
  • 26. Tricked you - there is of course no single perfect solution Closest thing is dialog, ideally face to face. People generally want to do right thing, need help to know why and how to do it. One five minute conversation often more useful than five months of email The solution
  • 27. Unless you hate life, do NOT try to scrape real estate data Re-read the line above. Our API: http://nestoria.com/api One more thing
  • 28. http://nestoria.com and http://nestoria.com/api http://devblog.nestoria.com - our dev blog http://www.lokku.com - our parent company http://opencagedata.com - all your geocoding are belong to us Twitter: @nestoria, @lokku, @opencagedata, @freyfogle Slides will be on http://slideshare.net/lokku later today Learn more