SlideShare a Scribd company logo
1 of 28
Download to read offline
A living hell: lessons learned in eight years of processing real estate listings
Ed Freyfogle
CSVConf Berlin
15 July 2014
Residential property search engine in nine markets
3-4 million unique users per month
Processing close to 20M listings daily
Extensive experience / painful lessons in ETL, geocoding, deduping, ...
http://www.nestoria.com
What we do
Real estate is complex, high value transaction. Our goal is :
Simple
Comprehensive
Fast (user time and time to market)
Where does the data come from?
Seller
Agent 1
Agent 2
Agent 3
Where does the data come from?
Seller
Agent 1
Agent 2
Agent 3
Portal 1
Portal 2
Portal 3
Where does the data come from?
Seller
Agent 1
Agent 2
Agent 3
Portal 1
Portal 2
Portal 3
Where does the data come from?
Seller
Agent 1
Agent 2
Agent 3
Portal 1
Portal 2
Portal 3
Where does the data come from?
Seller
Agent 1
Agent 2
Agent 3
Portal 1
Portal 2
Portal 3
Plenty of chances for data to go bad
Where we do it
India
Very, very good at:
Cricket
Amazing cuisine
World’s largest democracy
Too many other things to list here
India
Very, very good at:
Cricket
Amazing cuisine
World’s largest democracy
Too many other things to list here
Utterly fucking terrible at:
Real Estate data quality
Addresses / Geodata
Must garbage in be garbage out?
Can we turn multiple bits of shit into something useful?
What we really do
Must garbage in be garbage out?
Can we turn multiple bits of shit into something useful?
What we really do
something
useful
Chaos
Caveat: I love our clients
All the examples you are about to see are all theoretical *wink, wink*
Examples / Horror stories
Us: “Please set up an automated data transfer. Thx!”
Them: “It’s impossible to export the data from the database”
Them: “Just crawl our website”
Them: “Let’s do incremental updates to save bandwidth”
Them: “I’ll just send you an email when there is new stuff … starting when I get back from
holiday”
Getting the data
zip or tar full of subdirs, names of which change with each upload
filename “feed.xml?key=SsKpyM62QN0RbqCwnaAc”
One file per agent, when file not supplied no way to know if missing due to error or
intentionally
Format A on Monday, B on Tuesday, ...
Fun with files
<Description>Residential Plot available in Suncity&amp;lt;br
/&amp;gt;&#13;&amp;lt;br&amp;gt;&amp;lt;br
/&amp;gt;&#13;&amp;lt;br&amp;gt;SUNCITY PROJECT&amp;lt;br
/&amp;gt;&#13;&amp;lt;br&amp;gt;&amp;lt;br
/&amp;gt;&#13;&amp;lt;br&amp;gt;A complete township...
"&amp;gt;" - for when you really, really want to be sure you've escaped
your XML
&#13; anyone?
XML, LOL
One 500 MB file of XML
On a single line … to save space
Go grep yourself
Newlines, newlines,
newlines
Choose your delimiter wisely - ^B
So simple even a child could get it wrong
Microsoft quotes vs. ASCII quotes
Excel vs. CSV
CSV, LOL
Them “we will send the data in X (where X is large industry player) format”
Us “not even X uses that format”
Them “We use X format, but changed it slightly so we could ….”
Us *sigh*
Wrong tool for right job
Are they really unique?
Are the unique across time?
Partner re-uses numeric unique ids … in case there is ever a shortage of numbers
Unique identifiers
I’m ranting
Topics we haven’t yet even touched upon:
Character encodings
Geocoding / Parsing addresses
Image processing/classification at scale
Parsing free text descriptions
Deduplication
Too many other things to list here
Never trust, check everything, every single time
Tests, tests, tests, tests
Embrace UNIX philosophy of many small tools in a chain
Reuse rather than reinvent (but not always)
Technology helps manage the problem, it is not “the solution”.
Problems are almost always cultural not technical
What have we learned?
Misaligned incentives
Technology laggards
Apathy
Ignorance
Why do they hate us?
Tricked you - there is of course no single perfect solution
Closest thing is dialog, ideally face to face.
People generally want to do right thing, need help to know why and how to do it.
One five minute conversation often more useful than five months of email
The solution
Unless you hate life, do NOT try to scrape real estate data
Re-read the line above.
Our API: http://nestoria.com/api
One more thing
http://nestoria.com and http://nestoria.com/api
http://devblog.nestoria.com - our dev blog
http://www.lokku.com - our parent company
http://opencagedata.com - all your geocoding are belong to us
Twitter: @nestoria, @lokku, @opencagedata, @freyfogle
Slides will be on http://slideshare.net/lokku later today
Learn more

More Related Content

Similar to A living hell - lessons learned in eight years of parsing real estate data

Roelof Temmingh FIRST07 slides
Roelof Temmingh FIRST07 slidesRoelof Temmingh FIRST07 slides
Roelof Temmingh FIRST07 slidesLeon Kuunders
 
Drew Conway: A Social Scientist's Perspective on Data Science
Drew Conway: A Social Scientist's Perspective on Data ScienceDrew Conway: A Social Scientist's Perspective on Data Science
Drew Conway: A Social Scientist's Perspective on Data Sciencemortardata
 
Apps as Machines — at FH Potsdam
Apps as Machines — at FH PotsdamApps as Machines — at FH Potsdam
Apps as Machines — at FH PotsdamMartin Jordan
 
Algorithm Marketplace and the new "Algorithm Economy"
Algorithm Marketplace and the new "Algorithm Economy"Algorithm Marketplace and the new "Algorithm Economy"
Algorithm Marketplace and the new "Algorithm Economy"Diego Oppenheimer
 
From DevOps to NoOps how not to get Equifaxed Apidays
From DevOps to NoOps how not to get Equifaxed ApidaysFrom DevOps to NoOps how not to get Equifaxed Apidays
From DevOps to NoOps how not to get Equifaxed ApidaysOri Pekelman
 
OpenFest 2012 : Leveraging the public internet
OpenFest 2012 : Leveraging the public internetOpenFest 2012 : Leveraging the public internet
OpenFest 2012 : Leveraging the public internettkisason
 
What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)Brian Brazil
 
From 🤦 to 🐿️
From 🤦 to 🐿️From 🤦 to 🐿️
From 🤦 to 🐿️Ori Pekelman
 
SpringOne Tour: The Influential Software Engineer
SpringOne Tour: The Influential Software EngineerSpringOne Tour: The Influential Software Engineer
SpringOne Tour: The Influential Software EngineerVMware Tanzu
 
What your employees need to learn to work with data in the 21 st century
What your employees need to learn to work with data in the 21 st century What your employees need to learn to work with data in the 21 st century
What your employees need to learn to work with data in the 21 st century Human Capital Media
 
Cybercrime and the Developer Java2Days 2016 Sofia
Cybercrime and the Developer Java2Days 2016 SofiaCybercrime and the Developer Java2Days 2016 Sofia
Cybercrime and the Developer Java2Days 2016 SofiaSteve Poole
 
Log Mining: Beyond Log Analysis
Log Mining: Beyond Log AnalysisLog Mining: Beyond Log Analysis
Log Mining: Beyond Log AnalysisAnton Chuvakin
 
Narrative Essay If I Were Invisible
Narrative Essay If I Were InvisibleNarrative Essay If I Were Invisible
Narrative Essay If I Were InvisiblePatty Loen
 
Pc magazine january 2015 usa
Pc magazine   january 2015  usaPc magazine   january 2015  usa
Pc magazine january 2015 usaNhóc Nhóc
 
State of the art in Natural Language Processing (March 2019)
State of the art in Natural Language Processing (March 2019)State of the art in Natural Language Processing (March 2019)
State of the art in Natural Language Processing (March 2019)Liad Magen
 
AI-SDV 2020: Special Hypertext Information Treatment in is Special Hypertext ...
AI-SDV 2020: Special Hypertext Information Treatment in is Special Hypertext ...AI-SDV 2020: Special Hypertext Information Treatment in is Special Hypertext ...
AI-SDV 2020: Special Hypertext Information Treatment in is Special Hypertext ...Dr. Haxel Consult
 
How Did We End up Here?
 How Did We End up Here? How Did We End up Here?
How Did We End up Here?C4Media
 

Similar to A living hell - lessons learned in eight years of parsing real estate data (20)

Roelof Temmingh FIRST07 slides
Roelof Temmingh FIRST07 slidesRoelof Temmingh FIRST07 slides
Roelof Temmingh FIRST07 slides
 
Drew Conway: A Social Scientist's Perspective on Data Science
Drew Conway: A Social Scientist's Perspective on Data ScienceDrew Conway: A Social Scientist's Perspective on Data Science
Drew Conway: A Social Scientist's Perspective on Data Science
 
Apps as Machines — at FH Potsdam
Apps as Machines — at FH PotsdamApps as Machines — at FH Potsdam
Apps as Machines — at FH Potsdam
 
Better the devil you know
Better the devil you knowBetter the devil you know
Better the devil you know
 
Algorithm Marketplace and the new "Algorithm Economy"
Algorithm Marketplace and the new "Algorithm Economy"Algorithm Marketplace and the new "Algorithm Economy"
Algorithm Marketplace and the new "Algorithm Economy"
 
From DevOps to NoOps how not to get Equifaxed Apidays
From DevOps to NoOps how not to get Equifaxed ApidaysFrom DevOps to NoOps how not to get Equifaxed Apidays
From DevOps to NoOps how not to get Equifaxed Apidays
 
Talk_AI_in_Africa
Talk_AI_in_AfricaTalk_AI_in_Africa
Talk_AI_in_Africa
 
OpenFest 2012 : Leveraging the public internet
OpenFest 2012 : Leveraging the public internetOpenFest 2012 : Leveraging the public internet
OpenFest 2012 : Leveraging the public internet
 
What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)
 
From 🤦 to 🐿️
From 🤦 to 🐿️From 🤦 to 🐿️
From 🤦 to 🐿️
 
SpringOne Tour: The Influential Software Engineer
SpringOne Tour: The Influential Software EngineerSpringOne Tour: The Influential Software Engineer
SpringOne Tour: The Influential Software Engineer
 
What your employees need to learn to work with data in the 21 st century
What your employees need to learn to work with data in the 21 st century What your employees need to learn to work with data in the 21 st century
What your employees need to learn to work with data in the 21 st century
 
Cybercrime and the Developer Java2Days 2016 Sofia
Cybercrime and the Developer Java2Days 2016 SofiaCybercrime and the Developer Java2Days 2016 Sofia
Cybercrime and the Developer Java2Days 2016 Sofia
 
Log Mining: Beyond Log Analysis
Log Mining: Beyond Log AnalysisLog Mining: Beyond Log Analysis
Log Mining: Beyond Log Analysis
 
Narrative Essay If I Were Invisible
Narrative Essay If I Were InvisibleNarrative Essay If I Were Invisible
Narrative Essay If I Were Invisible
 
Pc magazine january 2015 usa
Pc magazine   january 2015  usaPc magazine   january 2015  usa
Pc magazine january 2015 usa
 
State of the art in Natural Language Processing (March 2019)
State of the art in Natural Language Processing (March 2019)State of the art in Natural Language Processing (March 2019)
State of the art in Natural Language Processing (March 2019)
 
Tf wiads
Tf wiadsTf wiads
Tf wiads
 
AI-SDV 2020: Special Hypertext Information Treatment in is Special Hypertext ...
AI-SDV 2020: Special Hypertext Information Treatment in is Special Hypertext ...AI-SDV 2020: Special Hypertext Information Treatment in is Special Hypertext ...
AI-SDV 2020: Special Hypertext Information Treatment in is Special Hypertext ...
 
How Did We End up Here?
 How Did We End up Here? How Did We End up Here?
How Did We End up Here?
 

More from lokku

Geocoding Overview
Geocoding OverviewGeocoding Overview
Geocoding Overviewlokku
 
OpenCage Data and sustainable business models for open data
OpenCage Data and sustainable business models for open data OpenCage Data and sustainable business models for open data
OpenCage Data and sustainable business models for open data lokku
 
Presenting the OpenCage Geocoder at #londonapi 17 Sept 2014
Presenting the OpenCage Geocoder at #londonapi 17 Sept 2014Presenting the OpenCage Geocoder at #londonapi 17 Sept 2014
Presenting the OpenCage Geocoder at #londonapi 17 Sept 2014lokku
 
Geo-search-location-based-results-for-site-search
Geo-search-location-based-results-for-site-searchGeo-search-location-based-results-for-site-search
Geo-search-location-based-results-for-site-searchlokku
 
Geocoding India - talk delivered on 31 Jan 2014 at the Bangalore goeBLR event
Geocoding India - talk delivered on 31 Jan 2014 at the Bangalore goeBLR eventGeocoding India - talk delivered on 31 Jan 2014 at the Bangalore goeBLR event
Geocoding India - talk delivered on 31 Jan 2014 at the Bangalore goeBLR eventlokku
 
Nestoria new design
Nestoria new designNestoria new design
Nestoria new designlokku
 
CSS::SpriteMaker in action!
CSS::SpriteMaker in action!CSS::SpriteMaker in action!
CSS::SpriteMaker in action!lokku
 
Reducing the technical hurdle - why we started OpenCage Data
Reducing the technical hurdle - why we started OpenCage DataReducing the technical hurdle - why we started OpenCage Data
Reducing the technical hurdle - why we started OpenCage Datalokku
 
Css sprite_maker-1
Css  sprite_maker-1Css  sprite_maker-1
Css sprite_maker-1lokku
 
Nestoria case study - The effective use of geo-data for search marketing
Nestoria case study - The effective use of geo-data for search marketingNestoria case study - The effective use of geo-data for search marketing
Nestoria case study - The effective use of geo-data for search marketinglokku
 
The Nestoria GeoChallenge
The Nestoria GeoChallengeThe Nestoria GeoChallenge
The Nestoria GeoChallengelokku
 
Geo-Data for Search Marketing SEM & SEO
Geo-Data for Search Marketing SEM & SEOGeo-Data for Search Marketing SEM & SEO
Geo-Data for Search Marketing SEM & SEOlokku
 
Making using OSM data simpler - OpenCage Data
Making using OSM data simpler - OpenCage Data Making using OSM data simpler - OpenCage Data
Making using OSM data simpler - OpenCage Data lokku
 
What’s next in mapping for portals? ppw2012
What’s next in mapping for portals? ppw2012What’s next in mapping for portals? ppw2012
What’s next in mapping for portals? ppw2012lokku
 
How Nestoria switched to OpenStreetMap maps
How Nestoria switched to OpenStreetMap mapsHow Nestoria switched to OpenStreetMap maps
How Nestoria switched to OpenStreetMap mapslokku
 
Remote Geocoding
Remote GeocodingRemote Geocoding
Remote Geocodinglokku
 
Mapstraction
MapstractionMapstraction
Mapstractionlokku
 
Bar Camp London 7
Bar Camp London 7Bar Camp London 7
Bar Camp London 7lokku
 
The path ahead for property portals
The path ahead for property portalsThe path ahead for property portals
The path ahead for property portalslokku
 
How People Search For Locations
How People Search For LocationsHow People Search For Locations
How People Search For Locationslokku
 

More from lokku (20)

Geocoding Overview
Geocoding OverviewGeocoding Overview
Geocoding Overview
 
OpenCage Data and sustainable business models for open data
OpenCage Data and sustainable business models for open data OpenCage Data and sustainable business models for open data
OpenCage Data and sustainable business models for open data
 
Presenting the OpenCage Geocoder at #londonapi 17 Sept 2014
Presenting the OpenCage Geocoder at #londonapi 17 Sept 2014Presenting the OpenCage Geocoder at #londonapi 17 Sept 2014
Presenting the OpenCage Geocoder at #londonapi 17 Sept 2014
 
Geo-search-location-based-results-for-site-search
Geo-search-location-based-results-for-site-searchGeo-search-location-based-results-for-site-search
Geo-search-location-based-results-for-site-search
 
Geocoding India - talk delivered on 31 Jan 2014 at the Bangalore goeBLR event
Geocoding India - talk delivered on 31 Jan 2014 at the Bangalore goeBLR eventGeocoding India - talk delivered on 31 Jan 2014 at the Bangalore goeBLR event
Geocoding India - talk delivered on 31 Jan 2014 at the Bangalore goeBLR event
 
Nestoria new design
Nestoria new designNestoria new design
Nestoria new design
 
CSS::SpriteMaker in action!
CSS::SpriteMaker in action!CSS::SpriteMaker in action!
CSS::SpriteMaker in action!
 
Reducing the technical hurdle - why we started OpenCage Data
Reducing the technical hurdle - why we started OpenCage DataReducing the technical hurdle - why we started OpenCage Data
Reducing the technical hurdle - why we started OpenCage Data
 
Css sprite_maker-1
Css  sprite_maker-1Css  sprite_maker-1
Css sprite_maker-1
 
Nestoria case study - The effective use of geo-data for search marketing
Nestoria case study - The effective use of geo-data for search marketingNestoria case study - The effective use of geo-data for search marketing
Nestoria case study - The effective use of geo-data for search marketing
 
The Nestoria GeoChallenge
The Nestoria GeoChallengeThe Nestoria GeoChallenge
The Nestoria GeoChallenge
 
Geo-Data for Search Marketing SEM & SEO
Geo-Data for Search Marketing SEM & SEOGeo-Data for Search Marketing SEM & SEO
Geo-Data for Search Marketing SEM & SEO
 
Making using OSM data simpler - OpenCage Data
Making using OSM data simpler - OpenCage Data Making using OSM data simpler - OpenCage Data
Making using OSM data simpler - OpenCage Data
 
What’s next in mapping for portals? ppw2012
What’s next in mapping for portals? ppw2012What’s next in mapping for portals? ppw2012
What’s next in mapping for portals? ppw2012
 
How Nestoria switched to OpenStreetMap maps
How Nestoria switched to OpenStreetMap mapsHow Nestoria switched to OpenStreetMap maps
How Nestoria switched to OpenStreetMap maps
 
Remote Geocoding
Remote GeocodingRemote Geocoding
Remote Geocoding
 
Mapstraction
MapstractionMapstraction
Mapstraction
 
Bar Camp London 7
Bar Camp London 7Bar Camp London 7
Bar Camp London 7
 
The path ahead for property portals
The path ahead for property portalsThe path ahead for property portals
The path ahead for property portals
 
How People Search For Locations
How People Search For LocationsHow People Search For Locations
How People Search For Locations
 

Recently uploaded

pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfJOHNBEBONYAP1
 
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...APNIC
 
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge GraphsEleniIlkou
 
💚😋 Bilaspur Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋
💚😋 Bilaspur Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋💚😋 Bilaspur Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋
💚😋 Bilaspur Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋nirzagarg
 
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdfMatthew Sinclair
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirtrahman018755
 
20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdfMatthew Sinclair
 
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...tanu pandey
 
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service AvailableCall Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service AvailableSeo
 
Katraj ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For S...
Katraj ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For S...Katraj ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For S...
Katraj ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For S...tanu pandey
 
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrStory Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrHenryBriggs2
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtrahman018755
 
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdfMatthew Sinclair
 
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdfMatthew Sinclair
 
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...SUHANI PANDEY
 
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)Delhi Call girls
 
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 

Recently uploaded (20)

pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
 
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
 
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
 
💚😋 Bilaspur Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋
💚😋 Bilaspur Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋💚😋 Bilaspur Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋
💚😋 Bilaspur Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋
 
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirt
 
20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf
 
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
 
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service AvailableCall Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
 
Katraj ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For S...
Katraj ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For S...Katraj ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For S...
Katraj ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For S...
 
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrStory Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirt
 
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
 
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
 
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
 
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
 
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
 
Thalassery Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call G...
Thalassery Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call G...Thalassery Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call G...
Thalassery Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call G...
 
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
 
📱Dehradun Call Girls Service 📱☎️ +91'905,3900,678 ☎️📱 Call Girls In Dehradun 📱
📱Dehradun Call Girls Service 📱☎️ +91'905,3900,678 ☎️📱 Call Girls In Dehradun 📱📱Dehradun Call Girls Service 📱☎️ +91'905,3900,678 ☎️📱 Call Girls In Dehradun 📱
📱Dehradun Call Girls Service 📱☎️ +91'905,3900,678 ☎️📱 Call Girls In Dehradun 📱
 

A living hell - lessons learned in eight years of parsing real estate data

  • 1. A living hell: lessons learned in eight years of processing real estate listings Ed Freyfogle CSVConf Berlin 15 July 2014
  • 2. Residential property search engine in nine markets 3-4 million unique users per month Processing close to 20M listings daily Extensive experience / painful lessons in ETL, geocoding, deduping, ... http://www.nestoria.com
  • 3. What we do Real estate is complex, high value transaction. Our goal is : Simple Comprehensive Fast (user time and time to market)
  • 4.
  • 5. Where does the data come from? Seller Agent 1 Agent 2 Agent 3
  • 6. Where does the data come from? Seller Agent 1 Agent 2 Agent 3 Portal 1 Portal 2 Portal 3
  • 7. Where does the data come from? Seller Agent 1 Agent 2 Agent 3 Portal 1 Portal 2 Portal 3
  • 8. Where does the data come from? Seller Agent 1 Agent 2 Agent 3 Portal 1 Portal 2 Portal 3
  • 9. Where does the data come from? Seller Agent 1 Agent 2 Agent 3 Portal 1 Portal 2 Portal 3 Plenty of chances for data to go bad
  • 11. India Very, very good at: Cricket Amazing cuisine World’s largest democracy Too many other things to list here
  • 12. India Very, very good at: Cricket Amazing cuisine World’s largest democracy Too many other things to list here Utterly fucking terrible at: Real Estate data quality Addresses / Geodata
  • 13. Must garbage in be garbage out? Can we turn multiple bits of shit into something useful? What we really do
  • 14. Must garbage in be garbage out? Can we turn multiple bits of shit into something useful? What we really do something useful Chaos
  • 15. Caveat: I love our clients All the examples you are about to see are all theoretical *wink, wink* Examples / Horror stories
  • 16. Us: “Please set up an automated data transfer. Thx!” Them: “It’s impossible to export the data from the database” Them: “Just crawl our website” Them: “Let’s do incremental updates to save bandwidth” Them: “I’ll just send you an email when there is new stuff … starting when I get back from holiday” Getting the data
  • 17. zip or tar full of subdirs, names of which change with each upload filename “feed.xml?key=SsKpyM62QN0RbqCwnaAc” One file per agent, when file not supplied no way to know if missing due to error or intentionally Format A on Monday, B on Tuesday, ... Fun with files
  • 18. <Description>Residential Plot available in Suncity&amp;lt;br /&amp;gt;&#13;&amp;lt;br&amp;gt;&amp;lt;br /&amp;gt;&#13;&amp;lt;br&amp;gt;SUNCITY PROJECT&amp;lt;br /&amp;gt;&#13;&amp;lt;br&amp;gt;&amp;lt;br /&amp;gt;&#13;&amp;lt;br&amp;gt;A complete township... "&amp;gt;" - for when you really, really want to be sure you've escaped your XML &#13; anyone? XML, LOL
  • 19. One 500 MB file of XML On a single line … to save space Go grep yourself
  • 20. Newlines, newlines, newlines Choose your delimiter wisely - ^B So simple even a child could get it wrong Microsoft quotes vs. ASCII quotes Excel vs. CSV CSV, LOL
  • 21. Them “we will send the data in X (where X is large industry player) format” Us “not even X uses that format” Them “We use X format, but changed it slightly so we could ….” Us *sigh* Wrong tool for right job
  • 22. Are they really unique? Are the unique across time? Partner re-uses numeric unique ids … in case there is ever a shortage of numbers Unique identifiers
  • 23. I’m ranting Topics we haven’t yet even touched upon: Character encodings Geocoding / Parsing addresses Image processing/classification at scale Parsing free text descriptions Deduplication Too many other things to list here
  • 24. Never trust, check everything, every single time Tests, tests, tests, tests Embrace UNIX philosophy of many small tools in a chain Reuse rather than reinvent (but not always) Technology helps manage the problem, it is not “the solution”. Problems are almost always cultural not technical What have we learned?
  • 26. Tricked you - there is of course no single perfect solution Closest thing is dialog, ideally face to face. People generally want to do right thing, need help to know why and how to do it. One five minute conversation often more useful than five months of email The solution
  • 27. Unless you hate life, do NOT try to scrape real estate data Re-read the line above. Our API: http://nestoria.com/api One more thing
  • 28. http://nestoria.com and http://nestoria.com/api http://devblog.nestoria.com - our dev blog http://www.lokku.com - our parent company http://opencagedata.com - all your geocoding are belong to us Twitter: @nestoria, @lokku, @opencagedata, @freyfogle Slides will be on http://slideshare.net/lokku later today Learn more