Talk given at the 2015 Fall Regional in Oshkosh WI.
"An Approach to Address Parsing and Data Standardization"
Abstract:
Maintaining fully parsed address elements in your database can be one of the most beneficial steps toward
achieving quality and consistency in addressing. Parsed address elements also serve a preparatory step in
modeling an address toward NG9-1-1 supporting formats such as the FGDC address standard. In this talk,
we’ll take a look at the approach we’ve used for parsing site addresses for the V1 Statewide Parcel Map, the
role regular expressions played in this approach, and will unveil a suite of (free) ArcPy tools that can help you
parse addresses, standardize field values, and achieve other tasks.
Presenters:
Codie See
David Vogel
1. An Approach to
Address Parsing
and Data
Standardization
Codie See
David Vogel
WLIA Fall Regional
Conference – Oshkosh, WI
October 2015
2. An Approach to
Address Parsing
and Data
Standardization
Codie See
David Vogel
WLIA Fall Regional
Conference – Oshkosh, WI
October 2015
3. A short history of parsing
Wisconsin addresses at SCO…
• LinkWISCONSIN Address Point andParcel Mapping Project
- Built understanding of FGDC address standard
- Built understanding of Wisconsin Addresses
- Built a tool to handle this as flexibly as possible
• V1 Statewide Parcel Project
- Improved understandings
- Improved upon our parsing tool
…So we had a Wisconsin parsing tool but it was at its tipping
point….
4. … and then one day on
GitHub
Parserator – a python toolkit for making domain-
specific probabilistic parsers.
• Tendency-Based Parsing, not Rule-Based Parsing
• Trainable to a specific domain
• A flexible framework to build your own parser –
not just for addresses, but anything really!
5. Parserator - usaddress
usaddress - a child project built off of Parserator:
https://github.com/datamade/usaddress
• Impressive out of the box performance
• Embraces the FGDC endorsed - US Postal
Address Data Standard
• Which is well suited for NG9-1-1, and
adopted by the parcel initiative schema.
6. Rules … Tendencies
A typical parser will often be brute, adhering to
very discrete & specific classifications…
…how do we anticipate deviations from the norm?
Statistically-driven educated guesses, based on 3 concepts:
-Tokenizing the input 2554 | CTH | J
-Relative order of tokens 2554 | CTH | J
-Content of tokens 2554 | CTH | J
9. Training: Process Overview
Address Parsing Tool Uses:
• Trained CRFSUITE file (statistical portion of the parse) – Consumed by
Usaddress.py
• Hard coded expressions
• Regex for grid addresses
• Directionals
10. Tool is based on ~2000 addresses
(number of records in training data)
GOAL:
• Produce the best results with the least amount
of training data
Focused on selecting addresses for the training data that accounted
for the greatest number of addresses across the state.
Then shifted our focus to the more specific addresses or special
cases where we noticed issues occurring
11. Element Focused Training
Created training files specific to
particular elements
Street Types
Unit Types & Unit IDs
Address Number Suffixes
Uncaught Street Names
12. Workflow of Training Process
After initially adding our
state specific training data,
we went through the data
provided with the library
and corrected issues that
were resulting in incorrect
parses.
**This was the most
time-consuming part of
developing this tool.
13. Wisconsin has 2.28+ million site
addresses associated with parcels!
-Tool does an impressive job flexibly parsing these addresses
-BUT: Not feasible to accommodate for all potential address
options
-Built in 4 Additional Flag fields to the output to help identify where errors or incorrect
parses may have occurred & what the issue may be
Flags include:
1. Parse Error Flag (Id’s addresses parser was unable to parse)
2. Extraneous Data Flag (Id’s data not commonly found in address elements)
3. Character Flag (Id’s improper and uncommon special characters)
4. Incomplete Data Flag (Id’s addresses that appear to be missing elements)
14. Other Tools:
XML PARSING TOOL
• Input: Directory of County DOR XML Files
• Converts DOR validated data to .dbf format
• Note: FMKV still needs to be joined after dbf creation
STANDARDIZE TOOL
• Efficient method for standardizing various attributes
• Leverages the InMemory workspace to preform the standardization quickly
• Developed for use with 1)Prefix 2)Street Type 3)Suffix
• Other Uses: School Districts, Class of Property, etc…
COMING SOON!!
• Condo Stack Tool
-Stack relationally related condos using common pins/join keys
-Estimated release Mid-November