SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Downloaden Sie, um offline zu lesen
TEXT TAGGING WITH FINITE STATE
TRANSDUCERS
David Smiley
Software Systems Engineer, Lead
Text Tagging with
Finite State Transducers
David Smiley
Lucene/Solr Revolution 2013
© 2012 The MITRE Corporation. All rights reserved.
About David Smiley
 Working at MITRE, for 13 years
 web development, Java, search
 Published 1st book on Solr; then 2nd edition (2009, 2011)
 Apache Lucene / Solr committer/PMC member (2012)
 Presented at Lucene Revolution (2010) & Basis O.S. Search
Conference (2011, 2012)
 Taught Solr classes at MITRE (2010, 2011, 2012)
 Solr search consultant within MITRE and its sponsors, and
privately
3
What is “Text Tagging” and “FSTs”?
 First, I need to establish the context:
 JIEDDO’s OpenSextant project
 Though this presentation is not about OpenSextant or
geotagging
 Ultimately, I want to convey how cool Lucene’s FSTs are
 And you may have a need for a text tagger
 Or a geotagger (like OpenSextant)
OpenSextant
A DoD Funded Project: JIEDDO/COIC & NGA
Open Source approval recently obtained
OpenSextant Project
 A geotagging solution for unstructured text
 Finds place name references in natural language
 “… I live near Boston … ”
 Finds “Boston” with input character offset #s
 Often resolves to multiple gazetteer entries: “Boston” has 73
 What’s a Gazetteer?
 A dictionary of place names with metadata like latitude &
longitude
How does it work?
The “Naïve” Tagger
 AKA “Text Tagger”
 Simply consults a dictionary/gazetteer; no fancy NLP
 There’s nothing geospatial about it
 Subsequent NLP processing eliminates low-confidence tags
 Actually, not so simple
 Names vary in word length
 Must find overlapping names
 but not names within names
The Gazetteer
 13 million place name records
 8.1M distinct place names
 Why not 13M?
 Ambiguous names (e.g. San Diego)
 Text analysis normalization (e.g. diacritic removal, etc.)
 2.8M are single-word names (1/3rd)
 2.3 avg. words / name
 14 avg. chars / name
3 Naïve Tagger Implementations
 GATE’s Tagger
 In-memory Aho-Corasick string-matching algorithm
 Requires an estimated 80 GB RAM !! (for our data)
 FAST
 A JIEDDO developed MySQL based Tagger
 “Reasonable” RAM requirements ~4GB
 SLOW (~15x, 20x? not certain). ~1 doc/second
 A JIEDDO developed Solr/FST based Tagger …
Finite State Transducers
Applied to text tagging
Finite State Automata (FSA)
 SortedSet<char[]>:
 mop, moth, pop, slop, sloth, stop, top
Note: a “Trie” data structure is similar but only shares prefixes
Finite State Transducer (FST)
 Adds optional output to each arc
 SortedMap<char[],int>
 mop: 0, moth: 1, pop: 2, slop: 3, sloth: 4, stop: 5, top: 6
Lucene’s FST Implementation
 FST encoded as a byte[]
 Memory efficient! And fast to load from disk.
 Write-once API (immutable)
 Build minimal, acyclic FST from pre-sorted inputs
 Fast (linear time with input size), low memory
 Optional two-pass packing can shrink by ~25%
 SortedMap<int[],T>: arcs are sorted by label
 getByOutput also possible if outputs are sorted
 http://s.apache.org/LuceneFSTs
Based on a
Mihov & Maurel
paper, 2001
FSTs and Text Tagging
 My approach involves two layers of FSTs:
 A word dictionary FST to hold each unique word
 Enables using integers as substitutes for char[]
 Via getByOutput(12345) -> “New”
 Ex: “New” -> 12345, “York” -> 5522111, “City” -> 345
 A word phrase FST comprised of word id string keys
 Ex: “New York City” -> [12345, 5522111, 345]
 Value are arrays of gazetteer primary keys
Memory Use
 Word Dict FST:
 3.3M words with ordinal ids in 26MB of RAM
 Name Phrase FST:
 8.1M word id phrases in 90 MB of RAM
 Plus 82MB of arrays of gazetteer primary key ids
 Total: 198 MB (compare to 80GB GATE Aho-Corasick)
 Building it consumes ~1.5GB Java heap, for 2 minutes
Experimental measurements
 Single FST Experiment
 1 FST of analyzed character word phrase -> int id
 “new york city” -> 6344207
 Theory: more than 2x the memory
 Result: 69 MB! (compare to 26+90) 41% reduction
 Retrospective: What I would have done differently
 Index a field of concatenated terms (custom TokenFilter).
 More disk needed but reduces build time & memory
requirements. Unclear effect on tagging performance.
 Potential to use MemoryPostingsFormat, a Lucene Codec that
uses an FST internally + vInt doc ids, instead of custom FST code.
Tagging Algorithm
It’s complicated! Single-pass (streaming) algorithm
 For each input word, lookup its ordinal id, then:
1. Create an FST arc iterator for name phrase
2. Append the iterator onto a queue of active ones
3. Try to advance all iterators
 Remove those that don’t advance
Iterator linked-list queue:
Head: New, York, City ✔
Head+1: York, City
Head+2: City …
Speed Benchmarks
Docs/Sec RAM (GB)
OpenSextant: GATE Tagger ? 80
OpenSextant: MySQL based Tagger 1.1 4
OpenSextant: Solr/FST Tagger 15.9 2*
Measures single-threaded performance of geotagging 428
documents in the “ACE” collection. OpenSextant tests all had
the same gazetteer.
Integrated with Solr
 As a custom Solr Request Handler
 Builds the FSTs from the index (the gazetteer)
 Configurable
 Text analysis (e.g. phonetic)
 Exclude gazetteer docs by configured query
 Optional partial word phrase matching
 Optional sub-tags tagging
 Solr integration benefits
 Solr as a taxonomy manager! Web-service, searchable,
scalable, easy to update, …
~$ curl -XPOST 'http://localhost:8983/solr/tag
?fl=*&wt=json&indent=2' -H 'Content-Type:text/plain' -d "I live near Boston"
{
"responseHeader":{
"status":0,
"QTime":1898},
"tagsCount":1,
"tags":[[
"startOffset",12,
"endOffset",18,
"ids",[1190927,
1099063,
2562742,
2667203,
2684629,
2695904,
2653982,
2657690,
2585165,
2597292,
…
… 11890986,
11891415]]],
"matchingDocs":{"numFound":73,"start":0,"docs":[
{
"id":12719030,
"place_id":"USGS1893700",
"name":"Boston",
"lat":65.01667,
"lon":-163.28333,
"feat_class":"L",
"feat_code":"AREA",
"FIPS_cc":"US",
"ISO_cc":["US"],
"cc":"US",
"ISO3_cc":"USA",
"adm1":"US02",
"adm2":"US02.0180",
"name_bias":0.0,
"id_bias":0.04,
"geo":"65.01667,-163.28333"},
…
Where can you get this?
 https://github.com/openSextant/SolrTextTagger
 An independent module of OpenSextant
 Might seek incubator status at http://www.osgeo.org
 Includes documentation, tests
Concluding Remarks
 Lucene FSTs are awesome!
 Great for storing large amounts of strings in-memory
 Or other string-like data: e.g. IP addresses, geohashes
 The API is hard to use, however
 The text tagger should be useful independent of
OpenSextant
 Tag people/org names or special keywords
 Might be ported to Lucene as an alternative to its synonym
token filter
 I’ve got an idea on applying these concepts to Lucene
“Shingling” as a codec to make it more scalable
CONFERENCE PARTY
The Tipsy Crow: 770 5th Ave
Starts after Stump The Chump
Your conference badge gets
you in the door
TOMORROW
Breakfast starts at 7:30
Keynotes start at 8:30
CONTACT
David Smiley
dsmiley@mitre.org

Weitere ähnliche Inhalte

Was ist angesagt?

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Streaming sql and druid
Streaming sql and druid Streaming sql and druid
Streaming sql and druid arupmalakar
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
Spark SQL Beyond Official Documentation
Spark SQL Beyond Official DocumentationSpark SQL Beyond Official Documentation
Spark SQL Beyond Official DocumentationDatabricks
 
Finite State Queries In Lucene
Finite State Queries In LuceneFinite State Queries In Lucene
Finite State Queries In Luceneotisg
 
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedInMagnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedInDatabricks
 
Nginx Internals
Nginx InternalsNginx Internals
Nginx InternalsJoshua Zhu
 
Stability Patterns for Microservices
Stability Patterns for MicroservicesStability Patterns for Microservices
Stability Patterns for Microservicespflueras
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013Jun Rao
 
쉽고 강력한 모바일 백엔드 Parse-server
쉽고 강력한 모바일 백엔드 Parse-server쉽고 강력한 모바일 백엔드 Parse-server
쉽고 강력한 모바일 백엔드 Parse-serverInGrowth Gim
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안SANG WON PARK
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
[오픈소스컨설팅]Day #1 MySQL 엔진소개, 튜닝, 백업 및 복구, 업그레이드방법
[오픈소스컨설팅]Day #1 MySQL 엔진소개, 튜닝, 백업 및 복구, 업그레이드방법[오픈소스컨설팅]Day #1 MySQL 엔진소개, 튜닝, 백업 및 복구, 업그레이드방법
[오픈소스컨설팅]Day #1 MySQL 엔진소개, 튜닝, 백업 및 복구, 업그레이드방법Ji-Woong Choi
 
Redis vs Infinispan | DevNation Tech Talk
Redis vs Infinispan | DevNation Tech TalkRedis vs Infinispan | DevNation Tech Talk
Redis vs Infinispan | DevNation Tech TalkRed Hat Developers
 
Ceph Object Storage Reference Architecture Performance and Sizing Guide
Ceph Object Storage Reference Architecture Performance and Sizing GuideCeph Object Storage Reference Architecture Performance and Sizing Guide
Ceph Object Storage Reference Architecture Performance and Sizing GuideKaran Singh
 

Was ist angesagt? (20)

HDFS Erasure Coding in Action
HDFS Erasure Coding in Action HDFS Erasure Coding in Action
HDFS Erasure Coding in Action
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Streaming sql and druid
Streaming sql and druid Streaming sql and druid
Streaming sql and druid
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Spark SQL Beyond Official Documentation
Spark SQL Beyond Official DocumentationSpark SQL Beyond Official Documentation
Spark SQL Beyond Official Documentation
 
Finite State Queries In Lucene
Finite State Queries In LuceneFinite State Queries In Lucene
Finite State Queries In Lucene
 
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedInMagnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
 
Nginx Internals
Nginx InternalsNginx Internals
Nginx Internals
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Stability Patterns for Microservices
Stability Patterns for MicroservicesStability Patterns for Microservices
Stability Patterns for Microservices
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013
 
쉽고 강력한 모바일 백엔드 Parse-server
쉽고 강력한 모바일 백엔드 Parse-server쉽고 강력한 모바일 백엔드 Parse-server
쉽고 강력한 모바일 백엔드 Parse-server
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
[오픈소스컨설팅]Day #1 MySQL 엔진소개, 튜닝, 백업 및 복구, 업그레이드방법
[오픈소스컨설팅]Day #1 MySQL 엔진소개, 튜닝, 백업 및 복구, 업그레이드방법[오픈소스컨설팅]Day #1 MySQL 엔진소개, 튜닝, 백업 및 복구, 업그레이드방법
[오픈소스컨설팅]Day #1 MySQL 엔진소개, 튜닝, 백업 및 복구, 업그레이드방법
 
Redis vs Infinispan | DevNation Tech Talk
Redis vs Infinispan | DevNation Tech TalkRedis vs Infinispan | DevNation Tech Talk
Redis vs Infinispan | DevNation Tech Talk
 
Ceph Object Storage Reference Architecture Performance and Sizing Guide
Ceph Object Storage Reference Architecture Performance and Sizing GuideCeph Object Storage Reference Architecture Performance and Sizing Guide
Ceph Object Storage Reference Architecture Performance and Sizing Guide
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 

Andere mochten auch

Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene Dawid Weiss- Finite state automata in lucene
Dawid Weiss- Finite state automata in luceneLucidworks (Archived)
 
Class 5, adlt 671 developmental theorists
Class 5, adlt 671 developmental theoristsClass 5, adlt 671 developmental theorists
Class 5, adlt 671 developmental theoriststjcarter
 
Class 5 adult development theories___longer_version
Class 5 adult development theories___longer_versionClass 5 adult development theories___longer_version
Class 5 adult development theories___longer_versiontjcarter
 
Adult development theory
Adult development theoryAdult development theory
Adult development theorycccscoetc
 
類義語検索と類義語ハイライト
類義語検索と類義語ハイライト類義語検索と類義語ハイライト
類義語検索と類義語ハイライトShinichiro Abe
 
Current state and future state using VE
Current state and future state using VECurrent state and future state using VE
Current state and future state using VECharles Palus
 
VE plus graphic facilitation for currrent / future states
VE plus graphic facilitation for currrent / future statesVE plus graphic facilitation for currrent / future states
VE plus graphic facilitation for currrent / future statesCharles Palus
 
HMC Conference 2011 Scotland
HMC Conference 2011 ScotlandHMC Conference 2011 Scotland
HMC Conference 2011 ScotlandCharles Palus
 

Andere mochten auch (11)

Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene Dawid Weiss- Finite state automata in lucene
Dawid Weiss- Finite state automata in lucene
 
Adult Development
Adult DevelopmentAdult Development
Adult Development
 
Class 5, adlt 671 developmental theorists
Class 5, adlt 671 developmental theoristsClass 5, adlt 671 developmental theorists
Class 5, adlt 671 developmental theorists
 
Class 5 adult development theories___longer_version
Class 5 adult development theories___longer_versionClass 5 adult development theories___longer_version
Class 5 adult development theories___longer_version
 
Adult Development
Adult Development Adult Development
Adult Development
 
Adult development theory
Adult development theoryAdult development theory
Adult development theory
 
Automata Invasion
Automata InvasionAutomata Invasion
Automata Invasion
 
類義語検索と類義語ハイライト
類義語検索と類義語ハイライト類義語検索と類義語ハイライト
類義語検索と類義語ハイライト
 
Current state and future state using VE
Current state and future state using VECurrent state and future state using VE
Current state and future state using VE
 
VE plus graphic facilitation for currrent / future states
VE plus graphic facilitation for currrent / future statesVE plus graphic facilitation for currrent / future states
VE plus graphic facilitation for currrent / future states
 
HMC Conference 2011 Scotland
HMC Conference 2011 ScotlandHMC Conference 2011 Scotland
HMC Conference 2011 Scotland
 

Ähnlich wie Text tagging with finite state transducers

Mark Logic StrangeLoop 2010
Mark Logic StrangeLoop 2010Mark Logic StrangeLoop 2010
Mark Logic StrangeLoop 2010Christopher Biow
 
About "Apache Cassandra"
About "Apache Cassandra"About "Apache Cassandra"
About "Apache Cassandra"Jihyun Ahn
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to ElasticsearchSperasoft
 
MongoDB at the Silicon Valley iPhone and iPad Developers' Meetup
MongoDB at the Silicon Valley iPhone and iPad Developers' MeetupMongoDB at the Silicon Valley iPhone and iPad Developers' Meetup
MongoDB at the Silicon Valley iPhone and iPad Developers' MeetupMongoDB
 
Xml processing-by-asfak
Xml processing-by-asfakXml processing-by-asfak
Xml processing-by-asfakAsfak Mahamud
 
eXtensible Markup Language (XML)
eXtensible Markup Language (XML)eXtensible Markup Language (XML)
eXtensible Markup Language (XML)Serhii Kartashov
 
Why databases cry at night
Why databases cry at nightWhy databases cry at night
Why databases cry at nightMichael Yarichuk
 
Hatkit Project - Datafiddler
Hatkit Project - DatafiddlerHatkit Project - Datafiddler
Hatkit Project - Datafiddlerholiman
 
Clojure talk at Münster JUG
Clojure talk at Münster JUGClojure talk at Münster JUG
Clojure talk at Münster JUGAlex Ott
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network ProcessingRyousei Takano
 
Dictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalDictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalSpark Summit
 
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLPDictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLPSujit Pal
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...javier ramirez
 
MongoDB Auto-Sharding at Mongo Seattle
MongoDB Auto-Sharding at Mongo SeattleMongoDB Auto-Sharding at Mongo Seattle
MongoDB Auto-Sharding at Mongo SeattleMongoDB
 
Component Framework Primer for JSF Users
Component Framework Primer for JSF UsersComponent Framework Primer for JSF Users
Component Framework Primer for JSF UsersAndy Schwartz
 
Optimizing Application Architecture (.NET/Java topics)
Optimizing Application Architecture (.NET/Java topics)Optimizing Application Architecture (.NET/Java topics)
Optimizing Application Architecture (.NET/Java topics)Ravi Okade
 

Ähnlich wie Text tagging with finite state transducers (20)

Mark Logic StrangeLoop 2010
Mark Logic StrangeLoop 2010Mark Logic StrangeLoop 2010
Mark Logic StrangeLoop 2010
 
About "Apache Cassandra"
About "Apache Cassandra"About "Apache Cassandra"
About "Apache Cassandra"
 
Basics of XML
Basics of XMLBasics of XML
Basics of XML
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
MongoDB at the Silicon Valley iPhone and iPad Developers' Meetup
MongoDB at the Silicon Valley iPhone and iPad Developers' MeetupMongoDB at the Silicon Valley iPhone and iPad Developers' Meetup
MongoDB at the Silicon Valley iPhone and iPad Developers' Meetup
 
Processing XML with Java
Processing XML with JavaProcessing XML with Java
Processing XML with Java
 
Xml processing-by-asfak
Xml processing-by-asfakXml processing-by-asfak
Xml processing-by-asfak
 
eXtensible Markup Language (XML)
eXtensible Markup Language (XML)eXtensible Markup Language (XML)
eXtensible Markup Language (XML)
 
Why databases cry at night
Why databases cry at nightWhy databases cry at night
Why databases cry at night
 
Open source Technology
Open source TechnologyOpen source Technology
Open source Technology
 
Hatkit Project - Datafiddler
Hatkit Project - DatafiddlerHatkit Project - Datafiddler
Hatkit Project - Datafiddler
 
Clojure talk at Münster JUG
Clojure talk at Münster JUGClojure talk at Münster JUG
Clojure talk at Münster JUG
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network Processing
 
Dictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalDictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit Pal
 
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLPDictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
MongoDB Auto-Sharding at Mongo Seattle
MongoDB Auto-Sharding at Mongo SeattleMongoDB Auto-Sharding at Mongo Seattle
MongoDB Auto-Sharding at Mongo Seattle
 
MongoDB @ fliptop
MongoDB @ fliptopMongoDB @ fliptop
MongoDB @ fliptop
 
Component Framework Primer for JSF Users
Component Framework Primer for JSF UsersComponent Framework Primer for JSF Users
Component Framework Primer for JSF Users
 
Optimizing Application Architecture (.NET/Java topics)
Optimizing Application Architecture (.NET/Java topics)Optimizing Application Architecture (.NET/Java topics)
Optimizing Application Architecture (.NET/Java topics)
 

Mehr von lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 

Mehr von lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Kürzlich hochgeladen

fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...KokoStevan
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterMateoGardella
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfSanaAli374401
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfChris Hunter
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 

Kürzlich hochgeladen (20)

fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 

Text tagging with finite state transducers

  • 1. TEXT TAGGING WITH FINITE STATE TRANSDUCERS David Smiley Software Systems Engineer, Lead
  • 2. Text Tagging with Finite State Transducers David Smiley Lucene/Solr Revolution 2013 © 2012 The MITRE Corporation. All rights reserved.
  • 3. About David Smiley  Working at MITRE, for 13 years  web development, Java, search  Published 1st book on Solr; then 2nd edition (2009, 2011)  Apache Lucene / Solr committer/PMC member (2012)  Presented at Lucene Revolution (2010) & Basis O.S. Search Conference (2011, 2012)  Taught Solr classes at MITRE (2010, 2011, 2012)  Solr search consultant within MITRE and its sponsors, and privately 3
  • 4. What is “Text Tagging” and “FSTs”?  First, I need to establish the context:  JIEDDO’s OpenSextant project  Though this presentation is not about OpenSextant or geotagging  Ultimately, I want to convey how cool Lucene’s FSTs are  And you may have a need for a text tagger  Or a geotagger (like OpenSextant)
  • 5. OpenSextant A DoD Funded Project: JIEDDO/COIC & NGA Open Source approval recently obtained
  • 6. OpenSextant Project  A geotagging solution for unstructured text  Finds place name references in natural language  “… I live near Boston … ”  Finds “Boston” with input character offset #s  Often resolves to multiple gazetteer entries: “Boston” has 73  What’s a Gazetteer?  A dictionary of place names with metadata like latitude & longitude
  • 7.
  • 8. How does it work?
  • 9. The “Naïve” Tagger  AKA “Text Tagger”  Simply consults a dictionary/gazetteer; no fancy NLP  There’s nothing geospatial about it  Subsequent NLP processing eliminates low-confidence tags  Actually, not so simple  Names vary in word length  Must find overlapping names  but not names within names
  • 10. The Gazetteer  13 million place name records  8.1M distinct place names  Why not 13M?  Ambiguous names (e.g. San Diego)  Text analysis normalization (e.g. diacritic removal, etc.)  2.8M are single-word names (1/3rd)  2.3 avg. words / name  14 avg. chars / name
  • 11. 3 Naïve Tagger Implementations  GATE’s Tagger  In-memory Aho-Corasick string-matching algorithm  Requires an estimated 80 GB RAM !! (for our data)  FAST  A JIEDDO developed MySQL based Tagger  “Reasonable” RAM requirements ~4GB  SLOW (~15x, 20x? not certain). ~1 doc/second  A JIEDDO developed Solr/FST based Tagger …
  • 13. Finite State Automata (FSA)  SortedSet<char[]>:  mop, moth, pop, slop, sloth, stop, top Note: a “Trie” data structure is similar but only shares prefixes
  • 14. Finite State Transducer (FST)  Adds optional output to each arc  SortedMap<char[],int>  mop: 0, moth: 1, pop: 2, slop: 3, sloth: 4, stop: 5, top: 6
  • 15. Lucene’s FST Implementation  FST encoded as a byte[]  Memory efficient! And fast to load from disk.  Write-once API (immutable)  Build minimal, acyclic FST from pre-sorted inputs  Fast (linear time with input size), low memory  Optional two-pass packing can shrink by ~25%  SortedMap<int[],T>: arcs are sorted by label  getByOutput also possible if outputs are sorted  http://s.apache.org/LuceneFSTs Based on a Mihov & Maurel paper, 2001
  • 16. FSTs and Text Tagging  My approach involves two layers of FSTs:  A word dictionary FST to hold each unique word  Enables using integers as substitutes for char[]  Via getByOutput(12345) -> “New”  Ex: “New” -> 12345, “York” -> 5522111, “City” -> 345  A word phrase FST comprised of word id string keys  Ex: “New York City” -> [12345, 5522111, 345]  Value are arrays of gazetteer primary keys
  • 17. Memory Use  Word Dict FST:  3.3M words with ordinal ids in 26MB of RAM  Name Phrase FST:  8.1M word id phrases in 90 MB of RAM  Plus 82MB of arrays of gazetteer primary key ids  Total: 198 MB (compare to 80GB GATE Aho-Corasick)  Building it consumes ~1.5GB Java heap, for 2 minutes
  • 18. Experimental measurements  Single FST Experiment  1 FST of analyzed character word phrase -> int id  “new york city” -> 6344207  Theory: more than 2x the memory  Result: 69 MB! (compare to 26+90) 41% reduction  Retrospective: What I would have done differently  Index a field of concatenated terms (custom TokenFilter).  More disk needed but reduces build time & memory requirements. Unclear effect on tagging performance.  Potential to use MemoryPostingsFormat, a Lucene Codec that uses an FST internally + vInt doc ids, instead of custom FST code.
  • 19. Tagging Algorithm It’s complicated! Single-pass (streaming) algorithm  For each input word, lookup its ordinal id, then: 1. Create an FST arc iterator for name phrase 2. Append the iterator onto a queue of active ones 3. Try to advance all iterators  Remove those that don’t advance Iterator linked-list queue: Head: New, York, City ✔ Head+1: York, City Head+2: City …
  • 20. Speed Benchmarks Docs/Sec RAM (GB) OpenSextant: GATE Tagger ? 80 OpenSextant: MySQL based Tagger 1.1 4 OpenSextant: Solr/FST Tagger 15.9 2* Measures single-threaded performance of geotagging 428 documents in the “ACE” collection. OpenSextant tests all had the same gazetteer.
  • 21. Integrated with Solr  As a custom Solr Request Handler  Builds the FSTs from the index (the gazetteer)  Configurable  Text analysis (e.g. phonetic)  Exclude gazetteer docs by configured query  Optional partial word phrase matching  Optional sub-tags tagging  Solr integration benefits  Solr as a taxonomy manager! Web-service, searchable, scalable, easy to update, …
  • 22. ~$ curl -XPOST 'http://localhost:8983/solr/tag ?fl=*&wt=json&indent=2' -H 'Content-Type:text/plain' -d "I live near Boston" { "responseHeader":{ "status":0, "QTime":1898}, "tagsCount":1, "tags":[[ "startOffset",12, "endOffset",18, "ids",[1190927, 1099063, 2562742, 2667203, 2684629, 2695904, 2653982, 2657690, 2585165, 2597292, … … 11890986, 11891415]]], "matchingDocs":{"numFound":73,"start":0,"docs":[ { "id":12719030, "place_id":"USGS1893700", "name":"Boston", "lat":65.01667, "lon":-163.28333, "feat_class":"L", "feat_code":"AREA", "FIPS_cc":"US", "ISO_cc":["US"], "cc":"US", "ISO3_cc":"USA", "adm1":"US02", "adm2":"US02.0180", "name_bias":0.0, "id_bias":0.04, "geo":"65.01667,-163.28333"}, …
  • 23. Where can you get this?  https://github.com/openSextant/SolrTextTagger  An independent module of OpenSextant  Might seek incubator status at http://www.osgeo.org  Includes documentation, tests
  • 24. Concluding Remarks  Lucene FSTs are awesome!  Great for storing large amounts of strings in-memory  Or other string-like data: e.g. IP addresses, geohashes  The API is hard to use, however  The text tagger should be useful independent of OpenSextant  Tag people/org names or special keywords  Might be ported to Lucene as an alternative to its synonym token filter  I’ve got an idea on applying these concepts to Lucene “Shingling” as a codec to make it more scalable
  • 25. CONFERENCE PARTY The Tipsy Crow: 770 5th Ave Starts after Stump The Chump Your conference badge gets you in the door TOMORROW Breakfast starts at 7:30 Keynotes start at 8:30 CONTACT David Smiley dsmiley@mitre.org