SlideShare a Scribd company logo
1 of 20
Open Source Software for Geospatial
Analytics on Unstructured Big Data
Charlie Greenbacker, Principal Data Scientist
Background




                                                                 About Me:
                                                                        Data Scientist
                                                                        Natural Language Processing
                                                                        Unstructured Text  Information


                                                                 Berico Technologies:
                                                                        Veteran-owned Small Business
                                                                        Big Data Analytics in the Cloud
                                                                        Defense & Intel Community



All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   2
The Problem: geotagging unstructured text




     Growing demand for
     geospatial analytics

     Most of human knowledge
     remains “trapped” in text

     Existing solutions are
     expensive and don’t scale

     Need an open source solution



All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   3
The Solution: an open source geoparser



                                                                     1. Data Ingestion
                                                                                  Input: unstructured text
                                                                     2. Entity Extraction
                                                                                  Named entity recognition
                                                                                  Find location names in text
                                                                     3. Entity Resolution
                                                                                  Match against a gazetteer
                                                                                  “The Springfield Problem”
                                                                     4. Data Enrichment
                                                                                  Output: structured geo data



All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   4
Data Ingestion: unstructured text




                                                                                                              photo: Flickr user NS Newsflash


All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.          5
Entity Extraction: named entity recognition




All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   6
Entity Resolution: match against a gazetteer




All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   7
Data Enrichment: structured geo data




All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   8
“The Springfield Problem”




All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   9
Dealing with Ambiguity


     Intelligent Context-based Heuristics
            First: rank by population
            Next: look for other locations mentioned in the same document
                   “Springfield” + “Chicago” = Illinois
                   “Springfield” + “Boston” = Massachusetts
            Soon: calculate distance based on lat/lons


     Resolve alternate names to same geospatial entity
            “Ivory Coast” = “Côte d’Ivoire”


     Use fuzzy matching to capture misspelled place names
            Including both phonetic spelling & typographical errors

All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   10
CLAVIN: an open source geoparser




All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   11
System Architecture




All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   12
Live Demonstration




All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   13
Live Demonstration




                                              What can I do
                                              with this data?




All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   14
Map Visualizations




All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   15
Hierarchical Geospatial Search




                                                     Virginia

                                                               Reston           Arlington




All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   16
Geospatial Bounding Box Search




All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   17
Geospatial Analytics on Unstructured Text




All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   18
Performance Metrics & Features


                                                     Accurate: 0.75 F-measure
     CLAVIN




“
                                                     Fast: 100 locations per sec per cpu
Cartographic
                                                     Scalable: processes 1 million documents
Location                                             in 1 hour on a 9-node Hadoop cluster

And                                                  Smart: natural language
                                                     processing, context-based heuristics, &
Vicinity                                             fuzzy matching

INdexer                                              Easy to use: simple Java-based API

                                                     Open source: Apache License
 All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   19
clavin.bericotechnologies.com
                                                                                    Charlie Greenbacker
                                                                                          @greenbacker
                                 meetup.com/DC-NLP
                                 @DCNLP


All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   20

More Related Content

Similar to Greenbacker open analyticsdc

Data Sharing and the Polar Information Commons
Data Sharing and the Polar Information CommonsData Sharing and the Polar Information Commons
Data Sharing and the Polar Information CommonsKaitlin Thaney
 
PROCEED and Crowd-Sourced Formal Verification
PROCEED and Crowd-Sourced Formal VerificationPROCEED and Crowd-Sourced Formal Verification
PROCEED and Crowd-Sourced Formal VerificationMichael Scovetta
 
Trust Data Sharing and Utilization Infrastructure for Sensitive Data using Hy...
Trust Data Sharing and Utilization Infrastructure for Sensitive Data using Hy...Trust Data Sharing and Utilization Infrastructure for Sensitive Data using Hy...
Trust Data Sharing and Utilization Infrastructure for Sensitive Data using Hy...Koshi Ikegawa
 
Splendid: SPARQL Endpoint Federation Exploiting VOID Descriptions
Splendid: SPARQL Endpoint Federation Exploiting VOID DescriptionsSplendid: SPARQL Endpoint Federation Exploiting VOID Descriptions
Splendid: SPARQL Endpoint Federation Exploiting VOID DescriptionsOlafGoerlitz
 
Training Week: Create a Knowledge Graph: A Simple ML Approach
Training Week: Create a Knowledge Graph: A Simple ML Approach Training Week: Create a Knowledge Graph: A Simple ML Approach
Training Week: Create a Knowledge Graph: A Simple ML Approach Neo4j
 
Big Data, Watson & The Future of Sourcing
Big Data, Watson & The Future of SourcingBig Data, Watson & The Future of Sourcing
Big Data, Watson & The Future of SourcingKevin Wheeler
 
Jena based implementation of a iso 11179 meta data registry
Jena based implementation of a iso 11179 meta data registryJena based implementation of a iso 11179 meta data registry
Jena based implementation of a iso 11179 meta data registryA. Anil Sinaci
 
PhD Day: Entity Linking using Generic Linked Data Datasets
PhD Day: Entity Linking using Generic Linked Data DatasetsPhD Day: Entity Linking using Generic Linked Data Datasets
PhD Day: Entity Linking using Generic Linked Data DatasetsBianca Pereira
 
Questions On The And Football
Questions On The And FootballQuestions On The And Football
Questions On The And FootballAmanda Gray
 
The Art of the Possible with Graph by Dr Jim Webber Neo4j.pptx
The Art of the Possible with Graph by Dr Jim Webber Neo4j.pptxThe Art of the Possible with Graph by Dr Jim Webber Neo4j.pptx
The Art of the Possible with Graph by Dr Jim Webber Neo4j.pptxNeo4j
 
Introduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataIntroduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataAndre Freitas
 
Computer Science Related Questions
Computer Science Related QuestionsComputer Science Related Questions
Computer Science Related QuestionsBravoLulu1
 
Information Extraction and Integration of Hard and Soft Information for D2D v...
Information Extraction and Integration of Hard and Soft Information for D2D v...Information Extraction and Integration of Hard and Soft Information for D2D v...
Information Extraction and Integration of Hard and Soft Information for D2D v...DataCards
 
Hortonworks Data Platform and IBM Systems - A Complete Solution for Cognitive...
Hortonworks Data Platform and IBM Systems - A Complete Solution for Cognitive...Hortonworks Data Platform and IBM Systems - A Complete Solution for Cognitive...
Hortonworks Data Platform and IBM Systems - A Complete Solution for Cognitive...DataWorks Summit/Hadoop Summit
 
SmartData Slides: Machine Learning - From Discovery to Understanding
SmartData Slides: Machine Learning - From Discovery to UnderstandingSmartData Slides: Machine Learning - From Discovery to Understanding
SmartData Slides: Machine Learning - From Discovery to UnderstandingDATAVERSITY
 
Defense applications white paper
Defense applications white paperDefense applications white paper
Defense applications white paperGreg Pepus
 
Fraud webinar - Prevention & Risk Management
Fraud webinar - Prevention & Risk ManagementFraud webinar - Prevention & Risk Management
Fraud webinar - Prevention & Risk ManagementFernando Mesa
 
Workshop 20092019 Abbattista
Workshop 20092019 AbbattistaWorkshop 20092019 Abbattista
Workshop 20092019 AbbattistaRocco Baccelliere
 

Similar to Greenbacker open analyticsdc (20)

Data Sharing and the Polar Information Commons
Data Sharing and the Polar Information CommonsData Sharing and the Polar Information Commons
Data Sharing and the Polar Information Commons
 
Forrester
ForresterForrester
Forrester
 
Text Analytics - JCC2014 Kimelfeld
Text Analytics - JCC2014 KimelfeldText Analytics - JCC2014 Kimelfeld
Text Analytics - JCC2014 Kimelfeld
 
PROCEED and Crowd-Sourced Formal Verification
PROCEED and Crowd-Sourced Formal VerificationPROCEED and Crowd-Sourced Formal Verification
PROCEED and Crowd-Sourced Formal Verification
 
Trust Data Sharing and Utilization Infrastructure for Sensitive Data using Hy...
Trust Data Sharing and Utilization Infrastructure for Sensitive Data using Hy...Trust Data Sharing and Utilization Infrastructure for Sensitive Data using Hy...
Trust Data Sharing and Utilization Infrastructure for Sensitive Data using Hy...
 
Splendid: SPARQL Endpoint Federation Exploiting VOID Descriptions
Splendid: SPARQL Endpoint Federation Exploiting VOID DescriptionsSplendid: SPARQL Endpoint Federation Exploiting VOID Descriptions
Splendid: SPARQL Endpoint Federation Exploiting VOID Descriptions
 
Training Week: Create a Knowledge Graph: A Simple ML Approach
Training Week: Create a Knowledge Graph: A Simple ML Approach Training Week: Create a Knowledge Graph: A Simple ML Approach
Training Week: Create a Knowledge Graph: A Simple ML Approach
 
Big Data, Watson & The Future of Sourcing
Big Data, Watson & The Future of SourcingBig Data, Watson & The Future of Sourcing
Big Data, Watson & The Future of Sourcing
 
Jena based implementation of a iso 11179 meta data registry
Jena based implementation of a iso 11179 meta data registryJena based implementation of a iso 11179 meta data registry
Jena based implementation of a iso 11179 meta data registry
 
PhD Day: Entity Linking using Generic Linked Data Datasets
PhD Day: Entity Linking using Generic Linked Data DatasetsPhD Day: Entity Linking using Generic Linked Data Datasets
PhD Day: Entity Linking using Generic Linked Data Datasets
 
Questions On The And Football
Questions On The And FootballQuestions On The And Football
Questions On The And Football
 
The Art of the Possible with Graph by Dr Jim Webber Neo4j.pptx
The Art of the Possible with Graph by Dr Jim Webber Neo4j.pptxThe Art of the Possible with Graph by Dr Jim Webber Neo4j.pptx
The Art of the Possible with Graph by Dr Jim Webber Neo4j.pptx
 
Introduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataIntroduction to question answering for linked data & big data
Introduction to question answering for linked data & big data
 
Computer Science Related Questions
Computer Science Related QuestionsComputer Science Related Questions
Computer Science Related Questions
 
Information Extraction and Integration of Hard and Soft Information for D2D v...
Information Extraction and Integration of Hard and Soft Information for D2D v...Information Extraction and Integration of Hard and Soft Information for D2D v...
Information Extraction and Integration of Hard and Soft Information for D2D v...
 
Hortonworks Data Platform and IBM Systems - A Complete Solution for Cognitive...
Hortonworks Data Platform and IBM Systems - A Complete Solution for Cognitive...Hortonworks Data Platform and IBM Systems - A Complete Solution for Cognitive...
Hortonworks Data Platform and IBM Systems - A Complete Solution for Cognitive...
 
SmartData Slides: Machine Learning - From Discovery to Understanding
SmartData Slides: Machine Learning - From Discovery to UnderstandingSmartData Slides: Machine Learning - From Discovery to Understanding
SmartData Slides: Machine Learning - From Discovery to Understanding
 
Defense applications white paper
Defense applications white paperDefense applications white paper
Defense applications white paper
 
Fraud webinar - Prevention & Risk Management
Fraud webinar - Prevention & Risk ManagementFraud webinar - Prevention & Risk Management
Fraud webinar - Prevention & Risk Management
 
Workshop 20092019 Abbattista
Workshop 20092019 AbbattistaWorkshop 20092019 Abbattista
Workshop 20092019 Abbattista
 

More from Open Analytics

Cyber after Snowden (OA Cyber Summit)
Cyber after Snowden (OA Cyber Summit)Cyber after Snowden (OA Cyber Summit)
Cyber after Snowden (OA Cyber Summit)Open Analytics
 
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)Open Analytics
 
CDM….Where do you start? (OA Cyber Summit)
CDM….Where do you start? (OA Cyber Summit)CDM….Where do you start? (OA Cyber Summit)
CDM….Where do you start? (OA Cyber Summit)Open Analytics
 
An Immigrant’s view of Cyberspace (OA Cyber Summit)
An Immigrant’s view of Cyberspace (OA Cyber Summit)An Immigrant’s view of Cyberspace (OA Cyber Summit)
An Immigrant’s view of Cyberspace (OA Cyber Summit)Open Analytics
 
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)Open Analytics
 
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...Open Analytics
 
Using Real-Time Data to Drive Optimization & Personalization
Using Real-Time Data to Drive Optimization & PersonalizationUsing Real-Time Data to Drive Optimization & Personalization
Using Real-Time Data to Drive Optimization & PersonalizationOpen Analytics
 
M&A Trends in Telco Analytics
M&A Trends in Telco AnalyticsM&A Trends in Telco Analytics
M&A Trends in Telco AnalyticsOpen Analytics
 
Competing in the Digital Economy
Competing in the Digital EconomyCompeting in the Digital Economy
Competing in the Digital EconomyOpen Analytics
 
Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)Open Analytics
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Open Analytics
 
Crossing the Chasm (Ikanow - Chicago Summit)
Crossing the Chasm (Ikanow - Chicago Summit)Crossing the Chasm (Ikanow - Chicago Summit)
Crossing the Chasm (Ikanow - Chicago Summit)Open Analytics
 
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...Open Analytics
 
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...Open Analytics
 
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)Open Analytics
 
From Insight to Impact (Chicago Summit - Keynote)
From Insight to Impact (Chicago Summit - Keynote)From Insight to Impact (Chicago Summit - Keynote)
From Insight to Impact (Chicago Summit - Keynote)Open Analytics
 
Easybib Open Analytics NYC
Easybib Open Analytics NYCEasybib Open Analytics NYC
Easybib Open Analytics NYCOpen Analytics
 
MarkLogic - Open Analytics Meetup
MarkLogic - Open Analytics MeetupMarkLogic - Open Analytics Meetup
MarkLogic - Open Analytics MeetupOpen Analytics
 
The caprate presentation_july2013_open analytics dc meetup
The caprate presentation_july2013_open analytics dc meetupThe caprate presentation_july2013_open analytics dc meetup
The caprate presentation_july2013_open analytics dc meetupOpen Analytics
 
Verifeed open analytics_3min deck_071713_final
Verifeed open analytics_3min deck_071713_finalVerifeed open analytics_3min deck_071713_final
Verifeed open analytics_3min deck_071713_finalOpen Analytics
 

More from Open Analytics (20)

Cyber after Snowden (OA Cyber Summit)
Cyber after Snowden (OA Cyber Summit)Cyber after Snowden (OA Cyber Summit)
Cyber after Snowden (OA Cyber Summit)
 
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
 
CDM….Where do you start? (OA Cyber Summit)
CDM….Where do you start? (OA Cyber Summit)CDM….Where do you start? (OA Cyber Summit)
CDM….Where do you start? (OA Cyber Summit)
 
An Immigrant’s view of Cyberspace (OA Cyber Summit)
An Immigrant’s view of Cyberspace (OA Cyber Summit)An Immigrant’s view of Cyberspace (OA Cyber Summit)
An Immigrant’s view of Cyberspace (OA Cyber Summit)
 
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
 
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
 
Using Real-Time Data to Drive Optimization & Personalization
Using Real-Time Data to Drive Optimization & PersonalizationUsing Real-Time Data to Drive Optimization & Personalization
Using Real-Time Data to Drive Optimization & Personalization
 
M&A Trends in Telco Analytics
M&A Trends in Telco AnalyticsM&A Trends in Telco Analytics
M&A Trends in Telco Analytics
 
Competing in the Digital Economy
Competing in the Digital EconomyCompeting in the Digital Economy
Competing in the Digital Economy
 
Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
 
Crossing the Chasm (Ikanow - Chicago Summit)
Crossing the Chasm (Ikanow - Chicago Summit)Crossing the Chasm (Ikanow - Chicago Summit)
Crossing the Chasm (Ikanow - Chicago Summit)
 
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
 
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
 
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
 
From Insight to Impact (Chicago Summit - Keynote)
From Insight to Impact (Chicago Summit - Keynote)From Insight to Impact (Chicago Summit - Keynote)
From Insight to Impact (Chicago Summit - Keynote)
 
Easybib Open Analytics NYC
Easybib Open Analytics NYCEasybib Open Analytics NYC
Easybib Open Analytics NYC
 
MarkLogic - Open Analytics Meetup
MarkLogic - Open Analytics MeetupMarkLogic - Open Analytics Meetup
MarkLogic - Open Analytics Meetup
 
The caprate presentation_july2013_open analytics dc meetup
The caprate presentation_july2013_open analytics dc meetupThe caprate presentation_july2013_open analytics dc meetup
The caprate presentation_july2013_open analytics dc meetup
 
Verifeed open analytics_3min deck_071713_final
Verifeed open analytics_3min deck_071713_finalVerifeed open analytics_3min deck_071713_final
Verifeed open analytics_3min deck_071713_final
 

Recently uploaded

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 

Recently uploaded (20)

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 

Greenbacker open analyticsdc

  • 1. Open Source Software for Geospatial Analytics on Unstructured Big Data Charlie Greenbacker, Principal Data Scientist
  • 2. Background About Me: Data Scientist Natural Language Processing Unstructured Text  Information Berico Technologies: Veteran-owned Small Business Big Data Analytics in the Cloud Defense & Intel Community All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 2
  • 3. The Problem: geotagging unstructured text Growing demand for geospatial analytics Most of human knowledge remains “trapped” in text Existing solutions are expensive and don’t scale Need an open source solution All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 3
  • 4. The Solution: an open source geoparser 1. Data Ingestion Input: unstructured text 2. Entity Extraction Named entity recognition Find location names in text 3. Entity Resolution Match against a gazetteer “The Springfield Problem” 4. Data Enrichment Output: structured geo data All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 4
  • 5. Data Ingestion: unstructured text photo: Flickr user NS Newsflash All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 5
  • 6. Entity Extraction: named entity recognition All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 6
  • 7. Entity Resolution: match against a gazetteer All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 7
  • 8. Data Enrichment: structured geo data All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 8
  • 9. “The Springfield Problem” All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 9
  • 10. Dealing with Ambiguity Intelligent Context-based Heuristics First: rank by population Next: look for other locations mentioned in the same document “Springfield” + “Chicago” = Illinois “Springfield” + “Boston” = Massachusetts Soon: calculate distance based on lat/lons Resolve alternate names to same geospatial entity “Ivory Coast” = “Côte d’Ivoire” Use fuzzy matching to capture misspelled place names Including both phonetic spelling & typographical errors All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 10
  • 11. CLAVIN: an open source geoparser All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 11
  • 12. System Architecture All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 12
  • 13. Live Demonstration All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 13
  • 14. Live Demonstration What can I do with this data? All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 14
  • 15. Map Visualizations All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 15
  • 16. Hierarchical Geospatial Search Virginia Reston Arlington All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 16
  • 17. Geospatial Bounding Box Search All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 17
  • 18. Geospatial Analytics on Unstructured Text All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 18
  • 19. Performance Metrics & Features Accurate: 0.75 F-measure CLAVIN “ Fast: 100 locations per sec per cpu Cartographic Scalable: processes 1 million documents Location in 1 hour on a 9-node Hadoop cluster And Smart: natural language processing, context-based heuristics, & Vicinity fuzzy matching INdexer Easy to use: simple Java-based API Open source: Apache License All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 19
  • 20. clavin.bericotechnologies.com Charlie Greenbacker @greenbacker meetup.com/DC-NLP @DCNLP All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 20

Editor's Notes

  1. “Berico specializes in building open source software to support analytic missions, and implementing them through our services.”“We help our customers optimize the use of open source solutions for Cloud environments to replace the functionality traditionally licensed based projects.”“All of our products are built to run on and optimize cloud technologies – specifically HBase or Accumulo. We are the first authorized Cloudera partner in the federal sector”“CLAVIN is one of 7 open source products that we’ve built and implemented with customers in the DoD and IC. We’ve chosen CLAVIN as example to walk through today to illustrate how Berico’s open source products deliver great, market-leading, functionality with no licensing constraints, and at a fraction of the cost of proprietary tools in the market” (an infinite fraction – it’s free)
  2. Paris, France > Paris, Texas
  3. The interactivelive demo will be run offline from the presenter’s laptop. The CLAVIN demo interface accepts plain text as input, and returns a list of geospatial entities (with lat/lons, etc.) corresponding to the place names extracted and resolved from the text, along with a visualization plotting these locations on a map.The example text used in the demo may include the following:the sample text file built into the CLAVIN demo interface“Grover Cleveland was the 22nd president of the United States. He never went to Cuba.” (shows that CLAVIN knows “Grover Cleveland” is not a city in Ohio)“I was born in Boston and grew up in Springfield.” (produces a map of Massachusetts)“I was born in Chicago and grew up in Springfield.” (produces a map of Illinois)“I traveled to London and Oxford last summer.” (produces a map of England)“I traveled to London and Toronto last summer.” (produces a map of Ontario)a random news article from CNN.com (or a similar source)any example text provided by the audience
  4. geotag 1M documents containing 5.7M places names in under 1 hour on a 9-node Hadoop clustervsthe prohibitively expensive enterprise licenses of competing solutions like MetaCarta