SlideShare ist ein Scribd-Unternehmen logo
1 von 44
Downloaden Sie, um offline zu lesen
Building knowledge graphs
in DIG
Pedro Szekely and Craig Knoblock
University of Southern California
Information Sciences Institute
dig.isi.edu
Goal
USC Information Sciences Institute CC-By 2.0 2
raw w messy w disconnected clean w organized w linked
hard to query, analyze & visualize easy to query, analyze & visualize
Use Case: Human Trafficking
USC Information Sciences Institute CC-By 2.0 3
raw w messy w disconnected clean w organized w linked
hard to query, analyze & visualize easy to query, analyze & visualize
Use Case: Human Trafficking
USC Information Sciences Institute CC-By 2.0 4
100 million pages
~ 100 Web sites
help victims
prosecute traffickers
Salient Statistics on
Human Trafficking
• Profits per Year: $32 Billion
• Average Age of Entry To Prostitution in the US: 14
• PIMP’s Profit Per Victim Per Year: $150,000
• Advertising Budget On the Web:$45 Million
CC-By 2.0 5USC Information Sciences Institute
Task: Tracking the Victim’s
Locations
>	100	million	pages	advertising	adult	services
USC Information Sciences Institute CC-By 2.0 6
Example: Investigating a Reported Victim
San	Diego,	where	else?
USC Information Sciences Institute CC-By 2.0 7
DIG Interface: Find the locations where a
potential victim was advertised
CC-By 2.0 8
Steps To Build a DIG
USC Information Sciences Institute CC-By 2.0 9
Crawling Extraction
Data Acquisition
Mapping To
Ontology
Entity Linking
& Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
Construction
User
Interface
Data
Acquisition
Data Acquisition
USC Information Sciences Institute CC-By 2.0 10
downloading relevant data
batch w real-time
Web pagesw Web service w database w
CSV w Excel w XML w JSON
Steps To Build a DIG
USC Information Sciences Institute CC-By 2.0 11
Crawling Extraction
Data Acquisition
Mapping To
Ontology
Entity Linking
& Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
Construction
User
Interface
Feature Extraction
USC Information Sciences Institute CC-By 2.0 12
from raw sources to structured data
• trainable text extractors
• extraction from structured Web pages
• image features
• PDF extractor
Feature Extraction from Text
USC Information Sciences Institute CC-By 2.0 13
“YOU don't wanna miss out on
ME :) Perfect lil booty Green
eyes Long curly black hair Im a
Irish,Armenian and Filipino
mixed princess :) ❤ Kim ❤
7○7~7two7~7four77 ❤ HH 80
roses ❤ Hour 120 roses ❤ 15
mins 60 roses”
name: Kim
eye-color: green
hair-color: black
phone: 707-727-7477
rate: $60/15min
$80/30min
$120/60min
20 Examples
CC-By 2.0 14USC Information Sciences Institute
1,000’s of Tasks (2 Cents/Sentence)
CC-By 2.0 15
Performance of CRF Extractors
80
10
18
99
91 94
0
20
40
60
80
100
120
Precision Recall F
Regular	Expressions DIG
80
6
12
99
73
84
0
20
40
60
80
100
120
Precision Recall F
Regular	Expressions DIG
Eyes Hair
USC Information Sciences Institute CC-By 2.0 16
Structured Extraction
CC-By 2.0 17
Automated Extraction
input:	
a pile	of	pages
Classify	by
Templates
pages	clustered
by	template	
Infer
Extractor
Infer
Extractor
Infer
Extractor
Infer
Extractor
extractor
USC Information Sciences Institute CC-By 2.0 18
Unsupervised Extraction Tool
CC-By 2.0 19
Extraction Evaluation
Title Desc Seller Date Price Loc Cat
Member
Since
Expires Views ID
Perfect 1.0
(50/50)
.76
(37/49)
.95
(40/42)
.83
(40/48
)
.87
(39/45
)
.51
(23/45)
.68
(34/50)
1.0
(35/35)
.52
(15/29)
.76
(19/25)
.97
(35/36
)
Pretty
Good
1.0
(50/50)
.98
(48/49)
.95
(40/42)
.83
(40/48
)
.98
(44/45
)
.84
(38/45)
.88
(44/50)
1.0
(35/35)
.55
(16/29)
1.0
(25/25)
1.0
(36/36
)
10	websites,	5	pages	each
fields
USC Information Sciences Institute CC-By 2.0 20
Steps To Build a DIG
USC Information Sciences Institute CC-By 2.0 21
Crawling Extraction
Data Acquisition
Mapping To
Ontology
Entity Linking
& Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
Construction
User
Interface
Feature Alignment
USC Information Sciences Institute CC-By 2.0 22
from multiple schemas to a common domain schema
- CSV, Excel
- Database tables
- Web services
- Extractors
- Nomenclature
- Spelling
Multiple Schemas
Karma: Mapping Data to Ontologies
Services
Relational
Sources
Karma
{	JSON-LD	}
Hierarchical	
Sources
Schema.org
USC Information Sciences Institute CC-By 2.0 23
karma.isi.edu
Karma Solves Feature Alignment
CC-By 2.0 24USC Information Sciences Institute
Provenance
Domain Schema
took ~30 minutes to align
the output of the Stanford name extractor
Feature Alignment Statistics
• 5 contractors provided data
• ~ 15 datasets
• > 30 Karma models
• > 200 million records
• 1 hour processing in 20 node Hadoop cluster
CC-By 2.0 25USC Information Sciences Institute
Steps To Build a DIG
USC Information Sciences Institute CC-By 2.0 26
Crawling Extraction
Data Acquisition
Mapping To
Ontology
Entity Linking
& Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
Construction
User
Interface
Entity Resolution
USC Information Sciences Institute CC-By 2.0 27
merging records that refer to the same entity
missing data
incorrect data
scale (~50 million records)
currently working on techniques to address
Entity Resolutuion on Strong Attributes
AdultService-1
Person-1
Offer-1
availableAt
seller
phone
619-319-7315
Santa Barbara
hairColor
red
price
250/hour
startDate
2014-12-07
eyeColor
blue
name
Jessica
itemProvided
Offer-2
Person-2
availableAt
Washington DC
phone
seller
email
price
250/hour
startDate
2014-05-28
AdultService-2
eyeColor
blue
name
Jessica
itemProvided
USC Information Sciences Institute CC-By 2.0 28
Linking Using Text Similarity
E M I LY SEXY. ** wHiTe/lATin girl ** bUsTy SWEET. LoTs Of fUn. Call Me.
O_U_T_C___A___L_L_S
LAY LA SEXY. ** wHiTe girl ** bUsTy SWEET. LoTs Of fUn. Call Me.
O____U____T____C___A___L____L____S
L I LA SEXY. ** WhiTe girl ** bUsTy SWEET. LoTs Of fUn. Call Me.
O_U_T_C___A___L_L_S
USC Information Sciences Institute CC-By 2.0 29
Linking Using Image Similarity
CC-By 2.0 30USC Information Sciences Institute
100 Million Images Technology: Deep Learning
AdultService-1
Person-1
Offer-1
availableAt
seller
phone
619-319-7315
Santa Barbara
hairColor
red
price
250/hour
startDate
2014-12-07
eyeColor
blue
name
Jessica
itemProvided
Offer-2
Person-2
availableAt
Washington DC
phone
seller
email
price
250/hour
startDate
2014-05-28
AdultService-2
eyeColor
blue
name
Jessica
itemProvided
same victim
same Trafficker
Unsupervised Collective Entity Resolution
USC Information Sciences Institute CC-By 2.0 31
Unsupervised Collective Entity
Resolution
USC Information Sciences Institute CC-By 2.0 32
Steps To Build a DIG
USC Information Sciences Institute CC-By 2.0 33
Crawling Extraction
Data Acquisition
Mapping To
Ontology
Entity Linking
& Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
Construction
User
Interface
Graph Construction
USC Information Sciences Institute CC-By 2.0 34
assembling the data for efficient query & analysis
- ElasticSearch: scalable, efficient query
- graph databases: network analytics
- NoSQL: scalable analytics
- bulk loading: massive data imports
- real-time updates: live, changing data
Elastic Search Data Model
Adult
Service
Offer Person Phone
Web
Page
USC Information Sciences Institute CC-By 2.0 35
Indexing for High Performance
Knowledge Graph Queries
Avg.	Query	Times	in	Milliseconds
Single	User	Query	Load
1.2	billion	triples
State	of	the	Art	Graph	Database	(RDF)
DIG	indexing	deployed	in	ElasticSearch
USC Information Sciences Institute CC-By 2.0 36
Steps To Build a DIG
USC Information Sciences Institute CC-By 2.0 37
Crawling Extraction
Data Acquisition
Mapping To
Ontology
Entity Linking
& Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
Construction
User
Interface
DIG Deployment for Human Trafficking
USC Information Sciences Institute CC-By 2.0 40
- 100 million Web pages
- Live updates (~5,000 pages/hour)
- ElasticSearch database (7 nodes)
- Hadoop workflows (20 nodes)
- District Attorney
- Law Enforcement
- NGOs
Deployed	to	6
Law	Enforcement	
Agencies	and	Successfully	
Used	to	Prosecute	
Traffickers
USC Information Sciences Institute CC-By 2.0 41
DIG Applications
Human Trafficking
large, real users
Material Science Research
70,000 paper abstracts (built in 1 week)
Arms Trafficking
Identify illegal sales
Patent Trolls
Identify patent trolls
Cyber Attacks
Predict cyber attacks from dark web data
CC-By 2.0 42USC Information Sciences Institute
Conclusions
• Complete tool-chain to build domain-specific
knowledge graphs
• Integrates heterogeneous data: web pages,
databases, CSV, web APIs, images, etc.
• Scales to ~100 million pages, ~3 billion facts
• Deployed to law enforcement
USC Information Sciences Institute CC-By 2.0 43
Questions?
dig.isi.edu
Open Source, Apache 2 License
CC-By 2.0 44USC Information Sciences Institute

Weitere ähnliche Inhalte

Was ist angesagt?

Semantic Search at Yahoo
Semantic Search at YahooSemantic Search at Yahoo
Semantic Search at YahooPeter Mika
 
Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Peter Mika
 
An Introduction to Entities in Semantic Search
An Introduction to Entities in Semantic SearchAn Introduction to Entities in Semantic Search
An Introduction to Entities in Semantic SearchDavid Amerland
 
Bigdata and ai in p2 p industry: Knowledge graph and inference
Bigdata and ai in p2 p industry:  Knowledge graph and inferenceBigdata and ai in p2 p industry:  Knowledge graph and inference
Bigdata and ai in p2 p industry: Knowledge graph and inferencesfbiganalytics
 
Better Search Through Query Understanding
Better Search Through Query UnderstandingBetter Search Through Query Understanding
Better Search Through Query UnderstandingDaniel Tunkelang
 
Implementing Semantic Search
Implementing Semantic SearchImplementing Semantic Search
Implementing Semantic SearchPaul Wlodarczyk
 
Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)Bradley Allen
 
Hadoop and Neo4j: A Winning Combination for Bioinformatics
Hadoop and Neo4j: A Winning Combination for BioinformaticsHadoop and Neo4j: A Winning Combination for Bioinformatics
Hadoop and Neo4j: A Winning Combination for Bioinformaticsosintegrators
 
Knowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based SearchKnowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based SearchNeo4j
 
Named Entity Recognition from Online News
Named Entity Recognition from Online NewsNamed Entity Recognition from Online News
Named Entity Recognition from Online NewsBernardo Najlis
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use CasesMax De Marzi
 
Semantic Search using RDF Metadata (SemTech 2005)
Semantic Search using RDF Metadata (SemTech 2005)Semantic Search using RDF Metadata (SemTech 2005)
Semantic Search using RDF Metadata (SemTech 2005)Bradley Allen
 
The Graph Database Universe: Neo4j Overview
The Graph Database Universe: Neo4j OverviewThe Graph Database Universe: Neo4j Overview
The Graph Database Universe: Neo4j OverviewNeo4j
 
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”VOGIN-academie
 
Structured Document Search and Retrieval
Structured Document Search and RetrievalStructured Document Search and Retrieval
Structured Document Search and RetrievalOptum
 
Fyp ideas
Fyp ideasFyp ideas
Fyp ideasMr SMAK
 
The year of the graph: do you really need a graph database? How do you choose...
The year of the graph: do you really need a graph database? How do you choose...The year of the graph: do you really need a graph database? How do you choose...
The year of the graph: do you really need a graph database? How do you choose...George Anadiotis
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Semantics-aware Techniques for Social Media Analysis, User Modeling and Recom...
Semantics-aware Techniques for Social Media Analysis, User Modeling and Recom...Semantics-aware Techniques for Social Media Analysis, User Modeling and Recom...
Semantics-aware Techniques for Social Media Analysis, User Modeling and Recom...Cataldo Musto
 

Was ist angesagt? (20)

Semantic Search at Yahoo
Semantic Search at YahooSemantic Search at Yahoo
Semantic Search at Yahoo
 
Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012
 
Semantic search
Semantic searchSemantic search
Semantic search
 
An Introduction to Entities in Semantic Search
An Introduction to Entities in Semantic SearchAn Introduction to Entities in Semantic Search
An Introduction to Entities in Semantic Search
 
Bigdata and ai in p2 p industry: Knowledge graph and inference
Bigdata and ai in p2 p industry:  Knowledge graph and inferenceBigdata and ai in p2 p industry:  Knowledge graph and inference
Bigdata and ai in p2 p industry: Knowledge graph and inference
 
Better Search Through Query Understanding
Better Search Through Query UnderstandingBetter Search Through Query Understanding
Better Search Through Query Understanding
 
Implementing Semantic Search
Implementing Semantic SearchImplementing Semantic Search
Implementing Semantic Search
 
Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)
 
Hadoop and Neo4j: A Winning Combination for Bioinformatics
Hadoop and Neo4j: A Winning Combination for BioinformaticsHadoop and Neo4j: A Winning Combination for Bioinformatics
Hadoop and Neo4j: A Winning Combination for Bioinformatics
 
Knowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based SearchKnowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based Search
 
Named Entity Recognition from Online News
Named Entity Recognition from Online NewsNamed Entity Recognition from Online News
Named Entity Recognition from Online News
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use Cases
 
Semantic Search using RDF Metadata (SemTech 2005)
Semantic Search using RDF Metadata (SemTech 2005)Semantic Search using RDF Metadata (SemTech 2005)
Semantic Search using RDF Metadata (SemTech 2005)
 
The Graph Database Universe: Neo4j Overview
The Graph Database Universe: Neo4j OverviewThe Graph Database Universe: Neo4j Overview
The Graph Database Universe: Neo4j Overview
 
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
 
Structured Document Search and Retrieval
Structured Document Search and RetrievalStructured Document Search and Retrieval
Structured Document Search and Retrieval
 
Fyp ideas
Fyp ideasFyp ideas
Fyp ideas
 
The year of the graph: do you really need a graph database? How do you choose...
The year of the graph: do you really need a graph database? How do you choose...The year of the graph: do you really need a graph database? How do you choose...
The year of the graph: do you really need a graph database? How do you choose...
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Semantics-aware Techniques for Social Media Analysis, User Modeling and Recom...
Semantics-aware Techniques for Social Media Analysis, User Modeling and Recom...Semantics-aware Techniques for Social Media Analysis, User Modeling and Recom...
Semantics-aware Techniques for Social Media Analysis, User Modeling and Recom...
 

Ähnlich wie Building Knowledge Graphs for Investigating Human Trafficking

Linked Data, Cultural Heritage & the Karma Mapping Software
Linked Data, Cultural Heritage & the Karma Mapping SoftwareLinked Data, Cultural Heritage & the Karma Mapping Software
Linked Data, Cultural Heritage & the Karma Mapping SoftwarePedro Szekely
 
Microtask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataMicrotask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataEUCLID project
 
Bigdatacooltools
BigdatacooltoolsBigdatacooltools
Bigdatacooltoolssuresh sood
 
RDAP 15: You’re in good company: Unifying campus research data services
RDAP 15: You’re in good company: Unifying campus research data servicesRDAP 15: You’re in good company: Unifying campus research data services
RDAP 15: You’re in good company: Unifying campus research data servicesASIS&T
 
Zühlke Meetup - Mai 2017
Zühlke Meetup - Mai 2017Zühlke Meetup - Mai 2017
Zühlke Meetup - Mai 2017Boris Adryan
 
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...Keiichiro Ono
 
Mexico talk foster march 2012
Mexico talk foster march 2012Mexico talk foster march 2012
Mexico talk foster march 2012Ian Foster
 
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...Chris Bizer
 
Cloud computing and networking course: paper presentation -Data Mining for In...
Cloud computing and networking course: paper presentation -Data Mining for In...Cloud computing and networking course: paper presentation -Data Mining for In...
Cloud computing and networking course: paper presentation -Data Mining for In...Cristian Consonni
 
Digital Twin and Smart Spaces
Digital Twin and Smart Spaces Digital Twin and Smart Spaces
Digital Twin and Smart Spaces SANGHEE SHIN
 
Toward a Global Research Platform for Big Data Analysis
Toward a Global Research Platform for Big Data AnalysisToward a Global Research Platform for Big Data Analysis
Toward a Global Research Platform for Big Data AnalysisLarry Smarr
 
Test trend analysis: Towards robust reliable and timely tests
Test trend analysis: Towards robust reliable and timely testsTest trend analysis: Towards robust reliable and timely tests
Test trend analysis: Towards robust reliable and timely testsHugh McCamphill
 
Using A Distributed Graph Database To Make Sense Of Disparate Data Stores
Using A Distributed Graph Database To Make Sense Of Disparate Data StoresUsing A Distributed Graph Database To Make Sense Of Disparate Data Stores
Using A Distributed Graph Database To Make Sense Of Disparate Data StoresInfiniteGraph
 
Introduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataIntroduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataAndre Freitas
 
"Building a File Observatory: Making Sense of PDFs in the Wild"
"Building a File Observatory: Making Sense of PDFs in the Wild""Building a File Observatory: Making Sense of PDFs in the Wild"
"Building a File Observatory: Making Sense of PDFs in the Wild"Tim Allison
 
Risk Analytics Using Knowledge Graphs / FIBO with Deep Learning
Risk Analytics Using Knowledge Graphs / FIBO with Deep LearningRisk Analytics Using Knowledge Graphs / FIBO with Deep Learning
Risk Analytics Using Knowledge Graphs / FIBO with Deep LearningCambridge Semantics
 
Accelerating Data Lakes and Streams with Real-time Analytics
Accelerating Data Lakes and Streams with Real-time AnalyticsAccelerating Data Lakes and Streams with Real-time Analytics
Accelerating Data Lakes and Streams with Real-time AnalyticsArcadia Data
 

Ähnlich wie Building Knowledge Graphs for Investigating Human Trafficking (20)

Linked Data, Cultural Heritage & the Karma Mapping Software
Linked Data, Cultural Heritage & the Karma Mapping SoftwareLinked Data, Cultural Heritage & the Karma Mapping Software
Linked Data, Cultural Heritage & the Karma Mapping Software
 
Microtask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataMicrotask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked Data
 
Bigdatacooltools
BigdatacooltoolsBigdatacooltools
Bigdatacooltools
 
RDAP 15: You’re in good company: Unifying campus research data services
RDAP 15: You’re in good company: Unifying campus research data servicesRDAP 15: You’re in good company: Unifying campus research data services
RDAP 15: You’re in good company: Unifying campus research data services
 
Zühlke Meetup - Mai 2017
Zühlke Meetup - Mai 2017Zühlke Meetup - Mai 2017
Zühlke Meetup - Mai 2017
 
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...
 
Mexico talk foster march 2012
Mexico talk foster march 2012Mexico talk foster march 2012
Mexico talk foster march 2012
 
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...
 
Cloud computing and networking course: paper presentation -Data Mining for In...
Cloud computing and networking course: paper presentation -Data Mining for In...Cloud computing and networking course: paper presentation -Data Mining for In...
Cloud computing and networking course: paper presentation -Data Mining for In...
 
Digital Twin and Smart Spaces
Digital Twin and Smart Spaces Digital Twin and Smart Spaces
Digital Twin and Smart Spaces
 
Toward a Global Research Platform for Big Data Analysis
Toward a Global Research Platform for Big Data AnalysisToward a Global Research Platform for Big Data Analysis
Toward a Global Research Platform for Big Data Analysis
 
Test trend analysis: Towards robust reliable and timely tests
Test trend analysis: Towards robust reliable and timely testsTest trend analysis: Towards robust reliable and timely tests
Test trend analysis: Towards robust reliable and timely tests
 
Citizen-centric Linked Data Services for Smarter Cities
Citizen-centric Linked Data Services for Smarter CitiesCitizen-centric Linked Data Services for Smarter Cities
Citizen-centric Linked Data Services for Smarter Cities
 
Using A Distributed Graph Database To Make Sense Of Disparate Data Stores
Using A Distributed Graph Database To Make Sense Of Disparate Data StoresUsing A Distributed Graph Database To Make Sense Of Disparate Data Stores
Using A Distributed Graph Database To Make Sense Of Disparate Data Stores
 
Introduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataIntroduction to question answering for linked data & big data
Introduction to question answering for linked data & big data
 
Data Science At Zillow
Data Science At ZillowData Science At Zillow
Data Science At Zillow
 
UNIT_1-BD.pptx
UNIT_1-BD.pptxUNIT_1-BD.pptx
UNIT_1-BD.pptx
 
"Building a File Observatory: Making Sense of PDFs in the Wild"
"Building a File Observatory: Making Sense of PDFs in the Wild""Building a File Observatory: Making Sense of PDFs in the Wild"
"Building a File Observatory: Making Sense of PDFs in the Wild"
 
Risk Analytics Using Knowledge Graphs / FIBO with Deep Learning
Risk Analytics Using Knowledge Graphs / FIBO with Deep LearningRisk Analytics Using Knowledge Graphs / FIBO with Deep Learning
Risk Analytics Using Knowledge Graphs / FIBO with Deep Learning
 
Accelerating Data Lakes and Streams with Real-time Analytics
Accelerating Data Lakes and Streams with Real-time AnalyticsAccelerating Data Lakes and Streams with Real-time Analytics
Accelerating Data Lakes and Streams with Real-time Analytics
 

Kürzlich hochgeladen

INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingsocarem879
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfSubhamKumar3239
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 

Kürzlich hochgeladen (20)

INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdf
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 

Building Knowledge Graphs for Investigating Human Trafficking

  • 1. Building knowledge graphs in DIG Pedro Szekely and Craig Knoblock University of Southern California Information Sciences Institute dig.isi.edu
  • 2. Goal USC Information Sciences Institute CC-By 2.0 2 raw w messy w disconnected clean w organized w linked hard to query, analyze & visualize easy to query, analyze & visualize
  • 3. Use Case: Human Trafficking USC Information Sciences Institute CC-By 2.0 3 raw w messy w disconnected clean w organized w linked hard to query, analyze & visualize easy to query, analyze & visualize
  • 4. Use Case: Human Trafficking USC Information Sciences Institute CC-By 2.0 4 100 million pages ~ 100 Web sites help victims prosecute traffickers
  • 5. Salient Statistics on Human Trafficking • Profits per Year: $32 Billion • Average Age of Entry To Prostitution in the US: 14 • PIMP’s Profit Per Victim Per Year: $150,000 • Advertising Budget On the Web:$45 Million CC-By 2.0 5USC Information Sciences Institute
  • 6. Task: Tracking the Victim’s Locations > 100 million pages advertising adult services USC Information Sciences Institute CC-By 2.0 6
  • 7. Example: Investigating a Reported Victim San Diego, where else? USC Information Sciences Institute CC-By 2.0 7
  • 8. DIG Interface: Find the locations where a potential victim was advertised CC-By 2.0 8
  • 9. Steps To Build a DIG USC Information Sciences Institute CC-By 2.0 9 Crawling Extraction Data Acquisition Mapping To Ontology Entity Linking & Similarity Knowledge Graph Deployment Query & Visualization Elastic Search Graph DB schema.org geonames Data Acquisition Feature Extraction Feature Alignment Entity Resolution Graph Construction User Interface Data Acquisition
  • 10. Data Acquisition USC Information Sciences Institute CC-By 2.0 10 downloading relevant data batch w real-time Web pagesw Web service w database w CSV w Excel w XML w JSON
  • 11. Steps To Build a DIG USC Information Sciences Institute CC-By 2.0 11 Crawling Extraction Data Acquisition Mapping To Ontology Entity Linking & Similarity Knowledge Graph Deployment Query & Visualization Elastic Search Graph DB schema.org geonames Data Acquisition Feature Extraction Feature Alignment Entity Resolution Graph Construction User Interface
  • 12. Feature Extraction USC Information Sciences Institute CC-By 2.0 12 from raw sources to structured data • trainable text extractors • extraction from structured Web pages • image features • PDF extractor
  • 13. Feature Extraction from Text USC Information Sciences Institute CC-By 2.0 13 “YOU don't wanna miss out on ME :) Perfect lil booty Green eyes Long curly black hair Im a Irish,Armenian and Filipino mixed princess :) ❤ Kim ❤ 7○7~7two7~7four77 ❤ HH 80 roses ❤ Hour 120 roses ❤ 15 mins 60 roses” name: Kim eye-color: green hair-color: black phone: 707-727-7477 rate: $60/15min $80/30min $120/60min
  • 14. 20 Examples CC-By 2.0 14USC Information Sciences Institute
  • 15. 1,000’s of Tasks (2 Cents/Sentence) CC-By 2.0 15
  • 16. Performance of CRF Extractors 80 10 18 99 91 94 0 20 40 60 80 100 120 Precision Recall F Regular Expressions DIG 80 6 12 99 73 84 0 20 40 60 80 100 120 Precision Recall F Regular Expressions DIG Eyes Hair USC Information Sciences Institute CC-By 2.0 16
  • 20. Extraction Evaluation Title Desc Seller Date Price Loc Cat Member Since Expires Views ID Perfect 1.0 (50/50) .76 (37/49) .95 (40/42) .83 (40/48 ) .87 (39/45 ) .51 (23/45) .68 (34/50) 1.0 (35/35) .52 (15/29) .76 (19/25) .97 (35/36 ) Pretty Good 1.0 (50/50) .98 (48/49) .95 (40/42) .83 (40/48 ) .98 (44/45 ) .84 (38/45) .88 (44/50) 1.0 (35/35) .55 (16/29) 1.0 (25/25) 1.0 (36/36 ) 10 websites, 5 pages each fields USC Information Sciences Institute CC-By 2.0 20
  • 21. Steps To Build a DIG USC Information Sciences Institute CC-By 2.0 21 Crawling Extraction Data Acquisition Mapping To Ontology Entity Linking & Similarity Knowledge Graph Deployment Query & Visualization Elastic Search Graph DB schema.org geonames Data Acquisition Feature Extraction Feature Alignment Entity Resolution Graph Construction User Interface
  • 22. Feature Alignment USC Information Sciences Institute CC-By 2.0 22 from multiple schemas to a common domain schema - CSV, Excel - Database tables - Web services - Extractors - Nomenclature - Spelling Multiple Schemas
  • 23. Karma: Mapping Data to Ontologies Services Relational Sources Karma { JSON-LD } Hierarchical Sources Schema.org USC Information Sciences Institute CC-By 2.0 23 karma.isi.edu
  • 24. Karma Solves Feature Alignment CC-By 2.0 24USC Information Sciences Institute Provenance Domain Schema took ~30 minutes to align the output of the Stanford name extractor
  • 25. Feature Alignment Statistics • 5 contractors provided data • ~ 15 datasets • > 30 Karma models • > 200 million records • 1 hour processing in 20 node Hadoop cluster CC-By 2.0 25USC Information Sciences Institute
  • 26. Steps To Build a DIG USC Information Sciences Institute CC-By 2.0 26 Crawling Extraction Data Acquisition Mapping To Ontology Entity Linking & Similarity Knowledge Graph Deployment Query & Visualization Elastic Search Graph DB schema.org geonames Data Acquisition Feature Extraction Feature Alignment Entity Resolution Graph Construction User Interface
  • 27. Entity Resolution USC Information Sciences Institute CC-By 2.0 27 merging records that refer to the same entity missing data incorrect data scale (~50 million records) currently working on techniques to address
  • 28. Entity Resolutuion on Strong Attributes AdultService-1 Person-1 Offer-1 availableAt seller phone 619-319-7315 Santa Barbara hairColor red price 250/hour startDate 2014-12-07 eyeColor blue name Jessica itemProvided Offer-2 Person-2 availableAt Washington DC phone seller email price 250/hour startDate 2014-05-28 AdultService-2 eyeColor blue name Jessica itemProvided USC Information Sciences Institute CC-By 2.0 28
  • 29. Linking Using Text Similarity E M I LY SEXY. ** wHiTe/lATin girl ** bUsTy SWEET. LoTs Of fUn. Call Me. O_U_T_C___A___L_L_S LAY LA SEXY. ** wHiTe girl ** bUsTy SWEET. LoTs Of fUn. Call Me. O____U____T____C___A___L____L____S L I LA SEXY. ** WhiTe girl ** bUsTy SWEET. LoTs Of fUn. Call Me. O_U_T_C___A___L_L_S USC Information Sciences Institute CC-By 2.0 29
  • 30. Linking Using Image Similarity CC-By 2.0 30USC Information Sciences Institute 100 Million Images Technology: Deep Learning
  • 32. Unsupervised Collective Entity Resolution USC Information Sciences Institute CC-By 2.0 32
  • 33. Steps To Build a DIG USC Information Sciences Institute CC-By 2.0 33 Crawling Extraction Data Acquisition Mapping To Ontology Entity Linking & Similarity Knowledge Graph Deployment Query & Visualization Elastic Search Graph DB schema.org geonames Data Acquisition Feature Extraction Feature Alignment Entity Resolution Graph Construction User Interface
  • 34. Graph Construction USC Information Sciences Institute CC-By 2.0 34 assembling the data for efficient query & analysis - ElasticSearch: scalable, efficient query - graph databases: network analytics - NoSQL: scalable analytics - bulk loading: massive data imports - real-time updates: live, changing data
  • 35. Elastic Search Data Model Adult Service Offer Person Phone Web Page USC Information Sciences Institute CC-By 2.0 35
  • 36. Indexing for High Performance Knowledge Graph Queries Avg. Query Times in Milliseconds Single User Query Load 1.2 billion triples State of the Art Graph Database (RDF) DIG indexing deployed in ElasticSearch USC Information Sciences Institute CC-By 2.0 36
  • 37. Steps To Build a DIG USC Information Sciences Institute CC-By 2.0 37 Crawling Extraction Data Acquisition Mapping To Ontology Entity Linking & Similarity Knowledge Graph Deployment Query & Visualization Elastic Search Graph DB schema.org geonames Data Acquisition Feature Extraction Feature Alignment Entity Resolution Graph Construction User Interface
  • 38.
  • 39.
  • 40. DIG Deployment for Human Trafficking USC Information Sciences Institute CC-By 2.0 40 - 100 million Web pages - Live updates (~5,000 pages/hour) - ElasticSearch database (7 nodes) - Hadoop workflows (20 nodes) - District Attorney - Law Enforcement - NGOs
  • 42. DIG Applications Human Trafficking large, real users Material Science Research 70,000 paper abstracts (built in 1 week) Arms Trafficking Identify illegal sales Patent Trolls Identify patent trolls Cyber Attacks Predict cyber attacks from dark web data CC-By 2.0 42USC Information Sciences Institute
  • 43. Conclusions • Complete tool-chain to build domain-specific knowledge graphs • Integrates heterogeneous data: web pages, databases, CSV, web APIs, images, etc. • Scales to ~100 million pages, ~3 billion facts • Deployed to law enforcement USC Information Sciences Institute CC-By 2.0 43
  • 44. Questions? dig.isi.edu Open Source, Apache 2 License CC-By 2.0 44USC Information Sciences Institute

Hinweis der Redaktion

  1. Karma offers suggestions on how to do the mapping
  2. Simplest kind of linking we do – linking based on strong, explicit attributes (phones, emails, websites, etc.) So person-1 and person-2 might be the same person … but can we find more attributes to improve our confidence …
  3. Estimating text similarity is challenging – here we are emphasizing stylometric similarity; map->n-grams->jacquard similarity
  4. Clever scheme for storing pair-wise similarities in a database that can be updated incrementally (so we can bypass hashing that leverages elastic search w/lucene)
  5. Why is linking significant in this domain? Slide shows why.
  6. There is some clever tricks We produce json documents rooted on the classes we care about .. Contain enough of the graph-neighborhood so that keyword queries can work so that I can search for an adultservice using a phone number even though the phone number is really part of the seller. Or search all offers that have the same phone number. Basically copying over some content.