SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Domain Identification for Linked Open Data
Sarasi Lalithsena
Pascal Hitzler
Amit Sheth
Kno.e.sis Center
Wright State University, Dayton, OH

Prateek Jain
IBM T.J. Watson Research Center
Yorktown, NY, USA

WI 2013 Atlanta, GA, USA
Motivation

lod cloud
262 datasets

870 alive datasets

“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lodcloud.net/”

2
Motivation

Lingvoj
Climbdata

Need better ways to dataset discovery, description and organization

3
Problem
• How do we identify the relevant datasets from this structured
knowledge space?
– How do we create a registry of topics which describe the
domain of a dataset?

4
State of the Art – Existing Problems to
dataset lookup
• Rely on manual tagging provided by users and the manual
reviewing process
– CKAN data hub, LOD Diagram
• Rely on keywords and metadata provided by users
– CKAN data hub, LODStats
• Need to know instances to start explore the datasets
– Semantic Search Engines (SSE) such as Sigma, Swoogle and
Watson
• Need to know seed URIs to find the relevant datasets
– Federated Querying Systems for LOD

5
What we propose?
• Introduce a systematic and sophisticated way to identify
possible domains, topics, tags (Topic Domain) to better describe
these datasets
• What are these topic domain can be?
– Predefined set of list
– Type of the schema of each dataset

6
What we propose?

Knowledge bases + category system

Topic Domains

7
How do we address the previous problems
• Use the category system of existing knowledge sources as the
vocabulary to describe the domain
– Does not need to either rely on a predefined set of tags
– Does not need to rely on metadata and keywords
• Automatic way to identify the topic domains
• Vocabulary can be used to search the datasets and organize the
datasets

8
Our approach - Freebase
• Use Freebase as our knowledge source to identify the topic
domains
• Why Freebase?
– Wide Coverage
Has 39 million topics
– Simple Category Hierarchy System
• Freebase category system categorizes each topic in to types and
types are grouped in to domains
music

Domain

Artist

Type

• Utilized Freebase types and domains as our topic domains

9
Our Approach - STEPS

1.
2.
3.
4.
5.

Instance Identification
Category Hierarchy Creation
Category Hierarchy Merging
Candidate Category Hierarchy Selection
Frequency Count Generation

10
Our Approach
STEP 1 Instance Identification
– Extract the instances of the dataset with its type
– Extract the human readable values of the instances and type
Granite and its type Rock
– Identify the closely related instance from the freebase for
each instance in our dataset

Ignimbrite, Rock
Slate, Rock
Granite, Rock

http://www.freebase.com/m/
01tx7r
http://www.freebase.com/m/
01c_9j
http://www.freebase.com/m/
03fcm

11
Our Approach
• Instance Identification
We attach the type information as well to the query string

Apple

Apple Company

Apple Fruit

Apple Fruit

12
Our Approach
• STEP 2 Category Hierarchy Creation
Ignimbrite

/geology/rock_type

geography

geology

{domain/type}

geography

Ignimbrite
rock type

geology

mountain

geography

mountain range

music

music

slate
rock type

geology

mountain

release track

recording

geography

granite
rock type

mountain

13
Our Approach
• Category Hierarchy Merging
geography

geology

Ignimbrite

mountain
rock type
mountain range

geology

geography

slate

music
release track

rock type

mountain
recording

geology

geography

granite
rock type

mountain

14
Our Approach
• Candidate Category Hierarchy Selection
Filter out insignificant category hierarchies using a simple
heuristics
geography

geology

Ignimbrite

mountain
rock type
mountain range

geology

geography

slate

music
release track

rock type

mountain
recording

geology

geography

granite
rock type

mountain

15
Our Approach
• Frequency Count Generation
Count the number of occurrences for each category (number of
instances having the given category)

Term

Frequency

Parent Node

geology

3

rock type

3

geology

mountain range

1

geography

…..

…

….

16
Implementation
• Map Reduce Deployment
STEP 2 and 3
map1

STEP 4
Reducer
1

map2
<Inst, type>
……
.......
……
……

Map 3

map4

…

STEP 5
Post Processing

…
…
Reducer
m

…
Map n

Instances belong to same type will go into a
single reducer

17
Evaluation
• We ran our experiments with 30 datasets in LOD for evaluation

Evaluation
Appropriateness of the identified
domain

Effectiveness in finding the datasets

User Study

18
Appropriateness of the identified domain
• Select four high frequent domains and types from our results
• Mixed it with other randomly selected four domains and types
• Asked from users to select the terms that best represent the
higher level domains for the dataset – had 20 users

*

50% of the users
agreed on 73% of
the terms (88 out of
120)

19
Appropriateness of the identified domain

TERMS WITH HIGHEST USER AGREEMENT FOR EACH DATASET, WE INDICATE BY A STAR (*)
THAT TERM WAS ALSO THE HIGHEST RANKED BY OUR SYSTEM (for 22 datasets)

20
Evaluation

Evaluation
Appropriateness of the identified
domain

User Study

Effectiveness in finding the datasets

1. User Study with three other SE

21
Effectiveness in finding the datasets
• Developed a search application using the normalized frequency
count
• User study with three other existing state of the art
– CKAN, LOD Stat and Sigma
• Term selection
• Top ten results are retrieved
• Asked users to rank which set of results they preferred
– 1(best ) to 4(worst)
• Calculate a user preference score using weighted average

22
Effectiveness in finding the datasets
Term

Our Approach

CKAN

LODStat

Sigma

music

2.037

3.74

3.11

1.333

artist

2.815

3.926

1

2.259

biology

3.481

3.333

1

2.185

animal

2.926

1.63

3.481

1.926

geology

2.852

3.666

1

2.481

drug

2.926

3.148

2

2.555

gene

2.148

3.333

3.074

1.222

university

3.185

3.148

2.37

1.222

food

3.259

2.296

3

1.259

language

3.148

3.74

1

2.11

spacecraft

4

4

1

2

conference

2.814

3.555

1

2.666

astronaut

4

4

1

2

composer

3.815

3.037

1

2.11

tv program

3.666

2.923

1

2.370

instrument

3.852

2

2

3.148

recipe

3.926

2

2

3.074

student

2

3.889

2

3.111

phenotypes

2

3.923

2

3.037

energy

1

3.74

3.26

3.03

23
Evaluation

Evaluation
Appropriateness of the identified
domain

User Study

Effectiveness in finding the datasets

1. User Study with three other SE

2. Evaluate CKAN as the baseline

24
Evaluate CKAN as the baseline
Term

P

R1

F1

R2

F2

music

0.286

1

0.445

0.1

0.148

artist

0.4

1

0.571

0.2

0.267

biology

0.125

1

0.222

0.333

0.182

animal

0

0

n/a

0

n/a

geology

0

0

n/a

0

n/a

drug

0.6

0.667

0.632

0.75

0.667

gene

0.333

1

0.5

0.125

0.182

university

0.5

1

0.667

0.051

0.093

food

0

0

n/a

0

n/a

language

1

1

1

0.045

0.0861

spacecraft

1

1

1

1

1

conference

1

1

1

0.125

0.222

astronaut

1

1

1

1

1

composer

0.25

1

0.4

0.5

0.333

tv program

0

0

n/a

0

n/a

instrument

0

1

0

1

0

recipe

0

1

0

1

0

student

1

0

0

0

0

phenotypes

1

0

0

0

0

energy

1

0

0

0

0

25
Evaluation

Evaluation
Appropriateness of the identified
domain

User Study

Effectiveness in finding the datasets

1. User Study with three other SE

2. Evaluate CKAN as the baseline
3. Evaluate both CKAN and our
approach using a manually curated
gold standard

26
Evaluation with a manually curated gold
standard
CKAN

Our Approach

Term

Precision

Recall

F-Measure

Precision

Recall

F-Measure

music

1

0.5

0.667

0.571

1

0.727

artist

1

0.25

0.4

0.8

1

0.9

biology

1

0.2

0.333

0.625

1

0.769

animal

0

0

n/a

0.333

1

0.5

geology

0

0

n/a

1

0.5

0.667

drug

1

0.6

0.75

1

1

1

gene

1

0.333

0.5

1

1

1

university

0.5

0.667

0.572

0.6

1

0.75

food

0

0

n/a

0.25

1

0.4

language

1

1

1

1

1

1

spacecraft

1

1

1

1

1

1

conference

1

1

1

1

1

1

tv program

0

0

n/a

1

1

1

instrument

1

0

0

0.75

1

0.857

astronaut

1

1

1

1

1

1

composer

1

0.25

0.4

1

1

1

recipe

1

0

0

1

1

1

phenotypes

1

1

1

1

0

0

student

1

0.5

0.667

1

0

0

energy

1

0.333

0.5

1

0

0

Mean

0.775

0.432

0.489

0.846

0.825

0.728
27
Conclusion and Future Work
• Our approach is helpful for systematically categorizing the
datasets
• Demonstrate the potential of using the categorization for finding
relevant datasets
• Utilize a diverse classification hierarchy such as Freebase
• There are other potential application that this work might be
important such browsing and interlinking
• Plan to improve the domain coverage by using knowledge
sources such as Wikipedia and Yago
• Compare the interpretation given by multiple knowledge sources
to see which one gives a better interpretation

28
Thank You!

Questions?
http://knoesis.wright.edu/researchers/sarasi
sarasi@knoesis.org

Kno.e.sis – Ohio Center of Excellence in Knowledge-enabled Computing
Wright State University, Dayton, Ohio, USA

Weitere ähnliche Inhalte

Was ist angesagt?

How the Lucene More Like This Works
How the Lucene More Like This WorksHow the Lucene More Like This Works
How the Lucene More Like This WorksSease
 
From Academic Papers To Production : A Learning To Rank Story
From Academic Papers To Production : A Learning To Rank StoryFrom Academic Papers To Production : A Learning To Rank Story
From Academic Papers To Production : A Learning To Rank StoryAlessandro Benedetti
 
Techniques of information retrieval
Techniques of information retrieval Techniques of information retrieval
Techniques of information retrieval Tariq Hassan
 
DataONE Education Module 09: Analysis and Workflows
DataONE Education Module 09: Analysis and WorkflowsDataONE Education Module 09: Analysis and Workflows
DataONE Education Module 09: Analysis and WorkflowsDataONE
 

Was ist angesagt? (6)

Final presentation
Final presentationFinal presentation
Final presentation
 
How the Lucene More Like This Works
How the Lucene More Like This WorksHow the Lucene More Like This Works
How the Lucene More Like This Works
 
From Academic Papers To Production : A Learning To Rank Story
From Academic Papers To Production : A Learning To Rank StoryFrom Academic Papers To Production : A Learning To Rank Story
From Academic Papers To Production : A Learning To Rank Story
 
Segmentation
SegmentationSegmentation
Segmentation
 
Techniques of information retrieval
Techniques of information retrieval Techniques of information retrieval
Techniques of information retrieval
 
DataONE Education Module 09: Analysis and Workflows
DataONE Education Module 09: Analysis and WorkflowsDataONE Education Module 09: Analysis and Workflows
DataONE Education Module 09: Analysis and Workflows
 

Ähnlich wie Domain Identification for Linked Open Data

MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...Yongyao Jiang
 
FSCI Data Discovery
FSCI Data DiscoveryFSCI Data Discovery
FSCI Data DiscoveryARDC
 
Big Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiBig Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiVijay Susheedran C G
 
Big Data 101 - An introduction
Big Data 101 - An introductionBig Data 101 - An introduction
Big Data 101 - An introductionNeeraj Tewari
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit
 
Data Mining- Unit-I PPT (1).ppt
Data Mining- Unit-I PPT (1).pptData Mining- Unit-I PPT (1).ppt
Data Mining- Unit-I PPT (1).pptAravindReddy565690
 
Multimedia Answer Generation for Community Question Answering
Multimedia Answer Generation for Community Question AnsweringMultimedia Answer Generation for Community Question Answering
Multimedia Answer Generation for Community Question AnsweringSWAMI06
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrTrey Grainger
 
DataONE Education Module 07: Metadata
DataONE Education Module 07: MetadataDataONE Education Module 07: Metadata
DataONE Education Module 07: MetadataDataONE
 
Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS ...
Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS ...Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS ...
Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS ...Steve Kramer
 
Serving predictive models with Redis
Serving predictive models with RedisServing predictive models with Redis
Serving predictive models with RedisTague Griffith
 
Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...
Big Data (SOCIOMETRIC METHODS FOR  RELEVANCY ANALYSIS OF LONG TAIL  SCIENCE D...Big Data (SOCIOMETRIC METHODS FOR  RELEVANCY ANALYSIS OF LONG TAIL  SCIENCE D...
Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...AKSHAY BHAGAT
 
The Materials Data Facility: A Distributed Model for the Materials Data Commu...
The Materials Data Facility: A Distributed Model for the Materials Data Commu...The Materials Data Facility: A Distributed Model for the Materials Data Commu...
The Materials Data Facility: A Distributed Model for the Materials Data Commu...Ben Blaiszik
 

Ähnlich wie Domain Identification for Linked Open Data (20)

Saner17 sharma
Saner17 sharmaSaner17 sharma
Saner17 sharma
 
MUDROD - Ranking
MUDROD - RankingMUDROD - Ranking
MUDROD - Ranking
 
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
 
FSCI Data Discovery
FSCI Data DiscoveryFSCI Data Discovery
FSCI Data Discovery
 
BAS 250 Lecture 1
BAS 250 Lecture 1BAS 250 Lecture 1
BAS 250 Lecture 1
 
Big Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiBig Data Real Time Training in Chennai
Big Data Real Time Training in Chennai
 
Big Data 101 - An introduction
Big Data 101 - An introductionBig Data 101 - An introduction
Big Data 101 - An introduction
 
Semantics-enhanced Geoscience Interoperability, Analytics, and Applications
Semantics-enhanced Geoscience Interoperability, Analytics, and ApplicationsSemantics-enhanced Geoscience Interoperability, Analytics, and Applications
Semantics-enhanced Geoscience Interoperability, Analytics, and Applications
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 
Data Mining- Unit-I PPT (1).ppt
Data Mining- Unit-I PPT (1).pptData Mining- Unit-I PPT (1).ppt
Data Mining- Unit-I PPT (1).ppt
 
Multimedia Answer Generation for Community Question Answering
Multimedia Answer Generation for Community Question AnsweringMultimedia Answer Generation for Community Question Answering
Multimedia Answer Generation for Community Question Answering
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solr
 
DataONE Education Module 07: Metadata
DataONE Education Module 07: MetadataDataONE Education Module 07: Metadata
DataONE Education Module 07: Metadata
 
L07 metadata
L07 metadataL07 metadata
L07 metadata
 
Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS ...
Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS ...Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS ...
Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS ...
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
 
Unit 3 part i Data mining
Unit 3 part i Data miningUnit 3 part i Data mining
Unit 3 part i Data mining
 
Serving predictive models with Redis
Serving predictive models with RedisServing predictive models with Redis
Serving predictive models with Redis
 
Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...
Big Data (SOCIOMETRIC METHODS FOR  RELEVANCY ANALYSIS OF LONG TAIL  SCIENCE D...Big Data (SOCIOMETRIC METHODS FOR  RELEVANCY ANALYSIS OF LONG TAIL  SCIENCE D...
Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...
 
The Materials Data Facility: A Distributed Model for the Materials Data Commu...
The Materials Data Facility: A Distributed Model for the Materials Data Commu...The Materials Data Facility: A Distributed Model for the Materials Data Commu...
The Materials Data Facility: A Distributed Model for the Materials Data Commu...
 

Kürzlich hochgeladen

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 

Kürzlich hochgeladen (20)

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

Domain Identification for Linked Open Data

  • 1. Domain Identification for Linked Open Data Sarasi Lalithsena Pascal Hitzler Amit Sheth Kno.e.sis Center Wright State University, Dayton, OH Prateek Jain IBM T.J. Watson Research Center Yorktown, NY, USA WI 2013 Atlanta, GA, USA
  • 2. Motivation lod cloud 262 datasets 870 alive datasets “Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lodcloud.net/” 2
  • 3. Motivation Lingvoj Climbdata Need better ways to dataset discovery, description and organization 3
  • 4. Problem • How do we identify the relevant datasets from this structured knowledge space? – How do we create a registry of topics which describe the domain of a dataset? 4
  • 5. State of the Art – Existing Problems to dataset lookup • Rely on manual tagging provided by users and the manual reviewing process – CKAN data hub, LOD Diagram • Rely on keywords and metadata provided by users – CKAN data hub, LODStats • Need to know instances to start explore the datasets – Semantic Search Engines (SSE) such as Sigma, Swoogle and Watson • Need to know seed URIs to find the relevant datasets – Federated Querying Systems for LOD 5
  • 6. What we propose? • Introduce a systematic and sophisticated way to identify possible domains, topics, tags (Topic Domain) to better describe these datasets • What are these topic domain can be? – Predefined set of list – Type of the schema of each dataset 6
  • 7. What we propose? Knowledge bases + category system Topic Domains 7
  • 8. How do we address the previous problems • Use the category system of existing knowledge sources as the vocabulary to describe the domain – Does not need to either rely on a predefined set of tags – Does not need to rely on metadata and keywords • Automatic way to identify the topic domains • Vocabulary can be used to search the datasets and organize the datasets 8
  • 9. Our approach - Freebase • Use Freebase as our knowledge source to identify the topic domains • Why Freebase? – Wide Coverage Has 39 million topics – Simple Category Hierarchy System • Freebase category system categorizes each topic in to types and types are grouped in to domains music Domain Artist Type • Utilized Freebase types and domains as our topic domains 9
  • 10. Our Approach - STEPS 1. 2. 3. 4. 5. Instance Identification Category Hierarchy Creation Category Hierarchy Merging Candidate Category Hierarchy Selection Frequency Count Generation 10
  • 11. Our Approach STEP 1 Instance Identification – Extract the instances of the dataset with its type – Extract the human readable values of the instances and type Granite and its type Rock – Identify the closely related instance from the freebase for each instance in our dataset Ignimbrite, Rock Slate, Rock Granite, Rock http://www.freebase.com/m/ 01tx7r http://www.freebase.com/m/ 01c_9j http://www.freebase.com/m/ 03fcm 11
  • 12. Our Approach • Instance Identification We attach the type information as well to the query string Apple Apple Company Apple Fruit Apple Fruit 12
  • 13. Our Approach • STEP 2 Category Hierarchy Creation Ignimbrite /geology/rock_type geography geology {domain/type} geography Ignimbrite rock type geology mountain geography mountain range music music slate rock type geology mountain release track recording geography granite rock type mountain 13
  • 14. Our Approach • Category Hierarchy Merging geography geology Ignimbrite mountain rock type mountain range geology geography slate music release track rock type mountain recording geology geography granite rock type mountain 14
  • 15. Our Approach • Candidate Category Hierarchy Selection Filter out insignificant category hierarchies using a simple heuristics geography geology Ignimbrite mountain rock type mountain range geology geography slate music release track rock type mountain recording geology geography granite rock type mountain 15
  • 16. Our Approach • Frequency Count Generation Count the number of occurrences for each category (number of instances having the given category) Term Frequency Parent Node geology 3 rock type 3 geology mountain range 1 geography ….. … …. 16
  • 17. Implementation • Map Reduce Deployment STEP 2 and 3 map1 STEP 4 Reducer 1 map2 <Inst, type> …… ....... …… …… Map 3 map4 … STEP 5 Post Processing … … Reducer m … Map n Instances belong to same type will go into a single reducer 17
  • 18. Evaluation • We ran our experiments with 30 datasets in LOD for evaluation Evaluation Appropriateness of the identified domain Effectiveness in finding the datasets User Study 18
  • 19. Appropriateness of the identified domain • Select four high frequent domains and types from our results • Mixed it with other randomly selected four domains and types • Asked from users to select the terms that best represent the higher level domains for the dataset – had 20 users * 50% of the users agreed on 73% of the terms (88 out of 120) 19
  • 20. Appropriateness of the identified domain TERMS WITH HIGHEST USER AGREEMENT FOR EACH DATASET, WE INDICATE BY A STAR (*) THAT TERM WAS ALSO THE HIGHEST RANKED BY OUR SYSTEM (for 22 datasets) 20
  • 21. Evaluation Evaluation Appropriateness of the identified domain User Study Effectiveness in finding the datasets 1. User Study with three other SE 21
  • 22. Effectiveness in finding the datasets • Developed a search application using the normalized frequency count • User study with three other existing state of the art – CKAN, LOD Stat and Sigma • Term selection • Top ten results are retrieved • Asked users to rank which set of results they preferred – 1(best ) to 4(worst) • Calculate a user preference score using weighted average 22
  • 23. Effectiveness in finding the datasets Term Our Approach CKAN LODStat Sigma music 2.037 3.74 3.11 1.333 artist 2.815 3.926 1 2.259 biology 3.481 3.333 1 2.185 animal 2.926 1.63 3.481 1.926 geology 2.852 3.666 1 2.481 drug 2.926 3.148 2 2.555 gene 2.148 3.333 3.074 1.222 university 3.185 3.148 2.37 1.222 food 3.259 2.296 3 1.259 language 3.148 3.74 1 2.11 spacecraft 4 4 1 2 conference 2.814 3.555 1 2.666 astronaut 4 4 1 2 composer 3.815 3.037 1 2.11 tv program 3.666 2.923 1 2.370 instrument 3.852 2 2 3.148 recipe 3.926 2 2 3.074 student 2 3.889 2 3.111 phenotypes 2 3.923 2 3.037 energy 1 3.74 3.26 3.03 23
  • 24. Evaluation Evaluation Appropriateness of the identified domain User Study Effectiveness in finding the datasets 1. User Study with three other SE 2. Evaluate CKAN as the baseline 24
  • 25. Evaluate CKAN as the baseline Term P R1 F1 R2 F2 music 0.286 1 0.445 0.1 0.148 artist 0.4 1 0.571 0.2 0.267 biology 0.125 1 0.222 0.333 0.182 animal 0 0 n/a 0 n/a geology 0 0 n/a 0 n/a drug 0.6 0.667 0.632 0.75 0.667 gene 0.333 1 0.5 0.125 0.182 university 0.5 1 0.667 0.051 0.093 food 0 0 n/a 0 n/a language 1 1 1 0.045 0.0861 spacecraft 1 1 1 1 1 conference 1 1 1 0.125 0.222 astronaut 1 1 1 1 1 composer 0.25 1 0.4 0.5 0.333 tv program 0 0 n/a 0 n/a instrument 0 1 0 1 0 recipe 0 1 0 1 0 student 1 0 0 0 0 phenotypes 1 0 0 0 0 energy 1 0 0 0 0 25
  • 26. Evaluation Evaluation Appropriateness of the identified domain User Study Effectiveness in finding the datasets 1. User Study with three other SE 2. Evaluate CKAN as the baseline 3. Evaluate both CKAN and our approach using a manually curated gold standard 26
  • 27. Evaluation with a manually curated gold standard CKAN Our Approach Term Precision Recall F-Measure Precision Recall F-Measure music 1 0.5 0.667 0.571 1 0.727 artist 1 0.25 0.4 0.8 1 0.9 biology 1 0.2 0.333 0.625 1 0.769 animal 0 0 n/a 0.333 1 0.5 geology 0 0 n/a 1 0.5 0.667 drug 1 0.6 0.75 1 1 1 gene 1 0.333 0.5 1 1 1 university 0.5 0.667 0.572 0.6 1 0.75 food 0 0 n/a 0.25 1 0.4 language 1 1 1 1 1 1 spacecraft 1 1 1 1 1 1 conference 1 1 1 1 1 1 tv program 0 0 n/a 1 1 1 instrument 1 0 0 0.75 1 0.857 astronaut 1 1 1 1 1 1 composer 1 0.25 0.4 1 1 1 recipe 1 0 0 1 1 1 phenotypes 1 1 1 1 0 0 student 1 0.5 0.667 1 0 0 energy 1 0.333 0.5 1 0 0 Mean 0.775 0.432 0.489 0.846 0.825 0.728 27
  • 28. Conclusion and Future Work • Our approach is helpful for systematically categorizing the datasets • Demonstrate the potential of using the categorization for finding relevant datasets • Utilize a diverse classification hierarchy such as Freebase • There are other potential application that this work might be important such browsing and interlinking • Plan to improve the domain coverage by using knowledge sources such as Wikipedia and Yago • Compare the interpretation given by multiple knowledge sources to see which one gives a better interpretation 28
  • 29. Thank You! Questions? http://knoesis.wright.edu/researchers/sarasi sarasi@knoesis.org Kno.e.sis – Ohio Center of Excellence in Knowledge-enabled Computing Wright State University, Dayton, Ohio, USA

Hinweis der Redaktion

  1. Outdated cloud diagram – last updated on 2011
  2. Wikipedia 4.3 million articles
  3. User agreement on appropriateness of the termsThe graph in here shows how many users agreed on how many terms being appropriate descriptors, from a total of 20 users (=100%, horizontal axis) and 120 terms (=100%, vertical)axis).
  4. CKAN ranked best for 12 terms while our approach ranked best for 9 terms, we had 27 users participated in the studyWe generate second best results, with only a 30 datasets