Domain Identification for Linked Open Data

Domain Identification for Linked Open Data
Sarasi Lalithsena
Pascal Hitzler
Amit Sheth
Kno.e.sis Center
Wright State University, Dayton, OH

Prateek Jain
IBM T.J. Watson Research Center
Yorktown, NY, USA

WI 2013 Atlanta, GA, USA

Motivation

lod cloud
262 datasets

870 alive datasets

“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lodcloud.net/”

2

Motivation

Lingvoj
Climbdata

Need better ways to dataset discovery, description and organization

3

Problem
• How do we identify the relevant datasets from this structured
knowledge space?
– How do we create a registry of topics which describe the
domain of a dataset?

4

State of the Art – Existing Problems to
dataset lookup
• Rely on manual tagging provided by users and the manual
reviewing process
– CKAN data hub, LOD Diagram
• Rely on keywords and metadata provided by users
– CKAN data hub, LODStats
• Need to know instances to start explore the datasets
– Semantic Search Engines (SSE) such as Sigma, Swoogle and
Watson
• Need to know seed URIs to find the relevant datasets
– Federated Querying Systems for LOD

5

What we propose?
• Introduce a systematic and sophisticated way to identify
possible domains, topics, tags (Topic Domain) to better describe
these datasets
• What are these topic domain can be?
– Predefined set of list
– Type of the schema of each dataset

6

What we propose?

Knowledge bases + category system

Topic Domains

7

How do we address the previous problems
• Use the category system of existing knowledge sources as the
vocabulary to describe the domain
– Does not need to either rely on a predefined set of tags
– Does not need to rely on metadata and keywords
• Automatic way to identify the topic domains
• Vocabulary can be used to search the datasets and organize the
datasets

8

Our approach - Freebase
• Use Freebase as our knowledge source to identify the topic
domains
• Why Freebase?
– Wide Coverage
Has 39 million topics
– Simple Category Hierarchy System
• Freebase category system categorizes each topic in to types and
types are grouped in to domains
music

Domain

Artist

Type

• Utilized Freebase types and domains as our topic domains

9

Our Approach - STEPS

1.
2.
3.
4.
5.

Instance Identification
Category Hierarchy Creation
Category Hierarchy Merging
Candidate Category Hierarchy Selection
Frequency Count Generation

10

Our Approach
STEP 1 Instance Identification
– Extract the instances of the dataset with its type
– Extract the human readable values of the instances and type
Granite and its type Rock
– Identify the closely related instance from the freebase for
each instance in our dataset

Ignimbrite, Rock
Slate, Rock
Granite, Rock

http://www.freebase.com/m/
01tx7r
01c_9j
03fcm

11

Our Approach
• Instance Identification
We attach the type information as well to the query string

Apple

Apple Company

Apple Fruit

Apple Fruit

12

Our Approach
• STEP 2 Category Hierarchy Creation
Ignimbrite

/geology/rock_type

geography

geology

{domain/type}

geography

Ignimbrite
rock type

geology

mountain

geography

mountain range

music

music

slate
rock type

geology

mountain

release track

recording

geography

granite
rock type

mountain

13

Our Approach
• Category Hierarchy Merging
geography

geology

Ignimbrite

mountain
rock type
mountain range

geology

geography

slate

music
release track

rock type

mountain
recording

geology

geography

granite
rock type

mountain

14

Our Approach
• Candidate Category Hierarchy Selection
Filter out insignificant category hierarchies using a simple
heuristics
geography

geology

Ignimbrite

mountain
rock type
mountain range

geology

geography

slate

music
release track

rock type

mountain
recording

geology

geography

granite
rock type

mountain

15

Our Approach
• Frequency Count Generation
Count the number of occurrences for each category (number of
instances having the given category)

Term

Frequency

Parent Node

geology

3

rock type

3

geology

mountain range

1

geography

…..

…

….

16

Implementation
• Map Reduce Deployment
STEP 2 and 3
map1

STEP 4
Reducer
1

map2
<Inst, type>
……
.......
……
……

Map 3

map4

…

STEP 5
Post Processing

…
…
Reducer
m

…
Map n

Instances belong to same type will go into a
single reducer

17

Evaluation
• We ran our experiments with 30 datasets in LOD for evaluation

Evaluation
Appropriateness of the identified
domain

Effectiveness in finding the datasets

User Study

18

Appropriateness of the identified domain
• Select four high frequent domains and types from our results
• Mixed it with other randomly selected four domains and types
• Asked from users to select the terms that best represent the
higher level domains for the dataset – had 20 users

*

50% of the users
agreed on 73% of
the terms (88 out of
120)

19

Appropriateness of the identified domain

TERMS WITH HIGHEST USER AGREEMENT FOR EACH DATASET, WE INDICATE BY A STAR (*)
THAT TERM WAS ALSO THE HIGHEST RANKED BY OUR SYSTEM (for 22 datasets)

20

Evaluation

Evaluation
domain

User Study


1. User Study with three other SE

21

• Developed a search application using the normalized frequency
count
• User study with three other existing state of the art
– CKAN, LOD Stat and Sigma
• Term selection
• Top ten results are retrieved
• Asked users to rank which set of results they preferred
– 1(best ) to 4(worst)
• Calculate a user preference score using weighted average

22

Term

Our Approach

CKAN

LODStat

Sigma

music

2.037

3.74

3.11

1.333

artist

2.815

3.926

1

2.259

biology

3.481

3.333

1

2.185

animal

2.926

1.63

3.481

1.926

geology

2.852

3.666

1

2.481

drug

2.926

3.148

2

2.555

gene

2.148

3.333

3.074

1.222

university

3.185

3.148

2.37

1.222

food

3.259

2.296

3

1.259

language

3.148

3.74

1

2.11

spacecraft

4

4

1

2

conference

2.814

3.555

1

2.666

astronaut

4

4

1

2

composer

3.815

3.037

1

2.11

tv program

3.666

2.923

1

2.370

instrument

3.852

2

2

3.148

recipe

3.926

2

2

3.074

student

2

3.889

2

3.111

phenotypes

2

3.923

2

3.037

energy

1

3.74

3.26

3.03

23

Evaluation

Evaluation
domain

User Study



2. Evaluate CKAN as the baseline

24

Evaluate CKAN as the baseline
Term

P

R1

F1

R2

F2

music

0.286

1

0.445

0.1

0.148

artist

0.4

1

0.571

0.2

0.267

biology

0.125

1

0.222

0.333

0.182

animal

0

0

n/a

0

n/a

geology

0

0

n/a

0

n/a

drug

0.6

0.667

0.632

0.75

0.667

gene

0.333

1

0.5

0.125

0.182

university

0.5

1

0.667

0.051

0.093

food

0

0

n/a

0

n/a

language

1

1

1

0.045

0.0861

spacecraft

1

1

1

1

1

conference

1

1

1

0.125

0.222

astronaut

1

1

1

1

1

composer

0.25

1

0.4

0.5

0.333

tv program

0

0

n/a

0

n/a

instrument

0

1

0

1

0

recipe

0

1

0

1

0

student

1

0

0

0

0

phenotypes

1

0

0

0

0

energy

1

0

0

0

0

25

Evaluation

Evaluation
domain

User Study



2. Evaluate CKAN as the baseline
3. Evaluate both CKAN and our
approach using a manually curated
gold standard

26

Evaluation with a manually curated gold
standard
CKAN

Our Approach

Term

Precision

Recall

F-Measure

Precision

Recall

F-Measure

music

1

0.5

0.667

0.571

1

0.727

artist

1

0.25

0.4

0.8

1

0.9

biology

1

0.2

0.333

0.625

1

0.769

animal

0

0

n/a

0.333

1

0.5

geology

0

0

n/a

1

0.5

0.667

drug

1

0.6

0.75

1

1

1

gene

1

0.333

0.5

1

1

1

university

0.5

0.667

0.572

0.6

1

0.75

food

0

0

n/a

0.25

1

0.4

language

1

1

1

1

1

1

spacecraft

1

1

1

1

1

1

conference

1

1

1

1

1

1

tv program

0

0

n/a

1

1

1

instrument

1

0

0

0.75

1

0.857

astronaut

1

1

1

1

1

1

composer

1

0.25

0.4

1

1

1

recipe

1

0

0

1

1

1

phenotypes

1

1

1

1

0

0

student

1

0.5

0.667

1

0

0

energy

1

0.333

0.5

1

0

0

Mean

0.775

0.432

0.489

0.846

0.825

0.728
27

Conclusion and Future Work
• Our approach is helpful for systematically categorizing the
datasets
• Demonstrate the potential of using the categorization for finding
relevant datasets
• Utilize a diverse classification hierarchy such as Freebase
• There are other potential application that this work might be
important such browsing and interlinking
• Plan to improve the domain coverage by using knowledge
sources such as Wikipedia and Yago
• Compare the interpretation given by multiple knowledge sources
to see which one gives a better interpretation

28

Thank You!

Questions?
http://knoesis.wright.edu/researchers/sarasi
sarasi@knoesis.org

Kno.e.sis – Ohio Center of Excellence in Knowledge-enabled Computing
Wright State University, Dayton, Ohio, USA

Domain Identification for Linked Open Data

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (6)

Ähnlich wie Domain Identification for Linked Open Data

Ähnlich wie Domain Identification for Linked Open Data (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Domain Identification for Linked Open Data

Hinweis der Redaktion