SlideShare ist ein Scribd-Unternehmen logo
1 von 44
1Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Gregor Große-Bölting
Chifumi Nishioka
Ansgar Scherp
A Comparison of Different
Strategies for Automated
Semantic Document Annotation
2Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Motivation [1/2]
• Document annotation
– Facilitates users and search engines to find documents
– Requires a huge amount of human effort
– e.g., subject indexers in ZBW labeled 1.6 million
scientific documents in economics
• Semantic document annotation
– Documents annotated with semantic entities
– e.g., PubMed and MeSH, ACM DL and ACM CCS
Focus on semantic document annotation
Necessity of automated document annotation
3Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Motivation [2/2]
• Small scale experiments so far
– Comparing a small number of strategies
– Datasets containing a few hundred documents
• Comparing of 43 strategies for document annotation
within the developed experiment framework
– The largest number of strategies
• Experiments with three datasets from different domains
– Contain full-texts of 100,000 documents annotated by subject
indexers
– The largest dataset of scientific publications
We conducted the largest scale experiment
4Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Experiment Framework
Strategies are composed of methods from concept extraction,
concept activation, and annotation selection
1. Concept Extraction
detect concepts (candidate annotations) from each document
2. Concept Activation
compute a score for each concept of a document
3. Annotation Selection
select annotations from concepts for each document
4. Evaluation
measure performance of strategies with ground truth
5Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Research Question
• Research questions solved with the experiment
framework
(I) Which strategy performs best?
(II) Which concept extraction method performs best?
(III) Which concept activation method performs best?
(IV) Which annotation selection method performs
best?
6Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Concept Extraction [1/2]
• Entity
– Extract entities from documents using a domain-specific
knowledge base
– Domain-specific knowledge base
• Entities (subjects) in a specific domain (e.g., medicine)
• One or more labels for each entity
• Relationships between entities
– Detect entities by string matching with entity labels
• Tri-gram
– Extract contigurous sequences of one, two, and three
words in a document
7Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Concept Extraction [2/2]
• RAKE (Rapid Automatic Keyword
Extraction) [Rose et al. 10]
– Unsupervised method for extracting keywords
– Incorporate cooccurrence and frequency of words
• LDA (Latent Dirichlet Allocation) [Blei et al. 03]
– Unsupervised topic modeling method for inferring latent
topics in a document corpus
– Topic model
• Topic: A probability distribution over words
• Document: A probability distribution over topics
– Treat a topic as a concept
8Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Concept Activation [1/6]
• Three types of concept activation
methods
– Statistical Methods
• Baseline
• Use only directly mentioned concepts
– Hierarchy-based Methods
• Reveal concepts that are not mentioned explicitly using a
hierarchical knowledge base
– Graph-based Methods
• Use only directly mentioned concepts
• Represent concept
cooccurrences as a graph
Bank, Interest Rate, Financial Crisis, Bank, Central
Bank, Tax, Interest Rate
Tax
Bank
Interest Rate
Financial Crisis
Central Bank
9Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Concept Activation [2/6]
• Statistical Methods
– Frequency
• 𝑓𝑟𝑒𝑞 𝑐, 𝑑 depends on Concept Extraction methods
– The number of appearances (Entity and Tri-gram)
– The score output by RAKE (RAKE)
– The probability of a topic for a document 𝑑 (LDA)
– CF-IDF [Goossen et al. 11]
• An extension of TF-IDF replacing words with concepts
• Lower scores for concepts that appear in many documents
𝑠𝑐𝑜𝑟𝑒 𝑐𝑓𝑖𝑑𝑓(𝑐, 𝑑) = 𝑐𝑓(𝑐, 𝑑) ∙ 𝑙𝑜𝑔
|𝐷|
| 𝑑 ∈ 𝐷 : {𝑐 ∈ 𝑑}|
𝑠𝑐𝑜𝑟𝑒𝑓𝑟𝑒𝑞(𝑐, 𝑑) = 𝑓𝑟𝑒𝑞(𝑐, 𝑑)
10Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Concept Activation [3/6]
• Hierarchy-based Methods [1/2]
– Base Activation
• 𝐶𝑙(𝑐): a set of child concepts of a concept 𝑐
• 𝜆: decay parameter, set 𝜆 = 0.4
• e.g.,
𝑠𝑐𝑜𝑟𝑒 𝑏𝑎𝑠𝑒 𝑐, 𝑑 = 𝑓𝑟𝑒𝑞 𝑐, 𝑑 + 𝜆 ∙
𝑐 𝑖∈𝐶 𝑙(𝑐)
𝑠𝑐𝑜𝑟𝑒 𝑏𝑎𝑠𝑒(𝑐𝑖, 𝑑)
Social
Recommendation
Social
Tagging
Web Searching Web Mining
Site
Wrapping
Web Log
Analysis
World Wide Web
𝑐1
𝑐2
𝑐3
𝑠𝑐𝑜𝑟𝑒 𝑏𝑎𝑠𝑒 𝑐1, 𝑑 = 1.00
𝑠𝑐𝑜𝑟𝑒 𝑏𝑎𝑠𝑒 𝑐2, 𝑑 = 0.40
𝑠𝑐𝑜𝑟𝑒 𝑏𝑎𝑠𝑒 𝑐3, 𝑑 = 0.16
11Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Concept Activation [4/6]
• Hierarchy-based Methods [2/2]
– Branch Activation
• 𝐵𝑁: reciprocal of the number of concepts that are located one
level above a concept 𝑐
– OneHop Activation
• 𝐶 𝑑: set of concepts in a document 𝑑
• Activates concepts in a maximum distance of one hop
𝑠𝑐𝑜𝑟𝑒 𝑏𝑟𝑎𝑛𝑐ℎ 𝑐, 𝑑 = 𝑓𝑟𝑒𝑞 𝑐, 𝑑 + 𝜆 ∙ 𝐵𝑁 ∙
𝑐 𝑖∈𝐶 𝑙(𝑐)
𝑠𝑐𝑜𝑟𝑒 𝑏𝑟𝑎𝑛𝑐ℎ(𝑐𝑖, 𝑑)
𝑠𝑐𝑜𝑟𝑒 𝑜𝑛𝑒ℎ𝑜𝑝 𝑐, 𝑑 =
𝑓𝑟𝑒𝑞 𝑐, 𝑑 if |𝐶𝑙(𝑐) ∩ 𝐶 𝑑| ≥ 2
𝑓𝑟𝑒𝑞 (𝑐, 𝑑) + 𝜆 ∙
𝑐 𝑖∈𝐶 𝑙(𝑐)
𝑓𝑟𝑒𝑞 𝑐𝑖, 𝑑 otherwise
12Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Concept Activation [5/6]
• Graph-based Methods [1/2]
– Degree [Zouaq et al. 12]
• 𝑑𝑒𝑔𝑟𝑒𝑒 𝑐, 𝑑 : the number of edges linked with a concept 𝑐
• e.g., 𝑠𝑐𝑜𝑟𝑒 𝑑𝑒𝑔𝑟𝑒𝑒(Bank, 𝑑) = 3
– HITS [Kleinberg 99; Zouaq et al. 12]
• Link analysis algorithm for search engines [Kleinberg 99]
• ℎ𝑢𝑏 𝑐, 𝑑 = 𝑐 𝑖∈𝐶 𝑛(𝑐) 𝑎𝑢𝑡ℎ 𝑐𝑖, 𝑑
• 𝑎𝑢𝑡ℎ 𝑐, 𝑑 = 𝑐 𝑖∈𝐶 𝑛(𝑐) ℎ𝑢𝑏 𝑐𝑖, 𝑑
𝑠𝑐𝑜𝑟𝑒 𝑑𝑒𝑔𝑟𝑒𝑒(𝑐, 𝑑) = 𝑑𝑒𝑔𝑟𝑒𝑒(𝑐, 𝑑)
𝑠𝑐𝑜𝑟𝑒ℎ𝑖𝑡𝑠 𝑐, 𝑑 = ℎ𝑢𝑏 𝑐, 𝑑 + 𝑎𝑢𝑡ℎ 𝑐, 𝑑
Tax
Bank
Interest Rate
Financial Crisis
Central Bank
13Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Concept Activation [6/6]
• Graph-based Methods [2/2]
– PageRank [Page et al. 99; Mihalcea & Paul 04]
• Link analysis algorithm for search engines
• Based on the intuition that a node that is linked from many
important nodes is more important
• 𝐶𝑖𝑛(𝑐): set of concepts connected with incoming edges from 𝑐
• 𝐶 𝑜𝑢𝑡(𝑐): set of concepts connected with outgoing edges from 𝑐
• 𝜇: dumping factor, 𝜇 = 0.85
𝑠𝑐𝑜𝑟𝑒 𝑝𝑎𝑔𝑒 𝑐, 𝑑 = 1 − 𝜇 + 𝜇 ∙
𝑐 𝑖∈𝐶 𝑖𝑛(𝑐)
𝑠𝑐𝑜𝑟𝑒 𝑝𝑎𝑔𝑒(𝑐𝑖, 𝑑)
|𝐶 𝑜𝑢𝑡(𝑐𝑖)|
14Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Annotation Selection
• Top-5 and Top-10
– Select concepts whose scores are ranked in top-k
• k Nearest Neighbor (kNN) [Huang et al. 11]
– Based on the assumption that documents with similar
concepts share similar annotations
1. Compute similarity scores between a target document
and all documents with annotations
2. Select union of annotations of k nearest documents
Central bank
Law
Financial crisis
Finance
China
Human resource
Leadership
Marketing
Competition law
??
0.49
0.45
0.42
0.60
Example
- 𝑘 = 2
- Selected annotations
Finance; China; Marketing;
Competition Law
15Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Configurations [1/5]
Entity Tri-gram LDARAKE
Statistical
Methods
(2 methods)
Hierarchy-based
Methods
(3 methods)
Graph-based
Methods
(3 methods)
Top-k
(2 methods)
kNN
(1 method)
Concept
Extraction
Annotation
Selection
Concept
Activation
16Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Configurations [2/5]
24 strategies
Entity Tri-gram LDARAKE
Statistical
Methods
(2 methods)
Hierarchy-based
Methods
(3 methods)
Graph-based
Methods
(3 methods)
Top-k
(2 methods)
kNN
(1 method)
Concept
Extraction
Annotation
Selection
Concept
Activation
17Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
15 strategies
Entity Tri-gram LDARAKE
Statistical
Method
(2 methods)
Hierarchy-based
Methods
(3 methods)
Graph-based
Methods
(3 methods)
Top-k
(2 methods)
kNN
(1 method)
Concept
Extraction
Annotation
Selection
Concept
Activation
Configurations [3/5]
18Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
3 strategies
Entity Tri-gram LDARAKE
Statistical
Method
(Frequency)
Hierarchy-based
Methods
(3 methods)
Graph-based
Methods
(3 methods)
Top-k
(2 methods)
kNN
(1 method)
Concept
Extraction
Annotation
Selection
Concept
Activation
Configurations [4/5]
19Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Entity Tri-gram LDARAKE
Statistical
Methods
(Frequency)
Hierarchy-based
Methods
(3 methods)
Graph-based
Methods
(3 methods)
Top-k
(2 methods)
kNN
(1 method)
Concept
Extraction
Annotation
Selection
Concept
Activation
Configurations [5/5]
43 strategies in total
20Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Datasets and Metrics of Experiments
Economics Political Science Computer Science
publication ZBW FIV SemEval 2010
# of publications 62,924 28,324 244
# of annotations 5.26 (± 1.84) 12.00 (± 4.02) 5.05 (± 2.41)
knowledge base STW European Thesaurus ACM CCS
# of entities 6,335 7,912 2,299
# of labels 11,679 8,421 9,086
• Computer Science: SemEval 2010 dataset [Kim et al. 10]
– Publications are annotated with keywords originally
– We converted keywords to entities by string matching
• All publications and labels of entities are in English
• We use full-texts of publications
• All annotations are used as ground truth
• Evaluation metrics: Precision, Recall, F-measure
21Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
(I) Best Performing Strategies
• Economics and Political Science datasets
– The best strategy: Entity × HITS × kNN
– F-measure: 0.39 (economics), 0.28 (political science)
• Computer Science dataset
– The best strategy: Entity × Degree × kNN
– F-measure: 0.33 (computer science)
• Graph-based methods do not differ a lot
In general, a document annotation strategy
Entity × Graph-based method × kNN performs best
22Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
(II) Influence of Concept Extraction
• Concept Extraction method: Entity
– Use domain-specific knowledge bases
– Knowledge bases: freely available and of high quality
– 32 thesauri listed in W3C SKOS Datasets
For Concept Extraction methods, Entity consistently
outperforms Tri-gram, RAKE, and LDA
23Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
(III) Influence of Concept Activation
• Poor performance of hierarchy-based methods
– We use full-texts in the experiments
• Full-texts contain so many different concepts (203.80 unique
entities (SD: 24.50)) that others do not have to be activated
– However, OneHop can work as well as graph-based
methods
• It activates concept in one hop distance
For Concept Activation methods,
graph-based methods are better than statistical
methods or hierarchy-based methods
24Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
(IV) Influence of Annotation Selection
• kNN
– No learning process
– Confirms the assumption that documents with similar
concepts share similar annotations
For Annotation Selection methods, kNN can enhance
the performance
25Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Conclusion
• Large scale experiment for automated semantic
document annotation for scientific publications
• Best strategy: Entity × Graph-based method × kNN
– Novel combination of methods
• Best concept extraction method: Entity
• Best concept activation method: Graph-based
methods
– OneHop can achieve similar performance and requires
less computation cost
26Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Thank you!
Questions?
27Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Appendix
28Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Research Question
• Research questions solved with the experiment
framework
(I) Which strategy performs best?
(II) Which concept extraction method performs best?
(III) Which concept activation method performs best?
(IV) Which annotation selection method performs
best?
29Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
LDA (Latent Dirichlet Allocation)
source: D. M. Blei. Probabilistic topic models, CACM, 2012.
30Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Entity Extraction and Conversion
• Entity extraction
– String matching with entity labels
– Starting with longer entity labels
• e.g., From a text “financial crisis is …”, only an entity “financial
crisis” is detected (not “crisis”).
• Converting to entities
– Words and keywords are extracted in Tri-gram and RAKE
– They are converted to entities by string matching with
entity labels before annotation selection
– If no matched entity label is found, word or keyword is
discarded
31Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
kNN [1/2]
• Similarity measure
– Each document is represented as a vector where each
element is a score of a concept
– Cosine similarity is used as a similarity measure
GDP
Immigration
Population
Bank
Interest rate
Canada
0.3
0.5
0.8
0.1
0.0
0.5
GDP
Immigration
Population
Bank
Interest rate
Canada
0.6
0.0
0.4
0.8
0.4
0.2
cosine similarity between (0.3, 0.5, 0.8, 0.1, 0.0, 0.5) and (0.6, 0.0, 0.4, 0.8, 0.4, 0.2)
𝒔𝒊𝒎 𝒅 𝟏, 𝒅 𝟐 = 𝟎. 𝟓𝟐
𝑑1 𝑑2
32Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
kNN [2/2]
• k = 1
• k = 2
Central bank
Law
Financial crisis
Finance
China
Human resource
Leadership
Marketing
Competition law
??
0.49
0.45
0.42
0.60
Marketing
Competitive law
Selected annotations
Central bank
Law
Financial crisis
Finance
China
Human resource
Leadership
Marketing
Competition law
??
0.49
0.45
0.42
0.60
Marketing
Competitive law
Finance
China
Selected annotations
33Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Evaluation Metrics
• Precision
• Recall
• F-measure
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
{𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑜𝑓 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠} ∩ |{𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠}|
| 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠 |
𝑟𝑒𝑐𝑎𝑙𝑙 =
{𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑜𝑓 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠} ∩ |{𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠}|
| 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠 |
𝐹𝑚𝑒𝑎𝑠𝑢𝑟𝑒 = 2 ∙
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∙ 𝑟𝑒𝑐𝑎𝑙𝑙
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
34Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Datasets
• Economics dataset
– 11 GB
• Political science dataset
– 3.8 GB
35Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Experiments
• Preprocessing documents
– lemmatization
– stop words removal
• 10-fold cross validation
– split a dataset into 10 equal sized subsets
– 8 subset for training data
– 1 subset for testing data
– 1 subset for optimizing parameter
36Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Result Table: Entity [1/2]
Economics
top-5 top-10 kNN
Recall Precision F Recall Precision F Recall Precision F
Frequency .14 (.17) .14 (.15) .13 (.15) .22 (.20) .11 (.10) .14 (.12) .08 (.21) .08 (.21) .08 (.21)
CF-IDF .19 (.19) .18 (.17) .18 (.16) .24 (.21) .12 (.10) .15 (.12) .29 (.32) .30 (.32) .29 (.31)
Base Act. .10 (.14) .09 (.13) .09 (.13) .18 (.19) .09 (.09) .12 (.11) .20 (.30) .20 (.30) .20 (.29)
Branch Act. .08 (.14) .08 (.12) .08 (.12) .17 (.19) .08 (.09) .11 (.11) .17 (.28) .17 (.28) .17 (.27)
OneHop .12 (.16) .12 (.14) .12 (.14) .19 (.19) .09 (.09) .12 (.11) .35 (.34) .36 (.34) .35 (.33)
Degree .15 (.17) .14 (.15) .14 (.15) .23 (.20) .11 (.09) .14 (.12) .39 (.33) .40 (.33) .38 (.32)
HITS .14 (.17) .14 (.15) .14 (.15) .23 (.20) .11 (.10) .14 (.12) .40 (.32) .40 (.32) .39 (.31)
PageRank .14 (.17) .14 (.15) .14 (.15) .22 (.20) .11 (.09) .14 (.12) .39 (.33) .40 (.33) .38 (.32)
Political Science
top-5 top-10 kNN
Recall Precision F Recall Precision F Recall Precision F
Frequency .12 (.11) .18 (.16) .14 (.12) .15 (.13) .12 (.10) .13 (.10) .14 (.17) .05 (.07) .07 (.09)
CF-IDF .05 (.07) .12 (.16) .07 (.10) .07 (.09) .08 (.10) .07 (.09) .24 (.22) .14 (.14) .17 (.16)
Base Act. .05 (.08) .10 (.13) .07 (.09) .10 (.10) .10 (.09) .09 (.09) .14 (.19) .07 (.10) .09 (.12)
Branch Act. .04 (.07) .08 (.12) .05 (.08) .09 (.09) .09 (.09) .08 (.09) .12 (.17) .06 (.10) .08 (.11)
OneHop .10 (.09) .21 (.17) .13 (.11) .13 (.11) .14 (.11) .13 (.10) .27 (.21) .26 (.21) .25 (.19)
Degree .10 (.09) .21 (.17) .13 (.11) .13 (.11) .14 (.11) .13 (.10) .29 (.21) .28 (.21) .27 (.19)
HITS .10 (.09) .21 (.17) .13 (.11) .13 (.11) .14 (.11) .13 (.10) .30 (.22) .29 (.21) .28 (.20)
PageRank .10 (.09) .20 (.17) .13 (.11) .13 (.10) .14 (.11) .13 (.10) .29 (.22) .29 (.21) .27 (.20)
37Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Result Table: Entity [2/2]
Computer Science
top-5 top-10 kNN
Recall Precision F Recall Precision F Recall Precision F
Frequency .18 (.21) .14 (.15) .15 (.16) .22 (.22) .08 (.08) .12 (.11) .49 (.28) .24 (.16) .30 (.17)
CF-IDF .02 (.08) .02 (.06) .02 (.06) .03 (.11) .01 (.04) .02 (.05) .47 (.29) .23 (.17) .29 (.18)
Base Act. .17 (.20) .13 (.14) .14 (.15) .22 (.22) .08 (.08) .12 (.11) .49 (.28) .22 (.15) .29 (.17)
Branch Act. .17 (.20) .12 (.14) .14 (.15) .21 (.22) .08 (.08) .11 (.11) .50 (.28) .22 (.15) .29 (.17)
OneHop .17 (.20) .13 (.14) .14 (.15) .21 (.22) .08 (.08) .11 (.11) .42 (.30) .25 (.21) .29 (.20)
Degree .17 (.21) .13 (.15) .14 (.16) .22 (.22) .08 (.08) .12 (.11) .49 (.28) .27 (.17) .33 (.18)
HITS .18 (.21) .14 (.15) .15 (.16) .21 (.22) .08 (.08) .11 (.11) .48 (.31) .27 (.18) .32 (.20)
PageRank .17 (.21) .13 (.15) .14 (.16) .22 (.22) .08 (.08) .12 (.11) .50 (.29) .25 (.15) .31 (.18)
38Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Result Table: Tri-gram
Economics
top-5 top-10 kNN
Recall Precision F Recall Precision F Recall Precision F
Frequency .12 (.15) .12 (.14) .11 (.14) .19 (.19) .10 (.10) .13 (.12) .08 (.22) .08 (.22) .08 (.21)
CF-IDF .10 (.12) .10 (.12) .09 (.11) .17 (.17) .08 (.10) .12 (.12) .07 (.20) .06 (.22) .06 (.20)
Degree .03 (.09) .03 (.08) .03 (.08) .03 (.09) .03 (.08) .03 (.08) .07 (.21) .07 (.21) .07 (.20)
HITS .02 (.06) .02 (.06) .02 (.06) .02 (.06) .02 (.06) .02 (.06) .08 (.22) .08 (.22) .07 (.21)
PageRank .03 (.09) .03 (.08) .03 (.08) .03 (.09) .03 (.08) .03 (.08) .10 (.20) .04 (.08) .05 (.11)
Political Science
top-5 top-10 kNN
Recall Precision F Recall Precision F Recall Precision F
Frequency .06 (.08) .14 (.16) .08 (.10) .10 (.10) .11 (.11) .10 (.09) .08 (.14) .05 (.08) .06 (.09)
CF-IDF .05 (.05) .06 (.07) .05 (.06) .09 (.10) .09 (.10) .08 (.09) .09 (.15) .04 (.08) .06 (.10)
Degree .01 (.03) .03 (.07) .01 (.04) .01 (.03) .03 (.07) .01 (.04) .11 (.14) .03 (.05) .05 (.07)
HITS .01 (.03) .02 (.06) .01 (.03) .01 (.03) .00 (.06) .01 (.03) .12 (.14) .04 (.06) .06 (.08)
PageRank .01 (.04) .03 (.08) .02 (.05) .01 (.04) .03 (.08) .02 (.05) .08 (.12) .03 (.05) .04 (.06)
Computer Science
top-5 top-10 kNN
Recall Precision F Recall Precision F Recall Precision F
Frequency .26 (.24) .20 (.18) .22 (.19) .54 (.30) .20 (.13) .29 (.17) .44 (.28) .25 (.18) .30 (.19)
CF-IDF .23 (.24) .18 (.18) .19 (.19) .54 (.29) .22 (.14) .30 (.17) .48 (.28) .20 (.14) .26 (.15)
Degree .09 (.15) .07 (.11) .07 (.12) .13 (.19) .05 (.07) .07 (.09) .48 (.29) .23 (.16) .29 (.18)
HITS .05 (.14) .04 (.09) .04 (.10) .11 (.18) .04 (.06) .06 (.09) .39 (.29) .26 (.21) .28 (.19)
PageRank .02 (.06) .02 (.05) .02 (.06) .03 (.08) .01 (.03) .02 (.05) .46 (.29) .25 (.18) .30 (.18)
39Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Result Table: RAKE
Economics
top-5 top-10 kNN
Recall Precision F Recall Precision F Recall Precision F
Frequency .08 (.14) .08 (.12) .08 (.12) .15 (.18) .07 (.08) .10 (.11) .34 (.33) .34 (.33) .33 (.32)
Political Science
top-5 top-10 kNN
Recall Precision F Recall Precision F Recall Precision F
Frequency .04 (.07) .08 (.13) .05 (.08) .07 (.09) .08 (.09) .07 (.08) .31 (.23) .18 (.15) .22 (.17)
Computer Science
top-5 top-10 kNN
Recall Precision F Recall Precision F Recall Precision F
Frequency .24 (.24) .17 (.16) .19 (.17) .42 (.28) .15 (.10) .22 (.14) .42 (.27) .20 (.13) .25 (.15)
40Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Result Table: LDA
Economics
kNN
Recall Precision F
Frequency .19 (.30) .19 (.30) .19 (.30)
Political Science
kNN
Recall Precision F
Frequency .15 (.19) .15 (.18) .14 (.17)
Computer Science
kNN
Recall Precision F
Frequency .28 (.27) .24 (.23) .24 (.22)
41Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Materials
• Codes
– https://github.com/ggb/ShortStories
• Datasets
– economics and political science
• not publicly available yet
• contact us directly, if you are interested in
– computer science
• publicly available
42Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Presentation
• K-CAP 2015
– International Conference on Knowledge Capture
– Scope
• Knowledge Acquisition / Capture
• Knowledge Extraction from Text
• Semantic Web
• Knowledge Engineering and Modelling
• …
• Time slot
– Presentation: 25 minutes
– Q & A: 5 minutes
43Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Reference
• [Blei et al. 03] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation,
JMLR, 2003.
• [Blei 12] D. M. Blei. Probabilistic topic models, CACM, 2012.
• [Goossen et al. 11] F. Goossen, W. IJntema, F. Frasincar, F. Hogenboom, and U.
Kaymak. News personalization using the CF-IDF semantic recommender, WIMS,
2011.
• [Grosse-Bolting et al. 15] G. Grosse-Bolting, C. Nishioka, and A. Scherp. Generic
process for extracting user profiles from social media using hierarchical knowledge
bases, ICSC, 2015.
• [Huang et al. 11] M. Huang, A. Névéol, and Z. Lu. Recommending MeSH terms for
annotating biomedical articles, JAMIA, 2011.
• [Kapanipathi et al. 14] P. Kapanipathi, P. Jain, C. Venkataramani, and A. Sheth. User
interests identification on Twitter using a hierarchical knowledge base, ESWC, 2014.
• [Kim et al. 10] S. N. Kim, O. Medelyan, M. Y. Kan, and T. Baldwin. Semeval-2010
task 5: Automatic keyphrase extraction from scientific articles, International
Workshop on Semantic Evaluation, 2010.
44Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Reference
• [Kleinberg 99] J. M. Kleinberg. Authoritative sources in a hyperlinked
environment, Journal of the ACM, 1999.
• [Mihalcea & Paul 04] R. Mihalcea and T. Paul. TextRank: Bringing order into texts,
EMNLP, 2004.
• [Page et al. 99] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank
citation ranking: bringing order to the web, TR of Stanford InfoLab, 1999.
• [Rose et al. 10] S. Rose, D. Engel, N. Cramer, and W. Cowley. Automatic keyword
extraction from individual documents, Text Mining, 2010.
• [Zouaq et al. 12] A. Zouaq, G. Dragan, and H. Marek. Voting theory for concept
detection, ESWC, 2012.

Weitere ähnliche Inhalte

Was ist angesagt?

Real-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsReal-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsAlbert Bifet
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming AlgorithmsJoe Kelley
 
Mining Big Data in Real Time
Mining Big Data in Real TimeMining Big Data in Real Time
Mining Big Data in Real TimeAlbert Bifet
 
Signals from outer space
Signals from outer spaceSignals from outer space
Signals from outer spaceGraphAware
 
Moa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data StreamsMoa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data StreamsAlbert Bifet
 
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Ian Foster
 
Efficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream ClassifiersEfficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream ClassifiersAlbert Bifet
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streamsKrish_ver2
 
Mining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDTMining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDTDavide Gallitelli
 
Vector databases and neural search
Vector databases and neural searchVector databases and neural search
Vector databases and neural searchDmitry Kan
 
Deep learning and Apache Spark
Deep learning and Apache SparkDeep learning and Apache Spark
Deep learning and Apache SparkQuantUniversity
 
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATLParikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATLMLconf
 
STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.Albert Bifet
 
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...Dmitry Kan
 
Improving Model Predictions via Stacking and Hyper-parameters Tuning
Improving Model Predictions via Stacking and Hyper-parameters TuningImproving Model Predictions via Stacking and Hyper-parameters Tuning
Improving Model Predictions via Stacking and Hyper-parameters TuningJo-fai Chow
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Jen Aman
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for ScienceIan Foster
 
Probabilistic data structures
Probabilistic data structuresProbabilistic data structures
Probabilistic data structuresYoav chernobroda
 
Apache con big data 2015 magellan
Apache con big data 2015 magellanApache con big data 2015 magellan
Apache con big data 2015 magellanRam Sriharsha
 

Was ist angesagt? (20)

Real-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsReal-Time Big Data Stream Analytics
Real-Time Big Data Stream Analytics
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming Algorithms
 
Mining Big Data in Real Time
Mining Big Data in Real TimeMining Big Data in Real Time
Mining Big Data in Real Time
 
Signals from outer space
Signals from outer spaceSignals from outer space
Signals from outer space
 
Moa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data StreamsMoa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data Streams
 
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
 
Efficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream ClassifiersEfficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream Classifiers
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
 
CLIM Program: Remote Sensing Workshop, High Performance Computing and Spatial...
CLIM Program: Remote Sensing Workshop, High Performance Computing and Spatial...CLIM Program: Remote Sensing Workshop, High Performance Computing and Spatial...
CLIM Program: Remote Sensing Workshop, High Performance Computing and Spatial...
 
Mining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDTMining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDT
 
Vector databases and neural search
Vector databases and neural searchVector databases and neural search
Vector databases and neural search
 
Deep learning and Apache Spark
Deep learning and Apache SparkDeep learning and Apache Spark
Deep learning and Apache Spark
 
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATLParikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
 
STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.
 
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...
 
Improving Model Predictions via Stacking and Hyper-parameters Tuning
Improving Model Predictions via Stacking and Hyper-parameters TuningImproving Model Predictions via Stacking and Hyper-parameters Tuning
Improving Model Predictions via Stacking and Hyper-parameters Tuning
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
Probabilistic data structures
Probabilistic data structuresProbabilistic data structures
Probabilistic data structures
 
Apache con big data 2015 magellan
Apache con big data 2015 magellanApache con big data 2015 magellan
Apache con big data 2015 magellan
 

Andere mochten auch

Smart photo selection: interpret gaze as personal interest
Smart photo selection: interpret gaze as personal interestSmart photo selection: interpret gaze as personal interest
Smart photo selection: interpret gaze as personal interestAnsgar Scherp
 
SchemEX -- Building an Index for Linked Open Data
SchemEX -- Building an Index for Linked Open DataSchemEX -- Building an Index for Linked Open Data
SchemEX -- Building an Index for Linked Open DataAnsgar Scherp
 
About Multimedia Presentation Generation and Multimedia Metadata: From Synthe...
About Multimedia Presentation Generation and Multimedia Metadata: From Synthe...About Multimedia Presentation Generation and Multimedia Metadata: From Synthe...
About Multimedia Presentation Generation and Multimedia Metadata: From Synthe...Ansgar Scherp
 
A Framework for Iterative Signing of Graph Data on the Web
A Framework for Iterative Signing of Graph Data on the WebA Framework for Iterative Signing of Graph Data on the Web
A Framework for Iterative Signing of Graph Data on the WebAnsgar Scherp
 
Linked Open Data (Entwurfsprinzipien und Muster für vernetzte Daten)
Linked Open Data (Entwurfsprinzipien und Muster für vernetzte Daten)Linked Open Data (Entwurfsprinzipien und Muster für vernetzte Daten)
Linked Open Data (Entwurfsprinzipien und Muster für vernetzte Daten)Ansgar Scherp
 
Events in Multimedia - Theory, Model, Application
Events in Multimedia - Theory, Model, ApplicationEvents in Multimedia - Theory, Model, Application
Events in Multimedia - Theory, Model, ApplicationAnsgar Scherp
 
Cvpr2007 object category recognition p1 - bag of words models
Cvpr2007 object category recognition   p1 - bag of words modelsCvpr2007 object category recognition   p1 - bag of words models
Cvpr2007 object category recognition p1 - bag of words modelszukun
 
Conceptual indexing
Conceptual indexingConceptual indexing
Conceptual indexingSajan Sahu
 
Phrase Based Indexing and Information Retrivel
Phrase Based Indexing and Information RetrivelPhrase Based Indexing and Information Retrivel
Phrase Based Indexing and Information Retrivelbalaabirami
 
Semantic Analysis using Wikipedia Taxonomy
Semantic Analysis using Wikipedia TaxonomySemantic Analysis using Wikipedia Taxonomy
Semantic Analysis using Wikipedia TaxonomyPatrick Nicolas
 
Rule based approach to sentiment analysis at romip’11 slides
Rule based approach to sentiment analysis at romip’11 slidesRule based approach to sentiment analysis at romip’11 slides
Rule based approach to sentiment analysis at romip’11 slidesDmitry Kan
 
Concept-Based Information Retrieval using Explicit Semantic Analysis
Concept-Based Information Retrieval using Explicit Semantic AnalysisConcept-Based Information Retrieval using Explicit Semantic Analysis
Concept-Based Information Retrieval using Explicit Semantic AnalysisOfer Egozi
 
Sequential pattern mining
Sequential pattern miningSequential pattern mining
Sequential pattern miningkiran said
 
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkUnsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkDB Tsai
 
Patterns number and geometric
Patterns  number and geometricPatterns  number and geometric
Patterns number and geometricamdzubinski
 
Repeating and growing patterns
Repeating and growing patternsRepeating and growing patterns
Repeating and growing patternsJessica Weesies
 
Machine Learning in Pathology Diagnostics with Simagis Live
Machine Learning in Pathology Diagnostics with Simagis LiveMachine Learning in Pathology Diagnostics with Simagis Live
Machine Learning in Pathology Diagnostics with Simagis Livekhvatkov
 
Web mining slides
Web mining slidesWeb mining slides
Web mining slidesmahavir_a
 

Andere mochten auch (20)

Smart photo selection: interpret gaze as personal interest
Smart photo selection: interpret gaze as personal interestSmart photo selection: interpret gaze as personal interest
Smart photo selection: interpret gaze as personal interest
 
SchemEX -- Building an Index for Linked Open Data
SchemEX -- Building an Index for Linked Open DataSchemEX -- Building an Index for Linked Open Data
SchemEX -- Building an Index for Linked Open Data
 
About Multimedia Presentation Generation and Multimedia Metadata: From Synthe...
About Multimedia Presentation Generation and Multimedia Metadata: From Synthe...About Multimedia Presentation Generation and Multimedia Metadata: From Synthe...
About Multimedia Presentation Generation and Multimedia Metadata: From Synthe...
 
A Framework for Iterative Signing of Graph Data on the Web
A Framework for Iterative Signing of Graph Data on the WebA Framework for Iterative Signing of Graph Data on the Web
A Framework for Iterative Signing of Graph Data on the Web
 
Linked Open Data (Entwurfsprinzipien und Muster für vernetzte Daten)
Linked Open Data (Entwurfsprinzipien und Muster für vernetzte Daten)Linked Open Data (Entwurfsprinzipien und Muster für vernetzte Daten)
Linked Open Data (Entwurfsprinzipien und Muster für vernetzte Daten)
 
Events in Multimedia - Theory, Model, Application
Events in Multimedia - Theory, Model, ApplicationEvents in Multimedia - Theory, Model, Application
Events in Multimedia - Theory, Model, Application
 
Cvpr2007 object category recognition p1 - bag of words models
Cvpr2007 object category recognition   p1 - bag of words modelsCvpr2007 object category recognition   p1 - bag of words models
Cvpr2007 object category recognition p1 - bag of words models
 
Conceptual indexing
Conceptual indexingConceptual indexing
Conceptual indexing
 
Phrase Based Indexing and Information Retrivel
Phrase Based Indexing and Information RetrivelPhrase Based Indexing and Information Retrivel
Phrase Based Indexing and Information Retrivel
 
Semantic Analysis using Wikipedia Taxonomy
Semantic Analysis using Wikipedia TaxonomySemantic Analysis using Wikipedia Taxonomy
Semantic Analysis using Wikipedia Taxonomy
 
Rule based approach to sentiment analysis at romip’11 slides
Rule based approach to sentiment analysis at romip’11 slidesRule based approach to sentiment analysis at romip’11 slides
Rule based approach to sentiment analysis at romip’11 slides
 
Concept-Based Information Retrieval using Explicit Semantic Analysis
Concept-Based Information Retrieval using Explicit Semantic AnalysisConcept-Based Information Retrieval using Explicit Semantic Analysis
Concept-Based Information Retrieval using Explicit Semantic Analysis
 
Sequential pattern mining
Sequential pattern miningSequential pattern mining
Sequential pattern mining
 
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkUnsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
 
Patterns number and geometric
Patterns  number and geometricPatterns  number and geometric
Patterns number and geometric
 
Repeating and growing patterns
Repeating and growing patternsRepeating and growing patterns
Repeating and growing patterns
 
Machine Learning in Pathology Diagnostics with Simagis Live
Machine Learning in Pathology Diagnostics with Simagis LiveMachine Learning in Pathology Diagnostics with Simagis Live
Machine Learning in Pathology Diagnostics with Simagis Live
 
NLP
NLPNLP
NLP
 
Pattern recognition
Pattern recognitionPattern recognition
Pattern recognition
 
Web mining slides
Web mining slidesWeb mining slides
Web mining slides
 

Ähnlich wie A Comparison of Different Strategies for Automated Semantic Document Annotation

Dominik Kowald PhD Defense Recommender Systems
Dominik Kowald PhD Defense Recommender SystemsDominik Kowald PhD Defense Recommender Systems
Dominik Kowald PhD Defense Recommender SystemsDominik Kowald
 
Empirical Model of Supervised Learning Approach for Opinion Mining
Empirical Model of Supervised Learning Approach for Opinion MiningEmpirical Model of Supervised Learning Approach for Opinion Mining
Empirical Model of Supervised Learning Approach for Opinion MiningIRJET Journal
 
Gunjan insight student conference v2
Gunjan insight student conference v2Gunjan insight student conference v2
Gunjan insight student conference v2Gunjan Kumar
 
Efficient Query Processing Infrastructures
Efficient Query Processing InfrastructuresEfficient Query Processing Infrastructures
Efficient Query Processing InfrastructuresCrai Macdonald
 
Profiling vs. Time vs. Content: What does Matter for Top-k Publication Recomm...
Profiling vs. Time vs. Content: What does Matter for Top-k Publication Recomm...Profiling vs. Time vs. Content: What does Matter for Top-k Publication Recomm...
Profiling vs. Time vs. Content: What does Matter for Top-k Publication Recomm...MOVING Project
 
Large language models in higher education
Large language models in higher educationLarge language models in higher education
Large language models in higher educationPeter Trkman
 
Preservation Planning using Plato, by Hannes Kulovits and Andreas Rauber
Preservation Planning using Plato, by Hannes Kulovits and Andreas RauberPreservation Planning using Plato, by Hannes Kulovits and Andreas Rauber
Preservation Planning using Plato, by Hannes Kulovits and Andreas RauberJISC KeepIt project
 
Stu R C8e Ch09
Stu R C8e Ch09Stu R C8e Ch09
Stu R C8e Ch09D
 
Big Data Analytics IEEE 2015 Projects
Big Data Analytics IEEE 2015 ProjectsBig Data Analytics IEEE 2015 Projects
Big Data Analytics IEEE 2015 ProjectsVijay Karan
 
Phase 1 Learning Analytics Intro Slides
Phase 1 Learning Analytics Intro SlidesPhase 1 Learning Analytics Intro Slides
Phase 1 Learning Analytics Intro SlidesPaul Bailey
 
Mining Query Log to Suggest Competitive Keyphrases for Sponsored Search Via I...
Mining Query Log to Suggest Competitive Keyphrases for Sponsored Search Via I...Mining Query Log to Suggest Competitive Keyphrases for Sponsored Search Via I...
Mining Query Log to Suggest Competitive Keyphrases for Sponsored Search Via I...IRJET Journal
 
Intro to jisc Learning Analytics March 16
Intro to jisc Learning Analytics March 16Intro to jisc Learning Analytics March 16
Intro to jisc Learning Analytics March 16Paul Bailey
 
Introduction to Logic Models
Introduction to Logic ModelsIntroduction to Logic Models
Introduction to Logic ModelsDiscoveryCenterMU
 
Profiler for Smartphone Users Interests Using Modified Hierarchical Agglomera...
Profiler for Smartphone Users Interests Using Modified Hierarchical Agglomera...Profiler for Smartphone Users Interests Using Modified Hierarchical Agglomera...
Profiler for Smartphone Users Interests Using Modified Hierarchical Agglomera...Lippo Group Digital
 
Introduction to Computational Statistics
Introduction to Computational StatisticsIntroduction to Computational Statistics
Introduction to Computational StatisticsSetia Pramana
 
John Kenyon masterclass
John Kenyon masterclassJohn Kenyon masterclass
John Kenyon masterclassConnecting Up
 
Clustering of Big Data Using Different Data-Mining Techniques
Clustering of Big Data Using Different Data-Mining TechniquesClustering of Big Data Using Different Data-Mining Techniques
Clustering of Big Data Using Different Data-Mining TechniquesIRJET Journal
 
Building a ICT Strategy with an Enterprise Architecture Mindset
Building a ICT Strategy  with an Enterprise Architecture MindsetBuilding a ICT Strategy  with an Enterprise Architecture Mindset
Building a ICT Strategy with an Enterprise Architecture MindsetDaljit Banger
 

Ähnlich wie A Comparison of Different Strategies for Automated Semantic Document Annotation (20)

Dominik Kowald PhD Defense Recommender Systems
Dominik Kowald PhD Defense Recommender SystemsDominik Kowald PhD Defense Recommender Systems
Dominik Kowald PhD Defense Recommender Systems
 
Empirical Model of Supervised Learning Approach for Opinion Mining
Empirical Model of Supervised Learning Approach for Opinion MiningEmpirical Model of Supervised Learning Approach for Opinion Mining
Empirical Model of Supervised Learning Approach for Opinion Mining
 
Gunjan insight student conference v2
Gunjan insight student conference v2Gunjan insight student conference v2
Gunjan insight student conference v2
 
Efficient Query Processing Infrastructures
Efficient Query Processing InfrastructuresEfficient Query Processing Infrastructures
Efficient Query Processing Infrastructures
 
Profiling vs. Time vs. Content: What does Matter for Top-k Publication Recomm...
Profiling vs. Time vs. Content: What does Matter for Top-k Publication Recomm...Profiling vs. Time vs. Content: What does Matter for Top-k Publication Recomm...
Profiling vs. Time vs. Content: What does Matter for Top-k Publication Recomm...
 
Large language models in higher education
Large language models in higher educationLarge language models in higher education
Large language models in higher education
 
Preservation Planning using Plato, by Hannes Kulovits and Andreas Rauber
Preservation Planning using Plato, by Hannes Kulovits and Andreas RauberPreservation Planning using Plato, by Hannes Kulovits and Andreas Rauber
Preservation Planning using Plato, by Hannes Kulovits and Andreas Rauber
 
Stu R C8e Ch09
Stu R C8e Ch09Stu R C8e Ch09
Stu R C8e Ch09
 
Big Data Analytics IEEE 2015 Projects
Big Data Analytics IEEE 2015 ProjectsBig Data Analytics IEEE 2015 Projects
Big Data Analytics IEEE 2015 Projects
 
Phase 1 Learning Analytics Intro Slides
Phase 1 Learning Analytics Intro SlidesPhase 1 Learning Analytics Intro Slides
Phase 1 Learning Analytics Intro Slides
 
Mining Query Log to Suggest Competitive Keyphrases for Sponsored Search Via I...
Mining Query Log to Suggest Competitive Keyphrases for Sponsored Search Via I...Mining Query Log to Suggest Competitive Keyphrases for Sponsored Search Via I...
Mining Query Log to Suggest Competitive Keyphrases for Sponsored Search Via I...
 
Intro to jisc Learning Analytics March 16
Intro to jisc Learning Analytics March 16Intro to jisc Learning Analytics March 16
Intro to jisc Learning Analytics March 16
 
Internship Presentation.pdf
Internship Presentation.pdfInternship Presentation.pdf
Internship Presentation.pdf
 
Introduction to Logic Models
Introduction to Logic ModelsIntroduction to Logic Models
Introduction to Logic Models
 
2015 03-28-eb-final
2015 03-28-eb-final2015 03-28-eb-final
2015 03-28-eb-final
 
Profiler for Smartphone Users Interests Using Modified Hierarchical Agglomera...
Profiler for Smartphone Users Interests Using Modified Hierarchical Agglomera...Profiler for Smartphone Users Interests Using Modified Hierarchical Agglomera...
Profiler for Smartphone Users Interests Using Modified Hierarchical Agglomera...
 
Introduction to Computational Statistics
Introduction to Computational StatisticsIntroduction to Computational Statistics
Introduction to Computational Statistics
 
John Kenyon masterclass
John Kenyon masterclassJohn Kenyon masterclass
John Kenyon masterclass
 
Clustering of Big Data Using Different Data-Mining Techniques
Clustering of Big Data Using Different Data-Mining TechniquesClustering of Big Data Using Different Data-Mining Techniques
Clustering of Big Data Using Different Data-Mining Techniques
 
Building a ICT Strategy with an Enterprise Architecture Mindset
Building a ICT Strategy  with an Enterprise Architecture MindsetBuilding a ICT Strategy  with an Enterprise Architecture Mindset
Building a ICT Strategy with an Enterprise Architecture Mindset
 

Mehr von Ansgar Scherp

Analysis of GraphSum's Attention Weights to Improve the Explainability of Mul...
Analysis of GraphSum's Attention Weights to Improve the Explainability of Mul...Analysis of GraphSum's Attention Weights to Improve the Explainability of Mul...
Analysis of GraphSum's Attention Weights to Improve the Explainability of Mul...Ansgar Scherp
 
STEREO: A Pipeline for Extracting Experiment Statistics, Conditions, and Topi...
STEREO: A Pipeline for Extracting Experiment Statistics, Conditions, and Topi...STEREO: A Pipeline for Extracting Experiment Statistics, Conditions, and Topi...
STEREO: A Pipeline for Extracting Experiment Statistics, Conditions, and Topi...Ansgar Scherp
 
Text Localization in Scientific Figures using Fully Convolutional Neural Netw...
Text Localization in Scientific Figures using Fully Convolutional Neural Netw...Text Localization in Scientific Figures using Fully Convolutional Neural Netw...
Text Localization in Scientific Figures using Fully Convolutional Neural Netw...Ansgar Scherp
 
A Comparison of Approaches for Automated Text Extraction from Scholarly Figures
A Comparison of Approaches for Automated Text Extraction from Scholarly FiguresA Comparison of Approaches for Automated Text Extraction from Scholarly Figures
A Comparison of Approaches for Automated Text Extraction from Scholarly FiguresAnsgar Scherp
 
Can you see it? Annotating Image Regions based on Users' Gaze Information
Can you see it? Annotating Image Regions based on Users' Gaze InformationCan you see it? Annotating Image Regions based on Users' Gaze Information
Can you see it? Annotating Image Regions based on Users' Gaze InformationAnsgar Scherp
 
Linked open data - how to juggle with more than a billion triples
Linked open data - how to juggle with more than a billion triplesLinked open data - how to juggle with more than a billion triples
Linked open data - how to juggle with more than a billion triplesAnsgar Scherp
 
SchemEX -- Building an Index for Linked Open Data
SchemEX -- Building an Index for Linked Open DataSchemEX -- Building an Index for Linked Open Data
SchemEX -- Building an Index for Linked Open DataAnsgar Scherp
 
A Model of Events for Integrating Event-based Information in Complex Socio-te...
A Model of Events for Integrating Event-based Information in Complex Socio-te...A Model of Events for Integrating Event-based Information in Complex Socio-te...
A Model of Events for Integrating Event-based Information in Complex Socio-te...Ansgar Scherp
 
strukt - A Pattern System for Integrating Individual and Organizational Knowl...
strukt - A Pattern System for Integrating Individual and Organizational Knowl...strukt - A Pattern System for Integrating Individual and Organizational Knowl...
strukt - A Pattern System for Integrating Individual and Organizational Knowl...Ansgar Scherp
 
Identifying Objects in Images from Analyzing the User‘s Gaze Movements for Pr...
Identifying Objects in Images from Analyzing the User‘s Gaze Movements for Pr...Identifying Objects in Images from Analyzing the User‘s Gaze Movements for Pr...
Identifying Objects in Images from Analyzing the User‘s Gaze Movements for Pr...Ansgar Scherp
 

Mehr von Ansgar Scherp (10)

Analysis of GraphSum's Attention Weights to Improve the Explainability of Mul...
Analysis of GraphSum's Attention Weights to Improve the Explainability of Mul...Analysis of GraphSum's Attention Weights to Improve the Explainability of Mul...
Analysis of GraphSum's Attention Weights to Improve the Explainability of Mul...
 
STEREO: A Pipeline for Extracting Experiment Statistics, Conditions, and Topi...
STEREO: A Pipeline for Extracting Experiment Statistics, Conditions, and Topi...STEREO: A Pipeline for Extracting Experiment Statistics, Conditions, and Topi...
STEREO: A Pipeline for Extracting Experiment Statistics, Conditions, and Topi...
 
Text Localization in Scientific Figures using Fully Convolutional Neural Netw...
Text Localization in Scientific Figures using Fully Convolutional Neural Netw...Text Localization in Scientific Figures using Fully Convolutional Neural Netw...
Text Localization in Scientific Figures using Fully Convolutional Neural Netw...
 
A Comparison of Approaches for Automated Text Extraction from Scholarly Figures
A Comparison of Approaches for Automated Text Extraction from Scholarly FiguresA Comparison of Approaches for Automated Text Extraction from Scholarly Figures
A Comparison of Approaches for Automated Text Extraction from Scholarly Figures
 
Can you see it? Annotating Image Regions based on Users' Gaze Information
Can you see it? Annotating Image Regions based on Users' Gaze InformationCan you see it? Annotating Image Regions based on Users' Gaze Information
Can you see it? Annotating Image Regions based on Users' Gaze Information
 
Linked open data - how to juggle with more than a billion triples
Linked open data - how to juggle with more than a billion triplesLinked open data - how to juggle with more than a billion triples
Linked open data - how to juggle with more than a billion triples
 
SchemEX -- Building an Index for Linked Open Data
SchemEX -- Building an Index for Linked Open DataSchemEX -- Building an Index for Linked Open Data
SchemEX -- Building an Index for Linked Open Data
 
A Model of Events for Integrating Event-based Information in Complex Socio-te...
A Model of Events for Integrating Event-based Information in Complex Socio-te...A Model of Events for Integrating Event-based Information in Complex Socio-te...
A Model of Events for Integrating Event-based Information in Complex Socio-te...
 
strukt - A Pattern System for Integrating Individual and Organizational Knowl...
strukt - A Pattern System for Integrating Individual and Organizational Knowl...strukt - A Pattern System for Integrating Individual and Organizational Knowl...
strukt - A Pattern System for Integrating Individual and Organizational Knowl...
 
Identifying Objects in Images from Analyzing the User‘s Gaze Movements for Pr...
Identifying Objects in Images from Analyzing the User‘s Gaze Movements for Pr...Identifying Objects in Images from Analyzing the User‘s Gaze Movements for Pr...
Identifying Objects in Images from Analyzing the User‘s Gaze Movements for Pr...
 

Kürzlich hochgeladen

IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119APNIC
 
Company Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptxCompany Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptxMario
 
ETHICAL HACKING dddddddddddddddfnandni.pptx
ETHICAL HACKING dddddddddddddddfnandni.pptxETHICAL HACKING dddddddddddddddfnandni.pptx
ETHICAL HACKING dddddddddddddddfnandni.pptxNIMMANAGANTI RAMAKRISHNA
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书zdzoqco
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa494f574xmv
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predieusebiomeyer
 
Unidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptxUnidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptxmibuzondetrabajo
 
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书rnrncn29
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书rnrncn29
 
TRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptxTRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptxAndrieCagasanAkio
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxDyna Gilbert
 

Kürzlich hochgeladen (11)

IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119
 
Company Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptxCompany Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptx
 
ETHICAL HACKING dddddddddddddddfnandni.pptx
ETHICAL HACKING dddddddddddddddfnandni.pptxETHICAL HACKING dddddddddddddddfnandni.pptx
ETHICAL HACKING dddddddddddddddfnandni.pptx
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predi
 
Unidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptxUnidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptx
 
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
 
TRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptxTRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptx
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptx
 

A Comparison of Different Strategies for Automated Semantic Document Annotation

  • 1. 1Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Gregor Große-Bölting Chifumi Nishioka Ansgar Scherp A Comparison of Different Strategies for Automated Semantic Document Annotation
  • 2. 2Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Motivation [1/2] • Document annotation – Facilitates users and search engines to find documents – Requires a huge amount of human effort – e.g., subject indexers in ZBW labeled 1.6 million scientific documents in economics • Semantic document annotation – Documents annotated with semantic entities – e.g., PubMed and MeSH, ACM DL and ACM CCS Focus on semantic document annotation Necessity of automated document annotation
  • 3. 3Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Motivation [2/2] • Small scale experiments so far – Comparing a small number of strategies – Datasets containing a few hundred documents • Comparing of 43 strategies for document annotation within the developed experiment framework – The largest number of strategies • Experiments with three datasets from different domains – Contain full-texts of 100,000 documents annotated by subject indexers – The largest dataset of scientific publications We conducted the largest scale experiment
  • 4. 4Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Experiment Framework Strategies are composed of methods from concept extraction, concept activation, and annotation selection 1. Concept Extraction detect concepts (candidate annotations) from each document 2. Concept Activation compute a score for each concept of a document 3. Annotation Selection select annotations from concepts for each document 4. Evaluation measure performance of strategies with ground truth
  • 5. 5Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Research Question • Research questions solved with the experiment framework (I) Which strategy performs best? (II) Which concept extraction method performs best? (III) Which concept activation method performs best? (IV) Which annotation selection method performs best?
  • 6. 6Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Concept Extraction [1/2] • Entity – Extract entities from documents using a domain-specific knowledge base – Domain-specific knowledge base • Entities (subjects) in a specific domain (e.g., medicine) • One or more labels for each entity • Relationships between entities – Detect entities by string matching with entity labels • Tri-gram – Extract contigurous sequences of one, two, and three words in a document
  • 7. 7Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Concept Extraction [2/2] • RAKE (Rapid Automatic Keyword Extraction) [Rose et al. 10] – Unsupervised method for extracting keywords – Incorporate cooccurrence and frequency of words • LDA (Latent Dirichlet Allocation) [Blei et al. 03] – Unsupervised topic modeling method for inferring latent topics in a document corpus – Topic model • Topic: A probability distribution over words • Document: A probability distribution over topics – Treat a topic as a concept
  • 8. 8Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Concept Activation [1/6] • Three types of concept activation methods – Statistical Methods • Baseline • Use only directly mentioned concepts – Hierarchy-based Methods • Reveal concepts that are not mentioned explicitly using a hierarchical knowledge base – Graph-based Methods • Use only directly mentioned concepts • Represent concept cooccurrences as a graph Bank, Interest Rate, Financial Crisis, Bank, Central Bank, Tax, Interest Rate Tax Bank Interest Rate Financial Crisis Central Bank
  • 9. 9Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Concept Activation [2/6] • Statistical Methods – Frequency • 𝑓𝑟𝑒𝑞 𝑐, 𝑑 depends on Concept Extraction methods – The number of appearances (Entity and Tri-gram) – The score output by RAKE (RAKE) – The probability of a topic for a document 𝑑 (LDA) – CF-IDF [Goossen et al. 11] • An extension of TF-IDF replacing words with concepts • Lower scores for concepts that appear in many documents 𝑠𝑐𝑜𝑟𝑒 𝑐𝑓𝑖𝑑𝑓(𝑐, 𝑑) = 𝑐𝑓(𝑐, 𝑑) ∙ 𝑙𝑜𝑔 |𝐷| | 𝑑 ∈ 𝐷 : {𝑐 ∈ 𝑑}| 𝑠𝑐𝑜𝑟𝑒𝑓𝑟𝑒𝑞(𝑐, 𝑑) = 𝑓𝑟𝑒𝑞(𝑐, 𝑑)
  • 10. 10Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Concept Activation [3/6] • Hierarchy-based Methods [1/2] – Base Activation • 𝐶𝑙(𝑐): a set of child concepts of a concept 𝑐 • 𝜆: decay parameter, set 𝜆 = 0.4 • e.g., 𝑠𝑐𝑜𝑟𝑒 𝑏𝑎𝑠𝑒 𝑐, 𝑑 = 𝑓𝑟𝑒𝑞 𝑐, 𝑑 + 𝜆 ∙ 𝑐 𝑖∈𝐶 𝑙(𝑐) 𝑠𝑐𝑜𝑟𝑒 𝑏𝑎𝑠𝑒(𝑐𝑖, 𝑑) Social Recommendation Social Tagging Web Searching Web Mining Site Wrapping Web Log Analysis World Wide Web 𝑐1 𝑐2 𝑐3 𝑠𝑐𝑜𝑟𝑒 𝑏𝑎𝑠𝑒 𝑐1, 𝑑 = 1.00 𝑠𝑐𝑜𝑟𝑒 𝑏𝑎𝑠𝑒 𝑐2, 𝑑 = 0.40 𝑠𝑐𝑜𝑟𝑒 𝑏𝑎𝑠𝑒 𝑐3, 𝑑 = 0.16
  • 11. 11Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Concept Activation [4/6] • Hierarchy-based Methods [2/2] – Branch Activation • 𝐵𝑁: reciprocal of the number of concepts that are located one level above a concept 𝑐 – OneHop Activation • 𝐶 𝑑: set of concepts in a document 𝑑 • Activates concepts in a maximum distance of one hop 𝑠𝑐𝑜𝑟𝑒 𝑏𝑟𝑎𝑛𝑐ℎ 𝑐, 𝑑 = 𝑓𝑟𝑒𝑞 𝑐, 𝑑 + 𝜆 ∙ 𝐵𝑁 ∙ 𝑐 𝑖∈𝐶 𝑙(𝑐) 𝑠𝑐𝑜𝑟𝑒 𝑏𝑟𝑎𝑛𝑐ℎ(𝑐𝑖, 𝑑) 𝑠𝑐𝑜𝑟𝑒 𝑜𝑛𝑒ℎ𝑜𝑝 𝑐, 𝑑 = 𝑓𝑟𝑒𝑞 𝑐, 𝑑 if |𝐶𝑙(𝑐) ∩ 𝐶 𝑑| ≥ 2 𝑓𝑟𝑒𝑞 (𝑐, 𝑑) + 𝜆 ∙ 𝑐 𝑖∈𝐶 𝑙(𝑐) 𝑓𝑟𝑒𝑞 𝑐𝑖, 𝑑 otherwise
  • 12. 12Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Concept Activation [5/6] • Graph-based Methods [1/2] – Degree [Zouaq et al. 12] • 𝑑𝑒𝑔𝑟𝑒𝑒 𝑐, 𝑑 : the number of edges linked with a concept 𝑐 • e.g., 𝑠𝑐𝑜𝑟𝑒 𝑑𝑒𝑔𝑟𝑒𝑒(Bank, 𝑑) = 3 – HITS [Kleinberg 99; Zouaq et al. 12] • Link analysis algorithm for search engines [Kleinberg 99] • ℎ𝑢𝑏 𝑐, 𝑑 = 𝑐 𝑖∈𝐶 𝑛(𝑐) 𝑎𝑢𝑡ℎ 𝑐𝑖, 𝑑 • 𝑎𝑢𝑡ℎ 𝑐, 𝑑 = 𝑐 𝑖∈𝐶 𝑛(𝑐) ℎ𝑢𝑏 𝑐𝑖, 𝑑 𝑠𝑐𝑜𝑟𝑒 𝑑𝑒𝑔𝑟𝑒𝑒(𝑐, 𝑑) = 𝑑𝑒𝑔𝑟𝑒𝑒(𝑐, 𝑑) 𝑠𝑐𝑜𝑟𝑒ℎ𝑖𝑡𝑠 𝑐, 𝑑 = ℎ𝑢𝑏 𝑐, 𝑑 + 𝑎𝑢𝑡ℎ 𝑐, 𝑑 Tax Bank Interest Rate Financial Crisis Central Bank
  • 13. 13Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Concept Activation [6/6] • Graph-based Methods [2/2] – PageRank [Page et al. 99; Mihalcea & Paul 04] • Link analysis algorithm for search engines • Based on the intuition that a node that is linked from many important nodes is more important • 𝐶𝑖𝑛(𝑐): set of concepts connected with incoming edges from 𝑐 • 𝐶 𝑜𝑢𝑡(𝑐): set of concepts connected with outgoing edges from 𝑐 • 𝜇: dumping factor, 𝜇 = 0.85 𝑠𝑐𝑜𝑟𝑒 𝑝𝑎𝑔𝑒 𝑐, 𝑑 = 1 − 𝜇 + 𝜇 ∙ 𝑐 𝑖∈𝐶 𝑖𝑛(𝑐) 𝑠𝑐𝑜𝑟𝑒 𝑝𝑎𝑔𝑒(𝑐𝑖, 𝑑) |𝐶 𝑜𝑢𝑡(𝑐𝑖)|
  • 14. 14Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Annotation Selection • Top-5 and Top-10 – Select concepts whose scores are ranked in top-k • k Nearest Neighbor (kNN) [Huang et al. 11] – Based on the assumption that documents with similar concepts share similar annotations 1. Compute similarity scores between a target document and all documents with annotations 2. Select union of annotations of k nearest documents Central bank Law Financial crisis Finance China Human resource Leadership Marketing Competition law ?? 0.49 0.45 0.42 0.60 Example - 𝑘 = 2 - Selected annotations Finance; China; Marketing; Competition Law
  • 15. 15Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Configurations [1/5] Entity Tri-gram LDARAKE Statistical Methods (2 methods) Hierarchy-based Methods (3 methods) Graph-based Methods (3 methods) Top-k (2 methods) kNN (1 method) Concept Extraction Annotation Selection Concept Activation
  • 16. 16Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Configurations [2/5] 24 strategies Entity Tri-gram LDARAKE Statistical Methods (2 methods) Hierarchy-based Methods (3 methods) Graph-based Methods (3 methods) Top-k (2 methods) kNN (1 method) Concept Extraction Annotation Selection Concept Activation
  • 17. 17Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 15 strategies Entity Tri-gram LDARAKE Statistical Method (2 methods) Hierarchy-based Methods (3 methods) Graph-based Methods (3 methods) Top-k (2 methods) kNN (1 method) Concept Extraction Annotation Selection Concept Activation Configurations [3/5]
  • 18. 18Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 3 strategies Entity Tri-gram LDARAKE Statistical Method (Frequency) Hierarchy-based Methods (3 methods) Graph-based Methods (3 methods) Top-k (2 methods) kNN (1 method) Concept Extraction Annotation Selection Concept Activation Configurations [4/5]
  • 19. 19Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Entity Tri-gram LDARAKE Statistical Methods (Frequency) Hierarchy-based Methods (3 methods) Graph-based Methods (3 methods) Top-k (2 methods) kNN (1 method) Concept Extraction Annotation Selection Concept Activation Configurations [5/5] 43 strategies in total
  • 20. 20Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Datasets and Metrics of Experiments Economics Political Science Computer Science publication ZBW FIV SemEval 2010 # of publications 62,924 28,324 244 # of annotations 5.26 (± 1.84) 12.00 (± 4.02) 5.05 (± 2.41) knowledge base STW European Thesaurus ACM CCS # of entities 6,335 7,912 2,299 # of labels 11,679 8,421 9,086 • Computer Science: SemEval 2010 dataset [Kim et al. 10] – Publications are annotated with keywords originally – We converted keywords to entities by string matching • All publications and labels of entities are in English • We use full-texts of publications • All annotations are used as ground truth • Evaluation metrics: Precision, Recall, F-measure
  • 21. 21Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 (I) Best Performing Strategies • Economics and Political Science datasets – The best strategy: Entity × HITS × kNN – F-measure: 0.39 (economics), 0.28 (political science) • Computer Science dataset – The best strategy: Entity × Degree × kNN – F-measure: 0.33 (computer science) • Graph-based methods do not differ a lot In general, a document annotation strategy Entity × Graph-based method × kNN performs best
  • 22. 22Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 (II) Influence of Concept Extraction • Concept Extraction method: Entity – Use domain-specific knowledge bases – Knowledge bases: freely available and of high quality – 32 thesauri listed in W3C SKOS Datasets For Concept Extraction methods, Entity consistently outperforms Tri-gram, RAKE, and LDA
  • 23. 23Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 (III) Influence of Concept Activation • Poor performance of hierarchy-based methods – We use full-texts in the experiments • Full-texts contain so many different concepts (203.80 unique entities (SD: 24.50)) that others do not have to be activated – However, OneHop can work as well as graph-based methods • It activates concept in one hop distance For Concept Activation methods, graph-based methods are better than statistical methods or hierarchy-based methods
  • 24. 24Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 (IV) Influence of Annotation Selection • kNN – No learning process – Confirms the assumption that documents with similar concepts share similar annotations For Annotation Selection methods, kNN can enhance the performance
  • 25. 25Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Conclusion • Large scale experiment for automated semantic document annotation for scientific publications • Best strategy: Entity × Graph-based method × kNN – Novel combination of methods • Best concept extraction method: Entity • Best concept activation method: Graph-based methods – OneHop can achieve similar performance and requires less computation cost
  • 26. 26Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Thank you! Questions?
  • 28. 28Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Research Question • Research questions solved with the experiment framework (I) Which strategy performs best? (II) Which concept extraction method performs best? (III) Which concept activation method performs best? (IV) Which annotation selection method performs best?
  • 29. 29Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 LDA (Latent Dirichlet Allocation) source: D. M. Blei. Probabilistic topic models, CACM, 2012.
  • 30. 30Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Entity Extraction and Conversion • Entity extraction – String matching with entity labels – Starting with longer entity labels • e.g., From a text “financial crisis is …”, only an entity “financial crisis” is detected (not “crisis”). • Converting to entities – Words and keywords are extracted in Tri-gram and RAKE – They are converted to entities by string matching with entity labels before annotation selection – If no matched entity label is found, word or keyword is discarded
  • 31. 31Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 kNN [1/2] • Similarity measure – Each document is represented as a vector where each element is a score of a concept – Cosine similarity is used as a similarity measure GDP Immigration Population Bank Interest rate Canada 0.3 0.5 0.8 0.1 0.0 0.5 GDP Immigration Population Bank Interest rate Canada 0.6 0.0 0.4 0.8 0.4 0.2 cosine similarity between (0.3, 0.5, 0.8, 0.1, 0.0, 0.5) and (0.6, 0.0, 0.4, 0.8, 0.4, 0.2) 𝒔𝒊𝒎 𝒅 𝟏, 𝒅 𝟐 = 𝟎. 𝟓𝟐 𝑑1 𝑑2
  • 32. 32Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 kNN [2/2] • k = 1 • k = 2 Central bank Law Financial crisis Finance China Human resource Leadership Marketing Competition law ?? 0.49 0.45 0.42 0.60 Marketing Competitive law Selected annotations Central bank Law Financial crisis Finance China Human resource Leadership Marketing Competition law ?? 0.49 0.45 0.42 0.60 Marketing Competitive law Finance China Selected annotations
  • 33. 33Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Evaluation Metrics • Precision • Recall • F-measure 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = {𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑜𝑓 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠} ∩ |{𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠}| | 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠 | 𝑟𝑒𝑐𝑎𝑙𝑙 = {𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑜𝑓 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠} ∩ |{𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠}| | 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠 | 𝐹𝑚𝑒𝑎𝑠𝑢𝑟𝑒 = 2 ∙ 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∙ 𝑟𝑒𝑐𝑎𝑙𝑙 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
  • 34. 34Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Datasets • Economics dataset – 11 GB • Political science dataset – 3.8 GB
  • 35. 35Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Experiments • Preprocessing documents – lemmatization – stop words removal • 10-fold cross validation – split a dataset into 10 equal sized subsets – 8 subset for training data – 1 subset for testing data – 1 subset for optimizing parameter
  • 36. 36Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Result Table: Entity [1/2] Economics top-5 top-10 kNN Recall Precision F Recall Precision F Recall Precision F Frequency .14 (.17) .14 (.15) .13 (.15) .22 (.20) .11 (.10) .14 (.12) .08 (.21) .08 (.21) .08 (.21) CF-IDF .19 (.19) .18 (.17) .18 (.16) .24 (.21) .12 (.10) .15 (.12) .29 (.32) .30 (.32) .29 (.31) Base Act. .10 (.14) .09 (.13) .09 (.13) .18 (.19) .09 (.09) .12 (.11) .20 (.30) .20 (.30) .20 (.29) Branch Act. .08 (.14) .08 (.12) .08 (.12) .17 (.19) .08 (.09) .11 (.11) .17 (.28) .17 (.28) .17 (.27) OneHop .12 (.16) .12 (.14) .12 (.14) .19 (.19) .09 (.09) .12 (.11) .35 (.34) .36 (.34) .35 (.33) Degree .15 (.17) .14 (.15) .14 (.15) .23 (.20) .11 (.09) .14 (.12) .39 (.33) .40 (.33) .38 (.32) HITS .14 (.17) .14 (.15) .14 (.15) .23 (.20) .11 (.10) .14 (.12) .40 (.32) .40 (.32) .39 (.31) PageRank .14 (.17) .14 (.15) .14 (.15) .22 (.20) .11 (.09) .14 (.12) .39 (.33) .40 (.33) .38 (.32) Political Science top-5 top-10 kNN Recall Precision F Recall Precision F Recall Precision F Frequency .12 (.11) .18 (.16) .14 (.12) .15 (.13) .12 (.10) .13 (.10) .14 (.17) .05 (.07) .07 (.09) CF-IDF .05 (.07) .12 (.16) .07 (.10) .07 (.09) .08 (.10) .07 (.09) .24 (.22) .14 (.14) .17 (.16) Base Act. .05 (.08) .10 (.13) .07 (.09) .10 (.10) .10 (.09) .09 (.09) .14 (.19) .07 (.10) .09 (.12) Branch Act. .04 (.07) .08 (.12) .05 (.08) .09 (.09) .09 (.09) .08 (.09) .12 (.17) .06 (.10) .08 (.11) OneHop .10 (.09) .21 (.17) .13 (.11) .13 (.11) .14 (.11) .13 (.10) .27 (.21) .26 (.21) .25 (.19) Degree .10 (.09) .21 (.17) .13 (.11) .13 (.11) .14 (.11) .13 (.10) .29 (.21) .28 (.21) .27 (.19) HITS .10 (.09) .21 (.17) .13 (.11) .13 (.11) .14 (.11) .13 (.10) .30 (.22) .29 (.21) .28 (.20) PageRank .10 (.09) .20 (.17) .13 (.11) .13 (.10) .14 (.11) .13 (.10) .29 (.22) .29 (.21) .27 (.20)
  • 37. 37Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Result Table: Entity [2/2] Computer Science top-5 top-10 kNN Recall Precision F Recall Precision F Recall Precision F Frequency .18 (.21) .14 (.15) .15 (.16) .22 (.22) .08 (.08) .12 (.11) .49 (.28) .24 (.16) .30 (.17) CF-IDF .02 (.08) .02 (.06) .02 (.06) .03 (.11) .01 (.04) .02 (.05) .47 (.29) .23 (.17) .29 (.18) Base Act. .17 (.20) .13 (.14) .14 (.15) .22 (.22) .08 (.08) .12 (.11) .49 (.28) .22 (.15) .29 (.17) Branch Act. .17 (.20) .12 (.14) .14 (.15) .21 (.22) .08 (.08) .11 (.11) .50 (.28) .22 (.15) .29 (.17) OneHop .17 (.20) .13 (.14) .14 (.15) .21 (.22) .08 (.08) .11 (.11) .42 (.30) .25 (.21) .29 (.20) Degree .17 (.21) .13 (.15) .14 (.16) .22 (.22) .08 (.08) .12 (.11) .49 (.28) .27 (.17) .33 (.18) HITS .18 (.21) .14 (.15) .15 (.16) .21 (.22) .08 (.08) .11 (.11) .48 (.31) .27 (.18) .32 (.20) PageRank .17 (.21) .13 (.15) .14 (.16) .22 (.22) .08 (.08) .12 (.11) .50 (.29) .25 (.15) .31 (.18)
  • 38. 38Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Result Table: Tri-gram Economics top-5 top-10 kNN Recall Precision F Recall Precision F Recall Precision F Frequency .12 (.15) .12 (.14) .11 (.14) .19 (.19) .10 (.10) .13 (.12) .08 (.22) .08 (.22) .08 (.21) CF-IDF .10 (.12) .10 (.12) .09 (.11) .17 (.17) .08 (.10) .12 (.12) .07 (.20) .06 (.22) .06 (.20) Degree .03 (.09) .03 (.08) .03 (.08) .03 (.09) .03 (.08) .03 (.08) .07 (.21) .07 (.21) .07 (.20) HITS .02 (.06) .02 (.06) .02 (.06) .02 (.06) .02 (.06) .02 (.06) .08 (.22) .08 (.22) .07 (.21) PageRank .03 (.09) .03 (.08) .03 (.08) .03 (.09) .03 (.08) .03 (.08) .10 (.20) .04 (.08) .05 (.11) Political Science top-5 top-10 kNN Recall Precision F Recall Precision F Recall Precision F Frequency .06 (.08) .14 (.16) .08 (.10) .10 (.10) .11 (.11) .10 (.09) .08 (.14) .05 (.08) .06 (.09) CF-IDF .05 (.05) .06 (.07) .05 (.06) .09 (.10) .09 (.10) .08 (.09) .09 (.15) .04 (.08) .06 (.10) Degree .01 (.03) .03 (.07) .01 (.04) .01 (.03) .03 (.07) .01 (.04) .11 (.14) .03 (.05) .05 (.07) HITS .01 (.03) .02 (.06) .01 (.03) .01 (.03) .00 (.06) .01 (.03) .12 (.14) .04 (.06) .06 (.08) PageRank .01 (.04) .03 (.08) .02 (.05) .01 (.04) .03 (.08) .02 (.05) .08 (.12) .03 (.05) .04 (.06) Computer Science top-5 top-10 kNN Recall Precision F Recall Precision F Recall Precision F Frequency .26 (.24) .20 (.18) .22 (.19) .54 (.30) .20 (.13) .29 (.17) .44 (.28) .25 (.18) .30 (.19) CF-IDF .23 (.24) .18 (.18) .19 (.19) .54 (.29) .22 (.14) .30 (.17) .48 (.28) .20 (.14) .26 (.15) Degree .09 (.15) .07 (.11) .07 (.12) .13 (.19) .05 (.07) .07 (.09) .48 (.29) .23 (.16) .29 (.18) HITS .05 (.14) .04 (.09) .04 (.10) .11 (.18) .04 (.06) .06 (.09) .39 (.29) .26 (.21) .28 (.19) PageRank .02 (.06) .02 (.05) .02 (.06) .03 (.08) .01 (.03) .02 (.05) .46 (.29) .25 (.18) .30 (.18)
  • 39. 39Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Result Table: RAKE Economics top-5 top-10 kNN Recall Precision F Recall Precision F Recall Precision F Frequency .08 (.14) .08 (.12) .08 (.12) .15 (.18) .07 (.08) .10 (.11) .34 (.33) .34 (.33) .33 (.32) Political Science top-5 top-10 kNN Recall Precision F Recall Precision F Recall Precision F Frequency .04 (.07) .08 (.13) .05 (.08) .07 (.09) .08 (.09) .07 (.08) .31 (.23) .18 (.15) .22 (.17) Computer Science top-5 top-10 kNN Recall Precision F Recall Precision F Recall Precision F Frequency .24 (.24) .17 (.16) .19 (.17) .42 (.28) .15 (.10) .22 (.14) .42 (.27) .20 (.13) .25 (.15)
  • 40. 40Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Result Table: LDA Economics kNN Recall Precision F Frequency .19 (.30) .19 (.30) .19 (.30) Political Science kNN Recall Precision F Frequency .15 (.19) .15 (.18) .14 (.17) Computer Science kNN Recall Precision F Frequency .28 (.27) .24 (.23) .24 (.22)
  • 41. 41Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Materials • Codes – https://github.com/ggb/ShortStories • Datasets – economics and political science • not publicly available yet • contact us directly, if you are interested in – computer science • publicly available
  • 42. 42Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Presentation • K-CAP 2015 – International Conference on Knowledge Capture – Scope • Knowledge Acquisition / Capture • Knowledge Extraction from Text • Semantic Web • Knowledge Engineering and Modelling • … • Time slot – Presentation: 25 minutes – Q & A: 5 minutes
  • 43. 43Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Reference • [Blei et al. 03] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation, JMLR, 2003. • [Blei 12] D. M. Blei. Probabilistic topic models, CACM, 2012. • [Goossen et al. 11] F. Goossen, W. IJntema, F. Frasincar, F. Hogenboom, and U. Kaymak. News personalization using the CF-IDF semantic recommender, WIMS, 2011. • [Grosse-Bolting et al. 15] G. Grosse-Bolting, C. Nishioka, and A. Scherp. Generic process for extracting user profiles from social media using hierarchical knowledge bases, ICSC, 2015. • [Huang et al. 11] M. Huang, A. Névéol, and Z. Lu. Recommending MeSH terms for annotating biomedical articles, JAMIA, 2011. • [Kapanipathi et al. 14] P. Kapanipathi, P. Jain, C. Venkataramani, and A. Sheth. User interests identification on Twitter using a hierarchical knowledge base, ESWC, 2014. • [Kim et al. 10] S. N. Kim, O. Medelyan, M. Y. Kan, and T. Baldwin. Semeval-2010 task 5: Automatic keyphrase extraction from scientific articles, International Workshop on Semantic Evaluation, 2010.
  • 44. 44Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015 Reference • [Kleinberg 99] J. M. Kleinberg. Authoritative sources in a hyperlinked environment, Journal of the ACM, 1999. • [Mihalcea & Paul 04] R. Mihalcea and T. Paul. TextRank: Bringing order into texts, EMNLP, 2004. • [Page et al. 99] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: bringing order to the web, TR of Stanford InfoLab, 1999. • [Rose et al. 10] S. Rose, D. Engel, N. Cramer, and W. Cowley. Automatic keyword extraction from individual documents, Text Mining, 2010. • [Zouaq et al. 12] A. Zouaq, G. Dragan, and H. Marek. Voting theory for concept detection, ESWC, 2012.