Over the last two decades frequent itemset and association
rule mining has attracted huge attention from the scientific community
which resulted in numerous publications, models, algorithms, and optimizations
of basic frameworks. In this paper we introduce an extension
of the frequent itemset framework, called substitutive itemsets. Substitutive
itemsets allow to discover equivalences between items, i.e., they
represent pairs of items that can be used interchangeably in many contexts.
In the paper we present basic notions pertaining to substitutive
itemsets, describe the implementation of the proposed method available
as a RapidMiner plugin, and illustrate the use of the framework for mining
substitutive object properties in the Linked Data.
RuleML2015: Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked Data
1. Using Substitutive Itemset Mining Framework for
Finding Synonymous Properties in Linked Data
Mikolaj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski
Poznan University of Technology, Poland
August 3rd, 2015
RuleML 2015
Mikolaj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked
August 3rd, 2015 RuleML 20
/ 15
2. Outline
Motivating Scenario
Substitutive Sets Mining
Finding Synonymous Properties with Substitutive Sets Mining
Summary
Mikolaj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked
August 3rd, 2015 RuleML 20
/ 15
3. Motivating scenario 1/3
Wiki
(mappings
Wikipedia
infoboxes
-‐>
DBpedia
ontology)
Norah_Jones
Denton,_Texas
dbpedia-‐prop:origin
Mark_Knopfler
Gosforth
dbpedia-‐owl:hometown
Peter_Gabriel
Godalming
dbpedia-‐owl:hometown
Mikolaj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked
August 3rd, 2015 RuleML 20
/ 15
4. Motivating scenario 2/3
DBpedia 2014 ontology has 1310 object and 1725 data properties
Many large Linked Data use relatively lightweight schemas with a
high number of object properties
Mikolaj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked
August 3rd, 2015 RuleML 20
/ 15
5. Motivating scenario 3/3
Wiki
(mappings
Wikipedia
infoboxes
-‐>
DBpedia
ontology)
dbpedia-‐owl:MusicalArAst
dbpedia-‐owl:PopulatedPlace
Norah_Jones
Denton,_Texas
dbpedia-‐prop:origin
Mark_Knopfler
Gosforth
dbpedia-‐owl:hometown
Peter_Gabriel
Godalming
dbpedia-‐owl:hometown
dbpedia-‐owl:MusicalArAst
dbpedia-‐owl:PopulatedPlace
dbpedia-‐owl:MusicalArAst
dbpedia-‐owl:PopulatedPlace
Mikolaj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked
August 3rd, 2015 RuleML 20
/ 15
6. Substitutive Sets Mining Framework
Frequent(
Itemset(Mining(
Subs1tu1ve(Set(
Genera1on(
Transac1on(DB(
Mikolaj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked
August 3rd, 2015 RuleML 20
/ 15
7. Frequent Itemsets
I = {i1,i2,...,im} - a set of items
DT = {t1,t2,...,tn}, where ∀iti ⊆ I -a database of transactions
support(X) =
{t∈DT X⊆t}
DT
ID Items
1 Nachos, Pepsi, Salsa
2 Nachos, Coca-Cola, Salsa
3 Nachos, Coca-Cola
4 Nachos, Pepsi, Salsa
5 Milk, Bread
Frequent Itemset Support
{Nachos} 80%
{Salsa} 60%
{Coca-Cola} 40%
{Pepsi} 40%
{Nachos, Salsa} 60%
{Nachos, Coca-Cola} 40%
{Nachos, Pepsi} 40%
{Salsa, Pepsi} 40%
Mikolaj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked
August 3rd, 2015 RuleML 20
/ 15
8. Covering Set
CS(i L) = {X ∈ L {i} ∪ X ∈ L}
coverage(i L) = CS(i L)
Frequent Itemset
{Nachos}
{Salsa}
{Coca-Cola}
{Pepsi}
{Nachos, Salsa}
{Nachos, Coca-Cola}
{Nachos, Pepsi}
{Salsa, Pepsi}
i CS(i) coverage
{Nachos} {{Salsa}, {Coca-Cola}, {Pepsi}} 3
{Salsa} {{Nachos}} 1
{Coca-Cola} {{Nachos}} 1
{Pepsi} {{Nachos}, {Salsa}} 2
Mikolaj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked
August 3rd, 2015 RuleML 20
/ 15
9. Substitutive Sets
A two-element itemset {x,y} is a substitutive itemset, if:
x ∈ L1,
y ∈ L1,
support({x} ∪ {y}) < ε, where ε is a user-defined threshold
representing the highest amount of noise in the data allowed,
CS(x L)∩CS(y L)
max{ CS L(x) , CS(y L) } ⩾ mincommon.
i CS(i) coverage
{Nachos} {{Salsa}, {Coca-Cola}, {Pepsi}} 3
{Salsa} {{Nachos}} 1
{Coca-Cola} {{Nachos}} 1
{Pepsi} {{Nachos}, {Salsa}} 2
CS(Pepsi)∩CS(Coca−Cola)
max{ CS(Pepsi) , CS(Coca−Cola) }
= 0.5
Mikolaj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked
August 3rd, 2015 RuleML 20
/ 15
10. Create Substitutive Sets RapidMiner operator
Mikolaj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked
August 3rd, 2015 RuleML 20
/ 15
11. Use Case: DBpedia
DBpedia knowledge base version 2014
sets of 3–item transactions {c1,p,c2}, where c1 and c2 classes of
subject and object of RDF triple, and p property connecting s and o
SELECT ?c1 ?p ?c2
WHERE {
?s rdf:type dbpedia-owl:Organization .
?s ?p ?o .
?s rdf:type ?c1 .
?o rdf:type ?c2 .
FILTER(?p != dbpedia-owl:wikiPageWikiLink) .
FILTER(?p != rdf:type) .
FILTER(?p != dbpedia-owl:wikiPageExternalLink) .
FILTER(?p != dbpedia-owl:wikiPageID) .
FILTER(?p != dbpedia-owl:wikiPageInterLanguageLink) .
FILTER(?p != dbpedia-owl:wikiPageLength) .
FILTER(?p != dbpedia-owl:wikiPageOutDegree) .
FILTER(?p != dbpedia-owl:wikiPageRedirects) .
FILTER(?p != dbpedia-owl:wikiPageRevisionID)
}
Mikolaj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked
August 3rd, 2015 RuleML 20
/ 15
13. Experimental Setup
FP-Growth: min number of itemsets = 500, max number of retries =
15, min support: 1.0E-4),
Create Substitutive Sets: min support = 1.0E-4, min common
= 0.7, epsilon =1.0E-5,
a sample of 100k results per each query,
desktop computer with 12GB RAM and CPU Intel(R) Core(TM)
i5-4570 3.20GHz,
a single run of mining substitutive sets (for a single class and 100k
transactions) took several seconds on average (ranging from 2s to
12s)
Mikolaj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked
August 3rd, 2015 RuleML 20
/ 15
14. Sample substitutive properties for the class Organisation
Item X Item Y Common
dbpprop:parentOrganization dbo:parentOrganisation 1.000
dbpprop:owner dbo:owner 1.000
dbpprop:origin dbo:hometown 1.000
dbpprop:headquarters dbpprop:parentOrganization 1.000
dbpprop:formerAffiliations dbo:formerBroadcastNetwork 1.000
dbo:product dbpprop:products 1.000
dbpprop:keyPeople dbo:keyPerson 0.910
dbpprop:commandStructure dbpprop:branch 0.857
dbo:schoolPatron dbo:foundedBy 0.835
dbpprop:notableCommanders dbo:notableCommander 0.824
dbo:recordLabel dbpprop:label 0.803
dbo:headquarter dbo:locationCountry 0.803
dbpprop:country dbo:state 0.753
Mikolaj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked
August 3rd, 2015 RuleML 20
/ 15
15. Summary
Introduced a model for substitutive itemsets mining
Preliminary tests of this model within the task of deduplication of
object properties in an RDF dataset (DBpedia)
Mikolaj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked
August 3rd, 2015 RuleML 20
/ 15
16. Acknowledgements
Foundation for Polish Science under the POMOST programme,
cofinanced from European Union, Regional Development Fund (No
POMOST/2013-7/8) (2013-2015)
EU FP7 ICT-2007.4.4 (No 231519) ”e-LICO: An e-Laboratory for
Interdisciplinary Collaborative Research in Data Mining and
Data-Intensive Science” (2009-2012)
Thanks to Ewa Kowalczuk for debugging the RapidMiner plugin
Mikolaj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked
August 3rd, 2015 RuleML 20
/ 15