PhD Dissertation Talk, 22 April 2011
----
The main topic of this thesis addresses the important problem of mining numerical data, and especially gene expression data. These data characterize the behaviour of thousand of genes in various biological situations (time, cell, etc.).
A difficult task consists in clustering genes to obtain classes of genes with similar behaviour, supposed to be involved together within a biological process.
Accordingly, we are interested in designing and comparing methods in the field of knowledge discovery from biological data. We propose to study how the conceptual classification method called Formal Concept Analysis (FCA) can handle the problem of extracting interesting classes of genes. For this purpose, we have designed and experimented several original methods based on an extension of FCA called pattern structures. Furthermore, we show that these methods can enhance decision making in agronomy and crop sanity in the vast formal domain of information fusion.
Web & Social Media Analytics Previous Year Question Paper.pdf
On the Mining of Numerical Data with Formal Concept Analysis
1. On the Mining of Numerical Data with
Formal Concept Analysis
Th`ese de doctorat en informatique
Mehdi Kaytoue
22 April 2011
Amedeo Napoli S´ebastien Duplessis
2. Somewhere... in a temperate forest...
2 / 40
On the Mining of Numerical Data with Formal Concept Analysis
3. Context
A biological problem
: How does symbiosis work at the cellular level?
Analyse biological processes
Find genes involved in symbiosis
Choose a model for
understanding symbiosis:
Laccaria bicolor
Analysing Gene Expression Data (GED)
F. Martin et al.
The Genome of Laccaria Bicolor Provides Insights into Mycorrhizal Symbiosis.
In Nature., 2008.
3 / 40
On the Mining of Numerical Data with Formal Concept Analysis
4. Context
Gene expression data (GED)
A numerical dataset, or data-table with
genes in rows
biological situations in columns
expression value of a gene in row for
the situation in column.
A row denotes the expression profile
of a gene (GEP)
m1 m2 m3
g1 5 7 6
g2 6 8 4
g3 4 8 5
g4 4 9 8
g5 5 8 5
Biological hypothesis
A group of genes having a similar expression profile interact to-
gether within the same biological process
4 / 40
On the Mining of Numerical Data with Formal Concept Analysis
5. Context
With very large datasets...
Gene expression data of Laccaria bicolor
22,294 genes
3 types of biological situations reflecting cells of the organism in
various stages of its biological cycle:
free living mycelium
symbiotic tissues
fruiting bodies
Attribute values ranged in [0, 65000]
5 / 40
On the Mining of Numerical Data with Formal Concept Analysis
6. Context
Knowledge discovery in databases
An iterative and interactive process
U. Fayyad, G. Piatetsky-Shapiro and P. Smyth
The KDD process for Extracting Useful Knowledge from Volumes of Data.
In Commun. ACM., 1996.
6 / 40
On the Mining of Numerical Data with Formal Concept Analysis
7. Context
Mining gene expression data
Extracting (maximal) rectangles in numerical data
A set of genes co-expressed in some biological situations
Local patterns: biological processes may be activated in some
situations only
Overlapping patterns: a gene may be involved in several
biological process
m1 m2 m3 m4 m5
g1 1 2 2 1 6
g2 2 1 1 0 6
g3 2 2 1 7 6
g4 8 9 2 6 7
Biclustering: A difficult problem relying on heuristics
R. Peeters
The Maximum Edge Biclique Problem is NP-Complete.
In Discrete Applied Math., vol. 131, no. 3., 2003
7 / 40
On the Mining of Numerical Data with Formal Concept Analysis
8. Context
Core of the thesis
Mining gene expression data with formal concept analysis
Turning GED into binary, encoding over/under expression
Bringing the problem into well-known settings
Allowing a complete and mathematically well defined approach
Exploiting algorithms and “tools”
m1 m2 m3 m4 m5
g1 1 2 2 1 6
g2 2 1 1 5 6
g3 2 2 1 7 6
g4 8 9 2 6 7
⇒
m1 m2 m3 m4 m5
g1 0 0 0 0 1
g2 0 0 0 0 1
g3 0 0 0 1 1
g4 1 1 0 1 1
Can we work with FCA directly on numerical data?
8 / 40
On the Mining of Numerical Data with Formal Concept Analysis
9. Context
Core of the thesis
Mining gene expression data with formal concept analysis
Turning GED into binary, encoding over/under expression
Bringing the problem into well-known settings
Allowing a complete and mathematically well defined approach
Exploiting algorithms and “tools”
m1 m2 m3 m4 m5
g1 1 2 2 1 6
g2 2 1 1 5 6
g3 2 2 1 7 6
g4 8 9 2 6 7
⇒
m1 m2 m3 m4 m5
g1 ×
g2 ×
g3 × ×
g4 × × × ×
Can we work with FCA directly on numerical data?
8 / 40
On the Mining of Numerical Data with Formal Concept Analysis
10. Context
Outline
1 Context
2 Formal Concept Analysis
3 Contributions
Interval pattern structures
Introducing similarity
A KDD-oriented discussion
4 Conclusion and perspectives
9 / 40
On the Mining of Numerical Data with Formal Concept Analysis
11. Formal Concept Analysis
A binary table as a formal context
Given by (G, M, I) with
G a set of objects
M a set of attributes
I a binary relation between objects and attributes:
(g, m) ∈ I means that “object g owns attribute m”
m1 m2 m3
g1 × ×
g2 × ×
g3 × ×
g4 × ×
g5 × × ×
G = {g1, . . . , g5}
M = {m1, m2, m3}
(g1, m3) ∈ I
B. Ganter and R. Wille
Formal Concept Analysis.
In Springer, Mathematical foundations., 1999.
10 / 40
On the Mining of Numerical Data with Formal Concept Analysis
12. Formal Concept Analysis
A maximal rectangle as a formal concept
A Galois connection to characterize formal concepts
A = {m ∈ M | ∀g ∈ A ⊆ G : (g, m) ∈ I}
B = {g ∈ G | ∀m ∈ B ⊆ M : (g, m) ∈ I}
(A, B) is a concept with extent A = B and intent B = A
{g3} = {m2, m3}
{m2, m3} = {g3, g4, g5}
m1 m2 m3
g1 × ×
g2 × ×
g3 × ×
g4 × ×
g5 × × ×
({g3, g4, g5}, {m2, m3}) is a formal concept
11 / 40
On the Mining of Numerical Data with Formal Concept Analysis
13. Formal Concept Analysis
Concept lattice
Ordered set of concepts...
(A1, B1) ≤ (A2, B2) ⇔ A1 ⊆ A2 (⇔ B2 ⊆ B1)
({g1, g5}, {m1, m3}) ≤ ({g1, g2, g5}, {m1})
... with interesting properties
Maximality of concepts as rectangles
Overlapping of concepts
Specialization/generalisation hierarchy
Synthetic representation of the data without loss of information
12 / 40
On the Mining of Numerical Data with Formal Concept Analysis
14. Formal Concept Analysis
Handling numerical data with FCA?
Initial problem
Extracting groups of genes with similar numerical values
Conceptual scaling (discretization or binarization)
An object has an attribute if its value lies in a predefined interval
m1 m2 m3
g1 5 7 6
g2 6 8 4
g3 4 8 5
g4 4 9 8
g5 5 8 5
m1, [4, 5] m2, [4, 7] m3, [5, 6]
g1 × × ×
g2
g3 × ×
g4 ×
g5 × ×
Different scalings: different interpretations of the data
General problem of the thesis
How to directly build a concept lattice from numerical data?
13 / 40
On the Mining of Numerical Data with Formal Concept Analysis
16. Contributions – Interval pattern structures
How to handle complex descriptions
An intersection as a similarity operator
∩ behaves as similarity operator
{m1, m2} ∩ {m1, m3} = {m1}
∩ induces an ordering relation ⊆
N ∩ O = N ⇐⇒ N ⊆ O
{m1} ∩ {m1, m2} = {m1} ⇐⇒ {m1} ⊆ {m1, m2}
∩ has the properties of a meet in a semi lattice,
a commutative, associative and idempotent operation
c d = c ⇐⇒ c d
A. Tversky
Features of similarity.
In Psychological Review, 84 (4), 1977.
15 / 40
On the Mining of Numerical Data with Formal Concept Analysis
17. Contributions – Interval pattern structures
Pattern structure
Given by (G, (D, ), δ)
G a set of objects
(D, ) a semi-lattice of descriptions or patterns
δ a mapping such as δ(g) ∈ D describes object g
A Galois connection
A =
g∈A
δ(g) for A ⊆ G
d = {g ∈ G|d δ(g)} for d ∈ (D, )
B. Ganter and S. O. Kuznetsov
Pattern Structures and their Projections.
In International Conference on Conceptual Structures, 2001.
16 / 40
On the Mining of Numerical Data with Formal Concept Analysis
19. Contributions – Interval pattern structures
Interval pattern concept lattice
Lowest concepts: few objects, small intervals
Highest concepts: many objects, large intervals
Concept/pattern overwhelming
18 / 40
On the Mining of Numerical Data with Formal Concept Analysis
20. Contributions – Interval pattern structures
Links with conceptual scaling
Interordinal scaling [Ganter & Wille]
A scale to encode intervals of attribute values
m1 ≤ 4 m1 ≤ 5 m1 ≤ 6 m1 ≥ 4 m1 ≥ 5 m1 ≥ 6
4 × × × ×
5 × × × ×
6 × × × ×
Equivalent concept lattice
Example
({g1, g2, g5}, {m1 ≤ 6, m1 ≥ 4, m1 ≥ 5, ... , ... })
({g1, g2, g5}, [5, 6] , ... , ... )
Why should we use pattern structures as we have scaling?
Processing a pattern structure is more efficient
19 / 40
On the Mining of Numerical Data with Formal Concept Analysis
21. Contributions – Introducing similarity
Outline
1 Context
2 Formal Concept Analysis
3 Contributions
Interval pattern structures
Introducing similarity
A KDD-oriented discussion
4 Conclusion and perspectives
20 / 40
On the Mining of Numerical Data with Formal Concept Analysis
22. Contributions – Introducing similarity
Introducing a similarity relation
Grouping in a same concept objects having similar values?
A natural similarity relation on numbers
a θ b ⇔ |a − b| ≤ θ e.g. 4 1 5 4 1 6
Similarity operator in pattern structures
4 5 6
[4,5] [5,6]
[4,6]
How to consider a similarity relation w.r.t. a distance?
21 / 40
On the Mining of Numerical Data with Formal Concept Analysis
23. Contributions – Introducing similarity
Introducing a similarity relation
Grouping in a same concept objects having similar values?
A natural similarity relation on numbers
a θ b ⇔ |a − b| ≤ θ e.g. 4 1 5 4 1 6
Similarity operator in pattern structures
θ = 2
4 5 6
[4,5] [5,6]
[4,6]
How to consider a similarity relation w.r.t. a distance?
21 / 40
On the Mining of Numerical Data with Formal Concept Analysis
24. Contributions – Introducing similarity
Introducing a similarity relation
Grouping in a same concept objects having similar values?
A natural similarity relation on numbers
a θ b ⇔ |a − b| ≤ θ e.g. 4 1 5 4 1 6
Similarity operator in pattern structures
θ = 1
4 5 6
[4,5] [5,6]
[4,6]
How to consider a similarity relation w.r.t. a distance?
21 / 40
On the Mining of Numerical Data with Formal Concept Analysis
25. Contributions – Introducing similarity
Introducing a similarity relation
Grouping in a same concept objects having similar values?
A natural similarity relation on numbers
a θ b ⇔ |a − b| ≤ θ e.g. 4 1 5 4 1 6
Similarity operator in pattern structures
θ = 04 5 6
[4,5] [5,6]
[4,6]
How to consider a similarity relation w.r.t. a distance?
21 / 40
On the Mining of Numerical Data with Formal Concept Analysis
26. Contributions – Introducing similarity
Towards a similarity between values
Introduce an element ∗ ∈ (D, ) denoting dissimilarity
c d = ∗ iff c θ d
c d = ∗ iff c θ d
Example with θ = 1
m1 m2 m3
g1 5 7 6
g2 6 8 4
g3 4 8 5
g4 4 9 8
g5 5 8 5
{g3, g4} = [4, 4], [8, 9], ∗
[4, 4], [8, 9], ∗ = {g3, g4}
({g3, g4}, [4, 4], [8, 9], ∗ ) is a concept:
g3 and g4 have similar values for attributes m1 and m2 only
22 / 40
On the Mining of Numerical Data with Formal Concept Analysis
27. Contributions – Introducing similarity
Towards a similarity between values
Introduce an element ∗ ∈ (D, ) denoting dissimilarity
c d = ∗ iff c θ d
c d = ∗ iff c θ d
Example with θ = 1
m1 m2 m3
g1 5 7 6
g2 6 8 4
g3 4 8 5
g4 4 9 8
g5 5 8 5
{g3, g4} = [4, 4], [8, 9], ∗
[4, 4], [8, 9], ∗ = {g3, g4}
({g3, g4}, [4, 4], [8, 9], ∗ ) is a concept:
g3 and g4 have similar values for attributes m1 and m2 only
Is {g3, g4} maximal w.r.t. similarity? We can add g5...
22 / 40
On the Mining of Numerical Data with Formal Concept Analysis
28. Contributions – Introducing similarity
Classes of tolerance in numerical data
Towards maximal sets of similar values
θ a tolerance relation : reflexive, symmetric, not transitive
Consider an attribute taking values in {6, 8, 11, 16, 17} and θ = 5
8 5 11, 11 5 16 but 8 5 16
A class of tolerance as a maximal set of pairwise similar values
{6, 8, 11} {11, 16} {16, 17}
[6, 11] [11, 16] [16, 17]
S. O. Kuznetsov
Galois Connections in Data Analysis: Contributions from the Soviet Era and Modern Russian Research.
In Formal Concept Analysis, Foundations and Applications, 2005.
23 / 40
On the Mining of Numerical Data with Formal Concept Analysis
29. Contributions – Introducing similarity
Tolerance in pattern structures
Projecting the pattern structure
Each value is replaced by the interval characterizing its class of
tolerance (if unique)
Each pattern d is projected with a mapping ψ(d) d
(pre-processing)
Example with θ = 1
m1 m2 m3
g1 5 7 6
g2 6 8 4
g3 4 8 5
g4 4 9 8
g5 5 8 5
{g3, g4} = ψ( [4, 4], [8, 9], ∗ )
= [4, 5], [8, 9], ∗
[4, 5], [8, 9], ∗ = {g3, g4, g5}
24 / 40
On the Mining of Numerical Data with Formal Concept Analysis
30. Contributions – Introducing similarity
Biological results
An extracted pattern among 2, 150 others
Genes present a high expression level in the fruit-body situations
Some of these genes encode metabolic enzymes in remobilization
of fungal resources towards the new organ in development
Other genes are unknown but specific to Laccaria Bicolor: it
requires biological experiments
25 / 40
On the Mining of Numerical Data with Formal Concept Analysis
31. Contributions – Introducing similarity
Relevant publications
Interval pattern structures and GED analysis
M. Kaytoue, S. Duplessis, S. O. Kuznetsov, and A. Napoli
Two FCA-Based Methods for Mining Gene Expression Data.
In International Conference on Formal Concept Analysis (ICFCA), 2009.
M. Kaytoue, S. O. Kuznetsov, A. Napoli and S. Duplessis
Mining Gene Expression Data with Pattern Structures in Formal Concept Analysis.
In Information Sciences. Spec. Iss.: Lattices (Elsevier), 2011.
Introducing tolerance relations and information fusion
M. Kaytoue, Z. Assaghir, N. Messai and A. Napoli
Two Complementary Classification Methods for Designing a Concept Lattice from Interval Data.
In Foundations of Information and Knowledge Systems, 6th International Symposium (FoIKS), 2010.
M. Kaytoue, Z. Assaghir, A. Napoli and S. O. Kuznetsov
Embedding Tolerance Relations in Formal Concept Analysis: an Application in Information Fusion.
In ACM Conference on Information and Knowledge Management (CIKM), 2010.
26 / 40
On the Mining of Numerical Data with Formal Concept Analysis
32. Contributions –
Other works
Pattern structures are useful for several tasks
Bi-clustering and tolerance relations
M. Kaytoue, S. O. Kuznetsov, and A. Napoli
Biclustering Numerical Data in Formal Concept Analysis.
In International Conference on Formal Concept Analysis (ICFCA), 2011.
Information fusion: enhancing decision making
Z. Assaghir, M. Kaytoue, A. Napoli and H. Prade
Managing Information Fusion with Formal Concept Analysis.
In Modeling Decisions for Artificial Intelligence, 6th International Conference (MDAI), 2010.
KDD: a study of equivalence classes of interval patterns
M. Kaytoue, S. O. Kuznetsov, and A. Napoli
Revisiting Numerical Pattern Mining with Formal Concept Analysis.
In International Joint Conference on Artificial Intelligence (IJCAI), 2011.
27 / 40
On the Mining of Numerical Data with Formal Concept Analysis
33. Contributions – A KDD-oriented discussion
Outline
1 Context
2 Formal Concept Analysis
3 Contributions
Interval pattern structures
Introducing similarity
A KDD-oriented discussion
4 Conclusion and perspectives
28 / 40
On the Mining of Numerical Data with Formal Concept Analysis
34. Contributions – A KDD-oriented discussion
Interval pattern search space
Counting all possible interval patterns
[am1 , bm1 ], [am2 , bm2 ], ...
where ami , bmi ∈ Wmi
m1 m2 m3
g1 5 7 6
g2 6 8 4
g3 4 8 5
g4 4 9 8
g5 5 8 5
i∈{1,...,|M|}
|Wmi | × (|Wmi | + 1)
2
360 possible interval patterns in our small example
29 / 40
On the Mining of Numerical Data with Formal Concept Analysis
35. Contributions – A KDD-oriented discussion
Semantics for interval patterns
Interval patterns as (hyper) rectangles
m1 m3
g1 5 6
g2 6 4
g3 4 5
g4 4 8
g5 5 5
3
4
5
6
7
8
3 4 5 6
m1
m3
δ(g1)
δ(g2)
δ(g3)
δ(g4)
δ(g5)
30 / 40
On the Mining of Numerical Data with Formal Concept Analysis
36. Contributions – A KDD-oriented discussion
Semantics for interval patterns
Interval patterns as (hyper) rectangles
m1 m3
g1 5 6
g2 6 4
g3 4 5
g4 4 8
g5 5 5
[4, 5], [5, 6] = {g1, g3, g5}
3
4
5
6
7
8
3 4 5 6
m1
m3
δ(g1)
δ(g2)
δ(g3)
δ(g4)
δ(g5)
30 / 40
On the Mining of Numerical Data with Formal Concept Analysis
37. Contributions – A KDD-oriented discussion
Semantics for interval patterns
Interval patterns as (hyper) rectangles
m1 m3
g1 5 6
g2 6 4
g3 4 5
g4 4 8
g5 5 5
[4, 5], [5, 6] = {g1, g3, g5}
[4, 5], [5, 7] = {g1, g3, g5}
3
4
5
6
7
8
3 4 5 6
m1
m3
δ(g1)
δ(g2)
δ(g3)
δ(g4)
δ(g5)
30 / 40
On the Mining of Numerical Data with Formal Concept Analysis
42. Contributions – A KDD-oriented discussion
A condensed representation
Equivalence classes of interval patterns
Two interval patterns with same image are said to be equivalent
c ∼= d ⇐⇒ c = d
Equivalence class of a pattern d
[d] = {c|c ∼= d}
with a unique closed pattern: the smallest rectangle
and one or several generators: the largest rectangles
Y. Bastide, R. Taouil, N. Pasquier, G. Stumme, and L. Lakhal.
Mining frequent patterns with counting inference.
SIGKDD Expl., 2(2):66–75, 2000.
In our example: 360 patterns ; 18 closed ; 44 generators
31 / 40
On the Mining of Numerical Data with Formal Concept Analysis
43. Contributions – A KDD-oriented discussion
Algorithms & experiments
Algorithms: MintIntChange, MinIntChangeG[t|h]
4 5 6
[4,5] [5,6]
[4,6]
Experiments
Mining several datasets from Bilkent University Repository
Compression rate varies between 107
and 109
Interordinal scaling: encodes 30.000 binary patterns
not efficient even with best algorithms (e.g. LCMv2)
redundancy problem discarding its use for generator extraction
32 / 40
On the Mining of Numerical Data with Formal Concept Analysis
44. Contributions – A KDD-oriented discussion
Algorithms & experiments
Algorithms: MintIntChange, MinIntChangeG[t|h]
4 5 6
[4,5] [5,6]
[4,6]
Experiments
Mining several datasets from Bilkent University Repository
Compression rate varies between 107
and 109
Interordinal scaling: encodes 30.000 binary patterns
not efficient even with best algorithms (e.g. LCMv2)
redundancy problem discarding its use for generator extraction
32 / 40
On the Mining of Numerical Data with Formal Concept Analysis
45. Contributions – A KDD-oriented discussion
Discussion
Advantages
Minimum description length principle favours generators
Potential applications
Data privacy and k-anonymisation
k-box problem in computational geometry
Quantitative association rule mining
Data summarization
Problem
With very large data set, compression is not enough
Numerical data are noisy
Need of fault-tolerant condensed representations
33 / 40
On the Mining of Numerical Data with Formal Concept Analysis
47. Conclusion and perspectives
Conclusion
A new insight for the mining numerical data
Our main tools...
Formal Concept Analysis and conceptual scaling
Pattern structures and projections
Tolerance relation
... for numerical data mining
Conceptual representations of numerical data
Bi-clustering
Information fusion
Applications: GED analysis and agricultural practice assessment
35 / 40
On the Mining of Numerical Data with Formal Concept Analysis
48. Conclusion and perspectives
Conclusion
An application in GED analysis
With FCA and pattern structures
Many ways of extracting patterns in GED
Biological validation of several patterns
We now need a systematic validation step using new knowledge
transcription factors
biological knowledge base, e.g. Gene Ontology
36 / 40
On the Mining of Numerical Data with Formal Concept Analysis
49. Conclusion and perspectives
To be continued...
Short- and mid- term
Handle other types of biclusters and algorithm comparison
S. C. Madeira and A. L. Oliveira
Biclustering Algorithms for Biological Data Analysis: a survey.
In IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2004.
Insert domain knowledge for biological data
Study threshold θ effect w.r.t. the number of tolerance classes
Post-doctoral position
Biclustering (multi-dimensional) numerical data
Numerical pattern based classifier and association rules
Data privacy and pattern projection
Wagner Jr. Meira (Universidade Federal de Minas Gerais, Brasil)
37 / 40
On the Mining of Numerical Data with Formal Concept Analysis
50. Conclusion and perspectives
Cross-domain fertilization
Itemset-mining in KDD
Other frameworks for closed patterns
H. Arimura and T. Uno
Polynomial-Delay and Polynomial-Space Algorithms for Mining Closed Sequences, Graphs, and
Pictures in Accessible Set Systems.
In SIAM International Conference on Data Mining, 2009.
G.C. Garriga
Formal Methods for Mining Structured Objects.
PhD Thesis, Universitat Polit`ecnica de Catalunya, 2006
Condensed representations and fault-tolerant patterns
m1 m2 m3
g1 5 7 6
g2 6 8 4
g3 4 8 5
g4 4 9 8
g5 15 8 5
R. Pensa and J.-F. Boulicaut
Towards Fault-Tolerant Formal Concept Analysis.
In Proc. 9th Congress of the Italian Association for Artificial Intelligence (AI*IA), Springer, 2005.
38 / 40
On the Mining of Numerical Data with Formal Concept Analysis
51. Conclusion and perspectives
Cross-domain fertilization
Data-analysis
Symbolic data analysis and distances
P. Agarwal, M. Kaytoue, S. O. Kuznetsov, A. Napoli and G. Polaillon
Symbolic Galois Lattices with Pattern Structures.
In International Conference on Rough Sets, Fuzzy Sets, Data-mining and Granularity Computing
(RSFDGrC), 2011.
Information fusion and fuzzy concept analysis
Fuzzy settings and possibility theory
Z. Assaghir, M. Kaytoue, and H. Prade
A Possibility Theory Oriented Discussion of Conceptual Pattern Ptructures.
In Scalable Uncertainty Management, 4th International Conference (SUM), 2010.
39 / 40
On the Mining of Numerical Data with Formal Concept Analysis