SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Downloaden Sie, um offline zu lesen
GraphClust: alignment-free structural clustering of
local RNA secondary structures.
Yakovlev Pavel
St-Petersburg Academic University
March 16, 2013
Yakovlev Pavel Graph Clust 1/33
Abstract
Non-protein coding RNAs (ncRNAs)
have moved to a key research in
molecular biology, fundamentally
challenging established paradigms and
rapidly transforming our perception of
genome complexity. There is mounting
evidence that the eukaryotic genome is
pervasively transcribed and that ncRNAs
function at various levels regulating gene
expression and cell biology.
Yakovlev Pavel Graph Clust 2/33
Motivation
Clustering algorithms are applied to store, compare and predict
ncRNAs, because:
in contrast to protein-coding genes, ncRNAs belong to a
diverse array of classes with vastly different structures,
functions, and evolutionary patterns
ncRNAs can be divided into RNA families according to
inherent functional, structural, or compositional similarities
Yakovlev Pavel Graph Clust 3/33
Motivation
Part II
Question:
So, what is the problem with «old» alignment-based algorithms?
Yakovlev Pavel Graph Clust 4/33
Motivation
Part II
Answer:
They are awfull and do not work.
Why?
Multiple Sequence Alignment (MSA) is an NP-complete
problem.
Approximation algorithms have very high computational
complexity too (dynamic programming):
Sankoff Algorithm (1985) - O(n4
)
Carrillo-Lipman Algorithm (1989)
Thus, the tools, based on these algorithms, are limited to
relatively small datasets.
LocARNA (2007)
FoldAlign (2007)
Yakovlev Pavel Graph Clust 5/33
Smart quote
Gorodkin: ‘Even using substantially more sophisticated techniques,
genome-scale ncRNA analyses often consume tens to hundreds of
computer years. These high computational costs are one reason
why ncRNA gene finding is still in its infancy.’1
1
Gorodkin,J. et al. (2010) De novo prediction of structured RNAs from
genomic sequences. Trends Biotechnol, 28, 9–19.
Yakovlev Pavel Graph Clust 6/33
Solutions
Postulate:
ncRNAs can be also grouped by RNA classes whose members, in
contrast with RNA families, have no discernible homology at the
sequence level, but still share common structural and functional
properties.
Two main classes of clustering approaches:
1 Class uses simplifications in the representation of structures.
heavily depend on the correctness of the structures
computational prediction of secondary structures are known to
be error prone
2 Class uses sequence information as prior knowledge to speed
up the computation.
sequences are first clustered by sequence-alignment
family assignments of structured RNAs obtained from
sequence alignments at pairwise sequence identities below 60%
are often wrong
Yakovlev Pavel Graph Clust 7/33
GraphClust
Alignment-free solution
GraphClust (University of Freiburg, Germany):
comparing and clustering RNAs according to both sequences
and structures
alignment-free
linear-time
scales to datasets of hundreds of thousands of sequences
extended fast graph kernel technique designed for
chemoinformatics
ready-to-use pipeline, that is available on request for free
Yakovlev Pavel Graph Clust 8/33
GraphClust
Approach
Approach pipeline:
1 Try a small number of probable, but sufficiently different,
structures for each RNA sequence.
2 Encode each structure as a labeled graph preserving all
information about the nucleotide type and the bond type -
sequence’s structure is represented as a graph with several
disconnected components.
3 Compute the similarity between the representative graphs
using a [chemoinformatics] graph kernel.
4 Evaluate each neighborhood and select as candidate clusters
those that contain very similar elements.
5 Dataset scanning: associating missing sequences to clusters
and removing them from set.
6 Reiterating the procedure
Yakovlev Pavel Graph Clust 9/33
GraphClust
Questions
Obscure:
How structures are encoded as graphs?
How to compare graphs or graph substructures?
Why we have not quadratic number of comparisons of graphs?
Yakovlev Pavel Graph Clust 10/33
Answer 1
Graph encoding for structures
Minimum free energy-based secondary structure prediction has been
shown to be error prone (Gardner et al., 2005). So, we want to use
representation of the entrie ensemble of low-energy conformations.
Problem: Resorting to a complete enumeration of near-optimal
structures would yield a tremendously large number of structures.
Solution: We opt for abstract shape analysis methods.
Yakovlev Pavel Graph Clust 11/33
Answer 1
Graph encoding for structures
We consider subsequences
of our ncRNA sequence, as
overlapping windows.
Procedure depends on two
parameters:
W - window sizes (30 and 150
nt in tests)
O - overlap parameter for
window start positions
Than we take the set l
most representative
structures of each
subsequence and encode
them as disconnected
components.
Yakovlev Pavel Graph Clust 12/33
Answer 2
Graph comparisons
But how to compare these graphs?
Graph similarity is the central problem for all learning tasks such as
clustering and classification on graphs.
Subgraphs polymorphism is NP-complete problem.
Yakovlev Pavel Graph Clust 13/33
Answer 2
Graph kernel
Graph kernels:
Compare substructures of graphs that are computable in
polynomial time
Examples: walks, paths, cyclic patterns, trees
Good graph kernel criteria:
Expressive
Efficient to compute
Positive definite
Applicable for wide range of graphs
Yakovlev Pavel Graph Clust 14/33
Answer 2
Graph kernel
NSPDK - Neighborhood Subgraph Pairwise Distance Kernel (Costa
and Grave, 2010)
Key fetures:
very fast kernel
suitable for large datatsets
works with sparse graphs with discrete vertex and edge labels
works with special pair of subgraphs, called «neighborhood»
subgraphs
instance of decomposition kernel (Haussler, 1999)
Yakovlev Pavel Graph Clust 15/33
Answer 2
Decomposition kernel
Definitions:
G(V , E) - graph, r ≥ 0. Than Nv
r (G) - neighborhood
subgraph of G.
Where Nv
r (G) rooted in v and induced by the set of vertices within
distance not exceeding r.
Rr,d - neighborhood-pair relation.
Where Rr,d - indicates that the distance between the roots of two
neighborhood subgraphs of radius r is exactly equal to d.
kr,d (G, G ) =
A,B∈R−1
r,d
(G)
A ,B ∈R−1
r,d
(G )
1(A ∼= A )1(B ∼= B ) - decomposition
kernel.
R−1
r,d - indicates all possible pairs of neighborhood subgraphs or radius r,
whose roots are at distance d.
Yakovlev Pavel Graph Clust 16/33
Answer 2
NSPDK
Than, NSPDK (non-normalized form) is:
K(G, G ) =
r d
kr,d (G, G )
For efficiency reasons, the "upper bound" form:
Kr∗,d∗ (G, G ) =
r∗
r=0
d∗
d=0
kr,d (G, G )
Normalization like decomposition kernel:
ˆkr,d (G, G ) =
kr,d (G, G )
kr,d (G, G)kr,d (G , G )
Yakovlev Pavel Graph Clust 17/33
Answer 2
NSPDK
Full isomorphism test is
computationally expensive.
Reducing the task:
1 String encoding: uses a technique
based on the distance information
between pairs of vertices and can
be computed in O(|V |).
2 Hashing procedure: encoding
string to integer code (Costa and
Grave, 2010).
3 Isomorphism test: equality test
between thier integer codes.
Yakovlev Pavel Graph Clust 18/33
Answer 3
Explicit feature representation
NSPDK limits the number of features to O(r∗d∗|V |2) pairs of
neighborhood subgraphs (one feature for each pair of vertices times
each possible combination of values for the radius and the distance).
r∗ ∈ [0; 5] d∗ ∈ [0; 10]
For sparse graphs, the number of vertices that are reachable within
fixed small distance is typically small ⇒ dependency on the vertex
set size can be more tightly approximated by O(|V |).
Conclusion: each graph is mapped into a sparse vector in Rm -
high-dimensional feature space. Number of non-zero features in
this space is linear in |V |.
Yakovlev Pavel Graph Clust 19/33
Alignment-free clustering
MinHash
Some definitions:
P - instances [graphs] set
x, z, p - any instance
xj - element of instance
Nk (x) - the set of the k-closest instances.
So, how to cluster graphs using NSPDK? Compare all the possible pairs of
graphs is too expensive. That‘s why let us compare only similar graphs.
Dk (x) = 1
k
z∈Nk (x)
Kr∗,d∗ (x, z) - k-neighborhood density is good metrics for
clustering.
To make it as fast, as possible, we want to count Nk (x) very fast. It is perfect
to do it in constant time.
MinHash algorithm will help us with this task.
Yakovlev Pavel Graph Clust 20/33
Alignment-free clustering
MinHash
MinHash algorithm:
1 Binarize all instances P = {p1, . . . , pn} from Rm
→ {0, 1}r
.
2 Choose some fi : N → N - random hash functions with terms:
∀xj = xk , fi (xj ) = fi (xk ) and ∀xj = xk , P(fi (xj ) ≤ fi (xk )) = 1/2.
3 Get hash functions hi (x) = argminxj ∈x fi (xj )1
. So, they return the first
feature indicator under a random permutation of the features oreder.
Useful facts:
P(hi (x) = hi (z)) =
|x∩z|
|x∪z|
= s(x, z) - Jaccard similarity, the fraction of
common x and z features over the total number of none-null features of x
and z.
Error rate γ = 1
N2 , where N - number of hash functions.
4 Build a tuple < h1(x), . . . , hN (x) >
5 Collect the set of instances for h∗
= hi (x) as
Zi (h∗
) = {z ∈ P : hi (z) = h∗
}.
6 Z = {Zi }N
i=1 - approximate neighborhood, that can be used to find Nk (x).
These steps can be performed in constant time (N is fixed, Z independent of
the dataset size |P|).
1
These hash functions are called min-hash function (Broder, 1997).
Yakovlev Pavel Graph Clust 21/33
MinHash
Demo Time
def generic_hash( s, seed ):
result = 1
for char in s:
result = int(seed*result + ord(char)) & 0xFFFFFFFF;
return result
def hashgen( size ):
seed = math.floor( random.random() * size ) + 32
return functools.partial(generic_hash, seed = seed)
def signature( set, functions ):
return [min(map(f, set)) for f in functions]
def similarity( sig_a, sig_b, num_funcs ):
return sum(x == y for x, y in zip(sig_a, sig_b)) / num_funcs
def minhash( max_error ):
num_funcs = int(1 / max_error ** 2)
functions = map(hashgen, xrange(num_funcs))
return [functools.partial(signature, functions = functions),
functools.partial(similarity, num_funcs = num_funcs)]
Yakovlev Pavel Graph Clust 22/33
GraphClust pipeline
Phase 1
Phase 1. Preprocessing
Get RNA sequences from RNA-seq or computational methods,
mask all repeats to avoid clusters made of genomic repeats.
Split long sequences into smaller fragments (windows)1.
Near identical sequences are removed using blastclust2.
1
to enable detection of small signals
2
The program begins with pairwise matches and places a sequence in a
cluster if the sequence matches at least one sequence already in the cluster.
Megablast algorithm is used.
Yakovlev Pavel Graph Clust 23/33
GraphClust pipeline
Phase 2
Phase 2. Structure determination
Try set of structures for each window with RNAShape tool.
150 nt sequence processing into set of disconnected graphs of
≈ 2500 total vertices.
O = 20%; W = 30, 150; l = 3
Yakovlev Pavel Graph Clust 24/33
GraphClust pipeline
Phase 3
Phase 3. Parallel encoding
Encoding set of structures into sparse vectors, by counting Kr∗,d∗ .
150 nt sequence processing into a sparse vector with ≈ 8000
features.
r∗ = 2; d∗ = 4
Yakovlev Pavel Graph Clust 25/33
GraphClust pipeline
Phase 4
Phase 4. Candidate cluster
ci - candidate clusters
ci sorted as ∀i < j, D(ci ) > D(cj )
Building the union of candidate clusters:
Cj =
j
i=1
ci with greedily discarded ck if |Cj ∩ ck| > threshold
Yakovlev Pavel Graph Clust 26/33
GraphClust pipeline
Phase 5
Phase 5. Parallel cluster refinement
Candidate clusters contains sequences, similar under NSPDK
measure.
Each cluster adjusts using sequence-structure alignment tool
LocARNA (Will et al., 2007). We choose only top ranked RNAs in
each cluster.
Yakovlev Pavel Graph Clust 27/33
GraphClust pipeline
Phase 6
Phase 6. Parallel cluster model
Selected top ranked RNA candidates are realigned with tool
LocARNA-P (Will et al., 2012). so, we identfy high trusted
alignment region and estimate the borders of the common local
motif.
At last, we create a covariance model (CM) by applying INFERNAL
(Nawrocki et al., 2009) tool on the identfied subsequence. Every
cluster member gets its CM bitscore1.
1
A log-scaled version of score.
Yakovlev Pavel Graph Clust 28/33
GraphClust pipeline
Phase 7
Phase 7. Model scanning
Covariance model of each cluster is used to find additional cluster
members in full residual dataset of sequences by CM bitscore or
E-value1.
Some sequences can be assigned to multiplie clusters.
1
In terms of the authors, the number of instances with a score equivalent or
better, than in current instance.
Yakovlev Pavel Graph Clust 29/33
GraphClust pipeline
Phase 8
Phase 8. Iteration and removal
All cluster members found in the previous phase are removed from
the dataset. A new iteration starts from Phase 4.
The termination conditions are:
predetermined max number of iterations
time limit
empty dataset
Yakovlev Pavel Graph Clust 30/33
GraphClust pipeline
Phase 9
Phase 9. Postprocessing
Merge clusters
For every pair of clusters, we compute the relative overlap and
merge them if it is more, than 50%.
Make in-cluster postprocessing
Sequences in a cluster finally ranked by their CM bitscore.
Yakovlev Pavel Graph Clust 31/33
Results
Species Type Method Input Time Cluster
Benchmark
Bacteria Small ncRNAs Misc 363 6.8h 39
Human Predicted RNA elemets EvoFam 699 0.3h 37
Misc Small ncRNAs Rfam 3 900 36h 130
De-novo discovery
Fugu LincRNAs RNA-seq 5 877 10.3h 99
Fugu Predicted RNA elements RNAz 11 287 13.3h 97
Fruit fly Predicted RNA elements RNAz 17 765 20.4h 95
Human LincRNAs RNA-seq 31 418 3.6d 95
Human Predicted RNA elements EvoFold 37 258 5.7d 117
Human 3’UTRs RefSeq 118 514 12.8d 106
227 081 25.7d 815
Clustering 3901 sequences using LocARNA takes ∼ 370 days.
Yakovlev Pavel Graph Clust 32/33
Q&A
Thank you for your attention!
Q&A
Yakovlev Pavel Graph Clust 33/33

Weitere ähnliche Inhalte

Was ist angesagt?

Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based ClusteringSSA KPI
 
Do we need a logic of quantum computation?
Do we need a logic of quantum computation?Do we need a logic of quantum computation?
Do we need a logic of quantum computation?Matthew Leifer
 
Notes on Spectral Clustering
Notes on Spectral ClusteringNotes on Spectral Clustering
Notes on Spectral ClusteringDavide Eynard
 
Spectral clustering
Spectral clusteringSpectral clustering
Spectral clusteringSOYEON KIM
 
Adams_SIAMCSE15
Adams_SIAMCSE15Adams_SIAMCSE15
Adams_SIAMCSE15Karen Pao
 
Radial Basis Function Interpolation
Radial Basis Function InterpolationRadial Basis Function Interpolation
Radial Basis Function InterpolationJesse Bettencourt
 
Clustering: Large Databases in data mining
Clustering: Large Databases in data miningClustering: Large Databases in data mining
Clustering: Large Databases in data miningZHAO Sam
 
DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)Cory Cook
 
Neural Networks: Radial Bases Functions (RBF)
Neural Networks: Radial Bases Functions (RBF)Neural Networks: Radial Bases Functions (RBF)
Neural Networks: Radial Bases Functions (RBF)Mostafa G. M. Mostafa
 
Training and Inference for Deep Gaussian Processes
Training and Inference for Deep Gaussian ProcessesTraining and Inference for Deep Gaussian Processes
Training and Inference for Deep Gaussian ProcessesKeyon Vafa
 
ICML2019読み会(2019/08/04, @オムロン)「On the Universality of Invariant Networks」紹介スライド
ICML2019読み会(2019/08/04, @オムロン)「On the Universality of Invariant Networks」紹介スライドICML2019読み会(2019/08/04, @オムロン)「On the Universality of Invariant Networks」紹介スライド
ICML2019読み会(2019/08/04, @オムロン)「On the Universality of Invariant Networks」紹介スライドrmizuno1
 
Optics ordering points to identify the clustering structure
Optics ordering points to identify the clustering structureOptics ordering points to identify the clustering structure
Optics ordering points to identify the clustering structureRajesh Piryani
 
머피의 머신러닝: 17장 Markov Chain and HMM
머피의 머신러닝: 17장  Markov Chain and HMM머피의 머신러닝: 17장  Markov Chain and HMM
머피의 머신러닝: 17장 Markov Chain and HMMJungkyu Lee
 
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...Masahiro Suzuki
 
Convolutional networks and graph networks through kernels
Convolutional networks and graph networks through kernelsConvolutional networks and graph networks through kernels
Convolutional networks and graph networks through kernelstuxette
 
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural NetworksMasahiro Suzuki
 
Lecture 6: Convolutional Neural Networks
Lecture 6: Convolutional Neural NetworksLecture 6: Convolutional Neural Networks
Lecture 6: Convolutional Neural NetworksSang Jun Lee
 
Neural Networks: Model Building Through Linear Regression
Neural Networks: Model Building Through Linear RegressionNeural Networks: Model Building Through Linear Regression
Neural Networks: Model Building Through Linear RegressionMostafa G. M. Mostafa
 
IJCAI13 Paper review: Large-scale spectral clustering on graphs
IJCAI13 Paper review: Large-scale spectral clustering on graphsIJCAI13 Paper review: Large-scale spectral clustering on graphs
IJCAI13 Paper review: Large-scale spectral clustering on graphsAkisato Kimura
 

Was ist angesagt? (20)

Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based Clustering
 
Do we need a logic of quantum computation?
Do we need a logic of quantum computation?Do we need a logic of quantum computation?
Do we need a logic of quantum computation?
 
Notes on Spectral Clustering
Notes on Spectral ClusteringNotes on Spectral Clustering
Notes on Spectral Clustering
 
Spectral clustering
Spectral clusteringSpectral clustering
Spectral clustering
 
Adams_SIAMCSE15
Adams_SIAMCSE15Adams_SIAMCSE15
Adams_SIAMCSE15
 
Radial Basis Function Interpolation
Radial Basis Function InterpolationRadial Basis Function Interpolation
Radial Basis Function Interpolation
 
Clustering: Large Databases in data mining
Clustering: Large Databases in data miningClustering: Large Databases in data mining
Clustering: Large Databases in data mining
 
DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)
 
Neural Networks: Radial Bases Functions (RBF)
Neural Networks: Radial Bases Functions (RBF)Neural Networks: Radial Bases Functions (RBF)
Neural Networks: Radial Bases Functions (RBF)
 
Training and Inference for Deep Gaussian Processes
Training and Inference for Deep Gaussian ProcessesTraining and Inference for Deep Gaussian Processes
Training and Inference for Deep Gaussian Processes
 
ICML2019読み会(2019/08/04, @オムロン)「On the Universality of Invariant Networks」紹介スライド
ICML2019読み会(2019/08/04, @オムロン)「On the Universality of Invariant Networks」紹介スライドICML2019読み会(2019/08/04, @オムロン)「On the Universality of Invariant Networks」紹介スライド
ICML2019読み会(2019/08/04, @オムロン)「On the Universality of Invariant Networks」紹介スライド
 
Optics ordering points to identify the clustering structure
Optics ordering points to identify the clustering structureOptics ordering points to identify the clustering structure
Optics ordering points to identify the clustering structure
 
머피의 머신러닝: 17장 Markov Chain and HMM
머피의 머신러닝: 17장  Markov Chain and HMM머피의 머신러닝: 17장  Markov Chain and HMM
머피의 머신러닝: 17장 Markov Chain and HMM
 
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
 
Db Scan
Db ScanDb Scan
Db Scan
 
Convolutional networks and graph networks through kernels
Convolutional networks and graph networks through kernelsConvolutional networks and graph networks through kernels
Convolutional networks and graph networks through kernels
 
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
 
Lecture 6: Convolutional Neural Networks
Lecture 6: Convolutional Neural NetworksLecture 6: Convolutional Neural Networks
Lecture 6: Convolutional Neural Networks
 
Neural Networks: Model Building Through Linear Regression
Neural Networks: Model Building Through Linear RegressionNeural Networks: Model Building Through Linear Regression
Neural Networks: Model Building Through Linear Regression
 
IJCAI13 Paper review: Large-scale spectral clustering on graphs
IJCAI13 Paper review: Large-scale spectral clustering on graphsIJCAI13 Paper review: Large-scale spectral clustering on graphs
IJCAI13 Paper review: Large-scale spectral clustering on graphs
 

Ähnlich wie Slides4

Master Thesis on the Mathematial Analysis of Neural Networks
Master Thesis on the Mathematial Analysis of Neural NetworksMaster Thesis on the Mathematial Analysis of Neural Networks
Master Thesis on the Mathematial Analysis of Neural NetworksAlina Leidinger
 
Randomized Algorithms for Higher-Order Voronoi Diagrams
Randomized Algorithms for Higher-Order Voronoi DiagramsRandomized Algorithms for Higher-Order Voronoi Diagrams
Randomized Algorithms for Higher-Order Voronoi DiagramsMaksym Zavershynskyi
 
240401_JW_labseminar[LINE: Large-scale Information Network Embeddin].pptx
240401_JW_labseminar[LINE: Large-scale Information Network Embeddin].pptx240401_JW_labseminar[LINE: Large-scale Information Network Embeddin].pptx
240401_JW_labseminar[LINE: Large-scale Information Network Embeddin].pptxthanhdowork
 
Face recognition using laplacianfaces (synopsis)
Face recognition using laplacianfaces (synopsis)Face recognition using laplacianfaces (synopsis)
Face recognition using laplacianfaces (synopsis)Mumbai Academisc
 
MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsElvis DOHMATOB
 
NS-CUK Seminar: S.T.Nguyen, Review on "Improving Graph Neural Network Express...
NS-CUK Seminar: S.T.Nguyen, Review on "Improving Graph Neural Network Express...NS-CUK Seminar: S.T.Nguyen, Review on "Improving Graph Neural Network Express...
NS-CUK Seminar: S.T.Nguyen, Review on "Improving Graph Neural Network Express...ssuser4b1f48
 
My invited talk at the 2018 Annual Meeting of SIAM (Society of Industrial and...
My invited talk at the 2018 Annual Meeting of SIAM (Society of Industrial and...My invited talk at the 2018 Annual Meeting of SIAM (Society of Industrial and...
My invited talk at the 2018 Annual Meeting of SIAM (Society of Industrial and...Anirbit Mukherjee
 
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...ssuser2624f71
 
My 2hr+ survey talk at the Vector Institute, on our deep learning theorems.
My 2hr+ survey talk at the Vector Institute, on our deep learning theorems.My 2hr+ survey talk at the Vector Institute, on our deep learning theorems.
My 2hr+ survey talk at the Vector Institute, on our deep learning theorems.Anirbit Mukherjee
 
Graph Neural Network in practice
Graph Neural Network in practiceGraph Neural Network in practice
Graph Neural Network in practicetuxette
 
Coherence incoherence patterns in a ring of non-locally coupled phase oscilla...
Coherence incoherence patterns in a ring of non-locally coupled phase oscilla...Coherence incoherence patterns in a ring of non-locally coupled phase oscilla...
Coherence incoherence patterns in a ring of non-locally coupled phase oscilla...Ola Carmen
 

Ähnlich wie Slides4 (20)

Iclr2016 vaeまとめ
Iclr2016 vaeまとめIclr2016 vaeまとめ
Iclr2016 vaeまとめ
 
Deepwalk vs Node2vec
Deepwalk vs Node2vecDeepwalk vs Node2vec
Deepwalk vs Node2vec
 
Deepwalk vs Node2vec
Deepwalk vs Node2vecDeepwalk vs Node2vec
Deepwalk vs Node2vec
 
Master Thesis on the Mathematial Analysis of Neural Networks
Master Thesis on the Mathematial Analysis of Neural NetworksMaster Thesis on the Mathematial Analysis of Neural Networks
Master Thesis on the Mathematial Analysis of Neural Networks
 
Randomized Algorithms for Higher-Order Voronoi Diagrams
Randomized Algorithms for Higher-Order Voronoi DiagramsRandomized Algorithms for Higher-Order Voronoi Diagrams
Randomized Algorithms for Higher-Order Voronoi Diagrams
 
240401_JW_labseminar[LINE: Large-scale Information Network Embeddin].pptx
240401_JW_labseminar[LINE: Large-scale Information Network Embeddin].pptx240401_JW_labseminar[LINE: Large-scale Information Network Embeddin].pptx
240401_JW_labseminar[LINE: Large-scale Information Network Embeddin].pptx
 
Face recognition using laplacianfaces (synopsis)
Face recognition using laplacianfaces (synopsis)Face recognition using laplacianfaces (synopsis)
Face recognition using laplacianfaces (synopsis)
 
06075626 (1)
06075626 (1)06075626 (1)
06075626 (1)
 
06075626
0607562606075626
06075626
 
MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priors
 
NS-CUK Seminar: S.T.Nguyen, Review on "Improving Graph Neural Network Express...
NS-CUK Seminar: S.T.Nguyen, Review on "Improving Graph Neural Network Express...NS-CUK Seminar: S.T.Nguyen, Review on "Improving Graph Neural Network Express...
NS-CUK Seminar: S.T.Nguyen, Review on "Improving Graph Neural Network Express...
 
My invited talk at the 2018 Annual Meeting of SIAM (Society of Industrial and...
My invited talk at the 2018 Annual Meeting of SIAM (Society of Industrial and...My invited talk at the 2018 Annual Meeting of SIAM (Society of Industrial and...
My invited talk at the 2018 Annual Meeting of SIAM (Society of Industrial and...
 
Spectral graph theory
Spectral graph theorySpectral graph theory
Spectral graph theory
 
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
 
My 2hr+ survey talk at the Vector Institute, on our deep learning theorems.
My 2hr+ survey talk at the Vector Institute, on our deep learning theorems.My 2hr+ survey talk at the Vector Institute, on our deep learning theorems.
My 2hr+ survey talk at the Vector Institute, on our deep learning theorems.
 
Graph Neural Network in practice
Graph Neural Network in practiceGraph Neural Network in practice
Graph Neural Network in practice
 
project report(1)
project report(1)project report(1)
project report(1)
 
Coherence incoherence patterns in a ring of non-locally coupled phase oscilla...
Coherence incoherence patterns in a ring of non-locally coupled phase oscilla...Coherence incoherence patterns in a ring of non-locally coupled phase oscilla...
Coherence incoherence patterns in a ring of non-locally coupled phase oscilla...
 
Line
LineLine
Line
 
Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
 

Mehr von BioinformaticsInstitute

Comparative Genomics and de Bruijn graphs
Comparative Genomics and de Bruijn graphsComparative Genomics and de Bruijn graphs
Comparative Genomics and de Bruijn graphsBioinformaticsInstitute
 
Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
 Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес... Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...BioinformaticsInstitute
 
Вперед в прошлое. Методы генетической диагностики древней днк
Вперед в прошлое. Методы генетической диагностики древней днкВперед в прошлое. Методы генетической диагностики древней днк
Вперед в прошлое. Методы генетической диагностики древней днкBioinformaticsInstitute
 
"Зачем биологам суперкомпьютеры", Александр Предеус
"Зачем биологам суперкомпьютеры", Александр Предеус"Зачем биологам суперкомпьютеры", Александр Предеус
"Зачем биологам суперкомпьютеры", Александр ПредеусBioinformaticsInstitute
 
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...BioinformaticsInstitute
 
Рак 101 (Мария Шутова, ИоГЕН РАН)
Рак 101 (Мария Шутова, ИоГЕН РАН)Рак 101 (Мария Шутова, ИоГЕН РАН)
Рак 101 (Мария Шутова, ИоГЕН РАН)BioinformaticsInstitute
 
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...BioinformaticsInstitute
 
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)Инвестиции в биоинформатику и биотех (Андрей Афанасьев)
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)BioinformaticsInstitute
 

Mehr von BioinformaticsInstitute (20)

Graph genome
Graph genome Graph genome
Graph genome
 
Nanopores sequencing
Nanopores sequencingNanopores sequencing
Nanopores sequencing
 
A superglue for string comparison
A superglue for string comparisonA superglue for string comparison
A superglue for string comparison
 
Comparative Genomics and de Bruijn graphs
Comparative Genomics and de Bruijn graphsComparative Genomics and de Bruijn graphs
Comparative Genomics and de Bruijn graphs
 
Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
 Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес... Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
 
Вперед в прошлое. Методы генетической диагностики древней днк
Вперед в прошлое. Методы генетической диагностики древней днкВперед в прошлое. Методы генетической диагностики древней днк
Вперед в прошлое. Методы генетической диагностики древней днк
 
Knime &amp; bioinformatics
Knime &amp; bioinformaticsKnime &amp; bioinformatics
Knime &amp; bioinformatics
 
"Зачем биологам суперкомпьютеры", Александр Предеус
"Зачем биологам суперкомпьютеры", Александр Предеус"Зачем биологам суперкомпьютеры", Александр Предеус
"Зачем биологам суперкомпьютеры", Александр Предеус
 
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...
 
Рак 101 (Мария Шутова, ИоГЕН РАН)
Рак 101 (Мария Шутова, ИоГЕН РАН)Рак 101 (Мария Шутова, ИоГЕН РАН)
Рак 101 (Мария Шутова, ИоГЕН РАН)
 
Плюрипотентность 101
Плюрипотентность 101Плюрипотентность 101
Плюрипотентность 101
 
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
 
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)Инвестиции в биоинформатику и биотех (Андрей Афанасьев)
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)
 
Biodb 2011-everything
Biodb 2011-everythingBiodb 2011-everything
Biodb 2011-everything
 
Biodb 2011-05
Biodb 2011-05Biodb 2011-05
Biodb 2011-05
 
Biodb 2011-04
Biodb 2011-04Biodb 2011-04
Biodb 2011-04
 
Biodb 2011-03
Biodb 2011-03Biodb 2011-03
Biodb 2011-03
 
Biodb 2011-01
Biodb 2011-01Biodb 2011-01
Biodb 2011-01
 
Biodb 2011-02
Biodb 2011-02Biodb 2011-02
Biodb 2011-02
 
Ngs 3 1
Ngs 3 1Ngs 3 1
Ngs 3 1
 

Kürzlich hochgeladen

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 

Kürzlich hochgeladen (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 

Slides4

  • 1. GraphClust: alignment-free structural clustering of local RNA secondary structures. Yakovlev Pavel St-Petersburg Academic University March 16, 2013 Yakovlev Pavel Graph Clust 1/33
  • 2. Abstract Non-protein coding RNAs (ncRNAs) have moved to a key research in molecular biology, fundamentally challenging established paradigms and rapidly transforming our perception of genome complexity. There is mounting evidence that the eukaryotic genome is pervasively transcribed and that ncRNAs function at various levels regulating gene expression and cell biology. Yakovlev Pavel Graph Clust 2/33
  • 3. Motivation Clustering algorithms are applied to store, compare and predict ncRNAs, because: in contrast to protein-coding genes, ncRNAs belong to a diverse array of classes with vastly different structures, functions, and evolutionary patterns ncRNAs can be divided into RNA families according to inherent functional, structural, or compositional similarities Yakovlev Pavel Graph Clust 3/33
  • 4. Motivation Part II Question: So, what is the problem with «old» alignment-based algorithms? Yakovlev Pavel Graph Clust 4/33
  • 5. Motivation Part II Answer: They are awfull and do not work. Why? Multiple Sequence Alignment (MSA) is an NP-complete problem. Approximation algorithms have very high computational complexity too (dynamic programming): Sankoff Algorithm (1985) - O(n4 ) Carrillo-Lipman Algorithm (1989) Thus, the tools, based on these algorithms, are limited to relatively small datasets. LocARNA (2007) FoldAlign (2007) Yakovlev Pavel Graph Clust 5/33
  • 6. Smart quote Gorodkin: ‘Even using substantially more sophisticated techniques, genome-scale ncRNA analyses often consume tens to hundreds of computer years. These high computational costs are one reason why ncRNA gene finding is still in its infancy.’1 1 Gorodkin,J. et al. (2010) De novo prediction of structured RNAs from genomic sequences. Trends Biotechnol, 28, 9–19. Yakovlev Pavel Graph Clust 6/33
  • 7. Solutions Postulate: ncRNAs can be also grouped by RNA classes whose members, in contrast with RNA families, have no discernible homology at the sequence level, but still share common structural and functional properties. Two main classes of clustering approaches: 1 Class uses simplifications in the representation of structures. heavily depend on the correctness of the structures computational prediction of secondary structures are known to be error prone 2 Class uses sequence information as prior knowledge to speed up the computation. sequences are first clustered by sequence-alignment family assignments of structured RNAs obtained from sequence alignments at pairwise sequence identities below 60% are often wrong Yakovlev Pavel Graph Clust 7/33
  • 8. GraphClust Alignment-free solution GraphClust (University of Freiburg, Germany): comparing and clustering RNAs according to both sequences and structures alignment-free linear-time scales to datasets of hundreds of thousands of sequences extended fast graph kernel technique designed for chemoinformatics ready-to-use pipeline, that is available on request for free Yakovlev Pavel Graph Clust 8/33
  • 9. GraphClust Approach Approach pipeline: 1 Try a small number of probable, but sufficiently different, structures for each RNA sequence. 2 Encode each structure as a labeled graph preserving all information about the nucleotide type and the bond type - sequence’s structure is represented as a graph with several disconnected components. 3 Compute the similarity between the representative graphs using a [chemoinformatics] graph kernel. 4 Evaluate each neighborhood and select as candidate clusters those that contain very similar elements. 5 Dataset scanning: associating missing sequences to clusters and removing them from set. 6 Reiterating the procedure Yakovlev Pavel Graph Clust 9/33
  • 10. GraphClust Questions Obscure: How structures are encoded as graphs? How to compare graphs or graph substructures? Why we have not quadratic number of comparisons of graphs? Yakovlev Pavel Graph Clust 10/33
  • 11. Answer 1 Graph encoding for structures Minimum free energy-based secondary structure prediction has been shown to be error prone (Gardner et al., 2005). So, we want to use representation of the entrie ensemble of low-energy conformations. Problem: Resorting to a complete enumeration of near-optimal structures would yield a tremendously large number of structures. Solution: We opt for abstract shape analysis methods. Yakovlev Pavel Graph Clust 11/33
  • 12. Answer 1 Graph encoding for structures We consider subsequences of our ncRNA sequence, as overlapping windows. Procedure depends on two parameters: W - window sizes (30 and 150 nt in tests) O - overlap parameter for window start positions Than we take the set l most representative structures of each subsequence and encode them as disconnected components. Yakovlev Pavel Graph Clust 12/33
  • 13. Answer 2 Graph comparisons But how to compare these graphs? Graph similarity is the central problem for all learning tasks such as clustering and classification on graphs. Subgraphs polymorphism is NP-complete problem. Yakovlev Pavel Graph Clust 13/33
  • 14. Answer 2 Graph kernel Graph kernels: Compare substructures of graphs that are computable in polynomial time Examples: walks, paths, cyclic patterns, trees Good graph kernel criteria: Expressive Efficient to compute Positive definite Applicable for wide range of graphs Yakovlev Pavel Graph Clust 14/33
  • 15. Answer 2 Graph kernel NSPDK - Neighborhood Subgraph Pairwise Distance Kernel (Costa and Grave, 2010) Key fetures: very fast kernel suitable for large datatsets works with sparse graphs with discrete vertex and edge labels works with special pair of subgraphs, called «neighborhood» subgraphs instance of decomposition kernel (Haussler, 1999) Yakovlev Pavel Graph Clust 15/33
  • 16. Answer 2 Decomposition kernel Definitions: G(V , E) - graph, r ≥ 0. Than Nv r (G) - neighborhood subgraph of G. Where Nv r (G) rooted in v and induced by the set of vertices within distance not exceeding r. Rr,d - neighborhood-pair relation. Where Rr,d - indicates that the distance between the roots of two neighborhood subgraphs of radius r is exactly equal to d. kr,d (G, G ) = A,B∈R−1 r,d (G) A ,B ∈R−1 r,d (G ) 1(A ∼= A )1(B ∼= B ) - decomposition kernel. R−1 r,d - indicates all possible pairs of neighborhood subgraphs or radius r, whose roots are at distance d. Yakovlev Pavel Graph Clust 16/33
  • 17. Answer 2 NSPDK Than, NSPDK (non-normalized form) is: K(G, G ) = r d kr,d (G, G ) For efficiency reasons, the "upper bound" form: Kr∗,d∗ (G, G ) = r∗ r=0 d∗ d=0 kr,d (G, G ) Normalization like decomposition kernel: ˆkr,d (G, G ) = kr,d (G, G ) kr,d (G, G)kr,d (G , G ) Yakovlev Pavel Graph Clust 17/33
  • 18. Answer 2 NSPDK Full isomorphism test is computationally expensive. Reducing the task: 1 String encoding: uses a technique based on the distance information between pairs of vertices and can be computed in O(|V |). 2 Hashing procedure: encoding string to integer code (Costa and Grave, 2010). 3 Isomorphism test: equality test between thier integer codes. Yakovlev Pavel Graph Clust 18/33
  • 19. Answer 3 Explicit feature representation NSPDK limits the number of features to O(r∗d∗|V |2) pairs of neighborhood subgraphs (one feature for each pair of vertices times each possible combination of values for the radius and the distance). r∗ ∈ [0; 5] d∗ ∈ [0; 10] For sparse graphs, the number of vertices that are reachable within fixed small distance is typically small ⇒ dependency on the vertex set size can be more tightly approximated by O(|V |). Conclusion: each graph is mapped into a sparse vector in Rm - high-dimensional feature space. Number of non-zero features in this space is linear in |V |. Yakovlev Pavel Graph Clust 19/33
  • 20. Alignment-free clustering MinHash Some definitions: P - instances [graphs] set x, z, p - any instance xj - element of instance Nk (x) - the set of the k-closest instances. So, how to cluster graphs using NSPDK? Compare all the possible pairs of graphs is too expensive. That‘s why let us compare only similar graphs. Dk (x) = 1 k z∈Nk (x) Kr∗,d∗ (x, z) - k-neighborhood density is good metrics for clustering. To make it as fast, as possible, we want to count Nk (x) very fast. It is perfect to do it in constant time. MinHash algorithm will help us with this task. Yakovlev Pavel Graph Clust 20/33
  • 21. Alignment-free clustering MinHash MinHash algorithm: 1 Binarize all instances P = {p1, . . . , pn} from Rm → {0, 1}r . 2 Choose some fi : N → N - random hash functions with terms: ∀xj = xk , fi (xj ) = fi (xk ) and ∀xj = xk , P(fi (xj ) ≤ fi (xk )) = 1/2. 3 Get hash functions hi (x) = argminxj ∈x fi (xj )1 . So, they return the first feature indicator under a random permutation of the features oreder. Useful facts: P(hi (x) = hi (z)) = |x∩z| |x∪z| = s(x, z) - Jaccard similarity, the fraction of common x and z features over the total number of none-null features of x and z. Error rate γ = 1 N2 , where N - number of hash functions. 4 Build a tuple < h1(x), . . . , hN (x) > 5 Collect the set of instances for h∗ = hi (x) as Zi (h∗ ) = {z ∈ P : hi (z) = h∗ }. 6 Z = {Zi }N i=1 - approximate neighborhood, that can be used to find Nk (x). These steps can be performed in constant time (N is fixed, Z independent of the dataset size |P|). 1 These hash functions are called min-hash function (Broder, 1997). Yakovlev Pavel Graph Clust 21/33
  • 22. MinHash Demo Time def generic_hash( s, seed ): result = 1 for char in s: result = int(seed*result + ord(char)) & 0xFFFFFFFF; return result def hashgen( size ): seed = math.floor( random.random() * size ) + 32 return functools.partial(generic_hash, seed = seed) def signature( set, functions ): return [min(map(f, set)) for f in functions] def similarity( sig_a, sig_b, num_funcs ): return sum(x == y for x, y in zip(sig_a, sig_b)) / num_funcs def minhash( max_error ): num_funcs = int(1 / max_error ** 2) functions = map(hashgen, xrange(num_funcs)) return [functools.partial(signature, functions = functions), functools.partial(similarity, num_funcs = num_funcs)] Yakovlev Pavel Graph Clust 22/33
  • 23. GraphClust pipeline Phase 1 Phase 1. Preprocessing Get RNA sequences from RNA-seq or computational methods, mask all repeats to avoid clusters made of genomic repeats. Split long sequences into smaller fragments (windows)1. Near identical sequences are removed using blastclust2. 1 to enable detection of small signals 2 The program begins with pairwise matches and places a sequence in a cluster if the sequence matches at least one sequence already in the cluster. Megablast algorithm is used. Yakovlev Pavel Graph Clust 23/33
  • 24. GraphClust pipeline Phase 2 Phase 2. Structure determination Try set of structures for each window with RNAShape tool. 150 nt sequence processing into set of disconnected graphs of ≈ 2500 total vertices. O = 20%; W = 30, 150; l = 3 Yakovlev Pavel Graph Clust 24/33
  • 25. GraphClust pipeline Phase 3 Phase 3. Parallel encoding Encoding set of structures into sparse vectors, by counting Kr∗,d∗ . 150 nt sequence processing into a sparse vector with ≈ 8000 features. r∗ = 2; d∗ = 4 Yakovlev Pavel Graph Clust 25/33
  • 26. GraphClust pipeline Phase 4 Phase 4. Candidate cluster ci - candidate clusters ci sorted as ∀i < j, D(ci ) > D(cj ) Building the union of candidate clusters: Cj = j i=1 ci with greedily discarded ck if |Cj ∩ ck| > threshold Yakovlev Pavel Graph Clust 26/33
  • 27. GraphClust pipeline Phase 5 Phase 5. Parallel cluster refinement Candidate clusters contains sequences, similar under NSPDK measure. Each cluster adjusts using sequence-structure alignment tool LocARNA (Will et al., 2007). We choose only top ranked RNAs in each cluster. Yakovlev Pavel Graph Clust 27/33
  • 28. GraphClust pipeline Phase 6 Phase 6. Parallel cluster model Selected top ranked RNA candidates are realigned with tool LocARNA-P (Will et al., 2012). so, we identfy high trusted alignment region and estimate the borders of the common local motif. At last, we create a covariance model (CM) by applying INFERNAL (Nawrocki et al., 2009) tool on the identfied subsequence. Every cluster member gets its CM bitscore1. 1 A log-scaled version of score. Yakovlev Pavel Graph Clust 28/33
  • 29. GraphClust pipeline Phase 7 Phase 7. Model scanning Covariance model of each cluster is used to find additional cluster members in full residual dataset of sequences by CM bitscore or E-value1. Some sequences can be assigned to multiplie clusters. 1 In terms of the authors, the number of instances with a score equivalent or better, than in current instance. Yakovlev Pavel Graph Clust 29/33
  • 30. GraphClust pipeline Phase 8 Phase 8. Iteration and removal All cluster members found in the previous phase are removed from the dataset. A new iteration starts from Phase 4. The termination conditions are: predetermined max number of iterations time limit empty dataset Yakovlev Pavel Graph Clust 30/33
  • 31. GraphClust pipeline Phase 9 Phase 9. Postprocessing Merge clusters For every pair of clusters, we compute the relative overlap and merge them if it is more, than 50%. Make in-cluster postprocessing Sequences in a cluster finally ranked by their CM bitscore. Yakovlev Pavel Graph Clust 31/33
  • 32. Results Species Type Method Input Time Cluster Benchmark Bacteria Small ncRNAs Misc 363 6.8h 39 Human Predicted RNA elemets EvoFam 699 0.3h 37 Misc Small ncRNAs Rfam 3 900 36h 130 De-novo discovery Fugu LincRNAs RNA-seq 5 877 10.3h 99 Fugu Predicted RNA elements RNAz 11 287 13.3h 97 Fruit fly Predicted RNA elements RNAz 17 765 20.4h 95 Human LincRNAs RNA-seq 31 418 3.6d 95 Human Predicted RNA elements EvoFold 37 258 5.7d 117 Human 3’UTRs RefSeq 118 514 12.8d 106 227 081 25.7d 815 Clustering 3901 sequences using LocARNA takes ∼ 370 days. Yakovlev Pavel Graph Clust 32/33
  • 33. Q&A Thank you for your attention! Q&A Yakovlev Pavel Graph Clust 33/33