2. Motivation
• Label the unlabeled DNA sequences by the model,
built by examining the labeled DNA sequences
and be able to perceive some real world Machine
Learning problems.
2
3. Approaches
• K-mer based
Fixed length K-mer
K-mer with Mismatches
Using Regular Expression
• PWM based
MEME and MAST
• Combined Model
Unite both model
3
4. K-mer Approach Based on Regular
Expression
Motivation
2-mer appears mostly in the sequences. So, emphasize
mostly on 2-mer.
Strategy
- For any two 2-mers X & Y, generate regular expression
X(.*)Y and Y(.*)X.
- Use these Regular expression as candidate attribute.
5. Classifier Selection
Fig : Around 9 classifiers applied on TF data set
Algorithms are numbered as follows -
(1)Logistic (2)SMO (3)NaiveBayes (4)BayesianLogisticRegression (5)Kstar (6)Bagging
7)LogitBoost (8)RandomForest (9)J48
Summary -
* 9 classifiers are applied on 10 data set. 3 are shown among them
* choosing an absolute classifier is not a trivial task
* same classifier behaves differently on different data sets
5
6. Change in Accuracy due to Different Classifiers
Logistic J48 RandomForest NaiveBayes Logistic J48 RandomForest NaiveBayes
Fig : The performance of different types of Classifiers on TF_3 data set Fig : The performance of different types of Classifiers on TF_5 data set
Summary -
* classifiers have great consequences on accuracy
* one has to be prudent when choosing classifiers
6
7. Change in Accuracy due to Different K-mer
Length
4-mer 5-mer 6-mer
Fig : The performance of different length K-mer on TF_3 data set
Summary -
* K-mer length also has consequences on accuracy
* not trivial, difficult to find the absolute one
7
8. Attribute Space Selection
Fig : The performance of different selecting k-mer on TF_4 data set
Summary -
* considering number of attributes also has consequences on accuracy
* accuracy increases if we consider greater number of attributes, but from such
saturation point it decreases.
8
9. PWM based Analysis on Accuracy
(TF_1 data set)
Fig : J48, minW 6 - maxW 15, no. of sites 10 Fig : J48, minW 6 – maxW 15, no. of motifs 5
Summary -
* accuracy increases when we have more motifs but fixed no. of sites
* accuracy increases when we have more sites but fixed no. of motifs
* what happened when we increases both ?????
9
10. PWM based Analysis
Fig : Accuracy vary on no. of motifs and no. of sites
* 1st bar concern with no. of sites
* 2nd bar concern with no. of motifs
* 3rd bar concern with accuracy
* the point is that accuracy decreases when we increases no. of motifs and no. of sites.
11. Extra Work for TF_20
Sequences
identified by
both model
K-mer
The New Model
+ for TF-20
Pwm Sequences Biased 2- Newly
identified mer Model Labeled
differently Sequences
Fig : Flow diagram of Building New Model for TF-20
Summary -
* we have done some extra work for TF_20
12. AUC based on the Feedback (bonus model)
Fig : AUC of 10 data sets based on last submission
* accuracy improved than first submission
* PWM does not have pleasant result
12
13. Participation
Background Working Working Paramete Automation
Study with Tools with r Tuning
Models
Badri DNA,RNA, AlignAce, PWM K-mer Arff Writer,
Sampath protein, MEME, Mast output
motif MAST writer
Iffat Protein, Weka, K-mer PWM Script for
Sharmin Motif, AlignAce, FASTA,
Chowdhury Transcriptio ScanAce Weka
n
Prosunjit DNA, MEME, K-mer PWM Script for
Biswas Transcriptio MAST RE, for new
nK-mer model
Tahmina MEME, MEME, PWM K-mer Script for
Ahmed MAST, MAST, MEME,
PWM Weka MAST
13