SlideShare ist ein Scribd-Unternehmen logo
1 von 6
Downloaden Sie, um offline zu lesen
Research Article
Volume 1 Issue 2 - January 2017
Curr Trends Biomedical Eng & Biosci
Copyright © All rights are reserved by Damian R Mingle
Controlling Informative Features for Improved
Accuracy and Faster Predictions in Omentum Cancer
Models
Damian R Mingle*
WPC Healthcare, Nashville, USA
Submission: December 09, 2016; Published: January 03, 2017
*
Corresponding author: Damian Mingle, Chief Data Scientist, WPC Healthcare, 1802 Williamson Ct, Brentwood, TN 37027, USA
Email:
Introduction
In recent years, the dawn of technologies like microarrays,
proteomics, and next-generation sequencing has transformed
life science. The data from these experimental approaches
deliver a comprehensive depiction of the complexity of biological
systems at different levels. A challenge within the “-omics” data
strata is in finding the small amount of information that is
relevant to a particular question, such as biomarkers that can
accurately classify phenotypic outcomes [1]. This is certainly
true in the fold of peritoneum connecting the stomach with
other abdominal organs known as the omentum. Numerous
machine learning techniques and methods have been proposed
to identify biomarkers that accurately classify these outcomes
by learning the elusive pattern latent in the data. To date, there
have been three categories that assist in biomarker selection
and phenotypic classification:
A. Filters
B. Wrappers
C. Embedding
In practice, time-to-prediction and accuracy of prediction
matter a great deal.
Filtering methods are generally considered in an effort
to spend the least time-to-prediction and can be used to
decide which are the most informative features in relation
to the biological target [2]. Filtering produces the degrees of
correlation with a given phenotype and then ranks the markers
in a given dataset. Many researchers acknowledge the weakness
of such methods and take careful note to observe the selection
of redundant biomarkers. In addition, filtering methods do not
allow for interactions between biomarkers. An example of a
popular filtering method is Student’s t-test [3].
In order to optimize the predictive power of a classification
model, wrapper methods iteratively perform combinatorial
biomarker search. Since this combinatorial optimization process
is computationally complex, a NP-hard problem, many heuristics
have been proposed, for example, to reduce the search space
and thus reduce the computational burden of the biomarker
selection [4].
With the exception of performing feature selection and
classification simultaneously, embedded methods are similar to
wrapper methods. Recursive feature elimination support vector
machine (SVM-RFE) is a widely used technique for analysis of
microarray data [5,6]. The SVM-RFE procedure constructs a
Curr Trends Biomedical Eng & Biosci 1(2): CTBEB.MS.ID.555559 (2017) 001
Current Trends in Biomedical
Engineering & Biosciences
Abstract
Identification of suitable biomarkers for accurate prediction of phenotypic outcomes is a goal for personalized medicine. However, current
machine learning approaches are either too complex or perform poorly. Here, a novel feature detection and engineering machine-learning
framework is presented to address this need. First, the Rip Curl process is applied which generates a set of 10 additional features. Second, we
rank all features including the Rip Curl features from which the top-ranked will most likely contain the most informative features for prediction of
the underlying biological classes. The top-ranked features are used in model building. This process creates for more expressive features which are
captured in models with an eye towards the model learning from increasing sample amount and the accuracy/time results. The performance of
the proposed Rip Curl classification framework was tested on omentum cancer data. Rip Curl outperformed other more sophisticated classification
methods in terms of prediction accuracy, minimum number of classification markers, and computational time.
Keywords: Omentum, Cancer, Data science, Machine learning, Biomarkers, Phenotype, Personalized medicine
How to cite this article: Damian R M. Controlling Informative Features for Improved Accuracy and Faster Predictions in Omentum Cancer Models. Curr
Trends Biomedical Eng & Biosci. 2017; 1(2): 555559.
002
Current Trends in Biomedical Engineering & Biosciences
classification model using all available features, with the least
informative features being pruned from the model.This process
continues iteratively until a model has learned the minimum
number of features that are useful. In the case of “-omics” data,
this process becomes impractical when considering a large
feature space.
Our research used a hybrid approach between user and
machine that dramatically reduced the computational time
required by similar approaches while increasing prediction
accuracy when comparing other state-of-the art machine
learning techniques. Our proposed framework includes
A. Ranking and pruning attributes using information gain
to extract the most informative features and thus greatly
reducing the number of data dimensions;
B. A user to view histograms on attributes where the
information gain is 0.80 or higher and creating new binary
features from continuous values;
C. Re-ranking both the original features and the newly
constructed features; and
D. Using the number of instances to determine how many
top-n informative features should be used in modeling.
The Rip Curl framework can be used to construct a high-
dimensional classification model that takes into account
dependencies among the attributes for analysis of complex
biological -omics datasets containing dependencies of
features. The performance of the proposed four-step
classification framework was evaluated using datasets from
microarray. The proposed framework was compared with
SVM-RFE in terms of area under the ROC curve (AUC) and the
number of informative biomarkers used for classification.
Results and Discussion
Using the omentum dataset we conducted the Rip Curl
process of setting the target feature (in this case it was one-
versus-all), characterized the target variable, loaded and
prepared the omentum data, saved the target and portioning
information, analyzed the omentum features, created cross-
validation and hold-out partitions, and conducted exploratory
data analysis.
Table 1: Different types of descriptive features.
Type Description
Predictive
A predictive descriptive feature provides information
that is useful in estimating the correct value of a target
feature.
Interacting
By itself, an interacting descriptive feature is not
informative about the value of the target feature. In
conjunction with one or more other features, however,
it becomes informative.
Redundant
A descriptive feature is redundant if it has a strong
correlation with another descriptive feature.
Irrelevant
An irrelevant descriptive feature does not provide
information that is useful in estimating the value of the
target feature.
In an effort to increase performance and accuracy we opted
for an approach of feature selection to help reduce the number
of descriptive features in the omentum dataset to just the subset
that is most useful for prediction. Before we begin our discussion
of approaches to feature selection, it is useful to distinguish
between different types of descriptive features (Table 1):
The goal of any feature selection approach is to identify the
smallest subset of descriptive features that maintains overall
model performance. Ideally a feature selection approach will
return the subset of features that includes the predictive
and interacting features while excluding the irrelevant and
redundant features.
Using conventional methods, it is not efficient to find the
ideal subset of descriptive features used to train an omentum
model. Considerd features. There are 2d
different possible feature
subsets, which is far too many to evaluate unless d is very small.
For example, with the descriptive features represented in the
omentum dataset,there are 210,960
which produces a 3,300 digit
integer as the possible feature subsets.
Material and Methods
Datasets
The dataset used in the experiments were provided by the
Gene Expression Machine Learning Repository (GEMLeR) [7].
GEMLeR contains microarray data from 9 different tissue types
including colon, breast, endometrium, kidney, lung, omentum,
ovary, prostate, and uterus. Each microarray sample is classified
as tumor or normal. The data from this repository were collated
into 9 one-tissue-types versus all-other-types (OVA) datasets
where the second class is labeled as “other.” All GEMLeR
microarray datasets have been analyzed by SVM-RFE, the results
of which are available from the same resource.
Methods
Figure 1: Represent protein and LDH level in different types of
meningitis.
Figure 1 demonstrates the Rip Curl framework and its
dependencies on the prior stage.
In applying the Rip Curl framework, we initially ran the
omentum microarray data through to gain informative feature
feedback and then rank those features from most to least
important. We confirmed the number of instances available in
the dataset and then applied 1% (1,545 instances X 0.01 = 154
How to cite this article: Damian R M. Controlling Informative Features for Improved Accuracy and Faster Predictions in Omentum Cancer Models. Curr
Trends Biomedical Eng & Biosci. 2017; 1(2): 555559.
003
Current Trends in Biomedical Engineering & Biosciences
features) to discover how many top informative features we
would make use of in our framework. Where the features were
both in the top 1% and expressed informativeness at or above
80%, we created unique features that followed some meaningful
thresholds which grouped biomarker data into bins of “0” or “1”.
Finally, we sent the enhanced data back through the informative
feature test to reduce the feature space to the top 1% and then
removed all other features, modeling this subset using Random
Forest (Entropy).
Our general approach to model selection was to run several
model types and to select the best performing based on the
highest AUC from the cross-validation results. Once those models
were selected, we confirmed that the models were learning
models based on sample sizes of the data 16%, 32%, and 64%
and are reported in Figure 2. If the model for each sample size
did not increase then the model was discarded as a non-learning
model (Figure 2).
Figure 2: A demonstration of a proper learning model.
We chose area under the ROC curve for its immediate
understanding and calculated it as follows
Where T is a set of thresholds, |T| is the number of thresholds
tested, and TPR(T[i] ) and FPR(T[i] ) are the true positive and
false positive rates at threshold i respectively. The ROC index is
quite robust in the presence of imbalanced data, which makes
it a common choice for practitioners, especially when multiple
modeling techniques are being compared to one another.
In addition, we ran a second evaluation measure, Gini Norm,
which is calculated as follows
Gini coefficient=(2 Ă—ROC index)-1 (2)
The Gini coefficient can take values in the range of (0,1), and
higher values indicate better model performance.
In our experiment with the omentum microarray data, we
wanted to pay particular attention to reducing complexity and
thereby improving time-to-prediction. This is especially true
with an M X N dimensional dataset, where M is the number of
samples and N is the features respectively, more specifically
where N is orders of magnitude greater than M, as is the case in
our experiment.
We selected Random Forest to represent the general
technique of random decision forests, an ensemble learning
method for classification, regression, and other tasks. Random
Forest operates by constructing a multitude of decision trees
at training time and outputting the class that is the mode of
the classes (classification) or mean prediction (regression)
of the individual trees. Random decision forests correct for
decision trees’ habit of over fitting to their training set. Figure
3 demonstrates visually the increase in predictive accuracy as it
relates to complexity of that prediction.
Figure 3: Comparison of Rip Curl vs other methods.
Table 2: Comparative analysis of model performance.
Data Model
Number
of
Variables
AUC
(Validation)
AUC (Cross
-Validation)
AUC
(Hold-
Out)
Gini Norm
(Validation)
Gini Norm
(Cross-
Validation)
Gini Norm
(Hold-Out)
Time
(milliseconds)
Raw
Features
Random
Forest
(Entropy)
10,935 0.9520 0.9427 0.9269 0.9040 0.8855 0.8537 10,859.25
Univariate
Features
Random
Forest
(Entropy)
8,165 0.9592 0.9492 0.9232 0.9184 0.8984 0.8463 8,233.39
Informative
Features
Random
Forest
(Entropy)
8,283 0.9520 0.9427 0.9269 0.9040 0.8855 0.8537 10,732.36
Top 1% of
Features
Random
Forest
(Entropy)
15 0.9379 0.9201 0.9344 0.8757 0.8401 0.8689 3,374.21
Table 2 emphasizes the disparity of different results in
prediction and time by holding constant the model type, Random
Forest (Entropy).
We observed that Rip Curl (Top 1% of Features) made use
of the best parameters, which we found to be max_depth: None,
max_features: 0.2, max_leaf_nodes: 50, min-samples_leaf: 5, and
How to cite this article: Damian R M. Controlling Informative Features for Improved Accuracy and Faster Predictions in Omentum Cancer Models. Curr
Trends Biomedical Eng & Biosci. 2017; 1(2): 555559.
004
Current Trends in Biomedical Engineering & Biosciences
min_samples_split: 10. Rip Curl improved time-to-prediction
by a range of 59.02% to 68.93%, increased the hold-out AUC
by 0.81% to 1.21%, and increased the hold-out Gini Norm by a
range of 1.78% to 2.67%.
Our framework makes use of Claude Shannon’s entropy
formula which defines a computational measure of the impurity
of the elements in a set. Shannon’s idea of entropy is a weighted
sum of the logs of the probabilities of each possible outcome
when we make a random selection from a set. The weights used
in the sum are the probabilities of the outcomes themselves so
that outcomes with high probabilities contribute more to the
overall entropy of a set than outcomes with low probabilities.
Shannon’s entropy is defined as
where P(t=i) is the probability that the outcome of randomly
selecting an element t is the type i, l is the number of different
types of things in the set, and s is an arbitrary logarithmic base
(which we selected as 2) (Shannon, 1948).
Once we established the in formativeness of each feature we
visually explored the histograms of each variable that expressed
an informative value of ≥80%. Below is an example of the
histogram for 206067_s_at, indicating where the target’s signal
was most concentrated within this biomarker (Figure 4).
Figure 4: Histogram of Gene 206067_s_at in raw expression.
In an effort to concentrate the omentum tissue signal, we
generated a rule that stated
Which rendered a new feature that generated an additional
histogram (Figure 5)?
Figure 5: Demonstration of Rip Curl feature engineering using
visual thresholds.
Allowing us to pass a different, possibly more understandable
context to our algorithm.
We repeated this process above, applying rules based on our
observation of the training data.
	
Some additional features were simple descriptive statistics
such as Min and Mode while others were a bit unconventional
such as
Where X a gene is feature and i represents the placement
within the feature index.
Binsum was another engineered feature that was simply
Where bin is one of the generated binary features and i is the
index of the bin within the omentum training data. In an effort
to develop greater context for the omentum data and the new
features that were engineered, we analyzed key values and their
respective informativeness (or importance) [8] (Table 3):
Table 3: Rip Curl Feature engineering statistics.
Feature Name Importance Unique Missing Mean SD Median Min Max
Binsum 92.88 9 0 1.46 2.07 1 0 8
1800_206067_s_at 84.93 2 0 0.15 0.35 0 0 1
400_216953_s_at 82.15 2 0 0.14 0.34 0 0 1
1100_219454_at 82.09 2 0 0.22 0.42 0 0 1
1300_214844_s_at 78.42 2 0 0.13 0.34 0 0 1
How to cite this article: Damian R M. Controlling Informative Features for Improved Accuracy and Faster Predictions in Omentum Cancer Models. Curr
Trends Biomedical Eng & Biosci. 2017; 1(2): 555559.
005
Current Trends in Biomedical Engineering & Biosciences
4000_227195_at 77.04 2 0 0.21 0.41 0 0 1
1100_213518_at 76.07 2 0 0.31 0.46 0 0 1
3500_204457_s_at 75.71 2 0 0.21 0.40 0 0 1
900_219778_at 71.29 2 0 0.09 0.29 0 0 1
Lensum 55.36 3 0 13.82 2.98 16 8 16
Mode 53.04 837 0 214 413 19.80 1.60 4152
Min 51.85 81 0 242 1.41 2.20 0.20 15.40
Figure 6: Variable importance rank generated from the Rip Curl
framework.
Figure 6 demonstrates visually the variable importance
of the final Rip Curl model. We observed that 20% of the top
informative features were generated through the Rip Curl
framework: Binsum, >1800_206067_s_at, and >400_216953_s_
at with a range of importance between 27% and 93%.
GEMLeR provides an initial classification accuracy value for
the omentum dataset. In their experiments designed to generate
the state-of-the-art benchmark, all measurements were
performed using WEKA machine learning environment. They
opted to make use of one of the most popular machine learning
methods for gene expression analysis, Support Vector Machines
– Recursive Feature Elimination (SVM-RFE) feature selection
algorithm. They evaluated (feature selection + classification
was done inside a 10-fold cross-validation loop on the omentum
dataset to avoid so called selection bias [9] and demonstrates
their approach (Figure 7).
Figure 7: SVM-RFE Process.
Head-to-Head Comparison with SVM-RFE
Table 4: Comparison results of international benchmark and Rip Curl.
Model AUC
SVM-RFE (Benchmark) 0.703
Rip Curl 0.934
Table 4 shows a comparison of the SVM-RFE benchmark
established in with the Rip Curl framework, and the following
results were observed [10] (Table 4):
Rip Curl represents a 32.92% gain in prediction accuracy
over the GEMLeR benchmark for the same omentum dataset
[11].
Conclusion
The Rip Curl classification framework outperformed the
state-of-the-art benchmark (SVM-RFE) in the GEMLeR omentum
cancer experiment. Since the Rip Curl classification framework
utilizes entropy-based feature filtering and adds more contexts
through feature engineering, the complexity of this classification
framework is very low permitting analysis of data with many
features. Future research would suggest comparisons beyond
the omentum cancer data and exploration of other one-versus-
all experiments in the areas of breast, colon, endomentrium,
kidney, lung, ovary, prostate, and uterus [12-14].
Acknowledgment
We would like to acknowledge GEMLeR for making this
important dataset available to researchers and WPC Healthcare
for supporting our work. Finally, the authors would like to thank
the donors who participated in this study.
References
1.	 Abeel T, HelleputteT, Van de Peer Y, Dupont P, Saeys Y (2010) Robust
biomarker identification for cancer diagnosis with ensemble feature
selection methods. Bioinformatics 26(3): 392-398.
2.	 Mingle D (2015) A Discriminative Feature Space for Detecting and
Recognizing Pathologies of the Vertebral Column. International Journal
of Biomedical Data Mining.
3.	 Leung Y, Hung Y (2010) A multiple-filter-multiple-wrapper approach
to gene selection and microarray data classification. IEEE/ACM Trans
Comput Biol Bioinform 7(1): 108-117.
How to cite this article: Damian R M. Controlling Informative Features for Improved Accuracy and Faster Predictions in Omentum Cancer Models. Curr
Trends Biomedical Eng & Biosci. 2017; 1(2): 555559.
006
Current Trends in Biomedical Engineering & Biosciences
4.	 Mohammadi A, Saraee MH, Salehi M (2011) Identification of disease-
causing genes using microarray data mining and Gene Ontology. BMC
medical genomics 4(1): 1.
5.	 Ding Y, Wilkins D (2006) Improving the performance of SVM-RFE to
select genes in microarray data. BMC bioinformatics 7(Suppl 2): S12.
6.	 Balakrishnan S, Narayanaswamy R, Savarimuthu N, Samikannu R
(2008) SVM ranking with backward search for feature selection in type
II diabetes databases. IEEE pp. 2628-2633.
7.	 Stiglic G, Kokol P (2010) Stability of ranked gene lists in large
microarray analysis studies. BioMed Research International 2010: ID
616358.
8.	 Breiman L (2001) Random forests. Machine learning 45(1): 5-32.
9.	 Ambroise C, Mc Lachlan GJ (2002) Selection bias in gene extraction on
the basis of microarray gene-expression data. Proc Natl Acad Sci U S A
99(10): 6562-6566.
10.	Duan K, Rajapakse JC (2004) SVM-RFE peak selection for cancer
classification with mass spectrometry data. In APBC pp. 191-200.
11.	Hu ZZ, Huang H, Wu CH, Jung M, Dritschilo A, et al. (2011) Omics-based
molecular target and biomarker identification. Methods Mol Biol 547-
571.
12.	Shannon CE (1948) A note on the concept of entropy. Bell System Tech
J 27: 379-423.
13.	StiglicG,RodriguezJJ,KokolP(2010)Findingoptimalclassifiersforsmall
feature sets in genomics and proteomics. Neurocomputing 73(13):
2346-2352.
14.	Zervakis M, Blazadonakis ME, Tsiliki G, Danilatou V, Tsiknakis M, et al.
(2009) Outcome prediction based on microarray analysis: a critical
perspective on methods. BMC bioinformatics 10: 53.
Your next submission with JuniperPublishers
will reach you the below assets
•	 Quality Editorial service
•	 Swift Peer Review
•	 Reprints availability
•	 E-prints Service
•	 Manuscript Podcast for convenient understanding
•	 Global attainment for your research
•	 Manuscript accessibility in different formats
( Pdf, E-pub, Full Text, Audio)
•	 Unceasing customer service
Track the below URL for one-step submission
https://juniperpublishers.com/online-submission.php

Weitere ähnliche Inhalte

Was ist angesagt?

Anomaly detection via eliminating data redundancy and rectifying data error i...
Anomaly detection via eliminating data redundancy and rectifying data error i...Anomaly detection via eliminating data redundancy and rectifying data error i...
Anomaly detection via eliminating data redundancy and rectifying data error i...nalini manogaran
 
IRJET - Support Vector Machine versus Naive Bayes Classifier:A Juxtaposition ...
IRJET - Support Vector Machine versus Naive Bayes Classifier:A Juxtaposition ...IRJET - Support Vector Machine versus Naive Bayes Classifier:A Juxtaposition ...
IRJET - Support Vector Machine versus Naive Bayes Classifier:A Juxtaposition ...IRJET Journal
 
Trust Enhanced Role Based Access Control Using Genetic Algorithm
Trust Enhanced Role Based Access Control Using Genetic Algorithm Trust Enhanced Role Based Access Control Using Genetic Algorithm
Trust Enhanced Role Based Access Control Using Genetic Algorithm IJECEIAES
 
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...theijes
 
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTSQUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTSShakas Technologies
 
Sentiment Features based Analysis of Online Reviews
Sentiment Features based Analysis of Online ReviewsSentiment Features based Analysis of Online Reviews
Sentiment Features based Analysis of Online Reviewsiosrjce
 
IRJET- Credit Card Fraud Detection using Isolation Forest
IRJET- Credit Card Fraud Detection using Isolation ForestIRJET- Credit Card Fraud Detection using Isolation Forest
IRJET- Credit Card Fraud Detection using Isolation ForestIRJET Journal
 
Recommender systems bener
Recommender systems   benerRecommender systems   bener
Recommender systems benerdiannepatricia
 
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
 QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTSNexgen Technology
 
Statistics
Statistics Statistics
Statistics D Dutta Roy
 
CATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTION
CATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTIONCATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTION
CATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTIONIJDKP
 
DataMining_CA2-4
DataMining_CA2-4DataMining_CA2-4
DataMining_CA2-4Aravind Kumar
 
MOVIE SUCCESS PREDICTION AND PERFORMANCE COMPARISON USING VARIOUS STATISTICAL...
MOVIE SUCCESS PREDICTION AND PERFORMANCE COMPARISON USING VARIOUS STATISTICAL...MOVIE SUCCESS PREDICTION AND PERFORMANCE COMPARISON USING VARIOUS STATISTICAL...
MOVIE SUCCESS PREDICTION AND PERFORMANCE COMPARISON USING VARIOUS STATISTICAL...ijaia
 
Introduction to feature subset selection method
Introduction to feature subset selection methodIntroduction to feature subset selection method
Introduction to feature subset selection methodIJSRD
 
The Evaluated Measurement of a Combined Genetic Algorithm and Artificial Immu...
The Evaluated Measurement of a Combined Genetic Algorithm and Artificial Immu...The Evaluated Measurement of a Combined Genetic Algorithm and Artificial Immu...
The Evaluated Measurement of a Combined Genetic Algorithm and Artificial Immu...IJECEIAES
 
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : ...
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL  FOR HEALTHCARE INFORMATION SYSTEM :   ...ONTOLOGY-DRIVEN INFORMATION RETRIEVAL  FOR HEALTHCARE INFORMATION SYSTEM :   ...
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : ...IJNSA Journal
 
Multivariate Data analysis Workshop at UC Davis 2012
Multivariate Data analysis Workshop at UC Davis 2012Multivariate Data analysis Workshop at UC Davis 2012
Multivariate Data analysis Workshop at UC Davis 2012Dmitry Grapov
 

Was ist angesagt? (19)

Anomaly detection via eliminating data redundancy and rectifying data error i...
Anomaly detection via eliminating data redundancy and rectifying data error i...Anomaly detection via eliminating data redundancy and rectifying data error i...
Anomaly detection via eliminating data redundancy and rectifying data error i...
 
IRJET - Support Vector Machine versus Naive Bayes Classifier:A Juxtaposition ...
IRJET - Support Vector Machine versus Naive Bayes Classifier:A Juxtaposition ...IRJET - Support Vector Machine versus Naive Bayes Classifier:A Juxtaposition ...
IRJET - Support Vector Machine versus Naive Bayes Classifier:A Juxtaposition ...
 
Trust Enhanced Role Based Access Control Using Genetic Algorithm
Trust Enhanced Role Based Access Control Using Genetic Algorithm Trust Enhanced Role Based Access Control Using Genetic Algorithm
Trust Enhanced Role Based Access Control Using Genetic Algorithm
 
C017510717
C017510717C017510717
C017510717
 
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...
 
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTSQUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
 
Sentiment Features based Analysis of Online Reviews
Sentiment Features based Analysis of Online ReviewsSentiment Features based Analysis of Online Reviews
Sentiment Features based Analysis of Online Reviews
 
IRJET- Credit Card Fraud Detection using Isolation Forest
IRJET- Credit Card Fraud Detection using Isolation ForestIRJET- Credit Card Fraud Detection using Isolation Forest
IRJET- Credit Card Fraud Detection using Isolation Forest
 
Recommender systems bener
Recommender systems   benerRecommender systems   bener
Recommender systems bener
 
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
 QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
 
Statistics
Statistics Statistics
Statistics
 
Ieee doctoral progarm final
Ieee doctoral progarm finalIeee doctoral progarm final
Ieee doctoral progarm final
 
CATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTION
CATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTIONCATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTION
CATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTION
 
DataMining_CA2-4
DataMining_CA2-4DataMining_CA2-4
DataMining_CA2-4
 
MOVIE SUCCESS PREDICTION AND PERFORMANCE COMPARISON USING VARIOUS STATISTICAL...
MOVIE SUCCESS PREDICTION AND PERFORMANCE COMPARISON USING VARIOUS STATISTICAL...MOVIE SUCCESS PREDICTION AND PERFORMANCE COMPARISON USING VARIOUS STATISTICAL...
MOVIE SUCCESS PREDICTION AND PERFORMANCE COMPARISON USING VARIOUS STATISTICAL...
 
Introduction to feature subset selection method
Introduction to feature subset selection methodIntroduction to feature subset selection method
Introduction to feature subset selection method
 
The Evaluated Measurement of a Combined Genetic Algorithm and Artificial Immu...
The Evaluated Measurement of a Combined Genetic Algorithm and Artificial Immu...The Evaluated Measurement of a Combined Genetic Algorithm and Artificial Immu...
The Evaluated Measurement of a Combined Genetic Algorithm and Artificial Immu...
 
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : ...
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL  FOR HEALTHCARE INFORMATION SYSTEM :   ...ONTOLOGY-DRIVEN INFORMATION RETRIEVAL  FOR HEALTHCARE INFORMATION SYSTEM :   ...
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : ...
 
Multivariate Data analysis Workshop at UC Davis 2012
Multivariate Data analysis Workshop at UC Davis 2012Multivariate Data analysis Workshop at UC Davis 2012
Multivariate Data analysis Workshop at UC Davis 2012
 

Ă„hnlich wie Controlling informative features for improved accuracy and faster predictions in omentum cancer models

IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...IRJET Journal
 
A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...
A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...
A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...IJTET Journal
 
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...IJDKP
 
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...IJDKP
 
Improving Prediction Accuracy Results by Using Q-Statistic Algorithm in High ...
Improving Prediction Accuracy Results by Using Q-Statistic Algorithm in High ...Improving Prediction Accuracy Results by Using Q-Statistic Algorithm in High ...
Improving Prediction Accuracy Results by Using Q-Statistic Algorithm in High ...rahulmonikasharma
 
Hybrid filtering methods for feature selection in high-dimensional cancer data
Hybrid filtering methods for feature selection in high-dimensional cancer dataHybrid filtering methods for feature selection in high-dimensional cancer data
Hybrid filtering methods for feature selection in high-dimensional cancer dataIJECEIAES
 
Positive Impression of Low-Ranking Microrn as in Human Cancer Classification
Positive Impression of Low-Ranking Microrn as in Human Cancer ClassificationPositive Impression of Low-Ranking Microrn as in Human Cancer Classification
Positive Impression of Low-Ranking Microrn as in Human Cancer Classificationcsandit
 
IRJET- Breast Cancer Prediction using Supervised Machine Learning Algorithms
IRJET- Breast Cancer Prediction using Supervised Machine Learning AlgorithmsIRJET- Breast Cancer Prediction using Supervised Machine Learning Algorithms
IRJET- Breast Cancer Prediction using Supervised Machine Learning AlgorithmsIRJET Journal
 
GRC-MS: A GENETIC RULE-BASED CLASSIFIER MODEL FOR ANALYSIS OF MASS SPECTRA DATA
GRC-MS: A GENETIC RULE-BASED CLASSIFIER MODEL FOR ANALYSIS OF MASS SPECTRA DATAGRC-MS: A GENETIC RULE-BASED CLASSIFIER MODEL FOR ANALYSIS OF MASS SPECTRA DATA
GRC-MS: A GENETIC RULE-BASED CLASSIFIER MODEL FOR ANALYSIS OF MASS SPECTRA DATAcscpconf
 
Classification of Breast Cancer Diseases using Data Mining Techniques
Classification of Breast Cancer Diseases using Data Mining TechniquesClassification of Breast Cancer Diseases using Data Mining Techniques
Classification of Breast Cancer Diseases using Data Mining Techniquesinventionjournals
 
IRJET - Survey on Analysis of Breast Cancer Prediction
IRJET - Survey on Analysis of Breast Cancer PredictionIRJET - Survey on Analysis of Breast Cancer Prediction
IRJET - Survey on Analysis of Breast Cancer PredictionIRJET Journal
 
Machine Learning Based Approaches for Cancer Classification Using Gene Expres...
Machine Learning Based Approaches for Cancer Classification Using Gene Expres...Machine Learning Based Approaches for Cancer Classification Using Gene Expres...
Machine Learning Based Approaches for Cancer Classification Using Gene Expres...mlaij
 
Simplified Knowledge Prediction: Application of Machine Learning in Real Life
Simplified Knowledge Prediction: Application of Machine Learning in Real LifeSimplified Knowledge Prediction: Application of Machine Learning in Real Life
Simplified Knowledge Prediction: Application of Machine Learning in Real LifePeea Bal Chakraborty
 
Classification AlgorithmBased Analysis of Breast Cancer Data
Classification AlgorithmBased Analysis of Breast Cancer DataClassification AlgorithmBased Analysis of Breast Cancer Data
Classification AlgorithmBased Analysis of Breast Cancer DataIIRindia
 
An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...
An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...
An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...ijsc
 
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...ijsc
 
CSCI 6505 Machine Learning Project
CSCI 6505 Machine Learning ProjectCSCI 6505 Machine Learning Project
CSCI 6505 Machine Learning Projectbutest
 
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONSVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONijscai
 
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONSVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONijscai
 
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONSVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONijscai
 

Ă„hnlich wie Controlling informative features for improved accuracy and faster predictions in omentum cancer models (20)

IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...
 
A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...
A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...
A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...
 
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
 
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
 
Improving Prediction Accuracy Results by Using Q-Statistic Algorithm in High ...
Improving Prediction Accuracy Results by Using Q-Statistic Algorithm in High ...Improving Prediction Accuracy Results by Using Q-Statistic Algorithm in High ...
Improving Prediction Accuracy Results by Using Q-Statistic Algorithm in High ...
 
Hybrid filtering methods for feature selection in high-dimensional cancer data
Hybrid filtering methods for feature selection in high-dimensional cancer dataHybrid filtering methods for feature selection in high-dimensional cancer data
Hybrid filtering methods for feature selection in high-dimensional cancer data
 
Positive Impression of Low-Ranking Microrn as in Human Cancer Classification
Positive Impression of Low-Ranking Microrn as in Human Cancer ClassificationPositive Impression of Low-Ranking Microrn as in Human Cancer Classification
Positive Impression of Low-Ranking Microrn as in Human Cancer Classification
 
IRJET- Breast Cancer Prediction using Supervised Machine Learning Algorithms
IRJET- Breast Cancer Prediction using Supervised Machine Learning AlgorithmsIRJET- Breast Cancer Prediction using Supervised Machine Learning Algorithms
IRJET- Breast Cancer Prediction using Supervised Machine Learning Algorithms
 
GRC-MS: A GENETIC RULE-BASED CLASSIFIER MODEL FOR ANALYSIS OF MASS SPECTRA DATA
GRC-MS: A GENETIC RULE-BASED CLASSIFIER MODEL FOR ANALYSIS OF MASS SPECTRA DATAGRC-MS: A GENETIC RULE-BASED CLASSIFIER MODEL FOR ANALYSIS OF MASS SPECTRA DATA
GRC-MS: A GENETIC RULE-BASED CLASSIFIER MODEL FOR ANALYSIS OF MASS SPECTRA DATA
 
Classification of Breast Cancer Diseases using Data Mining Techniques
Classification of Breast Cancer Diseases using Data Mining TechniquesClassification of Breast Cancer Diseases using Data Mining Techniques
Classification of Breast Cancer Diseases using Data Mining Techniques
 
IRJET - Survey on Analysis of Breast Cancer Prediction
IRJET - Survey on Analysis of Breast Cancer PredictionIRJET - Survey on Analysis of Breast Cancer Prediction
IRJET - Survey on Analysis of Breast Cancer Prediction
 
Machine Learning Based Approaches for Cancer Classification Using Gene Expres...
Machine Learning Based Approaches for Cancer Classification Using Gene Expres...Machine Learning Based Approaches for Cancer Classification Using Gene Expres...
Machine Learning Based Approaches for Cancer Classification Using Gene Expres...
 
Simplified Knowledge Prediction: Application of Machine Learning in Real Life
Simplified Knowledge Prediction: Application of Machine Learning in Real LifeSimplified Knowledge Prediction: Application of Machine Learning in Real Life
Simplified Knowledge Prediction: Application of Machine Learning in Real Life
 
Classification AlgorithmBased Analysis of Breast Cancer Data
Classification AlgorithmBased Analysis of Breast Cancer DataClassification AlgorithmBased Analysis of Breast Cancer Data
Classification AlgorithmBased Analysis of Breast Cancer Data
 
An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...
An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...
An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...
 
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...
 
CSCI 6505 Machine Learning Project
CSCI 6505 Machine Learning ProjectCSCI 6505 Machine Learning Project
CSCI 6505 Machine Learning Project
 
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONSVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
 
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONSVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
 
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONSVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
 

Mehr von Damian R. Mingle, MBA

Classify Rice Disease Using Self-Optimizing Models and Edge Computing with A...
Classify Rice Disease Using Self-Optimizing Models and  Edge Computing with A...Classify Rice Disease Using Self-Optimizing Models and  Edge Computing with A...
Classify Rice Disease Using Self-Optimizing Models and Edge Computing with A...Damian R. Mingle, MBA
 
Predicting Diabetic Readmission Rates: Moving Beyond HbA1c
Predicting Diabetic Readmission Rates: Moving Beyond HbA1cPredicting Diabetic Readmission Rates: Moving Beyond HbA1c
Predicting Diabetic Readmission Rates: Moving Beyond HbA1cDamian R. Mingle, MBA
 
Greek Letters with LaTeX Cheat Sheet
Greek Letters with LaTeX Cheat SheetGreek Letters with LaTeX Cheat Sheet
Greek Letters with LaTeX Cheat SheetDamian R. Mingle, MBA
 
Clustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn TutorialClustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn TutorialDamian R. Mingle, MBA
 
Scikit Learn: How to Deal with Missing Values
Scikit Learn: How to Deal with Missing ValuesScikit Learn: How to Deal with Missing Values
Scikit Learn: How to Deal with Missing ValuesDamian R. Mingle, MBA
 
SciKit Learn: How to Standardize Your Data
SciKit Learn: How to Standardize Your DataSciKit Learn: How to Standardize Your Data
SciKit Learn: How to Standardize Your DataDamian R. Mingle, MBA
 
Scikit Learn: Data Normalization Techniques That Work
Scikit Learn: Data Normalization Techniques That WorkScikit Learn: Data Normalization Techniques That Work
Scikit Learn: Data Normalization Techniques That WorkDamian R. Mingle, MBA
 
The evolving definition of sepsis
The evolving definition of sepsis The evolving definition of sepsis
The evolving definition of sepsis Damian R. Mingle, MBA
 
Data and the Changing Role of the Tech Savvy CFO
Data and the Changing Role of the Tech Savvy CFOData and the Changing Role of the Tech Savvy CFO
Data and the Changing Role of the Tech Savvy CFODamian R. Mingle, MBA
 
A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...
A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...
A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...Damian R. Mingle, MBA
 
Practical Data Science the WPC Healthcare Strategy for Delivering Meaningful ...
Practical Data Science the WPC Healthcare Strategy for Delivering Meaningful ...Practical Data Science the WPC Healthcare Strategy for Delivering Meaningful ...
Practical Data Science the WPC Healthcare Strategy for Delivering Meaningful ...Damian R. Mingle, MBA
 
A Multi-Pronged Approach to Data Mining Post-Acute Care Episodes
A Multi-Pronged Approach to Data Mining Post-Acute Care EpisodesA Multi-Pronged Approach to Data Mining Post-Acute Care Episodes
A Multi-Pronged Approach to Data Mining Post-Acute Care EpisodesDamian R. Mingle, MBA
 

Mehr von Damian R. Mingle, MBA (13)

Classify Rice Disease Using Self-Optimizing Models and Edge Computing with A...
Classify Rice Disease Using Self-Optimizing Models and  Edge Computing with A...Classify Rice Disease Using Self-Optimizing Models and  Edge Computing with A...
Classify Rice Disease Using Self-Optimizing Models and Edge Computing with A...
 
Predicting Diabetic Readmission Rates: Moving Beyond HbA1c
Predicting Diabetic Readmission Rates: Moving Beyond HbA1cPredicting Diabetic Readmission Rates: Moving Beyond HbA1c
Predicting Diabetic Readmission Rates: Moving Beyond HbA1c
 
Greek Letters with LaTeX Cheat Sheet
Greek Letters with LaTeX Cheat SheetGreek Letters with LaTeX Cheat Sheet
Greek Letters with LaTeX Cheat Sheet
 
Clustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn TutorialClustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn Tutorial
 
Scikit Learn: How to Deal with Missing Values
Scikit Learn: How to Deal with Missing ValuesScikit Learn: How to Deal with Missing Values
Scikit Learn: How to Deal with Missing Values
 
SciKit Learn: How to Standardize Your Data
SciKit Learn: How to Standardize Your DataSciKit Learn: How to Standardize Your Data
SciKit Learn: How to Standardize Your Data
 
Scikit Learn: Data Normalization Techniques That Work
Scikit Learn: Data Normalization Techniques That WorkScikit Learn: Data Normalization Techniques That Work
Scikit Learn: Data Normalization Techniques That Work
 
What is sepsis?
What is sepsis?What is sepsis?
What is sepsis?
 
The evolving definition of sepsis
The evolving definition of sepsis The evolving definition of sepsis
The evolving definition of sepsis
 
Data and the Changing Role of the Tech Savvy CFO
Data and the Changing Role of the Tech Savvy CFOData and the Changing Role of the Tech Savvy CFO
Data and the Changing Role of the Tech Savvy CFO
 
A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...
A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...
A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...
 
Practical Data Science the WPC Healthcare Strategy for Delivering Meaningful ...
Practical Data Science the WPC Healthcare Strategy for Delivering Meaningful ...Practical Data Science the WPC Healthcare Strategy for Delivering Meaningful ...
Practical Data Science the WPC Healthcare Strategy for Delivering Meaningful ...
 
A Multi-Pronged Approach to Data Mining Post-Acute Care Episodes
A Multi-Pronged Approach to Data Mining Post-Acute Care EpisodesA Multi-Pronged Approach to Data Mining Post-Acute Care Episodes
A Multi-Pronged Approach to Data Mining Post-Acute Care Episodes
 

KĂĽrzlich hochgeladen

Apiculture Chapter 1. Introduction 2.ppt
Apiculture Chapter 1. Introduction 2.pptApiculture Chapter 1. Introduction 2.ppt
Apiculture Chapter 1. Introduction 2.pptkedirjemalharun
 
Glomerular Filtration rate and its determinants.pptx
Glomerular Filtration rate and its determinants.pptxGlomerular Filtration rate and its determinants.pptx
Glomerular Filtration rate and its determinants.pptxDr.Nusrat Tariq
 
Lippincott Microcards_ Microbiology Flash Cards-LWW (2015).pdf
Lippincott Microcards_ Microbiology Flash Cards-LWW (2015).pdfLippincott Microcards_ Microbiology Flash Cards-LWW (2015).pdf
Lippincott Microcards_ Microbiology Flash Cards-LWW (2015).pdfSreeja Cherukuru
 
call girls in Connaught Place DELHI 🔝 >༒9540349809 🔝 genuine Escort Service ...
call girls in Connaught Place  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service ...call girls in Connaught Place  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service ...
call girls in Connaught Place DELHI 🔝 >༒9540349809 🔝 genuine Escort Service ...saminamagar
 
call girls in aerocity DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in aerocity DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️call girls in aerocity DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in aerocity DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️saminamagar
 
PNEUMOTHORAX AND ITS MANAGEMENTS.pdf
PNEUMOTHORAX   AND  ITS  MANAGEMENTS.pdfPNEUMOTHORAX   AND  ITS  MANAGEMENTS.pdf
PNEUMOTHORAX AND ITS MANAGEMENTS.pdfDolisha Warbi
 
April 2024 ONCOLOGY CARTOON by DR KANHU CHARAN PATRO
April 2024 ONCOLOGY CARTOON by  DR KANHU CHARAN PATROApril 2024 ONCOLOGY CARTOON by  DR KANHU CHARAN PATRO
April 2024 ONCOLOGY CARTOON by DR KANHU CHARAN PATROKanhu Charan
 
PresentaciĂł "Real-Life VR Integration for Mild Cognitive Impairment Rehabilit...
PresentaciĂł "Real-Life VR Integration for Mild Cognitive Impairment Rehabilit...PresentaciĂł "Real-Life VR Integration for Mild Cognitive Impairment Rehabilit...
PresentaciĂł "Real-Life VR Integration for Mild Cognitive Impairment Rehabilit...Badalona Serveis Assistencials
 
epilepsy and status epilepticus for undergraduate.pptx
epilepsy and status epilepticus  for undergraduate.pptxepilepsy and status epilepticus  for undergraduate.pptx
epilepsy and status epilepticus for undergraduate.pptxMohamed Rizk Khodair
 
Biomechanics- Shoulder Joint!!!!!!!!!!!!
Biomechanics- Shoulder Joint!!!!!!!!!!!!Biomechanics- Shoulder Joint!!!!!!!!!!!!
Biomechanics- Shoulder Joint!!!!!!!!!!!!ibtesaam huma
 
The next social challenge to public health: the information environment.pptx
The next social challenge to public health:  the information environment.pptxThe next social challenge to public health:  the information environment.pptx
The next social challenge to public health: the information environment.pptxTina Purnat
 
call girls in munirka DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in munirka  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️call girls in munirka  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in munirka DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️saminamagar
 
Report Back from SGO: What’s New in Uterine Cancer?.pptx
Report Back from SGO: What’s New in Uterine Cancer?.pptxReport Back from SGO: What’s New in Uterine Cancer?.pptx
Report Back from SGO: What’s New in Uterine Cancer?.pptxbkling
 
Glomerular Filtration and determinants of glomerular filtration .pptx
Glomerular Filtration and  determinants of glomerular filtration .pptxGlomerular Filtration and  determinants of glomerular filtration .pptx
Glomerular Filtration and determinants of glomerular filtration .pptxDr.Nusrat Tariq
 
Informed Consent Empowering Healthcare Decision-Making.pptx
Informed Consent Empowering Healthcare Decision-Making.pptxInformed Consent Empowering Healthcare Decision-Making.pptx
Informed Consent Empowering Healthcare Decision-Making.pptxSasikiranMarri
 
History and Development of Pharmacovigilence.pdf
History and Development of Pharmacovigilence.pdfHistory and Development of Pharmacovigilence.pdf
History and Development of Pharmacovigilence.pdfSasikiranMarri
 
COVID-19 (NOVEL CORONA VIRUS DISEASE PANDEMIC ).pptx
COVID-19  (NOVEL CORONA  VIRUS DISEASE PANDEMIC ).pptxCOVID-19  (NOVEL CORONA  VIRUS DISEASE PANDEMIC ).pptx
COVID-19 (NOVEL CORONA VIRUS DISEASE PANDEMIC ).pptxBibekananda shah
 
Hematology and Immunology - Leukocytes Functions
Hematology and Immunology - Leukocytes FunctionsHematology and Immunology - Leukocytes Functions
Hematology and Immunology - Leukocytes FunctionsMedicoseAcademics
 
LUNG TUMORS AND ITS CLASSIFICATIONS.pdf
LUNG TUMORS AND ITS  CLASSIFICATIONS.pdfLUNG TUMORS AND ITS  CLASSIFICATIONS.pdf
LUNG TUMORS AND ITS CLASSIFICATIONS.pdfDolisha Warbi
 

KĂĽrzlich hochgeladen (20)

Apiculture Chapter 1. Introduction 2.ppt
Apiculture Chapter 1. Introduction 2.pptApiculture Chapter 1. Introduction 2.ppt
Apiculture Chapter 1. Introduction 2.ppt
 
Glomerular Filtration rate and its determinants.pptx
Glomerular Filtration rate and its determinants.pptxGlomerular Filtration rate and its determinants.pptx
Glomerular Filtration rate and its determinants.pptx
 
Lippincott Microcards_ Microbiology Flash Cards-LWW (2015).pdf
Lippincott Microcards_ Microbiology Flash Cards-LWW (2015).pdfLippincott Microcards_ Microbiology Flash Cards-LWW (2015).pdf
Lippincott Microcards_ Microbiology Flash Cards-LWW (2015).pdf
 
call girls in Connaught Place DELHI 🔝 >༒9540349809 🔝 genuine Escort Service ...
call girls in Connaught Place  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service ...call girls in Connaught Place  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service ...
call girls in Connaught Place DELHI 🔝 >༒9540349809 🔝 genuine Escort Service ...
 
call girls in aerocity DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in aerocity DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️call girls in aerocity DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in aerocity DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
 
PNEUMOTHORAX AND ITS MANAGEMENTS.pdf
PNEUMOTHORAX   AND  ITS  MANAGEMENTS.pdfPNEUMOTHORAX   AND  ITS  MANAGEMENTS.pdf
PNEUMOTHORAX AND ITS MANAGEMENTS.pdf
 
Epilepsy
EpilepsyEpilepsy
Epilepsy
 
April 2024 ONCOLOGY CARTOON by DR KANHU CHARAN PATRO
April 2024 ONCOLOGY CARTOON by  DR KANHU CHARAN PATROApril 2024 ONCOLOGY CARTOON by  DR KANHU CHARAN PATRO
April 2024 ONCOLOGY CARTOON by DR KANHU CHARAN PATRO
 
PresentaciĂł "Real-Life VR Integration for Mild Cognitive Impairment Rehabilit...
PresentaciĂł "Real-Life VR Integration for Mild Cognitive Impairment Rehabilit...PresentaciĂł "Real-Life VR Integration for Mild Cognitive Impairment Rehabilit...
PresentaciĂł "Real-Life VR Integration for Mild Cognitive Impairment Rehabilit...
 
epilepsy and status epilepticus for undergraduate.pptx
epilepsy and status epilepticus  for undergraduate.pptxepilepsy and status epilepticus  for undergraduate.pptx
epilepsy and status epilepticus for undergraduate.pptx
 
Biomechanics- Shoulder Joint!!!!!!!!!!!!
Biomechanics- Shoulder Joint!!!!!!!!!!!!Biomechanics- Shoulder Joint!!!!!!!!!!!!
Biomechanics- Shoulder Joint!!!!!!!!!!!!
 
The next social challenge to public health: the information environment.pptx
The next social challenge to public health:  the information environment.pptxThe next social challenge to public health:  the information environment.pptx
The next social challenge to public health: the information environment.pptx
 
call girls in munirka DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in munirka  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️call girls in munirka  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in munirka DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
 
Report Back from SGO: What’s New in Uterine Cancer?.pptx
Report Back from SGO: What’s New in Uterine Cancer?.pptxReport Back from SGO: What’s New in Uterine Cancer?.pptx
Report Back from SGO: What’s New in Uterine Cancer?.pptx
 
Glomerular Filtration and determinants of glomerular filtration .pptx
Glomerular Filtration and  determinants of glomerular filtration .pptxGlomerular Filtration and  determinants of glomerular filtration .pptx
Glomerular Filtration and determinants of glomerular filtration .pptx
 
Informed Consent Empowering Healthcare Decision-Making.pptx
Informed Consent Empowering Healthcare Decision-Making.pptxInformed Consent Empowering Healthcare Decision-Making.pptx
Informed Consent Empowering Healthcare Decision-Making.pptx
 
History and Development of Pharmacovigilence.pdf
History and Development of Pharmacovigilence.pdfHistory and Development of Pharmacovigilence.pdf
History and Development of Pharmacovigilence.pdf
 
COVID-19 (NOVEL CORONA VIRUS DISEASE PANDEMIC ).pptx
COVID-19  (NOVEL CORONA  VIRUS DISEASE PANDEMIC ).pptxCOVID-19  (NOVEL CORONA  VIRUS DISEASE PANDEMIC ).pptx
COVID-19 (NOVEL CORONA VIRUS DISEASE PANDEMIC ).pptx
 
Hematology and Immunology - Leukocytes Functions
Hematology and Immunology - Leukocytes FunctionsHematology and Immunology - Leukocytes Functions
Hematology and Immunology - Leukocytes Functions
 
LUNG TUMORS AND ITS CLASSIFICATIONS.pdf
LUNG TUMORS AND ITS  CLASSIFICATIONS.pdfLUNG TUMORS AND ITS  CLASSIFICATIONS.pdf
LUNG TUMORS AND ITS CLASSIFICATIONS.pdf
 

Controlling informative features for improved accuracy and faster predictions in omentum cancer models

  • 1. Research Article Volume 1 Issue 2 - January 2017 Curr Trends Biomedical Eng & Biosci Copyright © All rights are reserved by Damian R Mingle Controlling Informative Features for Improved Accuracy and Faster Predictions in Omentum Cancer Models Damian R Mingle* WPC Healthcare, Nashville, USA Submission: December 09, 2016; Published: January 03, 2017 * Corresponding author: Damian Mingle, Chief Data Scientist, WPC Healthcare, 1802 Williamson Ct, Brentwood, TN 37027, USA Email: Introduction In recent years, the dawn of technologies like microarrays, proteomics, and next-generation sequencing has transformed life science. The data from these experimental approaches deliver a comprehensive depiction of the complexity of biological systems at different levels. A challenge within the “-omics” data strata is in finding the small amount of information that is relevant to a particular question, such as biomarkers that can accurately classify phenotypic outcomes [1]. This is certainly true in the fold of peritoneum connecting the stomach with other abdominal organs known as the omentum. Numerous machine learning techniques and methods have been proposed to identify biomarkers that accurately classify these outcomes by learning the elusive pattern latent in the data. To date, there have been three categories that assist in biomarker selection and phenotypic classification: A. Filters B. Wrappers C. Embedding In practice, time-to-prediction and accuracy of prediction matter a great deal. Filtering methods are generally considered in an effort to spend the least time-to-prediction and can be used to decide which are the most informative features in relation to the biological target [2]. Filtering produces the degrees of correlation with a given phenotype and then ranks the markers in a given dataset. Many researchers acknowledge the weakness of such methods and take careful note to observe the selection of redundant biomarkers. In addition, filtering methods do not allow for interactions between biomarkers. An example of a popular filtering method is Student’s t-test [3]. In order to optimize the predictive power of a classification model, wrapper methods iteratively perform combinatorial biomarker search. Since this combinatorial optimization process is computationally complex, a NP-hard problem, many heuristics have been proposed, for example, to reduce the search space and thus reduce the computational burden of the biomarker selection [4]. With the exception of performing feature selection and classification simultaneously, embedded methods are similar to wrapper methods. Recursive feature elimination support vector machine (SVM-RFE) is a widely used technique for analysis of microarray data [5,6]. The SVM-RFE procedure constructs a Curr Trends Biomedical Eng & Biosci 1(2): CTBEB.MS.ID.555559 (2017) 001 Current Trends in Biomedical Engineering & Biosciences Abstract Identification of suitable biomarkers for accurate prediction of phenotypic outcomes is a goal for personalized medicine. However, current machine learning approaches are either too complex or perform poorly. Here, a novel feature detection and engineering machine-learning framework is presented to address this need. First, the Rip Curl process is applied which generates a set of 10 additional features. Second, we rank all features including the Rip Curl features from which the top-ranked will most likely contain the most informative features for prediction of the underlying biological classes. The top-ranked features are used in model building. This process creates for more expressive features which are captured in models with an eye towards the model learning from increasing sample amount and the accuracy/time results. The performance of the proposed Rip Curl classification framework was tested on omentum cancer data. Rip Curl outperformed other more sophisticated classification methods in terms of prediction accuracy, minimum number of classification markers, and computational time. Keywords: Omentum, Cancer, Data science, Machine learning, Biomarkers, Phenotype, Personalized medicine
  • 2. How to cite this article: Damian R M. Controlling Informative Features for Improved Accuracy and Faster Predictions in Omentum Cancer Models. Curr Trends Biomedical Eng & Biosci. 2017; 1(2): 555559. 002 Current Trends in Biomedical Engineering & Biosciences classification model using all available features, with the least informative features being pruned from the model.This process continues iteratively until a model has learned the minimum number of features that are useful. In the case of “-omics” data, this process becomes impractical when considering a large feature space. Our research used a hybrid approach between user and machine that dramatically reduced the computational time required by similar approaches while increasing prediction accuracy when comparing other state-of-the art machine learning techniques. Our proposed framework includes A. Ranking and pruning attributes using information gain to extract the most informative features and thus greatly reducing the number of data dimensions; B. A user to view histograms on attributes where the information gain is 0.80 or higher and creating new binary features from continuous values; C. Re-ranking both the original features and the newly constructed features; and D. Using the number of instances to determine how many top-n informative features should be used in modeling. The Rip Curl framework can be used to construct a high- dimensional classification model that takes into account dependencies among the attributes for analysis of complex biological -omics datasets containing dependencies of features. The performance of the proposed four-step classification framework was evaluated using datasets from microarray. The proposed framework was compared with SVM-RFE in terms of area under the ROC curve (AUC) and the number of informative biomarkers used for classification. Results and Discussion Using the omentum dataset we conducted the Rip Curl process of setting the target feature (in this case it was one- versus-all), characterized the target variable, loaded and prepared the omentum data, saved the target and portioning information, analyzed the omentum features, created cross- validation and hold-out partitions, and conducted exploratory data analysis. Table 1: Different types of descriptive features. Type Description Predictive A predictive descriptive feature provides information that is useful in estimating the correct value of a target feature. Interacting By itself, an interacting descriptive feature is not informative about the value of the target feature. In conjunction with one or more other features, however, it becomes informative. Redundant A descriptive feature is redundant if it has a strong correlation with another descriptive feature. Irrelevant An irrelevant descriptive feature does not provide information that is useful in estimating the value of the target feature. In an effort to increase performance and accuracy we opted for an approach of feature selection to help reduce the number of descriptive features in the omentum dataset to just the subset that is most useful for prediction. Before we begin our discussion of approaches to feature selection, it is useful to distinguish between different types of descriptive features (Table 1): The goal of any feature selection approach is to identify the smallest subset of descriptive features that maintains overall model performance. Ideally a feature selection approach will return the subset of features that includes the predictive and interacting features while excluding the irrelevant and redundant features. Using conventional methods, it is not efficient to find the ideal subset of descriptive features used to train an omentum model. Considerd features. There are 2d different possible feature subsets, which is far too many to evaluate unless d is very small. For example, with the descriptive features represented in the omentum dataset,there are 210,960 which produces a 3,300 digit integer as the possible feature subsets. Material and Methods Datasets The dataset used in the experiments were provided by the Gene Expression Machine Learning Repository (GEMLeR) [7]. GEMLeR contains microarray data from 9 different tissue types including colon, breast, endometrium, kidney, lung, omentum, ovary, prostate, and uterus. Each microarray sample is classified as tumor or normal. The data from this repository were collated into 9 one-tissue-types versus all-other-types (OVA) datasets where the second class is labeled as “other.” All GEMLeR microarray datasets have been analyzed by SVM-RFE, the results of which are available from the same resource. Methods Figure 1: Represent protein and LDH level in different types of meningitis. Figure 1 demonstrates the Rip Curl framework and its dependencies on the prior stage. In applying the Rip Curl framework, we initially ran the omentum microarray data through to gain informative feature feedback and then rank those features from most to least important. We confirmed the number of instances available in the dataset and then applied 1% (1,545 instances X 0.01 = 154
  • 3. How to cite this article: Damian R M. Controlling Informative Features for Improved Accuracy and Faster Predictions in Omentum Cancer Models. Curr Trends Biomedical Eng & Biosci. 2017; 1(2): 555559. 003 Current Trends in Biomedical Engineering & Biosciences features) to discover how many top informative features we would make use of in our framework. Where the features were both in the top 1% and expressed informativeness at or above 80%, we created unique features that followed some meaningful thresholds which grouped biomarker data into bins of “0” or “1”. Finally, we sent the enhanced data back through the informative feature test to reduce the feature space to the top 1% and then removed all other features, modeling this subset using Random Forest (Entropy). Our general approach to model selection was to run several model types and to select the best performing based on the highest AUC from the cross-validation results. Once those models were selected, we confirmed that the models were learning models based on sample sizes of the data 16%, 32%, and 64% and are reported in Figure 2. If the model for each sample size did not increase then the model was discarded as a non-learning model (Figure 2). Figure 2: A demonstration of a proper learning model. We chose area under the ROC curve for its immediate understanding and calculated it as follows Where T is a set of thresholds, |T| is the number of thresholds tested, and TPR(T[i] ) and FPR(T[i] ) are the true positive and false positive rates at threshold i respectively. The ROC index is quite robust in the presence of imbalanced data, which makes it a common choice for practitioners, especially when multiple modeling techniques are being compared to one another. In addition, we ran a second evaluation measure, Gini Norm, which is calculated as follows Gini coefficient=(2 Ă—ROC index)-1 (2) The Gini coefficient can take values in the range of (0,1), and higher values indicate better model performance. In our experiment with the omentum microarray data, we wanted to pay particular attention to reducing complexity and thereby improving time-to-prediction. This is especially true with an M X N dimensional dataset, where M is the number of samples and N is the features respectively, more specifically where N is orders of magnitude greater than M, as is the case in our experiment. We selected Random Forest to represent the general technique of random decision forests, an ensemble learning method for classification, regression, and other tasks. Random Forest operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees’ habit of over fitting to their training set. Figure 3 demonstrates visually the increase in predictive accuracy as it relates to complexity of that prediction. Figure 3: Comparison of Rip Curl vs other methods. Table 2: Comparative analysis of model performance. Data Model Number of Variables AUC (Validation) AUC (Cross -Validation) AUC (Hold- Out) Gini Norm (Validation) Gini Norm (Cross- Validation) Gini Norm (Hold-Out) Time (milliseconds) Raw Features Random Forest (Entropy) 10,935 0.9520 0.9427 0.9269 0.9040 0.8855 0.8537 10,859.25 Univariate Features Random Forest (Entropy) 8,165 0.9592 0.9492 0.9232 0.9184 0.8984 0.8463 8,233.39 Informative Features Random Forest (Entropy) 8,283 0.9520 0.9427 0.9269 0.9040 0.8855 0.8537 10,732.36 Top 1% of Features Random Forest (Entropy) 15 0.9379 0.9201 0.9344 0.8757 0.8401 0.8689 3,374.21 Table 2 emphasizes the disparity of different results in prediction and time by holding constant the model type, Random Forest (Entropy). We observed that Rip Curl (Top 1% of Features) made use of the best parameters, which we found to be max_depth: None, max_features: 0.2, max_leaf_nodes: 50, min-samples_leaf: 5, and
  • 4. How to cite this article: Damian R M. Controlling Informative Features for Improved Accuracy and Faster Predictions in Omentum Cancer Models. Curr Trends Biomedical Eng & Biosci. 2017; 1(2): 555559. 004 Current Trends in Biomedical Engineering & Biosciences min_samples_split: 10. Rip Curl improved time-to-prediction by a range of 59.02% to 68.93%, increased the hold-out AUC by 0.81% to 1.21%, and increased the hold-out Gini Norm by a range of 1.78% to 2.67%. Our framework makes use of Claude Shannon’s entropy formula which defines a computational measure of the impurity of the elements in a set. Shannon’s idea of entropy is a weighted sum of the logs of the probabilities of each possible outcome when we make a random selection from a set. The weights used in the sum are the probabilities of the outcomes themselves so that outcomes with high probabilities contribute more to the overall entropy of a set than outcomes with low probabilities. Shannon’s entropy is defined as where P(t=i) is the probability that the outcome of randomly selecting an element t is the type i, l is the number of different types of things in the set, and s is an arbitrary logarithmic base (which we selected as 2) (Shannon, 1948). Once we established the in formativeness of each feature we visually explored the histograms of each variable that expressed an informative value of ≥80%. Below is an example of the histogram for 206067_s_at, indicating where the target’s signal was most concentrated within this biomarker (Figure 4). Figure 4: Histogram of Gene 206067_s_at in raw expression. In an effort to concentrate the omentum tissue signal, we generated a rule that stated Which rendered a new feature that generated an additional histogram (Figure 5)? Figure 5: Demonstration of Rip Curl feature engineering using visual thresholds. Allowing us to pass a different, possibly more understandable context to our algorithm. We repeated this process above, applying rules based on our observation of the training data. Some additional features were simple descriptive statistics such as Min and Mode while others were a bit unconventional such as Where X a gene is feature and i represents the placement within the feature index. Binsum was another engineered feature that was simply Where bin is one of the generated binary features and i is the index of the bin within the omentum training data. In an effort to develop greater context for the omentum data and the new features that were engineered, we analyzed key values and their respective informativeness (or importance) [8] (Table 3): Table 3: Rip Curl Feature engineering statistics. Feature Name Importance Unique Missing Mean SD Median Min Max Binsum 92.88 9 0 1.46 2.07 1 0 8 1800_206067_s_at 84.93 2 0 0.15 0.35 0 0 1 400_216953_s_at 82.15 2 0 0.14 0.34 0 0 1 1100_219454_at 82.09 2 0 0.22 0.42 0 0 1 1300_214844_s_at 78.42 2 0 0.13 0.34 0 0 1
  • 5. How to cite this article: Damian R M. Controlling Informative Features for Improved Accuracy and Faster Predictions in Omentum Cancer Models. Curr Trends Biomedical Eng & Biosci. 2017; 1(2): 555559. 005 Current Trends in Biomedical Engineering & Biosciences 4000_227195_at 77.04 2 0 0.21 0.41 0 0 1 1100_213518_at 76.07 2 0 0.31 0.46 0 0 1 3500_204457_s_at 75.71 2 0 0.21 0.40 0 0 1 900_219778_at 71.29 2 0 0.09 0.29 0 0 1 Lensum 55.36 3 0 13.82 2.98 16 8 16 Mode 53.04 837 0 214 413 19.80 1.60 4152 Min 51.85 81 0 242 1.41 2.20 0.20 15.40 Figure 6: Variable importance rank generated from the Rip Curl framework. Figure 6 demonstrates visually the variable importance of the final Rip Curl model. We observed that 20% of the top informative features were generated through the Rip Curl framework: Binsum, >1800_206067_s_at, and >400_216953_s_ at with a range of importance between 27% and 93%. GEMLeR provides an initial classification accuracy value for the omentum dataset. In their experiments designed to generate the state-of-the-art benchmark, all measurements were performed using WEKA machine learning environment. They opted to make use of one of the most popular machine learning methods for gene expression analysis, Support Vector Machines – Recursive Feature Elimination (SVM-RFE) feature selection algorithm. They evaluated (feature selection + classification was done inside a 10-fold cross-validation loop on the omentum dataset to avoid so called selection bias [9] and demonstrates their approach (Figure 7). Figure 7: SVM-RFE Process. Head-to-Head Comparison with SVM-RFE Table 4: Comparison results of international benchmark and Rip Curl. Model AUC SVM-RFE (Benchmark) 0.703 Rip Curl 0.934 Table 4 shows a comparison of the SVM-RFE benchmark established in with the Rip Curl framework, and the following results were observed [10] (Table 4): Rip Curl represents a 32.92% gain in prediction accuracy over the GEMLeR benchmark for the same omentum dataset [11]. Conclusion The Rip Curl classification framework outperformed the state-of-the-art benchmark (SVM-RFE) in the GEMLeR omentum cancer experiment. Since the Rip Curl classification framework utilizes entropy-based feature filtering and adds more contexts through feature engineering, the complexity of this classification framework is very low permitting analysis of data with many features. Future research would suggest comparisons beyond the omentum cancer data and exploration of other one-versus- all experiments in the areas of breast, colon, endomentrium, kidney, lung, ovary, prostate, and uterus [12-14]. Acknowledgment We would like to acknowledge GEMLeR for making this important dataset available to researchers and WPC Healthcare for supporting our work. Finally, the authors would like to thank the donors who participated in this study. References 1. Abeel T, HelleputteT, Van de Peer Y, Dupont P, Saeys Y (2010) Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics 26(3): 392-398. 2. Mingle D (2015) A Discriminative Feature Space for Detecting and Recognizing Pathologies of the Vertebral Column. International Journal of Biomedical Data Mining. 3. Leung Y, Hung Y (2010) A multiple-filter-multiple-wrapper approach to gene selection and microarray data classification. IEEE/ACM Trans Comput Biol Bioinform 7(1): 108-117.
  • 6. How to cite this article: Damian R M. Controlling Informative Features for Improved Accuracy and Faster Predictions in Omentum Cancer Models. Curr Trends Biomedical Eng & Biosci. 2017; 1(2): 555559. 006 Current Trends in Biomedical Engineering & Biosciences 4. Mohammadi A, Saraee MH, Salehi M (2011) Identification of disease- causing genes using microarray data mining and Gene Ontology. BMC medical genomics 4(1): 1. 5. Ding Y, Wilkins D (2006) Improving the performance of SVM-RFE to select genes in microarray data. BMC bioinformatics 7(Suppl 2): S12. 6. Balakrishnan S, Narayanaswamy R, Savarimuthu N, Samikannu R (2008) SVM ranking with backward search for feature selection in type II diabetes databases. IEEE pp. 2628-2633. 7. Stiglic G, Kokol P (2010) Stability of ranked gene lists in large microarray analysis studies. BioMed Research International 2010: ID 616358. 8. Breiman L (2001) Random forests. Machine learning 45(1): 5-32. 9. Ambroise C, Mc Lachlan GJ (2002) Selection bias in gene extraction on the basis of microarray gene-expression data. Proc Natl Acad Sci U S A 99(10): 6562-6566. 10. Duan K, Rajapakse JC (2004) SVM-RFE peak selection for cancer classification with mass spectrometry data. In APBC pp. 191-200. 11. Hu ZZ, Huang H, Wu CH, Jung M, Dritschilo A, et al. (2011) Omics-based molecular target and biomarker identification. Methods Mol Biol 547- 571. 12. Shannon CE (1948) A note on the concept of entropy. Bell System Tech J 27: 379-423. 13. StiglicG,RodriguezJJ,KokolP(2010)Findingoptimalclassifiersforsmall feature sets in genomics and proteomics. Neurocomputing 73(13): 2346-2352. 14. Zervakis M, Blazadonakis ME, Tsiliki G, Danilatou V, Tsiknakis M, et al. (2009) Outcome prediction based on microarray analysis: a critical perspective on methods. BMC bioinformatics 10: 53. Your next submission with JuniperPublishers will reach you the below assets • Quality Editorial service • Swift Peer Review • Reprints availability • E-prints Service • Manuscript Podcast for convenient understanding • Global attainment for your research • Manuscript accessibility in different formats ( Pdf, E-pub, Full Text, Audio) • Unceasing customer service Track the below URL for one-step submission https://juniperpublishers.com/online-submission.php