SlideShare ist ein Scribd-Unternehmen logo
1 von 10
BMI 214
Spring 2009

Assignment 2:
Machine learning for expression data and Genotype- phenotype
associations


Name:             _________________________

SUNet ID:         _________________________


Contents:

Part 1: Weka for machine learning in expression data
      a. Introduction
      b. Supervised learning
      c. Feature selection
      d. Unsupervised learning

Part 2: Genotype-phenotype associations
      a. Introduction
      b. Feature selection

Submission: Send your answers as a PDF document.

Note: Please show work when appropriate, and/or cite sources (e.g. for problem 20,
site source used to obtain your answer.)


Part 1: Weka for machine learning in expression data
Part 1a: Introduction
Download / familiarize yourself with Weka.

As discussed in class, Weka is a useful tool that has implemented most of the major
machine learning algorithms. In this part you will re-live Project 2, but now let Weka do
the work. It will be both a tutorial and problem set.

We will be using the Weka GUI in this assignment. There are also an excellent
command line interface and Java API, which will not be covered. Read more in the
documentation if interested.

Download and start the Weka GUI. Follow the instructions on the Weka site:
http://www.cs.waikato.ac.nz/ml/weka/

You will see four buttons; in this class we will only use the Explorer functionality.

There is much Weka documentation available. You may familiarize yourself with
Explorer as much as you like by reading the user guide and using their provided sample
datasets. Some good (optional) starting points:
Explorer guide:
http://easynews.dl.sourceforge.net/sourceforge/weka/ExplorerGuide-3.4.pdf
Weka Wiki:
http://weka.sourceforge.net/wekadoc/index.php/Main_Page
Part 1b: Supervised learning

We will be using the same leukemia data as in Project 2. Recall that that dataset
comprised 72 leukemia patients, 28 of which were AML and 44 of which were ALL.
Each patient had expression measurements for 7129 genes.

We have compiled these data into a Weka-compatible CSV file.
http://helix-web.stanford.edu/bmi214/assignment2/data/leukemia.csv

This file has the 72 leukemia patients (rows) and expression values for 150 genes
(columns). This data matrix is slightly altered from the one in Project 2: it is transposed,
and we chose a subset of 150 genes to make your life easier. The gene and experiment
names, as well as a link to the reference paper, are available in Project 2.

Open the leukemia.csv file in Weka.

Question 1. What is the mean value of expression of the gene labeled
“CD33 CD33 antigen (differentiation antigen)”?



Now we will run KNN. Go to the “classify” tab. Under “Classifier” click the “Choose”
button. Expand the “lazy” menu (yes, we use only the best!). Choose “IBk” and OK.
This is KNN: IBk stands for Instance-Based k. Now next to the button it should say
“IBk” with some parameters. Click on this text and a menu will pop up. For “KNN”, enter
5. Recall that this means the algorithm will use the five nearest neighbors to classify
each data point. Leave the rest of the values as default.
      FYI: KNN is a type of instance-based learning, also known as lazy learning, in
      which the function is only approximated locally and all computation is deferred until
      classification.
Under “Test options” choose the radio button “Cross-validation” and under “Folds” enter
5. The dropdown menu below Test options should say “(Nom) leukemia_type”. This
means that the algorithm will classify “leukemia_type” (AML or ALL), using the
experiments as attributes.

Click the “Start’ button. The main window will show a variety of result summary
statistics, such as accuracy, true positives, false positives, and a confusion matrix.

Question 2a. What is the % of correctly classified instances?



ROC curves, as discussed in lecture, illustrate the tradeoffs between sensitivity and
specificity. Roc curves plot Sensitivity vs. 1 – Specificity, or TP (true positive) rate vs.
FP (false positive) rate. Recall that

TP rate = Sensitivity = TP / (TP + FN)
FP rate = 1 – Specificity = FP / (TN + FN)
Specificity = 1 – FP rate = TN / (FP + TN)

Where TP = number of true positives and FP = number of false positives.

Another way to think about it is:
TP rate = # true positives / # of positive instances
          (The true positive rate is the proportion of actual positive instances that were
          correctly reported as being positive.)
FP rate = # false positives / # of negative instances
          (The false positive rate is the proportion of negative instances that were
          erroneously reported as being positive.)


Question 2b. What are the TP and FP rates for ALL and AML?
ALL TP Rate =
ALL FP Rate =
AML TP Rate =
AML FP Rate =

Optional: Right click on your result in the “Result list” on the left side of the screen.
Choose “visualize threshold curve” and “ALL”. An ROC curve plots true positive (TP)
rate vs. false positive (FP) rate, which are the defaults. You can also view other types of
curves by clicking the dropdown menus. For example, precision-recall curves are an
alternative to ROC curves; precision and recall are options in the dropdown menu.

Recall that we can trivially achieve a TP rate of 1 for AML by classifying all of the
patients as positive (AML). That would correspond to a TP rate of 0 for ALL, since none
of the patients could then be classified ALL.

Question 3. Now we’ll try a different classification algorithm, ZeroR. Click
the Choose button under Classifier, and expand the “rules” folder. Choose
“ZeroR”. Again use cross- validation with Folds=5. Run it.

% correctly classified instances =
ALL TP Rate =
ALL FP Rate =
AML TP Rate =
AML FP Rate =

Explanation: ZeroR is a baseline classifier that simply identifies the class that is most
abundant (in this case, ALL with 44 patients), and predicts all variables (patients) to be
in that class. This illustrates that when the variables are unevenly assigned to classes, it
can be a problem to simply look at accuracy. For example, if you have a dataset of 100
variables (patients), where 90 were ALL, and 10 were AML, then ZeroR will predict them
all to be ALL, with 90% correctly classified (90% accuracy). Another classifier might
have 85% accuracy, which looks pretty good until you compare it with ZeroR. You can
get around the problem of uneven class sizes by weighting, a topic which we won’t cover
here. (But check out the “CostSensitiveClassifier” and “Cost Sensitive evaluation” if
you’re interested.)

Stop for a moment and think about the meanings of each of these terms (ALL TP rate,
ALL FP rate, AML TP rate, AML FP rate), and why you got the answers you wrote down,
when using ZeroR.

We’ll try a few more algorithms on this dataset.

Question 4. Under the “bayes” Classifier folder, choose “NaiveBayes” and
run.

% correctly classified instances =
ALL TP Rate =
ALL FP Rate =
AML TP Rate =
AML FP Rate =
Question 5. Under the “functions” Classifier folder, choose “SMO” (a
popular implementation of SVM) and run.

% correctly classified instances =
ALL TP Rate =
ALL FP Rate =
AML TP Rate =
AML FP Rate =


Question 6. Under the “trees” Classifier folder, choose “ADTree” (a
popular implementation of a decision tree) and run.

% correctly classified instances =
ALL TP Rate =
ALL FP Rate =
AML TP Rate =
AML FP Rate =


You can read about the various other algorithms in the documentation if interested.


Part 1c: Feature selection (aka attribute selection)

We will now use feature selection algorithms to extract the most “informative” genes for
classifying AML vs ALL. As mentioned in lecture, this is a useful real world exercise
because many times we are interested not in measuring all the gene expressions, but in
finding one or a few whose expression level can be used as a marker for the disease.
The general area of “Biomarkers” is economically important because it may lead to
diagnostic tests that can be marketed to clinical laboratories.

If interested, you can learn more about the theory behind feature selection in machine
learning textbooks and online resources. Wikipedia is always a good place to start.
http://en.wikipedia.org/wiki/Feature_selection

First we will choose a subset of five genes arbitrarily – the first five. To do this in Weka,
go back to the “Preprocess” tab and click the checkbox next to the first five genes (Zyxin
through RNS2). Scroll down to the bottom of the gene list and click “leukemia_type”,
which is the class label (AML or ALL). Check this box, too, so that there are 6 total
boxes checked. Click the “Invert” button above the list, so that the other 145 boxes
become checked. Click the “Remove” button below the list, so that you are left with 5
genes and leukemia_type.

Go to the “Classify” tab again. Classify using IBk and K = 5.

Question 7. What is the % correctly classified instances using just the
first five genes?



Re-load the dataset with “Open file…” to get all 150 genes back.

We’ll now find five genes that are “informative” according to some metric. Choose the
“Select attributes” tab at the top of the screen. Under ‘Attribute Evaluator”, choose the
algorithm called “InfoGainAttributeEval” and Search Method “Ranker” and Start. The
output shows the genes ranked by information gain, which is one possible metric for
measuring the “goodness” of a partition. The genes at the top of the list are the most
“informative” as attributes for classification.

Question 8. What are the top five genes in this output?




Back under “Preprocess”, remove all attributes except the top five genes from the
previous question, in the same manner as described above, so that you are left with the
five genes from Question 8 and the class attribute leukemia_type. Under “Classify”, run
the exact same algorithm as before (IBk, K=5).

Question 9. What is the % of correctly classified instances?




Part 1d: Unsupervised learning
K-means clustering

This part will use the same dataset that you used for Project 2 k-means clustering.
Recall that that dataset comprised expression profiles for 2467 genes across 79
experiments.

Download a Weka-compatible CSV file at
http://helix-web.stanford.edu/bmi214/assignment2/data/yeast.dat.csv

The rows are genes and the columns are experiments. The gene and experiment
names, as well as a link to the reference paper, is available on the Project 2 instructions.

In this particular file, we have added one additional column called “ribosomal” with value
Y or N, to indicate whether a gene is ribosomal.

Question 10. What is the mean value of expression of the experiment
labeled “alpha 21”?



Now we will run K-means clustering. Go to the Cluster tab. Under Clusterer choose
SimpleKMeans. Click the text “SimpleKMeans” and set numClusters to 2. “Cluster
mode” should be “Use training set”. Run it.

Question 11. How many genes in each of the two clusters?




An optional exercise for those interested -- If you want to see which genes were
assigned to which clusters, right-click on the result in the Result list and choose
“Visualize cluster assignments”. Click the Save button and choose a location. This will
save a file in ARFF format (the Weka format), where the last value on each line will be a
cluster assignment. (See the Weka documentation for more on ARFF format).


Question 12. Recall that K-means depends on a random starting seed.
Change this seed by clicking on the text “SimpleKMeans” and changing the
“seed” value to 15. How many genes are now in the two clusters?
Question 13. Try adjusting the seed value to various values (e.g. 1.2, 5,
20, 100). How much do the numbers of genes in each cluster change?
What might this say about the genes and their assignments to clusters?
(I.e. are the assignments robust or weak?)




Question 14. Now under “Cluster mode”, choose “Classes to clusters
evaluation” and run. How many of the 121 ribosomal genes are assigned
to the same cluster?



Question 15. How many other genes are assigned to that cluster?




(If you were interested in what those genes were, you might save the ARFF output file
as described above, extract the genes in that cluster, and enter them in the GO Term
Finder, to get a feel for their functions. That’s left as an optional exercise.)


Question 16. For completeness, we’ll now classify the yeast data. Under
the Classify tab, again choose the IBk algorithm, and use KNN=5. Use
cross- validation with Folds=5 and run.

% correctly classified instances =
Class N TP Rate =
Class N FP Rate =
Class Y TP Rate =
Class Y FP Rate =


Question 17. Run as in the previous question but using trivial classifier
ZeroR, which was discussed above.

% correctly classified instances =
Class N TP Rate =
Class N FP Rate =
Class Y TP Rate =
Class Y FP Rate =


Part 2:
Genotype/Phenotype
Part 2a: Introduction

Researching and answering these introductory questions will help you understand the
genotype-phenotype section below.

Question 18. What does diploid mean? How many copies of every
chromosome are there in one cell of a diploid organism? How many
copies of every gene?



Question 19. What is an “allele”, with respect to a single SNP (Single
Nucleotide Polymorphism)? How many alleles are theoretically possible
for one SNP locus? Usually, how many alternate forms actually arise in
life for one SNP locus?
(For reference you can refer to Genomes, 2 nd ed. Available for free online
at:
http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=genomes )




Counting the number of SNPs in the human genome depends on your definition for a
SNP. For example, there may be some bases where only a few people in the world
have a variant allele. For most purposes, it is not useful to call these bases SNPs.
Thus, to define SNPs, we refer to the “minor allele frequency”, which is the percent of
the population having the uncommon allele. The minor allele frequency can range from
0 to 50%.

Question 20. Approximately how many SNPs are there in the human
genome (i.e. how many bases of the ~3 billion base pairs in the human
genome are SNPs), assuming at least a 1% minor allele frequency?
Explain your answer.




Question 21. SNPs are one type of genetic variation – what other types
exist?




Question 22. What is penetrance, in terms of phenotype? What does 50%
penetrance mean? 100%? (three sentences or less)




Question 23. What is a haplotype? How is it related to linkage
disequilibrium? (three sentences or less)




Part 2b: Feature selection (aka attribute selection)

Background:

With genome sequencing cost drastically falling, we will soon be able to sequence
millions of people cheaply. With this wealth of new data, we will be able to associate
genotypes with phenotypes (e.g. diseases or drug responses). The goal is to find the
genes (or more specifically, the SNPs) that are found in conjunction with certain
phenotypes. We will be using a semi-synthetic dataset to explore this type of study.
The dataset can be found at
http://helix-web.stanford.edu/bmi214/assignment2/data/genotenureitus1.arff

Dataset description:

Many past genotype-phenotype association studies only examined a single gene
hypothesized to cause the phenotype (disease). In reality, there are complex
interactions of genes that cause disease. We will use machine learning to find such
complex associations, which is a very cutting-edge research area.

We will use Weka for this section. Open the file “genotenureitis1.arff”. Instead of CSV,
this file is in Weka’s ARFF format, which is described at
http://www.cs.waikato.ac.nz/~ml/weka/arff.html.

This file contains genotype/phenotype data for 557 subjects (people). Each row in the
data section represents one subject (person).

The (comma-delimited) columns are the attributes of the subjects. The first column is
the person’s identifier.

The next 188 columns represent genotypes, which in this case are SNPs (single
nucleotide polymorphisms). The SNPs were identified from three regions in the
genome. The SNPs are named
     RjSNPi
where j is the region (1,2, or 3), and i is the SNP number within that region. The values
each SNP can take are {11, 12, 22}.

Recall that humans are diploid. In this file, a value of “11” for a subject’s SNP means
that the both copies are allele 1 (where allele 1’s possible values are the nucleotides
{A,T,G,C}; the actual allele isn’t specified.). “12” means that one was allele 1, and the
other was allele 2. “22” means that both were allele 2. (Allele 2 would be a nucleotide
other than allele 1.)

The last 20 columns in the file represent (artificially created) phenotypes. Some of the
phenotypes, for example, are
   - “degree”, with possible values MD, MD/PhD, or PhD
   - “npubs” (number of publications), with possible values 4 through 12.
   - “gotgrants” (total grant dollars earned, in millions), with possible values 0 through
       19.

Note that the genotype data are real, but the phenotype data are artificial. A more
lengthy (and far more humorous) description of the phenotypes can be found here:
http://helix-web.stanford.edu/bmi214/assignment2/data/pgrn_2005.pdf
This is an optional read; it is not necessary for understanding this assignment. This
document and the genotenuritis data file were created for a pharmacogenetics
conference.


Data filtering

Back to Weka and genotenureitis1.arff. Scroll to the bottom of the attribute names.
Examine the distributions for the attribute “irep” by clicking on it.

Question 24. How many class labels exist for the attribute irep?
The attribute irep is not useful to us; we will remove it and other useless fields with a
filter. In the “Preprocess” tab, under “Filter”, choose unsupervised->attribute-
>RemoveUseless, and click Apply. This will remove the “irep” attribute and the “ID”
attribute(which actually appeared twice – attribute number 1 and 192). You can undo
operations like this one by clicking the Undo button at the top.

If you don’t apply this filter, you may get erroneous results in the following sections.


Genotype- genotype associations

First we’ll examine whether SNPs are associated with each other. Recall the discussion
in class about haplotypes and linkage disequilibrium.

Question 25. Under the “Associate” tab, run the default associator
(Apriori). Copy here the output under “Best rules found:” (This should be
nine SNP association pairs and one phenotype- SNP pair.)




Question 26.     Within all but one SNP- SNP pair, the two SNPs have
something in    common. What is it, and what does it mean biologically?
Hint: look at   the SNP names and the explanation of SNP names above.
(3 sentences    or less.)




Attribute selection

We now perform attribute selection to find which SNPs best predict some of the
phenotypes.

Phenotype 1: gotgrants
The first phenotype we’ll investigate is “gotgrants”. We want to know whether there are
any SNPs related to the amount of grant money earned.

We disclose that the first phenotype was created artificially as a simple linear function of
the penetrance of a single SNP. You will try to discover what this SNP was.

Choose “gotgrants” from the dropdown menu on the left. Under the “Select attributes”
tab, choose the GainRatioAttributeEval evaluator with Ranker Search Method. Use
cross-validation with Folds=5. Run.

Question 27. What is the top SNP (genotype ) in this feature selection
result?
(Paste the line from weka)
Question 28. Run multiple Attribute Evaluators with multiple Search
Methods; is the top gene the same in general?
(Note that some Evaluators require specific Search Methods; see the error log and/or
documentation if you get funny results.)



Question 29. Look at the top 80 or so SNPs. Do you see anything
interesting about them? What does this mean biologically? (3 sentences
or less.)




Phenotype 2: pctdrivel
The next phenotype is “pctdrivel”; again, we’ll investigate whether there are any SNPs
that are good predictors of this phenotype.

This phenotype was created artificially as a function of two other phenotypes. The
inclusion of other phenotypes represents the fact that environment can impact
phenotype, not just genetic makeup (cf. the nature vs. nurture debate).

Run multiple Attribute Evaluators with multiple Search Methods.

Question 30. What do you think are the two phenotypes affecting pctdrivel
and why? A well- justified answer will be accepted. (3 sentences or less.)

Weitere ähnliche Inhalte

Andere mochten auch

Curso Gestión de Procesos FEB.2014 - Dr. Miguel Aguilar Serrano
Curso Gestión de Procesos FEB.2014 - Dr. Miguel Aguilar SerranoCurso Gestión de Procesos FEB.2014 - Dr. Miguel Aguilar Serrano
Curso Gestión de Procesos FEB.2014 - Dr. Miguel Aguilar SerranoMiguel Aguilar
 
Estrategias competitivas básicas
Estrategias competitivas básicasEstrategias competitivas básicas
Estrategias competitivas básicasLarryJimenez
 
ANÁLISIS DEL ANALFABETISMO EN EL SALVADOR EN EL PERIODO 2000- 2009
ANÁLISIS DEL ANALFABETISMO EN EL SALVADOR EN EL PERIODO 2000- 2009ANÁLISIS DEL ANALFABETISMO EN EL SALVADOR EN EL PERIODO 2000- 2009
ANÁLISIS DEL ANALFABETISMO EN EL SALVADOR EN EL PERIODO 2000- 2009aliloinomedle
 
Globalization by Dr. mani kansal
Globalization by Dr. mani kansalGlobalization by Dr. mani kansal
Globalization by Dr. mani kansalDrMani Kansal
 
Proyecto formativo
Proyecto formativoProyecto formativo
Proyecto formativojhonfospino
 
Reglamento interior derecho de huelga derecho laboral.
Reglamento interior derecho de huelga derecho laboral.Reglamento interior derecho de huelga derecho laboral.
Reglamento interior derecho de huelga derecho laboral.Vero Pacheco
 
Training and development
Training and developmentTraining and development
Training and developmentdanny688
 
Total Quality Service Management Book 1
Total Quality Service Management Book 1Total Quality Service Management Book 1
Total Quality Service Management Book 1aireen clores
 
Aprovechamiento rso uencolombia
Aprovechamiento rso uencolombiaAprovechamiento rso uencolombia
Aprovechamiento rso uencolombiaRaychi Rodriguez
 
Logistica y Cadena de Suministros
Logistica y Cadena de SuministrosLogistica y Cadena de Suministros
Logistica y Cadena de SuministrosGonzalo Lagunes
 
Guia de aprendizaje diagnostico
Guia de aprendizaje  diagnosticoGuia de aprendizaje  diagnostico
Guia de aprendizaje diagnosticosamy meza alvarez
 
Guia informaticabasica
Guia informaticabasicaGuia informaticabasica
Guia informaticabasicalinda31
 
Gfpi f-019 guia de aprendizaje 01 tda orientar fpi
Gfpi f-019 guia de aprendizaje 01 tda orientar fpiGfpi f-019 guia de aprendizaje 01 tda orientar fpi
Gfpi f-019 guia de aprendizaje 01 tda orientar fpilisbet bravo
 
Guia fund. teoricos Elaboracion Guias Aprendizaje
Guia fund. teoricos Elaboracion Guias AprendizajeGuia fund. teoricos Elaboracion Guias Aprendizaje
Guia fund. teoricos Elaboracion Guias AprendizajeEduardo Cuervo
 

Andere mochten auch (20)

Curso Gestión de Procesos FEB.2014 - Dr. Miguel Aguilar Serrano
Curso Gestión de Procesos FEB.2014 - Dr. Miguel Aguilar SerranoCurso Gestión de Procesos FEB.2014 - Dr. Miguel Aguilar Serrano
Curso Gestión de Procesos FEB.2014 - Dr. Miguel Aguilar Serrano
 
Estrategias competitivas básicas
Estrategias competitivas básicasEstrategias competitivas básicas
Estrategias competitivas básicas
 
Portafolio virtual
Portafolio virtual Portafolio virtual
Portafolio virtual
 
ANÁLISIS DEL ANALFABETISMO EN EL SALVADOR EN EL PERIODO 2000- 2009
ANÁLISIS DEL ANALFABETISMO EN EL SALVADOR EN EL PERIODO 2000- 2009ANÁLISIS DEL ANALFABETISMO EN EL SALVADOR EN EL PERIODO 2000- 2009
ANÁLISIS DEL ANALFABETISMO EN EL SALVADOR EN EL PERIODO 2000- 2009
 
Titulo catorce Derecho procesal de trabajo
Titulo catorce Derecho procesal de trabajoTitulo catorce Derecho procesal de trabajo
Titulo catorce Derecho procesal de trabajo
 
Globalization by Dr. mani kansal
Globalization by Dr. mani kansalGlobalization by Dr. mani kansal
Globalization by Dr. mani kansal
 
Proyecto formativo
Proyecto formativoProyecto formativo
Proyecto formativo
 
Reglamento interior derecho de huelga derecho laboral.
Reglamento interior derecho de huelga derecho laboral.Reglamento interior derecho de huelga derecho laboral.
Reglamento interior derecho de huelga derecho laboral.
 
Training and development
Training and developmentTraining and development
Training and development
 
Derecho alimentario en el peru
Derecho alimentario en el peruDerecho alimentario en el peru
Derecho alimentario en el peru
 
Total Quality Service Management Book 1
Total Quality Service Management Book 1Total Quality Service Management Book 1
Total Quality Service Management Book 1
 
Aprovechamiento rso uencolombia
Aprovechamiento rso uencolombiaAprovechamiento rso uencolombia
Aprovechamiento rso uencolombia
 
Sist. gestión de calidad
Sist. gestión de calidadSist. gestión de calidad
Sist. gestión de calidad
 
Logistica y Cadena de Suministros
Logistica y Cadena de SuministrosLogistica y Cadena de Suministros
Logistica y Cadena de Suministros
 
Fijación de precios
Fijación de preciosFijación de precios
Fijación de precios
 
Guia de aprendizaje diagnostico
Guia de aprendizaje  diagnosticoGuia de aprendizaje  diagnostico
Guia de aprendizaje diagnostico
 
Guia informaticabasica
Guia informaticabasicaGuia informaticabasica
Guia informaticabasica
 
Guia 01
Guia 01Guia 01
Guia 01
 
Gfpi f-019 guia de aprendizaje 01 tda orientar fpi
Gfpi f-019 guia de aprendizaje 01 tda orientar fpiGfpi f-019 guia de aprendizaje 01 tda orientar fpi
Gfpi f-019 guia de aprendizaje 01 tda orientar fpi
 
Guia fund. teoricos Elaboracion Guias Aprendizaje
Guia fund. teoricos Elaboracion Guias AprendizajeGuia fund. teoricos Elaboracion Guias Aprendizaje
Guia fund. teoricos Elaboracion Guias Aprendizaje
 

Ähnlich wie BMI214_Assignment2_S..

Eastman_MI530_FinalProjectReport
Eastman_MI530_FinalProjectReportEastman_MI530_FinalProjectReport
Eastman_MI530_FinalProjectReportNicholas Eastman
 
Lab 10.doc
Lab 10.docLab 10.doc
Lab 10.docbutest
 
Lab 10.doc
Lab 10.docLab 10.doc
Lab 10.docbutest
 
Top 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfTop 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfDatacademy.ai
 
Factor analysis using spss 2005
Factor analysis using spss 2005Factor analysis using spss 2005
Factor analysis using spss 2005jamescupello
 
1 way ANOVA(Analysis Of VAriance)
1 way ANOVA(Analysis Of VAriance)1 way ANOVA(Analysis Of VAriance)
1 way ANOVA(Analysis Of VAriance)musadoto
 
20080609 Loinc Workshop
20080609   Loinc Workshop20080609   Loinc Workshop
20080609 Loinc Workshopdvreeman
 
analysis part 02.pptx
analysis part 02.pptxanalysis part 02.pptx
analysis part 02.pptxefrembeyene4
 
DNA and Genes Lab ActivityComplete your answers in the spaces .docx
DNA and Genes Lab ActivityComplete your answers in the spaces .docxDNA and Genes Lab ActivityComplete your answers in the spaces .docx
DNA and Genes Lab ActivityComplete your answers in the spaces .docxjacksnathalie
 
Factor analysis using SPSS
Factor analysis using SPSSFactor analysis using SPSS
Factor analysis using SPSSRemas Mohamed
 
1. F A Using S P S S1 (Saq.Sav) Q Ti A
1.  F A Using  S P S S1 (Saq.Sav)   Q Ti A1.  F A Using  S P S S1 (Saq.Sav)   Q Ti A
1. F A Using S P S S1 (Saq.Sav) Q Ti AZoha Qureshi
 
Machine learning session6(decision trees random forrest)
Machine learning   session6(decision trees random forrest)Machine learning   session6(decision trees random forrest)
Machine learning session6(decision trees random forrest)Abhimanyu Dwivedi
 
How long should Offspring Lifespan be in order to obtain a proper exploration?
How long should Offspring Lifespan be in order to obtain a proper exploration?How long should Offspring Lifespan be in order to obtain a proper exploration?
How long should Offspring Lifespan be in order to obtain a proper exploration?Mario Pavone
 
Assessing Model Performance - Beginner's Guide
Assessing Model Performance - Beginner's GuideAssessing Model Performance - Beginner's Guide
Assessing Model Performance - Beginner's GuideMegan Verbakel
 
Part 5 of RNA-seq for DE analysis: Detecting differential expression
Part 5 of RNA-seq for DE analysis: Detecting differential expressionPart 5 of RNA-seq for DE analysis: Detecting differential expression
Part 5 of RNA-seq for DE analysis: Detecting differential expressionJoachim Jacob
 
Machine learning session5(logistic regression)
Machine learning   session5(logistic regression)Machine learning   session5(logistic regression)
Machine learning session5(logistic regression)Abhimanyu Dwivedi
 
Advanced BLAST (BlastP, PSI-BLAST)
Advanced BLAST (BlastP, PSI-BLAST)Advanced BLAST (BlastP, PSI-BLAST)
Advanced BLAST (BlastP, PSI-BLAST)Syed Lokman
 

Ähnlich wie BMI214_Assignment2_S.. (20)

Eastman_MI530_FinalProjectReport
Eastman_MI530_FinalProjectReportEastman_MI530_FinalProjectReport
Eastman_MI530_FinalProjectReport
 
Lab 10.doc
Lab 10.docLab 10.doc
Lab 10.doc
 
Lab 10.doc
Lab 10.docLab 10.doc
Lab 10.doc
 
Top 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfTop 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdf
 
Factor analysis using spss 2005
Factor analysis using spss 2005Factor analysis using spss 2005
Factor analysis using spss 2005
 
1 way ANOVA(Analysis Of VAriance)
1 way ANOVA(Analysis Of VAriance)1 way ANOVA(Analysis Of VAriance)
1 way ANOVA(Analysis Of VAriance)
 
20080609 Loinc Workshop
20080609   Loinc Workshop20080609   Loinc Workshop
20080609 Loinc Workshop
 
analysis part 02.pptx
analysis part 02.pptxanalysis part 02.pptx
analysis part 02.pptx
 
DNA and Genes Lab ActivityComplete your answers in the spaces .docx
DNA and Genes Lab ActivityComplete your answers in the spaces .docxDNA and Genes Lab ActivityComplete your answers in the spaces .docx
DNA and Genes Lab ActivityComplete your answers in the spaces .docx
 
Factor analysis using SPSS
Factor analysis using SPSSFactor analysis using SPSS
Factor analysis using SPSS
 
1. F A Using S P S S1 (Saq.Sav) Q Ti A
1.  F A Using  S P S S1 (Saq.Sav)   Q Ti A1.  F A Using  S P S S1 (Saq.Sav)   Q Ti A
1. F A Using S P S S1 (Saq.Sav) Q Ti A
 
Analyze Your Modules
Analyze Your ModulesAnalyze Your Modules
Analyze Your Modules
 
Machine learning session6(decision trees random forrest)
Machine learning   session6(decision trees random forrest)Machine learning   session6(decision trees random forrest)
Machine learning session6(decision trees random forrest)
 
How long should Offspring Lifespan be in order to obtain a proper exploration?
How long should Offspring Lifespan be in order to obtain a proper exploration?How long should Offspring Lifespan be in order to obtain a proper exploration?
How long should Offspring Lifespan be in order to obtain a proper exploration?
 
Gadoc
GadocGadoc
Gadoc
 
Assessing Model Performance - Beginner's Guide
Assessing Model Performance - Beginner's GuideAssessing Model Performance - Beginner's Guide
Assessing Model Performance - Beginner's Guide
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 
Part 5 of RNA-seq for DE analysis: Detecting differential expression
Part 5 of RNA-seq for DE analysis: Detecting differential expressionPart 5 of RNA-seq for DE analysis: Detecting differential expression
Part 5 of RNA-seq for DE analysis: Detecting differential expression
 
Machine learning session5(logistic regression)
Machine learning   session5(logistic regression)Machine learning   session5(logistic regression)
Machine learning session5(logistic regression)
 
Advanced BLAST (BlastP, PSI-BLAST)
Advanced BLAST (BlastP, PSI-BLAST)Advanced BLAST (BlastP, PSI-BLAST)
Advanced BLAST (BlastP, PSI-BLAST)
 

Mehr von butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

Mehr von butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

BMI214_Assignment2_S..

  • 1. BMI 214 Spring 2009 Assignment 2: Machine learning for expression data and Genotype- phenotype associations Name: _________________________ SUNet ID: _________________________ Contents: Part 1: Weka for machine learning in expression data a. Introduction b. Supervised learning c. Feature selection d. Unsupervised learning Part 2: Genotype-phenotype associations a. Introduction b. Feature selection Submission: Send your answers as a PDF document. Note: Please show work when appropriate, and/or cite sources (e.g. for problem 20, site source used to obtain your answer.) Part 1: Weka for machine learning in expression data Part 1a: Introduction Download / familiarize yourself with Weka. As discussed in class, Weka is a useful tool that has implemented most of the major machine learning algorithms. In this part you will re-live Project 2, but now let Weka do the work. It will be both a tutorial and problem set. We will be using the Weka GUI in this assignment. There are also an excellent command line interface and Java API, which will not be covered. Read more in the documentation if interested. Download and start the Weka GUI. Follow the instructions on the Weka site: http://www.cs.waikato.ac.nz/ml/weka/ You will see four buttons; in this class we will only use the Explorer functionality. There is much Weka documentation available. You may familiarize yourself with Explorer as much as you like by reading the user guide and using their provided sample datasets. Some good (optional) starting points: Explorer guide: http://easynews.dl.sourceforge.net/sourceforge/weka/ExplorerGuide-3.4.pdf Weka Wiki: http://weka.sourceforge.net/wekadoc/index.php/Main_Page
  • 2. Part 1b: Supervised learning We will be using the same leukemia data as in Project 2. Recall that that dataset comprised 72 leukemia patients, 28 of which were AML and 44 of which were ALL. Each patient had expression measurements for 7129 genes. We have compiled these data into a Weka-compatible CSV file. http://helix-web.stanford.edu/bmi214/assignment2/data/leukemia.csv This file has the 72 leukemia patients (rows) and expression values for 150 genes (columns). This data matrix is slightly altered from the one in Project 2: it is transposed, and we chose a subset of 150 genes to make your life easier. The gene and experiment names, as well as a link to the reference paper, are available in Project 2. Open the leukemia.csv file in Weka. Question 1. What is the mean value of expression of the gene labeled “CD33 CD33 antigen (differentiation antigen)”? Now we will run KNN. Go to the “classify” tab. Under “Classifier” click the “Choose” button. Expand the “lazy” menu (yes, we use only the best!). Choose “IBk” and OK. This is KNN: IBk stands for Instance-Based k. Now next to the button it should say “IBk” with some parameters. Click on this text and a menu will pop up. For “KNN”, enter 5. Recall that this means the algorithm will use the five nearest neighbors to classify each data point. Leave the rest of the values as default. FYI: KNN is a type of instance-based learning, also known as lazy learning, in which the function is only approximated locally and all computation is deferred until classification. Under “Test options” choose the radio button “Cross-validation” and under “Folds” enter 5. The dropdown menu below Test options should say “(Nom) leukemia_type”. This means that the algorithm will classify “leukemia_type” (AML or ALL), using the experiments as attributes. Click the “Start’ button. The main window will show a variety of result summary statistics, such as accuracy, true positives, false positives, and a confusion matrix. Question 2a. What is the % of correctly classified instances? ROC curves, as discussed in lecture, illustrate the tradeoffs between sensitivity and specificity. Roc curves plot Sensitivity vs. 1 – Specificity, or TP (true positive) rate vs. FP (false positive) rate. Recall that TP rate = Sensitivity = TP / (TP + FN) FP rate = 1 – Specificity = FP / (TN + FN) Specificity = 1 – FP rate = TN / (FP + TN) Where TP = number of true positives and FP = number of false positives. Another way to think about it is: TP rate = # true positives / # of positive instances (The true positive rate is the proportion of actual positive instances that were correctly reported as being positive.)
  • 3. FP rate = # false positives / # of negative instances (The false positive rate is the proportion of negative instances that were erroneously reported as being positive.) Question 2b. What are the TP and FP rates for ALL and AML? ALL TP Rate = ALL FP Rate = AML TP Rate = AML FP Rate = Optional: Right click on your result in the “Result list” on the left side of the screen. Choose “visualize threshold curve” and “ALL”. An ROC curve plots true positive (TP) rate vs. false positive (FP) rate, which are the defaults. You can also view other types of curves by clicking the dropdown menus. For example, precision-recall curves are an alternative to ROC curves; precision and recall are options in the dropdown menu. Recall that we can trivially achieve a TP rate of 1 for AML by classifying all of the patients as positive (AML). That would correspond to a TP rate of 0 for ALL, since none of the patients could then be classified ALL. Question 3. Now we’ll try a different classification algorithm, ZeroR. Click the Choose button under Classifier, and expand the “rules” folder. Choose “ZeroR”. Again use cross- validation with Folds=5. Run it. % correctly classified instances = ALL TP Rate = ALL FP Rate = AML TP Rate = AML FP Rate = Explanation: ZeroR is a baseline classifier that simply identifies the class that is most abundant (in this case, ALL with 44 patients), and predicts all variables (patients) to be in that class. This illustrates that when the variables are unevenly assigned to classes, it can be a problem to simply look at accuracy. For example, if you have a dataset of 100 variables (patients), where 90 were ALL, and 10 were AML, then ZeroR will predict them all to be ALL, with 90% correctly classified (90% accuracy). Another classifier might have 85% accuracy, which looks pretty good until you compare it with ZeroR. You can get around the problem of uneven class sizes by weighting, a topic which we won’t cover here. (But check out the “CostSensitiveClassifier” and “Cost Sensitive evaluation” if you’re interested.) Stop for a moment and think about the meanings of each of these terms (ALL TP rate, ALL FP rate, AML TP rate, AML FP rate), and why you got the answers you wrote down, when using ZeroR. We’ll try a few more algorithms on this dataset. Question 4. Under the “bayes” Classifier folder, choose “NaiveBayes” and run. % correctly classified instances = ALL TP Rate = ALL FP Rate = AML TP Rate = AML FP Rate =
  • 4. Question 5. Under the “functions” Classifier folder, choose “SMO” (a popular implementation of SVM) and run. % correctly classified instances = ALL TP Rate = ALL FP Rate = AML TP Rate = AML FP Rate = Question 6. Under the “trees” Classifier folder, choose “ADTree” (a popular implementation of a decision tree) and run. % correctly classified instances = ALL TP Rate = ALL FP Rate = AML TP Rate = AML FP Rate = You can read about the various other algorithms in the documentation if interested. Part 1c: Feature selection (aka attribute selection) We will now use feature selection algorithms to extract the most “informative” genes for classifying AML vs ALL. As mentioned in lecture, this is a useful real world exercise because many times we are interested not in measuring all the gene expressions, but in finding one or a few whose expression level can be used as a marker for the disease. The general area of “Biomarkers” is economically important because it may lead to diagnostic tests that can be marketed to clinical laboratories. If interested, you can learn more about the theory behind feature selection in machine learning textbooks and online resources. Wikipedia is always a good place to start. http://en.wikipedia.org/wiki/Feature_selection First we will choose a subset of five genes arbitrarily – the first five. To do this in Weka, go back to the “Preprocess” tab and click the checkbox next to the first five genes (Zyxin through RNS2). Scroll down to the bottom of the gene list and click “leukemia_type”, which is the class label (AML or ALL). Check this box, too, so that there are 6 total boxes checked. Click the “Invert” button above the list, so that the other 145 boxes become checked. Click the “Remove” button below the list, so that you are left with 5 genes and leukemia_type. Go to the “Classify” tab again. Classify using IBk and K = 5. Question 7. What is the % correctly classified instances using just the first five genes? Re-load the dataset with “Open file…” to get all 150 genes back. We’ll now find five genes that are “informative” according to some metric. Choose the “Select attributes” tab at the top of the screen. Under ‘Attribute Evaluator”, choose the algorithm called “InfoGainAttributeEval” and Search Method “Ranker” and Start. The output shows the genes ranked by information gain, which is one possible metric for measuring the “goodness” of a partition. The genes at the top of the list are the most
  • 5. “informative” as attributes for classification. Question 8. What are the top five genes in this output? Back under “Preprocess”, remove all attributes except the top five genes from the previous question, in the same manner as described above, so that you are left with the five genes from Question 8 and the class attribute leukemia_type. Under “Classify”, run the exact same algorithm as before (IBk, K=5). Question 9. What is the % of correctly classified instances? Part 1d: Unsupervised learning K-means clustering This part will use the same dataset that you used for Project 2 k-means clustering. Recall that that dataset comprised expression profiles for 2467 genes across 79 experiments. Download a Weka-compatible CSV file at http://helix-web.stanford.edu/bmi214/assignment2/data/yeast.dat.csv The rows are genes and the columns are experiments. The gene and experiment names, as well as a link to the reference paper, is available on the Project 2 instructions. In this particular file, we have added one additional column called “ribosomal” with value Y or N, to indicate whether a gene is ribosomal. Question 10. What is the mean value of expression of the experiment labeled “alpha 21”? Now we will run K-means clustering. Go to the Cluster tab. Under Clusterer choose SimpleKMeans. Click the text “SimpleKMeans” and set numClusters to 2. “Cluster mode” should be “Use training set”. Run it. Question 11. How many genes in each of the two clusters? An optional exercise for those interested -- If you want to see which genes were assigned to which clusters, right-click on the result in the Result list and choose “Visualize cluster assignments”. Click the Save button and choose a location. This will save a file in ARFF format (the Weka format), where the last value on each line will be a cluster assignment. (See the Weka documentation for more on ARFF format). Question 12. Recall that K-means depends on a random starting seed. Change this seed by clicking on the text “SimpleKMeans” and changing the “seed” value to 15. How many genes are now in the two clusters?
  • 6. Question 13. Try adjusting the seed value to various values (e.g. 1.2, 5, 20, 100). How much do the numbers of genes in each cluster change? What might this say about the genes and their assignments to clusters? (I.e. are the assignments robust or weak?) Question 14. Now under “Cluster mode”, choose “Classes to clusters evaluation” and run. How many of the 121 ribosomal genes are assigned to the same cluster? Question 15. How many other genes are assigned to that cluster? (If you were interested in what those genes were, you might save the ARFF output file as described above, extract the genes in that cluster, and enter them in the GO Term Finder, to get a feel for their functions. That’s left as an optional exercise.) Question 16. For completeness, we’ll now classify the yeast data. Under the Classify tab, again choose the IBk algorithm, and use KNN=5. Use cross- validation with Folds=5 and run. % correctly classified instances = Class N TP Rate = Class N FP Rate = Class Y TP Rate = Class Y FP Rate = Question 17. Run as in the previous question but using trivial classifier ZeroR, which was discussed above. % correctly classified instances = Class N TP Rate = Class N FP Rate = Class Y TP Rate = Class Y FP Rate = Part 2: Genotype/Phenotype Part 2a: Introduction Researching and answering these introductory questions will help you understand the genotype-phenotype section below. Question 18. What does diploid mean? How many copies of every
  • 7. chromosome are there in one cell of a diploid organism? How many copies of every gene? Question 19. What is an “allele”, with respect to a single SNP (Single Nucleotide Polymorphism)? How many alleles are theoretically possible for one SNP locus? Usually, how many alternate forms actually arise in life for one SNP locus? (For reference you can refer to Genomes, 2 nd ed. Available for free online at: http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=genomes ) Counting the number of SNPs in the human genome depends on your definition for a SNP. For example, there may be some bases where only a few people in the world have a variant allele. For most purposes, it is not useful to call these bases SNPs. Thus, to define SNPs, we refer to the “minor allele frequency”, which is the percent of the population having the uncommon allele. The minor allele frequency can range from 0 to 50%. Question 20. Approximately how many SNPs are there in the human genome (i.e. how many bases of the ~3 billion base pairs in the human genome are SNPs), assuming at least a 1% minor allele frequency? Explain your answer. Question 21. SNPs are one type of genetic variation – what other types exist? Question 22. What is penetrance, in terms of phenotype? What does 50% penetrance mean? 100%? (three sentences or less) Question 23. What is a haplotype? How is it related to linkage disequilibrium? (three sentences or less) Part 2b: Feature selection (aka attribute selection) Background: With genome sequencing cost drastically falling, we will soon be able to sequence millions of people cheaply. With this wealth of new data, we will be able to associate
  • 8. genotypes with phenotypes (e.g. diseases or drug responses). The goal is to find the genes (or more specifically, the SNPs) that are found in conjunction with certain phenotypes. We will be using a semi-synthetic dataset to explore this type of study. The dataset can be found at http://helix-web.stanford.edu/bmi214/assignment2/data/genotenureitus1.arff Dataset description: Many past genotype-phenotype association studies only examined a single gene hypothesized to cause the phenotype (disease). In reality, there are complex interactions of genes that cause disease. We will use machine learning to find such complex associations, which is a very cutting-edge research area. We will use Weka for this section. Open the file “genotenureitis1.arff”. Instead of CSV, this file is in Weka’s ARFF format, which is described at http://www.cs.waikato.ac.nz/~ml/weka/arff.html. This file contains genotype/phenotype data for 557 subjects (people). Each row in the data section represents one subject (person). The (comma-delimited) columns are the attributes of the subjects. The first column is the person’s identifier. The next 188 columns represent genotypes, which in this case are SNPs (single nucleotide polymorphisms). The SNPs were identified from three regions in the genome. The SNPs are named RjSNPi where j is the region (1,2, or 3), and i is the SNP number within that region. The values each SNP can take are {11, 12, 22}. Recall that humans are diploid. In this file, a value of “11” for a subject’s SNP means that the both copies are allele 1 (where allele 1’s possible values are the nucleotides {A,T,G,C}; the actual allele isn’t specified.). “12” means that one was allele 1, and the other was allele 2. “22” means that both were allele 2. (Allele 2 would be a nucleotide other than allele 1.) The last 20 columns in the file represent (artificially created) phenotypes. Some of the phenotypes, for example, are - “degree”, with possible values MD, MD/PhD, or PhD - “npubs” (number of publications), with possible values 4 through 12. - “gotgrants” (total grant dollars earned, in millions), with possible values 0 through 19. Note that the genotype data are real, but the phenotype data are artificial. A more lengthy (and far more humorous) description of the phenotypes can be found here: http://helix-web.stanford.edu/bmi214/assignment2/data/pgrn_2005.pdf This is an optional read; it is not necessary for understanding this assignment. This document and the genotenuritis data file were created for a pharmacogenetics conference. Data filtering Back to Weka and genotenureitis1.arff. Scroll to the bottom of the attribute names. Examine the distributions for the attribute “irep” by clicking on it. Question 24. How many class labels exist for the attribute irep?
  • 9. The attribute irep is not useful to us; we will remove it and other useless fields with a filter. In the “Preprocess” tab, under “Filter”, choose unsupervised->attribute- >RemoveUseless, and click Apply. This will remove the “irep” attribute and the “ID” attribute(which actually appeared twice – attribute number 1 and 192). You can undo operations like this one by clicking the Undo button at the top. If you don’t apply this filter, you may get erroneous results in the following sections. Genotype- genotype associations First we’ll examine whether SNPs are associated with each other. Recall the discussion in class about haplotypes and linkage disequilibrium. Question 25. Under the “Associate” tab, run the default associator (Apriori). Copy here the output under “Best rules found:” (This should be nine SNP association pairs and one phenotype- SNP pair.) Question 26. Within all but one SNP- SNP pair, the two SNPs have something in common. What is it, and what does it mean biologically? Hint: look at the SNP names and the explanation of SNP names above. (3 sentences or less.) Attribute selection We now perform attribute selection to find which SNPs best predict some of the phenotypes. Phenotype 1: gotgrants The first phenotype we’ll investigate is “gotgrants”. We want to know whether there are any SNPs related to the amount of grant money earned. We disclose that the first phenotype was created artificially as a simple linear function of the penetrance of a single SNP. You will try to discover what this SNP was. Choose “gotgrants” from the dropdown menu on the left. Under the “Select attributes” tab, choose the GainRatioAttributeEval evaluator with Ranker Search Method. Use cross-validation with Folds=5. Run. Question 27. What is the top SNP (genotype ) in this feature selection result? (Paste the line from weka)
  • 10. Question 28. Run multiple Attribute Evaluators with multiple Search Methods; is the top gene the same in general? (Note that some Evaluators require specific Search Methods; see the error log and/or documentation if you get funny results.) Question 29. Look at the top 80 or so SNPs. Do you see anything interesting about them? What does this mean biologically? (3 sentences or less.) Phenotype 2: pctdrivel The next phenotype is “pctdrivel”; again, we’ll investigate whether there are any SNPs that are good predictors of this phenotype. This phenotype was created artificially as a function of two other phenotypes. The inclusion of other phenotypes represents the fact that environment can impact phenotype, not just genetic makeup (cf. the nature vs. nurture debate). Run multiple Attribute Evaluators with multiple Search Methods. Question 30. What do you think are the two phenotypes affecting pctdrivel and why? A well- justified answer will be accepted. (3 sentences or less.)