2. Introduction
● Searching for signatures of selection
● SFselect (Ronen, 2013)
● Multi-K (Whiteman, 2010)
● Introducing: SFselect-E
3. Contents
1) The selection classification problem
2) Overview of SVM classification with
SFselect
3) Ensemble preprocessing with Multi-*
4) Generating model variance
5) Introducing SFselect-E
6) Experimental Results
7) Conclusion
4. Natural selection
● Population genetics
● Evolution: Descent with modification
● Selection
o Directional
Positive
Negative
o Neutral
5. Classifying natural selection
● Record of demographic history
● Increased LD, reduced variation
● Site frequency spectrum
o ie, Tajima’s D
6. Background: SFSelect (Ronen, 2013)
● Scaled Site Frequency Spectrum
● Linear kernel Support Vector Machines
● Trained on extensive population simulations
o SFselect, SFselect-s, SFselect-XP
7. Background: Multi-K Clustering
● Bootstrap aggregation
o Random sampling
o Aggregation method
o Highly accurate, but computationally expensive
● Multi-K
o Iterative K-means clustering
o Classify new points based off centroid proximity
o Optimize Kend with cross validation
● Multi-KX, Multi-SVD
8. Generating ensemble diversity
● Generating ensemble diversity
o Generalizers
o Specializers
● Applied to SFS classification:
o Improve overall classification accuracy?
o Produce classifiers robust to wide variations in
genetic diversity
10. Population simulations
● 1000 individuals
● s = [0.005, 0.01, 0.02, 0.04, 0.08]
● t = [0, 50, 150, 200, …, 3500, 4000]
● n = 500
● labels = [-1, 1] (neutral, selected)
11. Training the standard model
● Compute allele frequencies
● Scale, normalize, bin into vectors
● Trained linear kernel SVM on entire dataset
12. Computational limits
● Very time intensive
o Population simulations
o Vectorization of SFS
o Training SVMs on SFS
● Simulations grouped/indexed by replicate
o Proved a major limitation on ensemble sampling
13. SFselect-E: Bagging approach
● Random sampling
o k = 100, n = 200
● Aggregation
o Majority voting
● Validation
o Cross validation
15. Experimental analysis: K-fold C.V.
How to cross validate an ensemble???
For each K, hold out Ki, train on D-Ki
Test classifier on Ki
Report mean accuracy (# correct
classifications)
16. Experimental analysis: C.V. Results
Model
Accuracy
Standard SFselect SVM: 74.28
Bagged SFselect-E SVM: 73.86
Multi-K SFselect-E SVM: NA
17. Experimental analysis: Time series
● For t = [0, 4000], test Dt
o Neutral vs Selected
o Dependent T-Test on time sample accuracies
p-value of 2.0136 X 10-24
18. Conclusions
● SFselect-E consistent with SFselect
o No separation of specialized classifiers
o Smaller subsets?
● Limitations of structure of training data as
implemented in SFselect
● Model variance best obtained by separating
by s, t.
19. Conclusions
● Computing time for training a major obstacle
● Multi-SVD preprocessing could reduce
training time
● Refactoring required first
20. Future work
Refactor to treat populations independently
Bagging: random sampling across s, t
Multi-K: hierarchical clustering of training data
Multi-KX, Multi-SVD
SFselect-s as component models
21. Future work
Cross population: SFselect-XP, XP-SFS
Cross species: SFS + conserved regions
XS-SFS
Tune ensemble diversity to population genetic
diversity