How are the environmental variables and marine evolution connected? Does astronomical forcing influence climate variation? Can we apply deep learning to classify index fossils?
Paleo environmental bio-diversity macro-evolutionary data mining and deep learning
1. Deep time Paleo-environmental
and Bio-diversity data mining
and deep learning classifcation
Abdullah Khan Zehady
Phd student @
Earth, Atmospheric & Planetary Science,
Purdue University
2. Research Projects
1. Macro and micro scale evolution of planktonic foraminifera and the potential drivers during Cenozoic era (66.04
Ma).
Hypothesis:
“Rates of evolution are correlated with rates of geochemical and sea-level change.”
- Can we find long term (2-5 myr) astronomical cycles?
2. Periodicities and other cause-effect relationships among pulses of evolution since Cambrian period (541 Ma).
Hypothesis:
(a) “The Earth has had semi-periodic episodes of unusual surface/biological change.”
- Abundance of events over time; verification of catastrophe models (20 – 60 myr periods); causation by impact and other disastrous events.
(b) “Pulses of biological evolution occur simultaneously with global changes in sediment facies.”
3. Automated fossil image recognition by feature extraction using deep neural network.
4. Effect of climate change on cultural turnover of last 2000 years of human civilization.
Hypothesis:
“A major factor in the rise and fall of human civilization in different continents is climate cooling.”
2
3. My papers on evolutionary tree visualization algorithm
3
To be sumbitted (Under review) at BMC
Evolutionary Biology
To be submitted at Nature Methods
6. Mutual Learning using Integrated Tree
Comparison and learning between integrated tree and molecular tree
7. PaleoEnvironmental & Bio-Diversity Data
Dataset Significance
1. Marine genera ranges and subsequent turnover timeseries data
for whole Phanerozoic
Marine biodiversity, turnover
2. Cenozoic planktonic foraminifera evolution and turnover
timeseries. (Zehady, Fordham 2018, Aze et al 2011)
Foraminifer diversity, turnover
3. Oxygen-18 (δ18O) curves and events (Cramer 2009) Long-term cooling of the ocean interior
4. Carbon-13 (δ13C) curves and events (Cramer 2009) Terrestrial climate proxy record
(Gradual sinking of the organic matter)
5. Strontium isotope record (Sr87/Sr86) Tectonic evolution, continental spreading
origin and evolution of igneous rocks
(magmatic style)
6. Sulphur isotope record (δ34 Ssulphate) Link with LIP
7. Ages and volume extent of LIP(Large Igneous Province) Large volume gas release in the ocean-
atmospheric systems, link with magmatism,
mass extinction, extinction cyclicity (Melott
2012)
8. Sea-level synthesis curve (Haq et al 2014) Plate motion, changes in continent mass
distribution
9. Passive margin 7
9. Orbital forcing - Milankovitch Cycles
9
Amplitude modulation, 2.4 Ma eccentrici
10. Marine genera turnover data
Entire Phanerozoic (Prob) Cenozoic only (# turnover)
Raw turnover prob
= Raw speciation +
Raw extinction prob
To reduce stochastic noise:
Fit Hidden Markov Model
(HMM)
Markov property:
Only dependent on the
previous state
Parameter estimation
Using Baum-Welch
Algorithm.
AIC is used to estimate
Speciation and Extinction
State.
11. Multi-taper spectrum – Whole Phaneorozoic
* Number of significant F-test peaks
identified = 9
ID / Frequency / Period / Harmonic_CL / Red
noise_CL
1 0.1504065 6.648649 98.00066 69.56518
2 0.2524021 3.961933 96.46992 98.53005
3 0.27051 3.696721 93.19279 84.04573
4 0.2775314 3.603196 94.98532 97.86011
5 0.2852919 3.505181 93.26527 59.50705
6 0.2908352 3.438374 90.65819 99.49677
7 0.3510717 2.848421 98.56435 80.32858
8 0.4035477 2.478022 97.29846 68.29225
9 0.4903917 2.039186 95.83189 93.04155
12. Multi-taper spectrum – Only Cenozoic (0-67 Ma)
* Number of significant F-test peaks identified = 1
ID / Frequency / Period / Harmonic_CL /
Rednoise_CL
1 0.2787879 3.586957 99.95627 92.50397
13. Multi-taper spectrum – Only Mesozoic(67-252 Ma)
* Number of significant F-test peaks identified =
3
ID / Frequency / Period / Harmonic_CL /
Rednoise_CL
1 0.01079914 92.6 99.45956 67.96132
2 0.3401728 2.939683 99.03912 56.54744
3 0.3704104 2.699708 94.96196 88.81204
14. Multi-taper spectrum – Only Paleozoic(252- 541 Ma)
* Number of significant F-test peaks identified =
7
ID / Frequency / Period / Harmonic_CL /
Rednoise_CL
1 0.004149378 241 93.49593 78.74621
2 0.01521438 65.72727 91.12898 93.60384
3 0.1500692 6.663594 91.39851 88.02085
4 0.2524205 3.961644 94.03579 98.97888
5 0.2904564 3.442857 93.06378 89.31391
6 0.3506224 2.852071 96.671 84.24726
7 0.4439834 2.252336 90.65441 92.18804
15. Ocean-Atmosphere System
How marine and terrestrial environment are connected?
MaGIC model – Geochemistry based modeling
(Arvidson et al 2006)
f[1]: Organic Phosphorus (P)
f[21]: Terrestrial Organic matter(C) burial flux
f[50]: Sulfate(S) reduction
f[54]: Organic Carbon(C) sedimentation
f[71], f[72]: Carbonate precipitation
17. PaleoEnvironmental Data : Strontium + Passive Margin
17
Continental/Sea floor Spreading
Transition between
oceanic and continental
lithosphere via
sedimentation on
passive margin
23. Principal Component Analysis of Cenozoic data
23
PCA transforms correlated data into a new co-ordinate
such that the new variables are uncorrelated.
The goal of PCA is to find components
Z = [Z_1, Z_2, …, Z_p]
which are linear combination of
u = [u_1, u_2, …, u_p]’ of the
Original variable
X = [X_1, X_2, …, X_p] that achieve maximum
variance.
For Cenozoic, we have no missing values for all 12
variables/parameters.
X: A matrix with dimension 69 x 12, n = 69, p =12
Y: Normalized matrix of X
C: Covariance matrix where C = t(X) * X / (p -1) ,
Eigen value decomposition in R
E = eigen(C) where t(E) * E = I
EOF (Empirical Orthogonal Function) : Orthogonal basis
function, basically the eigen vectors
Ev : Eigen vector matrix
28. Automated Fossil Genus Classification
Hedbergella Sigali Globigerinoides Altiaperturus
Binomial nomenclature system: Genus Species
Can we extract features from the species to detect its Genus?
29. Automated Fossil Genus Classification
Image Data Number of Images Accuracy with Best model so far
Training (Transformed image) 1947 ~ 74%
Validation 649 ~ 55%
Test (Previously Seen Species) 236 ~ 90%
Unseen Test (Totally new species) 37 ~ 43%
Multi class classification
How many different Genus class we have? --> 79
Model Comparison of 3 CNN (VGG19) models “Cross Entropy” loss minimization with Adam Optimizer
30. Automated Fossil Genus Classification
Paragloborotalia predicted as Turborotalia
Most Likely cause of misclassification
Turborotalia Training Images
37. What each CNN layer is learning?
Layer2 : Block2_Conv2
Mostly directional
Layer3 : Block3_Conv2
10 random filters in each layer
38. What each CNN layer is learning?
Layer2 : Block2_Conv2
Mostly directional
Layer3 : Block3_Conv2
10 random filters in each layer
39. How does VGG19 CNN model classify between genus?
What are the final abstraction/specific output categories for each of the 79 classes?
Unique features to classify.. With categorical cross-entropy
Globigerina Globigerinoides Globorotalia Sigalia
40. Month/Year Prioritized Item
Jan 2019 ~ April 2019 • Submission of Evolutionary tree visualization paper
• Submission of Species-Phenon Integrated tree paper
• Completion of culture paper
Feb 2019 • NSF proposal resubmission.
Jan 2019 ~ May 2019 • Phanerozoic data extraction, Visualization
• Correlation/causality and spectral analysis for Phanerozoic (500 myr) data.
• Learning on Cross phase, phase lag analysis, Causality analysis
• Automatic fossil image detection project
May 2019 ~ Aug 2019 Internship at Cisco Systems in their Data Center
Nov 2019 • Poster, paper presentation on Machine-learning and artificial-intelligence application in
the geosciences (GSA Annual meeting)
• Other Machine learning journals..
Timeline
40