UVA Data Science Institute MSDS students Caitlin Dreisbach ('18), Morgan Wall ('18), and Ali Zaidi ('18) presented a talk based on their capstone research project, part of the MSDS program, at the 2018 Tom Tom Applied Machine Learning Conference in Charlottesville, Va.
Learn more about the project at https://dsi.virginia.edu/projects/connecting-mind-and-body.
Generative AI on Enterprise Cloud with NiFi and Milvus
Â
Joining Separate Paradigms: Text Mining & Deep Neural Networks to Characterize Cell Type in the Cortex and Hippocampus
1. Joining Separate Paradigms:
Text Mining & Deep Neural Networks
to Characterize Cell Type in the
Cortex and Hippocampus
C. Dreisbach, M. Wall, A. Zaidi | Candidates for Masterâs in Data Science, Data Science Institute
A. Flower, PhD | Data Science Institute
C. Overall, PhD | Center for Brain Immunology & Glia (BIG)
8. Denoising Autoencoder (Tan, et al., 2016)
8
10
62
Cluster-level
scRNA-seq
data
Adding
random
noise
Encoding with
high and low
weights
Decode nodes to reconstruct
expression
92
đČ 1,19804 âŠ
đ
9. G1 G2 G3 G4 G5 G6
Node1 .07 .83 0 0 .40 .21
GO:0034613:
cellular protein
localization
GO:0046907:
intercellular transport
Topic 1 âŠ
Topic 2 âŠ
Topic 3 âŠ
Term frequency
inverse document
frequency (TF-IDF)
10. TF*IDF Algorithm
Document 1 Document 2
TF*IDFScore = TFt,d log (N/DFt)
Word Term Frequency
in Document 1
# Documents
Appears In
IDF
Log(Ndocs/NdocsTerm)
TF*IDF
Score
Cat 1 2 0 0
The 2 2 0 0
Dog 1 1 0.301 0.301
Cat
Dog
The
The
The
The
Cat
Cat
Cat
The
The word
âdogâ appears
the most
interesting as
itâs TF*IDF
Score is
0.301
13. Selected References
Herculano-Housel, S. (2009). The Human Brain in Numbers: A
Linearly Scaled-Up Primate Brain. Frontiers in Human Neuroscience,
3, 31.
Lun, A., McCarthy, D. and Marioni, J. (2016). A step-by-step workflow
for low-level analysis of single-cell RNA-seq data [version 1; referees:
5 approved with reservations]. F1000 Research, 5,2122.
(doi:10.12688/f1000research.9501.1).
Tan, J., Hammond, J., Hogan, D., & Greene, C. (2016). ADAGE-
Based Integration of Publicly Available Pseudomonas aeruginosa
Gene Expression Data with Denoising Autoencoders Illuminates
Microbe-Host Interactions. mSystems, 1(1) e00025-15. DOI:
10.1128/mSystems.00025-15
Zeisel A, Muñoz-Manchado AB, Codeluppi S, et al.: (2015). Brain
structure. Cell types in the mouse cortex and hippocampus revealed
by single-cell RNA-seq. Science, 347(6226): 1138â1142
13
CAIT:
Hello fellow TomTomâers! My name is Cait Dreisbach, and this is Morgan Wall and Ali Zaidi. We have been fortunate to spend the last year as a members of the Data Science Institute where, in 37 short days, not that we are counting, we will graduate with a Masterâs in Data Science. As a recipient of the degree, each student takes on a capstone project with 2-3 other students. Today we will be presenting on our work titled, âJoining Separate Paradigms: Text Mining and Deep Neural Networks to Characterize Cell Type in the Cortex and Hippocampus.â We have been supported by our advising team, Dr. Abigail Flower from the DSI, and Dr. Chris Overall from the Center for Brain Immunology and Glia.
CAIT:
Our group has been generously supported by several organizations including â
the Data Science Institute at the University of Virginia which I know is well-represented in this audience today!
Laboratory of Dr. Joni Kipnis which is credited with finding the physical connection between the brain and the immune system.
And the Advanced Research Computing Services for the use of Rivanna, UVAâs high-performance computing cluster.
Finally, weâd like to thank TomTom for letting us spread our wings today and get a chance to discuss our work!
CAIT:
Now, I want to you look at this adorable mouse and put yourself back into your 7th grade classroom. Typically when we give this presentation to data-oriented folks, the connection to biology leaves listeners confusion. What we want to do today is to leverage your 7th grade brain â specifically during the time you learned about cells! We know that organisms are made up of tissues which are just groups of cells that work together to perform a certain task. Youâve probably seen cells that look like this, or this⊠The goals of this research project was to label their cell type and identify their specific function.
CAIT:
Traditionally, we take an organism and extract a collection of cells. Procedures are used to collect a large number of cells, BUT, revolutionary progress in biotechnology now allows us to separate out a single cell. From that single cell, we can unravel the DNA to better understand how specific genes are expression. Expression of genes tells scientists what a cell is doing at any given time. Typical analytic methods include clustering which helps us to determine cell type. This is where we come in with the goal of leveraging data science methods, including autoencoding and text mining, to better refine the otherwise somewhat subjective measures of identifying cell type and function.
MORGAN: The initial part of our capstone included a substantial data processing workflow including sourcing of our candidate mouse-model dataset published in Science in 2015. We downloaded the publically-available data from the Gene Expression Omnibus data repository run by NCBI. Next, we read the data into the Bioconductor package in R, the most popular bioinformatics software available, to perform quality control and normalization procedures. Normalization is the systematic removal of non-natural variation in the experiments. After normalizing the entire dataset, we split the data into the 2 major regions of the brain represented by the data, the cortex and the hippocampus. At this point, you can think of the data as 2 separate CSV files with almost 20,000 gene columns and several thousand rows of single cell observations.
MORGAN:
MORGAN:
***HIGHLIGH
CAIT:
Our first approach was to use a denosing autoencoder to isolate candidate genes in the cortex. An DAE is an algorithm that aims to discover more robust features whichs prevent it from simply learning the original identity. We trained the autoencoder to reconstruct the input from a corrupted version of it. The corrupted version is created by adding random noise and then the genes are weighted based on their gene expression in each cell. Genes that have a mean expression level above 2.5 standard deviations are labeled as high weight. Using a non-linear sigmoid function, we were able to isolate the features from just under 20,000 genes to a matrix of less than 5,000.