Advances in cell biology and creation of an immense amount of data are converging with advances in Machine learning to analyze this data. Biology is experiencing its AI moment and driving the massive computation involved in understanding biological mechanisms and driving interventions. Learn about how cutting edge technologies such as Software Guard Extensions (SGX) in the latest Intel Xeon Processors and Open Federated Learning (OpenFL), an open framework for federated learning developed by Intel, are helping advance AI in gene therapy, drug design, disease identification and more.
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
AI for All: Biology is eating the world & AI is eating Biology
1. Biology is eating the
world & AI is eating
Biology
Pradeep K Dubey
Intel Senior Fellow, IEEE Fellow
Director, Parallel Computing Labs
2. Intel All.AI 2021 @ Population Scale Virtual Summit 2
Machines:
Crunch
Numbers
Humans:
Make
Decisions
3. Intel All.AI 2021 @ Population Scale Virtual Summit 3
Machines:
Crunch
Numbers
Humans:
Make
Decisions
Division of Labor Between Man and Machine Is Getting Disrupted:
Faster than Anyone Predicted!
Machines:
Number Crunching
AND
Decision Making
4. FROM
A World of
analytical
models
Computational Fluid Dynamics
Start with Mathematical Model
Model Simulate Predict
Start with Data
Initial State Increment Steer
TO
A World of
Data driven
Models
Event Detection from Social Media
Inside - Out Outside - In
5. Intel All.AI 2021 @ Population Scale Virtual Summit 5
• Effectiveness of AI relies on how well model structure matches the underlying invariant (structure) of the
high-dimensional task objective
• A good set of implicit or explicit inductive bias incorporating domain knowledge
• Such as, CNNs for vision and attention networks for NLP or emerging GNNs
• Training time: How well we manage exploitation versus exploration to get to the most generalizable
(flatter) minima
• Avoiding typical solver attraction to sharp minima
• Higher-order methods
What makes AI effective in practice
5
6. Intel All.AI 2021 @ Population Scale Virtual Summit 6
better understanding of interiors and
evolution of RED GIANT stars
Accurately extract seismic parameters from 1000
spectra in under 10 secs
Measuring the frequency separation ∆ν and period separation ∆Π in red-giant stars using Machine learning, under submission at Science Advances
Department of Astronomy and Astrophysics, Tata Institute of Fundamental, Center for Space Science, NYUAD Institute, New York University Abu Dhabi, Division of Solar and Plasma Astrophysics, NAOJ,
Mitaka, Tokyo, Japan, Parallel Computing Lab, Intel Labs, Bangalore, India
7. Intel All.AI 2021 @ Population Scale Virtual Summit 7
Convergence of Revolutions
Daphne Koller*: https://www.youtube.com/watch?v=V6bSlPNwrKo&feature=youtu.be
Advances in
CELL
biology &
creation of
immense
amount of
data
Advances in
ML to
analyZE
large scale
data and
leverage To
make
Prediction
8. Intel All.AI 2021 @ Population Scale Virtual Summit 8
AI is Eating Biology
8
Biology is experiencing its “AI moment”
Publications involving AI methods (e.g. deep learning, NLP, computer vision, RL) in biology are growing
21000 papers in 2020 alone
> 50% YoY since 2019
Papers since 2019 = 25% of all output
since 2000
https://pubs.acs.org/doi/10.1021/acs.jcim.1c01114
10. Intel All.AI 2021 @ Population Scale Virtual Summit 10
Understand mechanisms, Design Interventions:
Massive Compute Appetite
Big Data: Astronomical or Genomical
https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195
Algorithmic, Computational & Data Management
Requirements
>1000x
growth
IN COMPUTE
NEEDEDTO MATCH
DEMAND
100’s of TB/s
MEMORY BW AT
100’S OF GB
CAPACITY
Process 100’s of exabytes of
multi-modal data
e.g., Learning on Large Graphs,
structure learning, regulatory
networks, Combinatorial
optimizations…
Secure, Privacy preserving,
Federated
11. Intel All.AI 2021 @ Population Scale Virtual Summit 11
Accelerating Graph Neural
Networks on Xeon
Supercomputing’21 - distGNN: Scalable Distributed Training
for Large-Scale Graph Neural Networks
Full batch Training ~2-3.7x faster on 1s-CLX (1s) for GraphSAGE on OGB-Products & Reddit ~83x for distributed training on 128 sockets on OGB-
Papers
Cascade Lake Xeon: Intel® Xeon® Platinum 8280 Processor 38.5M Cache, 2.70 GHz, 28 cores
[arXiv’20, arXiv’21, SC’21]
DGL v 0.5.3
GraphSAGE on Reddit
GraphSAGE on OGB-Products
OGB-Papers: 100 Million Node Graph
Roofline: Upper &
lower bound
DGL v 0.5.3
12. Intel All.AI 2021 @ Population Scale Virtual Summit 12
LamBdaZero
Search space 10^18 vs internet 10^9
Combinatorial Optimization at scale
Uses ML and HPC to accelerate screening of drug-like
molecules
@MILA with Prof. Yoshua Bengio
[Intel-MILA announcement]
14. Intel All.AI 2021 @ Population Scale Virtual Summit 14
Bao outperforms them all!
SIGMOD’21: Best Paper
(Data Management)*
In collab with Prof. Tim
Kraska@MIT
* SIGMOD’21 Best Paper Announcement: https://2021.sigmod.org/sigmod_best_papers.shtml
15. Intel All.AI 2021 @ Population Scale Virtual Summit 15
BWA-MEM2* : An Accelerated
version of BWA MEM
(BWA-MEM has 950K+ Downloads, 70K
Users WW)
15
Higher is better
In collaboration with Dr Heng Li, Author BWA-MEM
Reference genome: GRCh38; Read dataset: 50x WGS ERR194147 (NA12878/HG001)
from Illumina HiSeq 2000
Sequence alignment
Cascade Lake Xeon: Intel® Xeon® Platinum 8280 Processor 38.5M Cache, 2.70 GHz, 28 cores
Ice Lake Xeon, ICX: Intel® Xeon® Platinum 8380 Processor 60MB Cache, 2.40 GHz, 40 cores
9.8
15.8
22.1
8.9
2s CLX 2s CLX 2s ICX 1 A100
BWA-MEM BWA-MEM2 Clara Parabricks BWA-
MEM
Throughput in genomes/day for 50x WGS
Higher is better
2.25x
2.5x
Source of Clara Parabricks results: https://at-cg.github.io/posts/ParaBricks-WGS/
Enabling Community Worldwide
https://github.com/bwa-mem2/bwa-mem2
horticulture
nutrition
In production use by Cancer, Ageing and Somatic
Mutations, Wellcome Sanger Institute; tested on ~88
Billion reads
16. Intel All.AI 2021 @ Population Scale Virtual Summit 16
MM2-Fast Accelerates
MINIMAP2 on Xeon by 3.1
Cascade Lake Xeon, CLX: Intel® Xeon® Platinum 8280 Processor 38.5M Cache, 2.70 GHz, 28 cores
Ice Lake Xeon, ICX: Intel® Xeon® Platinum 8380 Processor 60MB Cache, 2.40 GHz, 40 cores
[bioRxiv’21]
MM2-Fast Branch in
Minimap2 repo
In collaboration with Dr Heng Li, Author Minimap2
Reference genome: GRCh38; Read dataset: ONT, PacBio HiFi and PacBio CLR datasets derived from human trio benchmark genomes HG002, HG003 and HG004 as given at https://precision.fda.gov/challenges/10/view
and https://github.com/genome-in-a-bottle/giab_data_indexes
Minimap2 has >
100k Downloads
17. Intel All.AI 2021 @ Population Scale Virtual Summit 17
9x speedup for Analysis of Single Cell ATAC-
SEQ Data
Denoising and peak calling on noisy
ATAC-Seq data
Cascade Lake Xeon, CLX: Intel® Xeon® Platinum 8280 Processor 38.5MB Cache, 2.70 GHz, 28 cores
Cooper Lake Xeon, CPX: Intel® Xeon® Platinum 8380H Processor 38.5MB Cache, 2.90 GHz, 28 cores
Ice Lake Xeon, ICX: Intel® Xeon® Platinum 8380 Processor 60MB Cache, 2.40 GHz, 40 cores
Higher is better
1.8x
2.3x
Source of Clara Parabricks performance: [Nvidia, 2020] AtacWorks: A deep convolutional neural network toolkit for epigenomics
2.3x speedup over NVIDIA
Clara Parabricks on DGX-1
box (8 card V100) with 16
sockets of Cooper Lake
1.8x speedup over NVIDIA
Clara Parabricks on DGX-1
box (8 card V100) with 16
sockets of Ice Lake
[arXiv’21,
bioRxiv’21]
19. Intel All.AI 2021 @ Population Scale Virtual Summit 19
Brain tumor segmentation finds tumors from
MRIs
Sheller, M.J., Edwards, B., Reina, G.A. et al. Federated learning in medicine: facilitating multi-institutional
collaborations without sharing patient data. Sci Rep 10, 12598 (2020).
Intel-UPenn Collaboration
How much better does each institution do
when training on the full data vs. just their
own data?
17%
BETTER
2.6%
BETTER
on their own validation data
on the hold-out BraTS data
Other names and brands may be claimed as the property of others
20. Intel All.AI 2021 @ Population Scale Virtual Summit 20
1. Privacy Preserved Machine Learning for data
and model privacy / protection
2. Privacy/Confidentiality Preservation
3. Attestation and integrity
4. Federation deployment
5. Federated nodes software stacks for TTM
6. Curation tools and deployment automation
github.com/intel/openfl
openfl.readthedocs.io/
Enables greatest access to data
Any company can host a privacy
preserved federation
Complete software and platform
offering time to market deployment
21. Intel All.AI 2021 @ Population Scale Virtual Summit 21
: a Benchmark Suite For
Many GenomicsBench benchmarks have abundant data parallelism, but significant irregularity
makes it challenging to achieve good performance.
12 representative kernels spanning the major steps in short-read and long-read sequence
analysis pipelines
FM-index, Banded Smith-Waterman, deBruijn graphs, Pair HMM, DP Chaining, SIMD Partial Order
Alignment, Adaptive Banded Signal to Event Alignment, Genomic Relationship Matrix, Neural networks
based Basecalling, Neural networks based variant calling, Kmer counting, Pileup counting
Open-sourced and under active development:
https://github.com/arun-sub/genomicsbench
Xeon Optimized implementations of kernels under active development at:
https://github.com/IntelLabs/Trans-Omics-Acceleration-Library
AI-Driven HPC Research: A first of its kind Deep Learning approach to learn parameters that govern stellar evolution for Red Giant Stars, achieving average inference time of 5ms/star on Intel® Xeon® Platinum 8280, much faster (>10000x) than current SOTA methods based auto-correlation and MCMC: The power spectra of red giant stars are studied for better understanding of interiors and evolution of stars. The Kepler and TESS space missions have provided a vast set of red giant light curves data, and such data sets are expected to grow exponentially with future missions such as PLATO. There is a need to analyze such data accurately and efficiently at scale to enhance the understanding of physics of stars. For this, working in collaboration with cross-geo group of scientists, led by Tata Institute of Fundamental Research in India, we have developed a Deep Learning approach that can learn various parameters that govern the complex behavior of such stellar evolution. We train the networks using simulated data on a single node Intel® Xeon® Platinum 8280. Inference on a star takes average 5 milliseconds, which is 10000x faster than auto-correlation based methods, and 1000000x faster than MCMC methods. To the best of our knowledge, we are the first one to develop such efficient machine learning approach to analyze red giant stars. We have been invited to submit the paper to Science Advances scientific journal (impact factor 14.4).
Our network consists of six 1D convolution layers, followed by two LSTM layers and one dense layer. We apply categorical cross entropy loss and ADAM optimizer for backpropagation. The network takes a normalized power spectrum as input and outputs a probability (confidence score) of a parameter to be in a bin (range of values). Currently, we focus on learning the marginal distribution of three seismic parameters, namely, frequency separation ∆ν, period separation ∆Π, and peak frequency ν_max, using separate networks for each such parameter. Training time takes ~50 node hours for each seismic parameter on a single node Xeon cascade lake with 56 cores using Tensorflow.
Our learned model is accurate distinguishing red giants from noise by analyzing the spectra of real stars. It has a precision of 87% and recall 86%. The false positive rates are dominated by non-solar-like pulsator stars. Additionally, our model can discover new potential red giants. After eliminating false positives by visual inspection, we detect ~25 new red giants (validated this through various catalogues). Finally, our model can infer the relationship among various such seismic parameters, e.g., strong linear correlation between ∆ν and ∆Π (well-established in physics), and the relationship between ∆ν and ν_max that is observed in other studies. First figure below: The red points are predicted (∆ν, ν_max) and green band maps the relation observed in other studies; second figure: Prediction results (along with confidence) of our model on real stars.
AI is inferring laws of physics, unravelling complex phenomena, giving human super-human capabilities to see. Every time humans have seen more , world has transformed (think astronomy, microscopy).
Now that is happening to biology …. With increased resolution and sense making …. We can begin to understand mechanisms behind how biological systems work …understand how diseases happen, how different characteristics evolve
Even after decades of work, we knew structure for only about ~4K proteins and then overnight … with AI (AlphaFold), 20000 Human Protein structures were decoded. Using data, AI is beginning to unravel complex phenomena.
Imagine ….we can engineer biological systems and give ourselves capabilities/materials that otherwise biology discovers in thousands or even millions of years of evolution
Biological data is going to be the largest dataset on the planet >> YouTube with for example billions of genomes getting sequenced routinely….. We will need massive leaps in computational power
State of the art platforms today can do < 10 Whole Genome Sequences in a day, we need > 1000x leap in computational power to do all kinds of omics, rapidly to realize the vision of precision medicine.
Similarly, to design new material or drugs …. Search space is orders of magnitudes greater than number of web pages == massive compute appetite
Next Frontier in AI --- Search & Combinatorial Optimization e.g.
Search for Novel Molecules > O (10^60)
Search space for Protein Design: O (10^130)
Number of webpages on Internet: O (10^9)
CLX: Cascade Lake Xeon
CPX: Cooper Lake Xeon
1-D convolutions are specially important to digital biology due to sequence data
Nvidia performance source for 1D convolutions: [Nvidia, 2020] AtacWorks: A deep convolutional neural network toolkit for epigenomics