SlideShare ist ein Scribd-Unternehmen logo
1 von 53
Downloaden Sie, um offline zu lesen
Role of data accessibility
during Pandemic
Vini Jaiswal
Data and Spark Evangelist @ Databricks
Isaac Lee
Researcher @ Carnegie Mellon University
Who are we?
● Chief director | DS4C (Data Science for Covid)
● Software development team lead | Mindslab
● BS in computer science | Carnegie Mellon University
● Customer Success Engineer @ Databricks
“Making Data People successful with their data and ML/AI use cases”
● Data Science Engineering Lead - Citi
● Data Intern - Southwest Airlines
● MS in Information Technology & Management - UTDallas
Vini Jaiswal
Isaac Lee
Agenda
▪ Importance of data accessibility
▪ Overview of research conducted
by DS4C
▪ Data collection process
▪ Uniqueness of DS4C dataset
▪ Challenges
▪ Research Value
▪ Value of open source
community
Agenda
Importance of data
accessibility
Overview of research
conducted by Data Science
for Covid-19 (DS4C)
Value of open source
community
Dec 31, 2020
Pneumonia outbreak of unknown cause
Wuhan Municipal Health Commission made the first
public announcement, confirming 27 cases
Early Jan 2020
Unknown pathogen confirmed as novel
coronavirus
● Wuhan confirmed first 41
cases of confirmed COVID-19
along with one reported death
● Spread in other Chinese
provinces
Jan 20, 2020
Human to Human transmission
● Human-to-human transmission was
confirmed by the WHO and Chinese
authorities
● Links to Hunan seafood wholesale
market
● Strongly recommended PPE
Jan 20 - Feb 11
2020
Public emergency declared by WHO
● Jan 23 - Wuhan goes into lockdown
● Outbreak spread by a factor of 100 to 200 times.
● Italy had its first confirmed cases on 31 January 2020,
two tourists from China.
● Feb 11: WHO officially named this disease COVID-19
● ICTV named it SARS-CoV2 (severe acute respiratory
syndrome coronavirus 2)
11 March 2020
Covid-19 declared as Pandemic
- Global cases reported
- Europe became a major hub
by March 13, 2020
- Outbreak started spreading in
US starting March 7, 2020
As of 28 April 2021, more than 149 million cases have been confirmed, with more than 3.14 million deaths
attributed to COVID-19, making it one of the deadliest pandemics in history.
Coronavirus disease (COVID-19) Evolution
Role of Data Availability
2
L
a
c
k
o
f
c
o
h
e
s
i
v
e
D
a
t
a
S
o
u
r
c
e
s
o
f
d
a
t
a
w
e
r
e
l
i
m
i
t
e
d
a
n
d
d
a
t
a
q
u
a
l
i
t
y
w
a
s
a
c
h
a
l
l
e
n
g
e
3
D
a
t
a
a
v
a
i
l
a
b
i
l
i
t
y
o
n
o
p
e
n
p
l
a
t
f
o
r
m
s
L
e
v
e
l
o
f
d
e
t
a
i
l
s
,
i
n
c
l
u
d
i
n
g
r
o
u
t
e
i
n
f
o
,
p
o
l
i
c
i
e
s
,
p
a
t
i
e
n
t
d
a
t
a
,
r
e
c
o
v
e
r
e
d
v
s
d
e
a
t
h
c
a
s
e
s
,
e
t
c
.
4
Data
Science
for Covid
(DS4C) dataset
DS4C
is
a
non-profit organization
founded
by
data
analysts
and
M
L
researchers
w
ho
w
anted
to
contribute
to
fighting
COVID-19.
1
I
t
b
e
c
a
m
e
c
r
u
c
i
a
l
t
o
k
n
o
w
t
h
e
o
r
i
g
i
n
,
c
a
u
s
e
s
o
f
s
p
i
k
e
s
,
a
f
f
e
c
t
e
d
r
e
g
i
o
n
s
a
n
d
s
p
r
e
a
d
r
e
a
s
o
n
s
S
u
r
g
e
i
n
C
o
v
i
d
-
1
9
C
a
s
e
s
The South Korea COVID-19 Dataset
Popularity on
World’s biggest data science forum
The White House
Johns Hopkins
University
DS4C dataset
3rd
● Donec risus dolor porta venenatis
● Pharetra luctus felis
● Proin in tellus felis volutpat
2nd
● Donec risus dolor porta venenatis
● Pharetra luctus felis
● Proin in tellus felis volutpat
Lorem ipsum dolor sit amet at nec at
adipiscing
1st
● Donec risus dolor porta venenatis
● Pharetra luctus felis
● Proin in tellus felis volutpat
The White
House
Johns Hopkins
University
DS4C
dataset
Dataset Category Description
Case Case Data of COVID-19 infection cases in South Korea
PatientInfo Patient Epidemiological data of COVID-19 patients in South Korea
PatientRoute Patient Route data of COVID-19 patients in South Korea
Time Time Series Time series data of COVID-19 status in South Korea
TimeAge Time Series Time series data of COVID-19 status in terms of the age in South Korea
TimeGender Time Series Time series data of COVID-19 status in terms of gender in South Korea
TimeProvince Time Series Time series data of COVID-19 status in terms of the Province in South Korea
DS4C Tables
Source: https://www.kaggle.com/kimjihoo/coronavirusdataset
/databricks-datasets/COVID-19/coronavirusdataset
Schema
of DS4C
dataset
DS4C Dataset
Johns Hopkins Dataset
Patient Info Table
Patient Info Table
Mass outbreak cases data
S. Korea’s social distancing policy data
Patient Route Data
Source of Route Data
Cellphone GPS locations
Credit card history Surveillance camera footages
Patient Route Data
Patient ID 187218746
GPS Location
Number of contacted people
Use of mask
Time of visit
Results of COVID-19 tests of the contacted people
Type of transportation used
Type of facility used (ex. Restaurant)
Patient Route Data combined with Patient Info
GPS Location: (35.9, 127.7)
Number of contacted people: 12
Use of mask: True
Time of visit: March, 2nd, 14:32
Results of COVID-19 tests of the contacted people
Type of transportation used: Taxi -> subway -> walked
Type of facility used: Gangnam station McDonalds
Uniqueness of DS4C dataset
▪ Patient status (age & sex)
▪ Deaths, new infections, and recovered per day
▪ Infection routes, infection chain network,
diagnosed, symptom onset, and cured dates
▪ Patient travel routes
▪ COVID-19 events timeline
▪ COVID-19 preventative policies
▪ ETC: Population flow, hospital beds and
medical supplies, major infection stats,
vaccine history (BCG, MMR, etc), physical
examination results
What is unique to DS4C dataset
What everyone else also have
South Korea Covid-19 research analytics
How did the virus travel from the main hub to the other
countries ?
Covid-19 spread in South Korea:
First case | Route | Spread reason
Spread reasons
Preventive Measures
Patient demographics
SIR (Susceptible, Infected, Recovered) model
• Cases infected from overseas
• Flow of SIR
• New variable: o(t): # of people infected from overseas at t
S I R
�� ��
S I R
�� ��
Overseas
o
Overview of Research conducted by DS4C
1. Challenges
2. Data Engineering
3. Research Value
1. Decentralized publication
2. Absence of a unified formatting
3. Data embedded in natural language
Challenges of Data collection
Decentralized publication:
Over 100 Counties & Cities each publish the
data on their own local district website
0 1 .
0 2
.
Absence of a unified formatting:
Each district’s website has different formats,
so crawling is infeasible
“04/17: Patient 136 visited barber shop near Gagnam station exit 3 (15:30), went back home,
then went out to Seven-Eleven infront of his house”
Jack’s Barber
Shop
(37.566, 126.978)
beauty_salon
Jong-ro
Seven-Eleven
(37.616, 126.961)
store
NOT machine comprehensible
Machine comprehensible
0 3 . Data embedded in natural language:
Our in-house data engineering tool enabled
15 engineers to process exponentially-growing of patient data
Research Value
Vini Jaiswal Denny Lee
Research mentorship
Guidance on data engineering
Collaboration help with other organizations
Value of Open Source Community
Not even the Korean government
has centralized dataset...
1. Deleted in 3 days
2. Updated sporadically
3. Distributed by over 100 municipal districts
4. No unifying format
5. Data in natural language
Todo: Data anonymization
Isaac Lee @ CMU
joongkul@andrew.cmu.edu
Please reach out to us if you would like to take part in our
effort or collaborate with us
Contact us!
collaborate@databricks.com
Visit our hub!
https://databricks.com/databricks-covid-19-resource-hub
Closing remarks
Thank you!
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.
1. Importance of data accessibility (vini)
2. Why is DS4C unique (Isaac)
- take a look at the raw data very briefly
3. Use cases 1: dash board (Vini)
4. Use cases 2: resaerch papers (Isaac)
5. Data engineering: the 3 steps (Isaac)
6. value of community (Isaac)
7. closeout and feedback (Vini)
Data science for covid-19
Who we are:
- Non-profit organization founded by data analysts and machine learning researchers
- 16 Masters and PhD students from: Carnegie Mellon University, Seoul National University,
Hanyang University, Kyunghee University
1. Synthesized card history +
phone GPS +
closed-circuit cameras
2. Coordinate and time of
every location visited
3. Number of people
contacted
4. Wore mask or not
Mass outbreak cases data
Patient travel routes
(unreleased)
Korea’s social distancing policy
data
Most used COVID-19 Dataset world-wide
Downloads
Contributors
3 rd
70,000
300
Started out as a small project with a couple friends and myself...
Now, we have over 20 volunteer data engineers from universities all over Korea
Over 100K$ funding from the Korean government, Microsoft, and others
Dozens of world-class research institutions conducting
research with DS4C dataset
DS4C dataset was the foundational source for
many research papers from world-class research institutions
Patient Route Data
Patient ID 187218746
GPS Location
Number of contacted people
Use of mask
Time of visit
Results of COVID-19 tests of the contacted people
Type of transportation used
Type of facility used (ex. Restaurant)

Weitere ähnliche Inhalte

Was ist angesagt?

FedCentric_Presentation
FedCentric_PresentationFedCentric_Presentation
FedCentric_Presentation
Yatpang Cheung
 
dkNET Webinar: "The Microphysiology Systems Database (MPS-Db): A Platform For...
dkNET Webinar: "The Microphysiology Systems Database (MPS-Db): A Platform For...dkNET Webinar: "The Microphysiology Systems Database (MPS-Db): A Platform For...
dkNET Webinar: "The Microphysiology Systems Database (MPS-Db): A Platform For...
dkNET
 

Was ist angesagt? (20)

The Fourth Paradigm - Deltares Data Science Day, 31 October 2014
The Fourth Paradigm - Deltares Data Science Day, 31 October 2014The Fourth Paradigm - Deltares Data Science Day, 31 October 2014
The Fourth Paradigm - Deltares Data Science Day, 31 October 2014
 
FedCentric_Presentation
FedCentric_PresentationFedCentric_Presentation
FedCentric_Presentation
 
Some Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data PlatformsSome Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data Platforms
 
Some Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your CompanySome Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your Company
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health System
 
dkNET Webinar: "The Microphysiology Systems Database (MPS-Db): A Platform For...
dkNET Webinar: "The Microphysiology Systems Database (MPS-Db): A Platform For...dkNET Webinar: "The Microphysiology Systems Database (MPS-Db): A Platform For...
dkNET Webinar: "The Microphysiology Systems Database (MPS-Db): A Platform For...
 
Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...
Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...
Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...
 
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...
 
A Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchA Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical Research
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
 
Considerations and challenges in building an end to-end microbiome workflow
Considerations and challenges in building an end to-end microbiome workflowConsiderations and challenges in building an end to-end microbiome workflow
Considerations and challenges in building an end to-end microbiome workflow
 
The DataTags System: Sharing Sensitive Data with Confidence
The DataTags System: Sharing Sensitive Data with ConfidenceThe DataTags System: Sharing Sensitive Data with Confidence
The DataTags System: Sharing Sensitive Data with Confidence
 
On community-standards, data curation and scholarly communication" Stanford M...
On community-standards, data curation and scholarly communication" Stanford M...On community-standards, data curation and scholarly communication" Stanford M...
On community-standards, data curation and scholarly communication" Stanford M...
 
The NIH as a Digital Enterprise: Implications for PAG
The NIH as a Digital Enterprise: Implications for PAGThe NIH as a Digital Enterprise: Implications for PAG
The NIH as a Digital Enterprise: Implications for PAG
 
DataTags, The Tags Toolset, and Dataverse Integration
DataTags, The Tags Toolset, and Dataverse IntegrationDataTags, The Tags Toolset, and Dataverse Integration
DataTags, The Tags Toolset, and Dataverse Integration
 
Data Science for the Win
Data Science for the WinData Science for the Win
Data Science for the Win
 
dkNET Poster ENDO 2016
dkNET Poster ENDO 2016 dkNET Poster ENDO 2016
dkNET Poster ENDO 2016
 
Building a Network of Interoperable and Independently Produced Linked and Ope...
Building a Network of Interoperable and Independently Produced Linked and Ope...Building a Network of Interoperable and Independently Produced Linked and Ope...
Building a Network of Interoperable and Independently Produced Linked and Ope...
 
DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?
 

Ähnlich wie Role of Data Accessibility During Pandemic

Lessons from COVID-19: How Are Data Science and AI Changing Future Biomedical...
Lessons from COVID-19: How Are Data Science and AI Changing Future Biomedical...Lessons from COVID-19: How Are Data Science and AI Changing Future Biomedical...
Lessons from COVID-19: How Are Data Science and AI Changing Future Biomedical...
Jake Chen
 

Ähnlich wie Role of Data Accessibility During Pandemic (20)

Big data for development
Big data for development Big data for development
Big data for development
 
Lessons from COVID-19: How Are Data Science and AI Changing Future Biomedical...
Lessons from COVID-19: How Are Data Science and AI Changing Future Biomedical...Lessons from COVID-19: How Are Data Science and AI Changing Future Biomedical...
Lessons from COVID-19: How Are Data Science and AI Changing Future Biomedical...
 
ICCM 2014 -- Ignite Talks -- Session 2
ICCM 2014 -- Ignite Talks -- Session 2ICCM 2014 -- Ignite Talks -- Session 2
ICCM 2014 -- Ignite Talks -- Session 2
 
Data Con LA 2019 - Applied Privacy Engineering Study on SEER database by Ken ...
Data Con LA 2019 - Applied Privacy Engineering Study on SEER database by Ken ...Data Con LA 2019 - Applied Privacy Engineering Study on SEER database by Ken ...
Data Con LA 2019 - Applied Privacy Engineering Study on SEER database by Ken ...
 
Opportunities in technology and connected health for population science
Opportunities in technology and connected health for population science Opportunities in technology and connected health for population science
Opportunities in technology and connected health for population science
 
Real world data, the National COVID-19 Cohort Consortium, and Oncology 2021
Real world data, the National COVID-19 Cohort Consortium, and Oncology 2021Real world data, the National COVID-19 Cohort Consortium, and Oncology 2021
Real world data, the National COVID-19 Cohort Consortium, and Oncology 2021
 
Sais.34.1
Sais.34.1Sais.34.1
Sais.34.1
 
1. Data Science overview - part1.pptx
1. Data Science overview - part1.pptx1. Data Science overview - part1.pptx
1. Data Science overview - part1.pptx
 
Conference Report Final 11.18
Conference Report Final 11.18Conference Report Final 11.18
Conference Report Final 11.18
 
Presentation (1).pptx
Presentation (1).pptxPresentation (1).pptx
Presentation (1).pptx
 
The Grassroots Covid-19 Resilience
The Grassroots Covid-19 ResilienceThe Grassroots Covid-19 Resilience
The Grassroots Covid-19 Resilience
 
Role of data science during covid times
Role of data science during covid timesRole of data science during covid times
Role of data science during covid times
 
Data Activism: data as rhetoric, data as power
Data Activism: data as rhetoric, data as powerData Activism: data as rhetoric, data as power
Data Activism: data as rhetoric, data as power
 
What Can Happen when Genome Sciences Meets Data Sciences?
What Can Happen when Genome Sciences Meets Data Sciences?What Can Happen when Genome Sciences Meets Data Sciences?
What Can Happen when Genome Sciences Meets Data Sciences?
 
Coronavirus Case Tracking
Coronavirus Case TrackingCoronavirus Case Tracking
Coronavirus Case Tracking
 
What's up at Kno.e.sis?
What's up at Kno.e.sis? What's up at Kno.e.sis?
What's up at Kno.e.sis?
 
DCHI webinar on N3C January 2021
DCHI webinar on N3C January 2021DCHI webinar on N3C January 2021
DCHI webinar on N3C January 2021
 
Complications in big data-based communication in the wake of COVID-19: A comp...
Complications in big data-based communication in the wake of COVID-19: A comp...Complications in big data-based communication in the wake of COVID-19: A comp...
Complications in big data-based communication in the wake of COVID-19: A comp...
 
Lec#2 Big Data Analytics and AI in the battle against COVID-19
Lec#2  Big Data Analytics and AI in the battle against COVID-19 Lec#2  Big Data Analytics and AI in the battle against COVID-19
Lec#2 Big Data Analytics and AI in the battle against COVID-19
 
GenomeTrakr: Perspectives on linking internationally - Canada and IRIDA.ca
GenomeTrakr: Perspectives on linking internationally - Canada and IRIDA.caGenomeTrakr: Perspectives on linking internationally - Canada and IRIDA.ca
GenomeTrakr: Perspectives on linking internationally - Canada and IRIDA.ca
 

Mehr von Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Mehr von Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Kürzlich hochgeladen

➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
amitlee9823
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 

Kürzlich hochgeladen (20)

Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 

Role of Data Accessibility During Pandemic

  • 1. Role of data accessibility during Pandemic Vini Jaiswal Data and Spark Evangelist @ Databricks Isaac Lee Researcher @ Carnegie Mellon University
  • 2. Who are we? ● Chief director | DS4C (Data Science for Covid) ● Software development team lead | Mindslab ● BS in computer science | Carnegie Mellon University ● Customer Success Engineer @ Databricks “Making Data People successful with their data and ML/AI use cases” ● Data Science Engineering Lead - Citi ● Data Intern - Southwest Airlines ● MS in Information Technology & Management - UTDallas Vini Jaiswal Isaac Lee
  • 3. Agenda ▪ Importance of data accessibility ▪ Overview of research conducted by DS4C ▪ Data collection process ▪ Uniqueness of DS4C dataset ▪ Challenges ▪ Research Value ▪ Value of open source community
  • 4. Agenda Importance of data accessibility Overview of research conducted by Data Science for Covid-19 (DS4C) Value of open source community
  • 5. Dec 31, 2020 Pneumonia outbreak of unknown cause Wuhan Municipal Health Commission made the first public announcement, confirming 27 cases Early Jan 2020 Unknown pathogen confirmed as novel coronavirus ● Wuhan confirmed first 41 cases of confirmed COVID-19 along with one reported death ● Spread in other Chinese provinces Jan 20, 2020 Human to Human transmission ● Human-to-human transmission was confirmed by the WHO and Chinese authorities ● Links to Hunan seafood wholesale market ● Strongly recommended PPE Jan 20 - Feb 11 2020 Public emergency declared by WHO ● Jan 23 - Wuhan goes into lockdown ● Outbreak spread by a factor of 100 to 200 times. ● Italy had its first confirmed cases on 31 January 2020, two tourists from China. ● Feb 11: WHO officially named this disease COVID-19 ● ICTV named it SARS-CoV2 (severe acute respiratory syndrome coronavirus 2) 11 March 2020 Covid-19 declared as Pandemic - Global cases reported - Europe became a major hub by March 13, 2020 - Outbreak started spreading in US starting March 7, 2020 As of 28 April 2021, more than 149 million cases have been confirmed, with more than 3.14 million deaths attributed to COVID-19, making it one of the deadliest pandemics in history. Coronavirus disease (COVID-19) Evolution
  • 6. Role of Data Availability 2 L a c k o f c o h e s i v e D a t a S o u r c e s o f d a t a w e r e l i m i t e d a n d d a t a q u a l i t y w a s a c h a l l e n g e 3 D a t a a v a i l a b i l i t y o n o p e n p l a t f o r m s L e v e l o f d e t a i l s , i n c l u d i n g r o u t e i n f o , p o l i c i e s , p a t i e n t d a t a , r e c o v e r e d v s d e a t h c a s e s , e t c . 4 Data Science for Covid (DS4C) dataset DS4C is a non-profit organization founded by data analysts and M L researchers w ho w anted to contribute to fighting COVID-19. 1 I t b e c a m e c r u c i a l t o k n o w t h e o r i g i n , c a u s e s o f s p i k e s , a f f e c t e d r e g i o n s a n d s p r e a d r e a s o n s S u r g e i n C o v i d - 1 9 C a s e s
  • 7. The South Korea COVID-19 Dataset
  • 8. Popularity on World’s biggest data science forum The White House Johns Hopkins University DS4C dataset 3rd ● Donec risus dolor porta venenatis ● Pharetra luctus felis ● Proin in tellus felis volutpat 2nd ● Donec risus dolor porta venenatis ● Pharetra luctus felis ● Proin in tellus felis volutpat Lorem ipsum dolor sit amet at nec at adipiscing 1st ● Donec risus dolor porta venenatis ● Pharetra luctus felis ● Proin in tellus felis volutpat The White House Johns Hopkins University DS4C dataset
  • 9. Dataset Category Description Case Case Data of COVID-19 infection cases in South Korea PatientInfo Patient Epidemiological data of COVID-19 patients in South Korea PatientRoute Patient Route data of COVID-19 patients in South Korea Time Time Series Time series data of COVID-19 status in South Korea TimeAge Time Series Time series data of COVID-19 status in terms of the age in South Korea TimeGender Time Series Time series data of COVID-19 status in terms of gender in South Korea TimeProvince Time Series Time series data of COVID-19 status in terms of the Province in South Korea DS4C Tables Source: https://www.kaggle.com/kimjihoo/coronavirusdataset /databricks-datasets/COVID-19/coronavirusdataset
  • 11. DS4C Dataset Johns Hopkins Dataset Patient Info Table
  • 14. S. Korea’s social distancing policy data
  • 16. Source of Route Data Cellphone GPS locations Credit card history Surveillance camera footages
  • 17. Patient Route Data Patient ID 187218746 GPS Location Number of contacted people Use of mask Time of visit Results of COVID-19 tests of the contacted people Type of transportation used Type of facility used (ex. Restaurant)
  • 18. Patient Route Data combined with Patient Info GPS Location: (35.9, 127.7) Number of contacted people: 12 Use of mask: True Time of visit: March, 2nd, 14:32 Results of COVID-19 tests of the contacted people Type of transportation used: Taxi -> subway -> walked Type of facility used: Gangnam station McDonalds
  • 19. Uniqueness of DS4C dataset ▪ Patient status (age & sex) ▪ Deaths, new infections, and recovered per day ▪ Infection routes, infection chain network, diagnosed, symptom onset, and cured dates ▪ Patient travel routes ▪ COVID-19 events timeline ▪ COVID-19 preventative policies ▪ ETC: Population flow, hospital beds and medical supplies, major infection stats, vaccine history (BCG, MMR, etc), physical examination results What is unique to DS4C dataset What everyone else also have
  • 20. South Korea Covid-19 research analytics
  • 21. How did the virus travel from the main hub to the other countries ?
  • 22. Covid-19 spread in South Korea: First case | Route | Spread reason
  • 26. SIR (Susceptible, Infected, Recovered) model • Cases infected from overseas • Flow of SIR • New variable: o(t): # of people infected from overseas at t S I R �� �� S I R �� �� Overseas o
  • 27. Overview of Research conducted by DS4C 1. Challenges 2. Data Engineering 3. Research Value
  • 28. 1. Decentralized publication 2. Absence of a unified formatting 3. Data embedded in natural language Challenges of Data collection
  • 29. Decentralized publication: Over 100 Counties & Cities each publish the data on their own local district website 0 1 .
  • 30. 0 2 . Absence of a unified formatting: Each district’s website has different formats, so crawling is infeasible
  • 31. “04/17: Patient 136 visited barber shop near Gagnam station exit 3 (15:30), went back home, then went out to Seven-Eleven infront of his house” Jack’s Barber Shop (37.566, 126.978) beauty_salon Jong-ro Seven-Eleven (37.616, 126.961) store NOT machine comprehensible Machine comprehensible 0 3 . Data embedded in natural language:
  • 32. Our in-house data engineering tool enabled 15 engineers to process exponentially-growing of patient data
  • 34.
  • 35.
  • 36. Vini Jaiswal Denny Lee Research mentorship Guidance on data engineering Collaboration help with other organizations
  • 37. Value of Open Source Community
  • 38. Not even the Korean government has centralized dataset... 1. Deleted in 3 days 2. Updated sporadically 3. Distributed by over 100 municipal districts 4. No unifying format 5. Data in natural language
  • 40. Isaac Lee @ CMU joongkul@andrew.cmu.edu Please reach out to us if you would like to take part in our effort or collaborate with us
  • 41. Contact us! collaborate@databricks.com Visit our hub! https://databricks.com/databricks-covid-19-resource-hub Closing remarks
  • 43. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.
  • 44. 1. Importance of data accessibility (vini) 2. Why is DS4C unique (Isaac) - take a look at the raw data very briefly 3. Use cases 1: dash board (Vini) 4. Use cases 2: resaerch papers (Isaac) 5. Data engineering: the 3 steps (Isaac) 6. value of community (Isaac) 7. closeout and feedback (Vini)
  • 45. Data science for covid-19 Who we are: - Non-profit organization founded by data analysts and machine learning researchers - 16 Masters and PhD students from: Carnegie Mellon University, Seoul National University, Hanyang University, Kyunghee University
  • 46. 1. Synthesized card history + phone GPS + closed-circuit cameras 2. Coordinate and time of every location visited 3. Number of people contacted 4. Wore mask or not Mass outbreak cases data Patient travel routes (unreleased) Korea’s social distancing policy data
  • 47.
  • 48. Most used COVID-19 Dataset world-wide Downloads Contributors 3 rd 70,000 300
  • 49. Started out as a small project with a couple friends and myself... Now, we have over 20 volunteer data engineers from universities all over Korea Over 100K$ funding from the Korean government, Microsoft, and others
  • 50.
  • 51. Dozens of world-class research institutions conducting research with DS4C dataset
  • 52. DS4C dataset was the foundational source for many research papers from world-class research institutions
  • 53. Patient Route Data Patient ID 187218746 GPS Location Number of contacted people Use of mask Time of visit Results of COVID-19 tests of the contacted people Type of transportation used Type of facility used (ex. Restaurant)