This talk focuses on the importance of data access and how crucial it is, to have the granular level of data availability in the open-source space as it helps researchers and data teams to fuel their work.
We present to you the research conducted by the DS4C (Data Science for Covid-19) team who made a huge and detailed level of South Korea Covid-19 data available to a wider community. The DS4C dataset was one of the most impactful datasets on Kaggle with over fifty thousand cumulative downloads and 300 unique contributors. What makes the DS4C dataset so potent is the sheer amount of data collected for each patient. The Korean government has been collecting and releasing patient information with unprecedented levels of detail. The data released includes infected people’s travel routes, the public transport they took, and the medical institutions that are treating them. This extremely fine-grained detail is what makes the DS4C dataset valuable as it makes it easier for researchers and data scientists to identify trends and more evidence to support hypotheses to track down the cause and gain additional insights. We will cover the data challenges, impact that it had on the community by making this data available on a public forum and conclude it with an insightful visual representation.
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Role of Data Accessibility During Pandemic
1. Role of data accessibility
during Pandemic
Vini Jaiswal
Data and Spark Evangelist @ Databricks
Isaac Lee
Researcher @ Carnegie Mellon University
2. Who are we?
● Chief director | DS4C (Data Science for Covid)
● Software development team lead | Mindslab
● BS in computer science | Carnegie Mellon University
● Customer Success Engineer @ Databricks
“Making Data People successful with their data and ML/AI use cases”
● Data Science Engineering Lead - Citi
● Data Intern - Southwest Airlines
● MS in Information Technology & Management - UTDallas
Vini Jaiswal
Isaac Lee
3. Agenda
▪ Importance of data accessibility
▪ Overview of research conducted
by DS4C
▪ Data collection process
▪ Uniqueness of DS4C dataset
▪ Challenges
▪ Research Value
▪ Value of open source
community
5. Dec 31, 2020
Pneumonia outbreak of unknown cause
Wuhan Municipal Health Commission made the first
public announcement, confirming 27 cases
Early Jan 2020
Unknown pathogen confirmed as novel
coronavirus
● Wuhan confirmed first 41
cases of confirmed COVID-19
along with one reported death
● Spread in other Chinese
provinces
Jan 20, 2020
Human to Human transmission
● Human-to-human transmission was
confirmed by the WHO and Chinese
authorities
● Links to Hunan seafood wholesale
market
● Strongly recommended PPE
Jan 20 - Feb 11
2020
Public emergency declared by WHO
● Jan 23 - Wuhan goes into lockdown
● Outbreak spread by a factor of 100 to 200 times.
● Italy had its first confirmed cases on 31 January 2020,
two tourists from China.
● Feb 11: WHO officially named this disease COVID-19
● ICTV named it SARS-CoV2 (severe acute respiratory
syndrome coronavirus 2)
11 March 2020
Covid-19 declared as Pandemic
- Global cases reported
- Europe became a major hub
by March 13, 2020
- Outbreak started spreading in
US starting March 7, 2020
As of 28 April 2021, more than 149 million cases have been confirmed, with more than 3.14 million deaths
attributed to COVID-19, making it one of the deadliest pandemics in history.
Coronavirus disease (COVID-19) Evolution
6. Role of Data Availability
2
L
a
c
k
o
f
c
o
h
e
s
i
v
e
D
a
t
a
S
o
u
r
c
e
s
o
f
d
a
t
a
w
e
r
e
l
i
m
i
t
e
d
a
n
d
d
a
t
a
q
u
a
l
i
t
y
w
a
s
a
c
h
a
l
l
e
n
g
e
3
D
a
t
a
a
v
a
i
l
a
b
i
l
i
t
y
o
n
o
p
e
n
p
l
a
t
f
o
r
m
s
L
e
v
e
l
o
f
d
e
t
a
i
l
s
,
i
n
c
l
u
d
i
n
g
r
o
u
t
e
i
n
f
o
,
p
o
l
i
c
i
e
s
,
p
a
t
i
e
n
t
d
a
t
a
,
r
e
c
o
v
e
r
e
d
v
s
d
e
a
t
h
c
a
s
e
s
,
e
t
c
.
4
Data
Science
for Covid
(DS4C) dataset
DS4C
is
a
non-profit organization
founded
by
data
analysts
and
M
L
researchers
w
ho
w
anted
to
contribute
to
fighting
COVID-19.
1
I
t
b
e
c
a
m
e
c
r
u
c
i
a
l
t
o
k
n
o
w
t
h
e
o
r
i
g
i
n
,
c
a
u
s
e
s
o
f
s
p
i
k
e
s
,
a
f
f
e
c
t
e
d
r
e
g
i
o
n
s
a
n
d
s
p
r
e
a
d
r
e
a
s
o
n
s
S
u
r
g
e
i
n
C
o
v
i
d
-
1
9
C
a
s
e
s
8. Popularity on
World’s biggest data science forum
The White House
Johns Hopkins
University
DS4C dataset
3rd
● Donec risus dolor porta venenatis
● Pharetra luctus felis
● Proin in tellus felis volutpat
2nd
● Donec risus dolor porta venenatis
● Pharetra luctus felis
● Proin in tellus felis volutpat
Lorem ipsum dolor sit amet at nec at
adipiscing
1st
● Donec risus dolor porta venenatis
● Pharetra luctus felis
● Proin in tellus felis volutpat
The White
House
Johns Hopkins
University
DS4C
dataset
9. Dataset Category Description
Case Case Data of COVID-19 infection cases in South Korea
PatientInfo Patient Epidemiological data of COVID-19 patients in South Korea
PatientRoute Patient Route data of COVID-19 patients in South Korea
Time Time Series Time series data of COVID-19 status in South Korea
TimeAge Time Series Time series data of COVID-19 status in terms of the age in South Korea
TimeGender Time Series Time series data of COVID-19 status in terms of gender in South Korea
TimeProvince Time Series Time series data of COVID-19 status in terms of the Province in South Korea
DS4C Tables
Source: https://www.kaggle.com/kimjihoo/coronavirusdataset
/databricks-datasets/COVID-19/coronavirusdataset
16. Source of Route Data
Cellphone GPS locations
Credit card history Surveillance camera footages
17. Patient Route Data
Patient ID 187218746
GPS Location
Number of contacted people
Use of mask
Time of visit
Results of COVID-19 tests of the contacted people
Type of transportation used
Type of facility used (ex. Restaurant)
18. Patient Route Data combined with Patient Info
GPS Location: (35.9, 127.7)
Number of contacted people: 12
Use of mask: True
Time of visit: March, 2nd, 14:32
Results of COVID-19 tests of the contacted people
Type of transportation used: Taxi -> subway -> walked
Type of facility used: Gangnam station McDonalds
19. Uniqueness of DS4C dataset
▪ Patient status (age & sex)
▪ Deaths, new infections, and recovered per day
▪ Infection routes, infection chain network,
diagnosed, symptom onset, and cured dates
▪ Patient travel routes
▪ COVID-19 events timeline
▪ COVID-19 preventative policies
▪ ETC: Population flow, hospital beds and
medical supplies, major infection stats,
vaccine history (BCG, MMR, etc), physical
examination results
What is unique to DS4C dataset
What everyone else also have
26. SIR (Susceptible, Infected, Recovered) model
• Cases infected from overseas
• Flow of SIR
• New variable: o(t): # of people infected from overseas at t
S I R
�� ��
S I R
�� ��
Overseas
o
27. Overview of Research conducted by DS4C
1. Challenges
2. Data Engineering
3. Research Value
28. 1. Decentralized publication
2. Absence of a unified formatting
3. Data embedded in natural language
Challenges of Data collection
30. 0 2
.
Absence of a unified formatting:
Each district’s website has different formats,
so crawling is infeasible
31. “04/17: Patient 136 visited barber shop near Gagnam station exit 3 (15:30), went back home,
then went out to Seven-Eleven infront of his house”
Jack’s Barber
Shop
(37.566, 126.978)
beauty_salon
Jong-ro
Seven-Eleven
(37.616, 126.961)
store
NOT machine comprehensible
Machine comprehensible
0 3 . Data embedded in natural language:
32. Our in-house data engineering tool enabled
15 engineers to process exponentially-growing of patient data
38. Not even the Korean government
has centralized dataset...
1. Deleted in 3 days
2. Updated sporadically
3. Distributed by over 100 municipal districts
4. No unifying format
5. Data in natural language
44. 1. Importance of data accessibility (vini)
2. Why is DS4C unique (Isaac)
- take a look at the raw data very briefly
3. Use cases 1: dash board (Vini)
4. Use cases 2: resaerch papers (Isaac)
5. Data engineering: the 3 steps (Isaac)
6. value of community (Isaac)
7. closeout and feedback (Vini)
45. Data science for covid-19
Who we are:
- Non-profit organization founded by data analysts and machine learning researchers
- 16 Masters and PhD students from: Carnegie Mellon University, Seoul National University,
Hanyang University, Kyunghee University
46. 1. Synthesized card history +
phone GPS +
closed-circuit cameras
2. Coordinate and time of
every location visited
3. Number of people
contacted
4. Wore mask or not
Mass outbreak cases data
Patient travel routes
(unreleased)
Korea’s social distancing policy
data
47.
48. Most used COVID-19 Dataset world-wide
Downloads
Contributors
3 rd
70,000
300
49. Started out as a small project with a couple friends and myself...
Now, we have over 20 volunteer data engineers from universities all over Korea
Over 100K$ funding from the Korean government, Microsoft, and others
52. DS4C dataset was the foundational source for
many research papers from world-class research institutions
53. Patient Route Data
Patient ID 187218746
GPS Location
Number of contacted people
Use of mask
Time of visit
Results of COVID-19 tests of the contacted people
Type of transportation used
Type of facility used (ex. Restaurant)