SlideShare a Scribd company logo
1 of 28
Download to read offline
Using Multiple Big Datasets and Machine Learning
to Produce a New Global Particulate Dataset
A Technology Challenge Case Study
David Lary
Hanson Center for Space Science
University of Texas at Dallas
What?
Why?
Table!1.!PM!and!health!outcomes!(modified!from!Ruckerl*et*al.!(2006)).!
Decreased Lung Function < 10 μm

x, few studies; xx, many studies; xxx, large number of studies.

Cardiovascular Disease < 0.1 μm

Skin & Eye Disease < 2.5 μm
0.1 mm

0.001 μm

0.01 μm

0.1 μm

1 μm

10 μm

1 mm

100 μm

Tumors < 1 μm
0.0001 μm

1000 μm

Mold Spores

Types of biological Material

Cell
Pollen
House Dust Mite Allergens
Cat Allergens
Bacteria
Hair

Viruses

Types of Dust

Heavy Dust

Settling Dust

Suspended Atmospheric Dust

Cement Dust
Fly Ash

Types of Particulates

Long9term*Studies*
PM10! PM2.5! UFP!
!!
!!
!!
xx!
xx!
x!
xx!
xx!
x!
xx!
xx!
x!
!!
!!
!!
xxx!
xxx!
!!
xxx!
xxx!
!!
!!
!!
!!
xxx!
xxx!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
xx!
xx!
x!
xx!
xx!
x!
!!
!!
!!
x!
x!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!

Oil Smoke

Pin

Smog
Tobacco Smoke
Soot
Gas Molecules

Gas Molecules

Short9term*Studies*
PM10! PM2.5! UFP!
Mortality*
!!
!!
!!
!!!!All!causes!
xxx!!
xxx!!
x!
!!!!Cardiovascular!
xxx!
xxx!
x!!
!!!!Pulmonary!
xxx!
xxx!
x!
Pulmonary!effects!
!!
!!
!!
!!!!Lung!function,!e.g.,!PEF!
xxx!
xxx!
xx!
!!!!Lung!function!growth!
!!
!!
!!
Asthma!and!COPD!exacerbation!
!!
!!
!!
!!!!Acute!respiratory!symptoms!
!!
xx!
x!
!!!!Medication!use!
!!
!!
x!
!!!!Hospital!admission!
xx!
xxx!
x!
Lung!cancer!
!!
!!
!!
!!!!Cohort!
!!
!!
!!
!!!!Hospital!admission!
!!
!!
!!
Cardiovascular!effects!
!!
!!
!!
!!!!Hospital!admission!
xxx!
xxx!
!!
ECG@related!endpoints!
!!
!!
!!
!!!!Autonomic!nervous!system!
xxx!
xxx!
xx!
!!!!Myocardial!substrate!and!vulnerability! !!
xx!
x!
Vascular!function!
!!
!!
!!
!!!!Blood!pressure!
xx!
xxx!
x!
!!!!Endothelial!function!
x!
xx!
x!
Blood!markers!
!!
!!
!!
!!!!Pro!inflammatory!mediators!
xx!
xx!
xx!
!!!!Coagulation!blood!markers!
xx!
xx!
xx!
!!!!Diabetes!
x!
xx!
x!
!!!!Endothelial!function!
x!
x!
xx!
Reproduction!
!!
!!
!!
!!!!Premature!birth!
x!
x!
!!
!!!!Birth!weight!
xx!
x!
!!
!!!!IUR/SGA!
x!
x!
!!
Fetal!growth!
!!
!!
!!
!!!!Birth!defects!
x!
!!
!!
!!!!Infant!mortality!
xx!
x!
!!
!!!!Sperm!quality!
x!
x!
!!
Neurotoxic!effects!
!!
!!
!!
!!!!Central!nervous!system!!
!!
x!
xx!
!!
Health*Outcomes!

PM10 particles
PM2.5 particles
PM0.1 ultra fine particles

0.0001 μm

0.001 μm

0.01 μm

PM10-2.5 coarse fraction
0.1 μm

1 μm

10 μm

100 μm

1000 μm
Why?
How?
Used around 40 different BigData sets from satellites, meteorology,
demographics, scraped web-sites and social media to estimate PM2.5. Plot
below shows the average of 5,935 days from August 1, 1997 to the present.
Which Platform?
Which Platform?
Requirements:
1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise
before have had time to process the massive datasets the scratch space
time limit has expired)
Which Platform?
Requirements:
1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise
before have had time to process the massive datasets the scratch space
time limit has expired)
2. High Bandwidth connections
Which Platform?
Requirements:
1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise
before have had time to process the massive datasets the scratch space
time limit has expired)
2. High Bandwidth connections
3. Ability to harvest social media (e.g. twitter) and scrape web sites for data
Which Platform?
Requirements:
1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise
before have had time to process the massive datasets the scratch space
time limit has expired)
2. High Bandwidth connections
3. Ability to harvest social media (e.g. twitter) and scrape web sites for data
4. High level language with wide range of optimized toolboxes, matlab
Which Platform?
Requirements:
1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise
before have had time to process the massive datasets the scratch space
time limit has expired)
2. High Bandwidth connections
3. Ability to harvest social media (e.g. twitter) and scrape web sites for data
4. High level language with wide range of optimized toolboxes, matlab
5. Algorithms capable of dealing with massive non-linear, non-parametric,
non-Gaussian multivariate datasets (13,000+ variables)
Which Platform?
Requirements:
1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise
before have had time to process the massive datasets the scratch space
time limit has expired)
2. High Bandwidth connections
3. Ability to harvest social media (e.g. twitter) and scrape web sites for data
4. High level language with wide range of optimized toolboxes, matlab
5. Algorithms capable of dealing with massive non-linear, non-parametric,
non-Gaussian multivariate datasets (13,000+ variables)
6. Easy to make use of multiple GPUs and CPUs
Which Platform?
Requirements:
1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise
before have had time to process the massive datasets the scratch space
time limit has expired)
2. High Bandwidth connections
3. Ability to harvest social media (e.g. twitter) and scrape web sites for data
4. High level language with wide range of optimized toolboxes, matlab
5. Algorithms capable of dealing with massive non-linear, non-parametric,
non-Gaussian multivariate datasets (13,000+ variables)
6. Easy to make use of multiple GPUs and CPUs
7. Ability to schedule tasks at precise times and time intervals to automate
workflows (in this case tasks executed at intervals of 5 minutes, 15 minutes,
1 hour, 3 hours, 1 day)
How?

Exis%ng(

New(

Simula%on(

• Social(Media(
• Socioeconomic,(Census(
• News(feeds(
• Environmental(
• Weather(
• Satellite(
• Sensors(
• Health(
• Economic(

• UAVs(
• Smart(Dust(
• Autonomous(Cars(
• Sensors(

• Global(Weather(Models(
• Economic(Models(
• Earthquake(Models(

Data(
Machine(
Learning(

Insight(

Same approach highly relevant for
the validation and optimal
exploitation of the next generation
of satellites, e.g. the upcoming
NASA Decadal Survey Missions.
How?

California Children Example
Terra DeepBlue
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

Source

Variable

Type

Satellite Product
Meteorological Analyses
Satellite Product
Meteorological Analyses
Satellite Product
Satellite Product
Meteorological Analyses
Meteorological Analyses
Meteorological Analyses
Meteorological Analyses
Satellite Product
Meteorological Analyses
Satellite Product
Satellite Product
Satellite Product
Satellite Product
Meteorological Analyses
Satellite Product
Satellite Product
Satellite Product
Satellite Product
Satellite Product
Satellite Product
Satellite Product
Satellite Product
Satellite Product
Satellite Product
Satellite Product
Satellite Product
Satellite Product
Satellite Product

Population Density
Tropospheric NO2 Column
Surface Specific Humidity
Solar Azimuth
Surface Wind Speed
White-sky Albedo at 2,130 nm
White-sky Albedo at 555 nm
Surface Air Temperature
Surface Layer Height
Surface Ventilation Velocity
Total Precipitation
Solar Zenith
Air Density at Surface
Cloud Mask Qa
Deep Blue Aerosol Optical Depth 470 nm
Sensor Zenith
White-sky Albedo at 858 nm
Surface Velocity Scale
White-sky Albedo at 470 nm
Deep Blue Angstrom Exponent Land
White-sky Albedo at 1,240 nm
Scattering Angle
Sensor Azimuth
Deep Blue Surface Reflectance 412 nm
White-sky Albedo at 1,640 nm
Deep Blue Aerosol Optical Depth 660 nm
White-sky Albedo at 648 nm
Deep Blue Surface Reflectance 660 nm
Cloud Fraction Land
Deep Blue Surface Reflectance 470 nm
Deep Blue Aerosol Optical Depth 550 nm
Deep Blue Aerosol Optical Depth 412 nm

Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input

In-situ Observation

PM2.5

Target
Hourly measurements from 53 countries from 1997-present

A lot of measurements,
but notice the large gaps!
Gaps are inevitable because of the
infrastructure and cost associated with
making the measurements.
Challenge 1: Obtaining the in-situ PM2.5 data
Real time data from:
1. EPA AirNow data for USA and Canada
2. EEA data for Europe
3. Tasmania and Australia
4. Israel
5. Russia
6. Asia and Latin America by scraping http://aqicn.org/map/
7. Harvesting social media (twitter feeds from US Embassies)

Relative low bandwidth from multiple sites every 5 minutes
Challenge 2: (Easier)
Obtaining the Satellite & Meteorological Data
Real time data from:
1. Multiple satellites MODIS Terra, MODIS Aqua, SeaWIFS, VIIRS NPP etc
2. Global Meteorological Analyses

High bandwidth from few sites every 1 to 3 hours
Challenge 3:
Combine multiple BigData Sets with Machine Learning
Large member machine learning ensemble using massively parallel computing
to produce PM2.5 data product
Algorithms capable of dealing with massive non-linear, non-parametric, nonGaussian multivariate datasets (13,000+ variables)
Drastically reduced development time by using a high level language (Matlab)
that can easily exploit parallel execution using both multiple CPUs and GPUs.

Massively parallel every 3 hours
High level language which can readily use CPUs and GPUs
Challenge 4:
Continual Performance Improvement
Currently on around 400th version of system.
Have been making continuous improvements in:
1. Coverage of in-situ training data set
2. Inclusion of new satellite sensors
3. Additional BigData sets that help improve fidelity of the non-linear, nonparametric, non-Gaussian multivariate machine learning fits
4. Using many alternative machine learning strategies
5. Estimate uncertainties.
6. This requires frequent reprocessing of the entire multi-year record from
1997-present

Persistent massive data storage, much more
than usual scratch space at HPC centers
Fully Automated Workflow

Requires ability to schedule automated tasks
Requires ability to disseminate results in multiple formats including
ftp and as web and map services
Key System Requirements:
Not always available on current HPC systems
Requirements:
1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise
before have had time to process the massive datasets the scratch space
time limit has expired)
2. High Bandwidth connections
3. Ability to harvest social media (e.g. twitter) and scrape web sites for data
4. High level language with wide range of optimized toolboxes, matlab
5. Algorithms capable of dealing with massive non-linear, non-parametric,
non-Gaussian multivariate datasets (13,000+ variables)
6. Easy to make use of multiple GPUs and CPUs
7. Ability to schedule tasks at precise times and time intervals to automate
workflows (in this case tasks executed at intervals of 5 minutes, 15 minutes,
1 hour, 3 hours, 1 day)

Thank you!

More Related Content

What's hot

Ecosystem science requirements for uas remote sensing
Ecosystem science requirements for uas remote sensing Ecosystem science requirements for uas remote sensing
Ecosystem science requirements for uas remote sensing bensparrowau
 
How can drone data be used in modelling?
How can drone data be used in modelling?How can drone data be used in modelling?
How can drone data be used in modelling?ARDC
 
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...Allen Day, PhD
 
Pacific Research Platform Science Drivers
Pacific Research Platform Science DriversPacific Research Platform Science Drivers
Pacific Research Platform Science DriversLarry Smarr
 
Sensornets and Global Change
Sensornets and Global ChangeSensornets and Global Change
Sensornets and Global ChangeLarry Smarr
 
Andy Hardy-Enfermedades transmitidas por vectores
Andy Hardy-Enfermedades transmitidas por vectoresAndy Hardy-Enfermedades transmitidas por vectores
Andy Hardy-Enfermedades transmitidas por vectoresFundación Ramón Areces
 
2015 05 Scaling from seeds to ecosystems
2015 05 Scaling from seeds to ecosystems2015 05 Scaling from seeds to ecosystems
2015 05 Scaling from seeds to ecosystemsTimeScience
 
Pacific Wave and PRP Update Big News for Big Data
Pacific Wave and PRP Update Big News for Big DataPacific Wave and PRP Update Big News for Big Data
Pacific Wave and PRP Update Big News for Big DataLarry Smarr
 
 Gigapixel resolution imaging for near-remote sensing and phenomics
 Gigapixel resolution imaging for near-remote sensing and phenomics Gigapixel resolution imaging for near-remote sensing and phenomics
 Gigapixel resolution imaging for near-remote sensing and phenomicsTimeScience
 
Pacific Research Platform Supporting Earth Sciences
Pacific Research Platform Supporting Earth SciencesPacific Research Platform Supporting Earth Sciences
Pacific Research Platform Supporting Earth SciencesLarry Smarr
 
The FiRe CTO Design Challenge: Wildfire Technology
The FiRe CTO Design Challenge: Wildfire TechnologyThe FiRe CTO Design Challenge: Wildfire Technology
The FiRe CTO Design Challenge: Wildfire TechnologyLarry Smarr
 
Creating a Science-Driven Big Data Superhighway for SIO
Creating a Science-Driven Big Data Superhighway for SIOCreating a Science-Driven Big Data Superhighway for SIO
Creating a Science-Driven Big Data Superhighway for SIOLarry Smarr
 
GIS and GPS in plant pathology. vinay
GIS and GPS in plant pathology. vinayGIS and GPS in plant pathology. vinay
GIS and GPS in plant pathology. vinayvinay ju
 
DRI Energy Related Projects
DRI Energy Related ProjectsDRI Energy Related Projects
DRI Energy Related ProjectsDRIscience
 
The Pacific Research Platform: A Regional-Scale Big Data Analytics Cyberinfra...
The Pacific Research Platform: A Regional-Scale Big Data Analytics Cyberinfra...The Pacific Research Platform: A Regional-Scale Big Data Analytics Cyberinfra...
The Pacific Research Platform: A Regional-Scale Big Data Analytics Cyberinfra...Larry Smarr
 

What's hot (15)

Ecosystem science requirements for uas remote sensing
Ecosystem science requirements for uas remote sensing Ecosystem science requirements for uas remote sensing
Ecosystem science requirements for uas remote sensing
 
How can drone data be used in modelling?
How can drone data be used in modelling?How can drone data be used in modelling?
How can drone data be used in modelling?
 
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
 
Pacific Research Platform Science Drivers
Pacific Research Platform Science DriversPacific Research Platform Science Drivers
Pacific Research Platform Science Drivers
 
Sensornets and Global Change
Sensornets and Global ChangeSensornets and Global Change
Sensornets and Global Change
 
Andy Hardy-Enfermedades transmitidas por vectores
Andy Hardy-Enfermedades transmitidas por vectoresAndy Hardy-Enfermedades transmitidas por vectores
Andy Hardy-Enfermedades transmitidas por vectores
 
2015 05 Scaling from seeds to ecosystems
2015 05 Scaling from seeds to ecosystems2015 05 Scaling from seeds to ecosystems
2015 05 Scaling from seeds to ecosystems
 
Pacific Wave and PRP Update Big News for Big Data
Pacific Wave and PRP Update Big News for Big DataPacific Wave and PRP Update Big News for Big Data
Pacific Wave and PRP Update Big News for Big Data
 
 Gigapixel resolution imaging for near-remote sensing and phenomics
 Gigapixel resolution imaging for near-remote sensing and phenomics Gigapixel resolution imaging for near-remote sensing and phenomics
 Gigapixel resolution imaging for near-remote sensing and phenomics
 
Pacific Research Platform Supporting Earth Sciences
Pacific Research Platform Supporting Earth SciencesPacific Research Platform Supporting Earth Sciences
Pacific Research Platform Supporting Earth Sciences
 
The FiRe CTO Design Challenge: Wildfire Technology
The FiRe CTO Design Challenge: Wildfire TechnologyThe FiRe CTO Design Challenge: Wildfire Technology
The FiRe CTO Design Challenge: Wildfire Technology
 
Creating a Science-Driven Big Data Superhighway for SIO
Creating a Science-Driven Big Data Superhighway for SIOCreating a Science-Driven Big Data Superhighway for SIO
Creating a Science-Driven Big Data Superhighway for SIO
 
GIS and GPS in plant pathology. vinay
GIS and GPS in plant pathology. vinayGIS and GPS in plant pathology. vinay
GIS and GPS in plant pathology. vinay
 
DRI Energy Related Projects
DRI Energy Related ProjectsDRI Energy Related Projects
DRI Energy Related Projects
 
The Pacific Research Platform: A Regional-Scale Big Data Analytics Cyberinfra...
The Pacific Research Platform: A Regional-Scale Big Data Analytics Cyberinfra...The Pacific Research Platform: A Regional-Scale Big Data Analytics Cyberinfra...
The Pacific Research Platform: A Regional-Scale Big Data Analytics Cyberinfra...
 

Viewers also liked

.Net development with Azure Machine Learning (AzureML) Nov 2014
.Net development with Azure Machine Learning (AzureML) Nov 2014.Net development with Azure Machine Learning (AzureML) Nov 2014
.Net development with Azure Machine Learning (AzureML) Nov 2014Mark Tabladillo
 
Reproducibility and automation of machine learning process
Reproducibility and automation of machine learning processReproducibility and automation of machine learning process
Reproducibility and automation of machine learning processDenis Dus
 
Directions towards a cool consumer review platform using machine learning (ml...
Directions towards a cool consumer review platform using machine learning (ml...Directions towards a cool consumer review platform using machine learning (ml...
Directions towards a cool consumer review platform using machine learning (ml...Dhwaj Raj
 
Assignment of arbitrarily distributed random samples to the fixed probability...
Assignment of arbitrarily distributed random samples to the fixed probability...Assignment of arbitrarily distributed random samples to the fixed probability...
Assignment of arbitrarily distributed random samples to the fixed probability...Denis Dus
 
Lessons learned
Lessons learnedLessons learned
Lessons learnedhexgnu
 
Is Machine learning for your business? - Girls in Tech Luxembourg
Is Machine learning for your business? - Girls in Tech LuxembourgIs Machine learning for your business? - Girls in Tech Luxembourg
Is Machine learning for your business? - Girls in Tech LuxembourgMarie-Adélaïde Gervis
 
Technical Area: Machine Learning and Pattern Recognition
Technical Area: Machine Learning and Pattern RecognitionTechnical Area: Machine Learning and Pattern Recognition
Technical Area: Machine Learning and Pattern Recognitionbutest
 
Machine Learning part 2 - Introduction to Data Science
Machine Learning part 2 -  Introduction to Data Science Machine Learning part 2 -  Introduction to Data Science
Machine Learning part 2 - Introduction to Data Science Frank Kienle
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningLior Rokach
 

Viewers also liked (9)

.Net development with Azure Machine Learning (AzureML) Nov 2014
.Net development with Azure Machine Learning (AzureML) Nov 2014.Net development with Azure Machine Learning (AzureML) Nov 2014
.Net development with Azure Machine Learning (AzureML) Nov 2014
 
Reproducibility and automation of machine learning process
Reproducibility and automation of machine learning processReproducibility and automation of machine learning process
Reproducibility and automation of machine learning process
 
Directions towards a cool consumer review platform using machine learning (ml...
Directions towards a cool consumer review platform using machine learning (ml...Directions towards a cool consumer review platform using machine learning (ml...
Directions towards a cool consumer review platform using machine learning (ml...
 
Assignment of arbitrarily distributed random samples to the fixed probability...
Assignment of arbitrarily distributed random samples to the fixed probability...Assignment of arbitrarily distributed random samples to the fixed probability...
Assignment of arbitrarily distributed random samples to the fixed probability...
 
Lessons learned
Lessons learnedLessons learned
Lessons learned
 
Is Machine learning for your business? - Girls in Tech Luxembourg
Is Machine learning for your business? - Girls in Tech LuxembourgIs Machine learning for your business? - Girls in Tech Luxembourg
Is Machine learning for your business? - Girls in Tech Luxembourg
 
Technical Area: Machine Learning and Pattern Recognition
Technical Area: Machine Learning and Pattern RecognitionTechnical Area: Machine Learning and Pattern Recognition
Technical Area: Machine Learning and Pattern Recognition
 
Machine Learning part 2 - Introduction to Data Science
Machine Learning part 2 -  Introduction to Data Science Machine Learning part 2 -  Introduction to Data Science
Machine Learning part 2 - Introduction to Data Science
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 

Similar to Requirements for next generation of Cloud Computing: Case study with multiple Big Datasets and Machine Learning

Accelerating Science with Cloud Technologies in the ABoVE Science Cloud
Accelerating Science with Cloud Technologies in the ABoVE Science CloudAccelerating Science with Cloud Technologies in the ABoVE Science Cloud
Accelerating Science with Cloud Technologies in the ABoVE Science CloudGlobus
 
Sciunits: Reusable Research Objects
Sciunits: Reusable Research Objects Sciunits: Reusable Research Objects
Sciunits: Reusable Research Objects Globus
 
Many Task Applications for Grids and Supercomputers
Many Task Applications for Grids and SupercomputersMany Task Applications for Grids and Supercomputers
Many Task Applications for Grids and SupercomputersIan Foster
 
IEEE_BigData2014-Lee.pdf
IEEE_BigData2014-Lee.pdfIEEE_BigData2014-Lee.pdf
IEEE_BigData2014-Lee.pdfssuserff37aa
 
Drones and A.I in Earth Science
Drones and A.I in Earth ScienceDrones and A.I in Earth Science
Drones and A.I in Earth ScienceARDC
 
Accelerating Discovery via Science Services
Accelerating Discovery via Science ServicesAccelerating Discovery via Science Services
Accelerating Discovery via Science ServicesIan Foster
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22marpierc
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and KnowledgeIan Foster
 
NextGen environmental sensing at the National Arboretum
NextGen environmental sensing at the National ArboretumNextGen environmental sensing at the National Arboretum
NextGen environmental sensing at the National ArboretumTimeScience
 
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animationsRoots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animationsWilliam Yetman
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilitiesIan Foster
 
HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores inside-BigData.com
 
Research and technology explosion in scale-out storage
Research and technology explosion in scale-out storageResearch and technology explosion in scale-out storage
Research and technology explosion in scale-out storageJeff Spencer
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer OverlordsIan Foster
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? Robert Grossman
 
Cluster Filesystems and the next 1000 human genomes
Cluster Filesystems and the next 1000 human genomesCluster Filesystems and the next 1000 human genomes
Cluster Filesystems and the next 1000 human genomesGuy Coates
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSpraveen bhat
 
The Evolution of Big Data Frameworks
The Evolution of Big Data FrameworksThe Evolution of Big Data Frameworks
The Evolution of Big Data FrameworkseXascale Infolab
 

Similar to Requirements for next generation of Cloud Computing: Case study with multiple Big Datasets and Machine Learning (20)

Accelerating Science with Cloud Technologies in the ABoVE Science Cloud
Accelerating Science with Cloud Technologies in the ABoVE Science CloudAccelerating Science with Cloud Technologies in the ABoVE Science Cloud
Accelerating Science with Cloud Technologies in the ABoVE Science Cloud
 
Sciunits: Reusable Research Objects
Sciunits: Reusable Research Objects Sciunits: Reusable Research Objects
Sciunits: Reusable Research Objects
 
Many Task Applications for Grids and Supercomputers
Many Task Applications for Grids and SupercomputersMany Task Applications for Grids and Supercomputers
Many Task Applications for Grids and Supercomputers
 
IEEE_BigData2014-Lee.pdf
IEEE_BigData2014-Lee.pdfIEEE_BigData2014-Lee.pdf
IEEE_BigData2014-Lee.pdf
 
Drones and A.I in Earth Science
Drones and A.I in Earth ScienceDrones and A.I in Earth Science
Drones and A.I in Earth Science
 
Accelerating Discovery via Science Services
Accelerating Discovery via Science ServicesAccelerating Discovery via Science Services
Accelerating Discovery via Science Services
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
NextGen environmental sensing at the National Arboretum
NextGen environmental sensing at the National ArboretumNextGen environmental sensing at the National Arboretum
NextGen environmental sensing at the National Arboretum
 
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animationsRoots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilities
 
Braintalk cuso nm
Braintalk cuso nmBraintalk cuso nm
Braintalk cuso nm
 
HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores 
 
Research and technology explosion in scale-out storage
Research and technology explosion in scale-out storageResearch and technology explosion in scale-out storage
Research and technology explosion in scale-out storage
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
 
Cluster Filesystems and the next 1000 human genomes
Cluster Filesystems and the next 1000 human genomesCluster Filesystems and the next 1000 human genomes
Cluster Filesystems and the next 1000 human genomes
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFS
 
The Evolution of Big Data Frameworks
The Evolution of Big Data FrameworksThe Evolution of Big Data Frameworks
The Evolution of Big Data Frameworks
 

More from David Lary

The West Africa-America Chamber of Commerce & Industries presents: Big Data &...
The West Africa-America Chamber of Commerce & Industries presents: Big Data &...The West Africa-America Chamber of Commerce & Industries presents: Big Data &...
The West Africa-America Chamber of Commerce & Industries presents: Big Data &...David Lary
 
The West Africa-America Chamber of Commerce & Industries presents: Sub sahara...
The West Africa-America Chamber of Commerce & Industries presents: Sub sahara...The West Africa-America Chamber of Commerce & Industries presents: Sub sahara...
The West Africa-America Chamber of Commerce & Industries presents: Sub sahara...David Lary
 
The West Africa-America Chamber of Commerce & Industries presents:
The West Africa-America Chamber of Commerce & Industries presents: The West Africa-America Chamber of Commerce & Industries presents:
The West Africa-America Chamber of Commerce & Industries presents: David Lary
 
West Africa-America Chamber of Commerce & Industries: E mist
West Africa-America Chamber of Commerce & Industries: E mistWest Africa-America Chamber of Commerce & Industries: E mist
West Africa-America Chamber of Commerce & Industries: E mistDavid Lary
 
Big Data & Machine Learning for Societal Benefit
Big Data & Machine Learning for Societal BenefitBig Data & Machine Learning for Societal Benefit
Big Data & Machine Learning for Societal BenefitDavid Lary
 

More from David Lary (6)

The West Africa-America Chamber of Commerce & Industries presents: Big Data &...
The West Africa-America Chamber of Commerce & Industries presents: Big Data &...The West Africa-America Chamber of Commerce & Industries presents: Big Data &...
The West Africa-America Chamber of Commerce & Industries presents: Big Data &...
 
The West Africa-America Chamber of Commerce & Industries presents: Sub sahara...
The West Africa-America Chamber of Commerce & Industries presents: Sub sahara...The West Africa-America Chamber of Commerce & Industries presents: Sub sahara...
The West Africa-America Chamber of Commerce & Industries presents: Sub sahara...
 
The West Africa-America Chamber of Commerce & Industries presents:
The West Africa-America Chamber of Commerce & Industries presents: The West Africa-America Chamber of Commerce & Industries presents:
The West Africa-America Chamber of Commerce & Industries presents:
 
West Africa-America Chamber of Commerce & Industries: E mist
West Africa-America Chamber of Commerce & Industries: E mistWest Africa-America Chamber of Commerce & Industries: E mist
West Africa-America Chamber of Commerce & Industries: E mist
 
Big Data & Machine Learning for Societal Benefit
Big Data & Machine Learning for Societal BenefitBig Data & Machine Learning for Societal Benefit
Big Data & Machine Learning for Societal Benefit
 
Why geni
Why geniWhy geni
Why geni
 

Recently uploaded

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 

Recently uploaded (20)

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 

Requirements for next generation of Cloud Computing: Case study with multiple Big Datasets and Machine Learning

  • 1. Using Multiple Big Datasets and Machine Learning to Produce a New Global Particulate Dataset A Technology Challenge Case Study David Lary Hanson Center for Space Science University of Texas at Dallas
  • 3. Why? Table!1.!PM!and!health!outcomes!(modified!from!Ruckerl*et*al.!(2006)).! Decreased Lung Function < 10 μm x, few studies; xx, many studies; xxx, large number of studies. Cardiovascular Disease < 0.1 μm Skin & Eye Disease < 2.5 μm 0.1 mm 0.001 μm 0.01 μm 0.1 μm 1 μm 10 μm 1 mm 100 μm Tumors < 1 μm 0.0001 μm 1000 μm Mold Spores Types of biological Material Cell Pollen House Dust Mite Allergens Cat Allergens Bacteria Hair Viruses Types of Dust Heavy Dust Settling Dust Suspended Atmospheric Dust Cement Dust Fly Ash Types of Particulates Long9term*Studies* PM10! PM2.5! UFP! !! !! !! xx! xx! x! xx! xx! x! xx! xx! x! !! !! !! xxx! xxx! !! xxx! xxx! !! !! !! !! xxx! xxx! !! !! !! !! !! !! !! !! !! !! xx! xx! x! xx! xx! x! !! !! !! x! x! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! Oil Smoke Pin Smog Tobacco Smoke Soot Gas Molecules Gas Molecules Short9term*Studies* PM10! PM2.5! UFP! Mortality* !! !! !! !!!!All!causes! xxx!! xxx!! x! !!!!Cardiovascular! xxx! xxx! x!! !!!!Pulmonary! xxx! xxx! x! Pulmonary!effects! !! !! !! !!!!Lung!function,!e.g.,!PEF! xxx! xxx! xx! !!!!Lung!function!growth! !! !! !! Asthma!and!COPD!exacerbation! !! !! !! !!!!Acute!respiratory!symptoms! !! xx! x! !!!!Medication!use! !! !! x! !!!!Hospital!admission! xx! xxx! x! Lung!cancer! !! !! !! !!!!Cohort! !! !! !! !!!!Hospital!admission! !! !! !! Cardiovascular!effects! !! !! !! !!!!Hospital!admission! xxx! xxx! !! ECG@related!endpoints! !! !! !! !!!!Autonomic!nervous!system! xxx! xxx! xx! !!!!Myocardial!substrate!and!vulnerability! !! xx! x! Vascular!function! !! !! !! !!!!Blood!pressure! xx! xxx! x! !!!!Endothelial!function! x! xx! x! Blood!markers! !! !! !! !!!!Pro!inflammatory!mediators! xx! xx! xx! !!!!Coagulation!blood!markers! xx! xx! xx! !!!!Diabetes! x! xx! x! !!!!Endothelial!function! x! x! xx! Reproduction! !! !! !! !!!!Premature!birth! x! x! !! !!!!Birth!weight! xx! x! !! !!!!IUR/SGA! x! x! !! Fetal!growth! !! !! !! !!!!Birth!defects! x! !! !! !!!!Infant!mortality! xx! x! !! !!!!Sperm!quality! x! x! !! Neurotoxic!effects! !! !! !! !!!!Central!nervous!system!! !! x! xx! !! Health*Outcomes! PM10 particles PM2.5 particles PM0.1 ultra fine particles 0.0001 μm 0.001 μm 0.01 μm PM10-2.5 coarse fraction 0.1 μm 1 μm 10 μm 100 μm 1000 μm
  • 5. How? Used around 40 different BigData sets from satellites, meteorology, demographics, scraped web-sites and social media to estimate PM2.5. Plot below shows the average of 5,935 days from August 1, 1997 to the present.
  • 7. Which Platform? Requirements: 1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise before have had time to process the massive datasets the scratch space time limit has expired)
  • 8. Which Platform? Requirements: 1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise before have had time to process the massive datasets the scratch space time limit has expired) 2. High Bandwidth connections
  • 9. Which Platform? Requirements: 1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise before have had time to process the massive datasets the scratch space time limit has expired) 2. High Bandwidth connections 3. Ability to harvest social media (e.g. twitter) and scrape web sites for data
  • 10. Which Platform? Requirements: 1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise before have had time to process the massive datasets the scratch space time limit has expired) 2. High Bandwidth connections 3. Ability to harvest social media (e.g. twitter) and scrape web sites for data 4. High level language with wide range of optimized toolboxes, matlab
  • 11. Which Platform? Requirements: 1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise before have had time to process the massive datasets the scratch space time limit has expired) 2. High Bandwidth connections 3. Ability to harvest social media (e.g. twitter) and scrape web sites for data 4. High level language with wide range of optimized toolboxes, matlab 5. Algorithms capable of dealing with massive non-linear, non-parametric, non-Gaussian multivariate datasets (13,000+ variables)
  • 12. Which Platform? Requirements: 1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise before have had time to process the massive datasets the scratch space time limit has expired) 2. High Bandwidth connections 3. Ability to harvest social media (e.g. twitter) and scrape web sites for data 4. High level language with wide range of optimized toolboxes, matlab 5. Algorithms capable of dealing with massive non-linear, non-parametric, non-Gaussian multivariate datasets (13,000+ variables) 6. Easy to make use of multiple GPUs and CPUs
  • 13. Which Platform? Requirements: 1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise before have had time to process the massive datasets the scratch space time limit has expired) 2. High Bandwidth connections 3. Ability to harvest social media (e.g. twitter) and scrape web sites for data 4. High level language with wide range of optimized toolboxes, matlab 5. Algorithms capable of dealing with massive non-linear, non-parametric, non-Gaussian multivariate datasets (13,000+ variables) 6. Easy to make use of multiple GPUs and CPUs 7. Ability to schedule tasks at precise times and time intervals to automate workflows (in this case tasks executed at intervals of 5 minutes, 15 minutes, 1 hour, 3 hours, 1 day)
  • 16. Terra DeepBlue Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Source Variable Type Satellite Product Meteorological Analyses Satellite Product Meteorological Analyses Satellite Product Satellite Product Meteorological Analyses Meteorological Analyses Meteorological Analyses Meteorological Analyses Satellite Product Meteorological Analyses Satellite Product Satellite Product Satellite Product Satellite Product Meteorological Analyses Satellite Product Satellite Product Satellite Product Satellite Product Satellite Product Satellite Product Satellite Product Satellite Product Satellite Product Satellite Product Satellite Product Satellite Product Satellite Product Satellite Product Population Density Tropospheric NO2 Column Surface Specific Humidity Solar Azimuth Surface Wind Speed White-sky Albedo at 2,130 nm White-sky Albedo at 555 nm Surface Air Temperature Surface Layer Height Surface Ventilation Velocity Total Precipitation Solar Zenith Air Density at Surface Cloud Mask Qa Deep Blue Aerosol Optical Depth 470 nm Sensor Zenith White-sky Albedo at 858 nm Surface Velocity Scale White-sky Albedo at 470 nm Deep Blue Angstrom Exponent Land White-sky Albedo at 1,240 nm Scattering Angle Sensor Azimuth Deep Blue Surface Reflectance 412 nm White-sky Albedo at 1,640 nm Deep Blue Aerosol Optical Depth 660 nm White-sky Albedo at 648 nm Deep Blue Surface Reflectance 660 nm Cloud Fraction Land Deep Blue Surface Reflectance 470 nm Deep Blue Aerosol Optical Depth 550 nm Deep Blue Aerosol Optical Depth 412 nm Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input In-situ Observation PM2.5 Target
  • 17.
  • 18. Hourly measurements from 53 countries from 1997-present A lot of measurements, but notice the large gaps!
  • 19. Gaps are inevitable because of the infrastructure and cost associated with making the measurements.
  • 20. Challenge 1: Obtaining the in-situ PM2.5 data Real time data from: 1. EPA AirNow data for USA and Canada 2. EEA data for Europe 3. Tasmania and Australia 4. Israel 5. Russia 6. Asia and Latin America by scraping http://aqicn.org/map/ 7. Harvesting social media (twitter feeds from US Embassies) Relative low bandwidth from multiple sites every 5 minutes
  • 21. Challenge 2: (Easier) Obtaining the Satellite & Meteorological Data Real time data from: 1. Multiple satellites MODIS Terra, MODIS Aqua, SeaWIFS, VIIRS NPP etc 2. Global Meteorological Analyses High bandwidth from few sites every 1 to 3 hours
  • 22. Challenge 3: Combine multiple BigData Sets with Machine Learning Large member machine learning ensemble using massively parallel computing to produce PM2.5 data product Algorithms capable of dealing with massive non-linear, non-parametric, nonGaussian multivariate datasets (13,000+ variables) Drastically reduced development time by using a high level language (Matlab) that can easily exploit parallel execution using both multiple CPUs and GPUs. Massively parallel every 3 hours High level language which can readily use CPUs and GPUs
  • 23. Challenge 4: Continual Performance Improvement Currently on around 400th version of system. Have been making continuous improvements in: 1. Coverage of in-situ training data set 2. Inclusion of new satellite sensors 3. Additional BigData sets that help improve fidelity of the non-linear, nonparametric, non-Gaussian multivariate machine learning fits 4. Using many alternative machine learning strategies 5. Estimate uncertainties. 6. This requires frequent reprocessing of the entire multi-year record from 1997-present Persistent massive data storage, much more than usual scratch space at HPC centers
  • 24. Fully Automated Workflow Requires ability to schedule automated tasks
  • 25. Requires ability to disseminate results in multiple formats including ftp and as web and map services
  • 26.
  • 27.
  • 28. Key System Requirements: Not always available on current HPC systems Requirements: 1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise before have had time to process the massive datasets the scratch space time limit has expired) 2. High Bandwidth connections 3. Ability to harvest social media (e.g. twitter) and scrape web sites for data 4. High level language with wide range of optimized toolboxes, matlab 5. Algorithms capable of dealing with massive non-linear, non-parametric, non-Gaussian multivariate datasets (13,000+ variables) 6. Easy to make use of multiple GPUs and CPUs 7. Ability to schedule tasks at precise times and time intervals to automate workflows (in this case tasks executed at intervals of 5 minutes, 15 minutes, 1 hour, 3 hours, 1 day) Thank you!