The next generation of cloud computing system will need to handle:
1. Multiple massive datasets (Large Storage)
2. Massive memory per node
3. Facilitate automation and scheduling of repetitive tasks
4. Include high level technical languages (e.g. Matlab)
Requirements for next generation of Cloud Computing: Case study with multiple Big Datasets and Machine Learning
1. Using Multiple Big Datasets and Machine Learning
to Produce a New Global Particulate Dataset
A Technology Challenge Case Study
David Lary
Hanson Center for Space Science
University of Texas at Dallas
5. How?
Used around 40 different BigData sets from satellites, meteorology,
demographics, scraped web-sites and social media to estimate PM2.5. Plot
below shows the average of 5,935 days from August 1, 1997 to the present.
7. Which Platform?
Requirements:
1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise
before have had time to process the massive datasets the scratch space
time limit has expired)
8. Which Platform?
Requirements:
1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise
before have had time to process the massive datasets the scratch space
time limit has expired)
2. High Bandwidth connections
9. Which Platform?
Requirements:
1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise
before have had time to process the massive datasets the scratch space
time limit has expired)
2. High Bandwidth connections
3. Ability to harvest social media (e.g. twitter) and scrape web sites for data
10. Which Platform?
Requirements:
1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise
before have had time to process the massive datasets the scratch space
time limit has expired)
2. High Bandwidth connections
3. Ability to harvest social media (e.g. twitter) and scrape web sites for data
4. High level language with wide range of optimized toolboxes, matlab
11. Which Platform?
Requirements:
1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise
before have had time to process the massive datasets the scratch space
time limit has expired)
2. High Bandwidth connections
3. Ability to harvest social media (e.g. twitter) and scrape web sites for data
4. High level language with wide range of optimized toolboxes, matlab
5. Algorithms capable of dealing with massive non-linear, non-parametric,
non-Gaussian multivariate datasets (13,000+ variables)
12. Which Platform?
Requirements:
1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise
before have had time to process the massive datasets the scratch space
time limit has expired)
2. High Bandwidth connections
3. Ability to harvest social media (e.g. twitter) and scrape web sites for data
4. High level language with wide range of optimized toolboxes, matlab
5. Algorithms capable of dealing with massive non-linear, non-parametric,
non-Gaussian multivariate datasets (13,000+ variables)
6. Easy to make use of multiple GPUs and CPUs
13. Which Platform?
Requirements:
1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise
before have had time to process the massive datasets the scratch space
time limit has expired)
2. High Bandwidth connections
3. Ability to harvest social media (e.g. twitter) and scrape web sites for data
4. High level language with wide range of optimized toolboxes, matlab
5. Algorithms capable of dealing with massive non-linear, non-parametric,
non-Gaussian multivariate datasets (13,000+ variables)
6. Easy to make use of multiple GPUs and CPUs
7. Ability to schedule tasks at precise times and time intervals to automate
workflows (in this case tasks executed at intervals of 5 minutes, 15 minutes,
1 hour, 3 hours, 1 day)
16. Terra DeepBlue
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Source
Variable
Type
Satellite Product
Meteorological Analyses
Satellite Product
Meteorological Analyses
Satellite Product
Satellite Product
Meteorological Analyses
Meteorological Analyses
Meteorological Analyses
Meteorological Analyses
Satellite Product
Meteorological Analyses
Satellite Product
Satellite Product
Satellite Product
Satellite Product
Meteorological Analyses
Satellite Product
Satellite Product
Satellite Product
Satellite Product
Satellite Product
Satellite Product
Satellite Product
Satellite Product
Satellite Product
Satellite Product
Satellite Product
Satellite Product
Satellite Product
Satellite Product
Population Density
Tropospheric NO2 Column
Surface Specific Humidity
Solar Azimuth
Surface Wind Speed
White-sky Albedo at 2,130 nm
White-sky Albedo at 555 nm
Surface Air Temperature
Surface Layer Height
Surface Ventilation Velocity
Total Precipitation
Solar Zenith
Air Density at Surface
Cloud Mask Qa
Deep Blue Aerosol Optical Depth 470 nm
Sensor Zenith
White-sky Albedo at 858 nm
Surface Velocity Scale
White-sky Albedo at 470 nm
Deep Blue Angstrom Exponent Land
White-sky Albedo at 1,240 nm
Scattering Angle
Sensor Azimuth
Deep Blue Surface Reflectance 412 nm
White-sky Albedo at 1,640 nm
Deep Blue Aerosol Optical Depth 660 nm
White-sky Albedo at 648 nm
Deep Blue Surface Reflectance 660 nm
Cloud Fraction Land
Deep Blue Surface Reflectance 470 nm
Deep Blue Aerosol Optical Depth 550 nm
Deep Blue Aerosol Optical Depth 412 nm
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
In-situ Observation
PM2.5
Target
17.
18. Hourly measurements from 53 countries from 1997-present
A lot of measurements,
but notice the large gaps!
19. Gaps are inevitable because of the
infrastructure and cost associated with
making the measurements.
20. Challenge 1: Obtaining the in-situ PM2.5 data
Real time data from:
1. EPA AirNow data for USA and Canada
2. EEA data for Europe
3. Tasmania and Australia
4. Israel
5. Russia
6. Asia and Latin America by scraping http://aqicn.org/map/
7. Harvesting social media (twitter feeds from US Embassies)
Relative low bandwidth from multiple sites every 5 minutes
21. Challenge 2: (Easier)
Obtaining the Satellite & Meteorological Data
Real time data from:
1. Multiple satellites MODIS Terra, MODIS Aqua, SeaWIFS, VIIRS NPP etc
2. Global Meteorological Analyses
High bandwidth from few sites every 1 to 3 hours
22. Challenge 3:
Combine multiple BigData Sets with Machine Learning
Large member machine learning ensemble using massively parallel computing
to produce PM2.5 data product
Algorithms capable of dealing with massive non-linear, non-parametric, nonGaussian multivariate datasets (13,000+ variables)
Drastically reduced development time by using a high level language (Matlab)
that can easily exploit parallel execution using both multiple CPUs and GPUs.
Massively parallel every 3 hours
High level language which can readily use CPUs and GPUs
23. Challenge 4:
Continual Performance Improvement
Currently on around 400th version of system.
Have been making continuous improvements in:
1. Coverage of in-situ training data set
2. Inclusion of new satellite sensors
3. Additional BigData sets that help improve fidelity of the non-linear, nonparametric, non-Gaussian multivariate machine learning fits
4. Using many alternative machine learning strategies
5. Estimate uncertainties.
6. This requires frequent reprocessing of the entire multi-year record from
1997-present
Persistent massive data storage, much more
than usual scratch space at HPC centers
25. Requires ability to disseminate results in multiple formats including
ftp and as web and map services
26.
27.
28. Key System Requirements:
Not always available on current HPC systems
Requirements:
1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise
before have had time to process the massive datasets the scratch space
time limit has expired)
2. High Bandwidth connections
3. Ability to harvest social media (e.g. twitter) and scrape web sites for data
4. High level language with wide range of optimized toolboxes, matlab
5. Algorithms capable of dealing with massive non-linear, non-parametric,
non-Gaussian multivariate datasets (13,000+ variables)
6. Easy to make use of multiple GPUs and CPUs
7. Ability to schedule tasks at precise times and time intervals to automate
workflows (in this case tasks executed at intervals of 5 minutes, 15 minutes,
1 hour, 3 hours, 1 day)
Thank you!