TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
Â
Switc Hpa
1. Digital Data Handling with
Modern Cyberinfrastructure
Scott Teige
steige@indiana.edu
October 2009
2. Contents
⢠The trend toward âborn digitalâ data
⢠The bad old days
⢠The new days
⢠Examples: the new, the old
Scott Teige
3. Trends
⢠The US will produce 113 million medical images in the
next year (CNN)
⢠CT and MRI scans are âborn digitalâ
⢠Physics has a long tradition of digital data acquisition
which continues with, for example, the latest CERN
experiments
⢠Chemistry, Biology, Geology, Communication and
Culture, Anthropology and Economics are also producing
increasing amounts of data
⢠Hard drives are down to $0.07 per GigaByte, 8GB thumb
drives are SWAG at conferences.
Scott Teige
5. The bad old days, part 2
⢠Data written from the instrument to 8mm video tape (loss
of ~5%)
⢠Tapes carried from DAQ computers to analysis
computers
⢠Tapes carried (courier) from instrument building to
âstorageâ facility at BNL (Patty M. office bookshelves)
⢠2nd pass analysis on BNL mainframes (loss ~5%)
⢠Tapes copied to DLT (loss ~10%)
⢠⌠years pass âŚ
⢠DLT copied to HPSS (loss ~5%)
Scott Teige
6. Almost there âŚ
⢠USArray, locations of the transportable seismographs.
Scott Teige
7. Almost there âŚ
⢠Data written to a hard drive on the seismometer
⢠Data uplinked via cell phone or satellite to central location
⢠Researchers request specific portions of the data via web
interface
⢠Data sent via e-mail (small request) or hard drive to
researcher (âlargeâ request)
⢠Once a year, or so, someone goes to the seismographs
and retrieves the hard drivesâŚ
Scott Teige
9. A modern case
⢠Images are digitized by the instrument
⢠The digitized images are written directly to the Data
Capacitor
⢠The Data Capacitor appears as a local file system on the
researchers desktop computer, BigRed, Quarry and
some other TeraGrid systems
⢠The researcher does quality checks, tuning, optimization,
etc. on his local workstation.
⢠CPU intensive analysis is done on the large systems
provided by IU or the TeraGrid
⢠Data is archived daily to the HPSS (via high bandwidth
connection from DC to HPSS)
Scott Teige
12. Infrastructure, CPU Resources
Big Red [TeraGrid System]
30 TFLOPS IBM JS21 SuSE Cluster
768 blades/3072 cores: 2.5 GHz PPC 970MP
8GB Memory, 4 cores per blade
Myrinet 2000
LoadLeveler & Moab
Quarry [Future TeraGrid System]
7 TFLOPS IBM HS21 RHEL Cluster
140 blades/1120 cores: 2.0 GHz Intel Xeon
5335
8GB Memory, 8 cores per blade
1Gb Ethernet (upgrading to 10Gb)
PBS (Torque) & Moab
Scott Teige
13. Infrastructure, Network
⢠10 GigE to parts of campus, 1GigE to entire system
⢠4x10GigE from BigRed to DC
⢠48x1GigE from Quarry to DC
⢠15x10 GigE from DC to HPSS
Scott Teige
15. What does this give you? FAQ
⢠How much data can I have?
⢠All of it, right now.
⢠Where is my data?
⢠Everywhere.
⢠Where can I analyze my data?
⢠Anywhere.
⢠How long can I keep my data?
⢠Forever.
⢠Is there a backup?
⢠Yes, two of them.
Scott Teige
16. Acknowledgments
This material is based upon work supported by the National Science Foundation under
Grant Numbers 0116050 and 0521433. Any opinions, findings and conclusions or
recommendations expressed in this material are those of the author and do not
necessarily reflect the views of the National Science Foundation (NSF).
This work was support in part by the Indiana Metabolomics and Cytomics Initiative
(METACyt). METACyt is supported in part by Lilly Endowment, Inc.
This work was support in part by the Indiana Genomics Initiative. The Indiana
Genomics Initiative of Indiana University is supported in part by Lilly Endowment,
Inc.
This work was supported in part by Shared University Research grants from IBM, Inc.
to Indiana University.
Scott Teige