A Framework for Geospatial Web Services for Public Health
by Leslie Lenert, MD, MS, FACMI, Director
National Center for Public Health Informatics, CCHIS, CDC
June 8 2009 URISA Public Health Conference
uploaded by Wansoo Im, Ph.D.
URISA Membership Committee Chair
http://www.gisinpublichealth.org
A Framework for Geospatial Web Services for Public Health by Dr. Leslie Lenert
1. A Framework for Geospatial Web Services for Public Health June 8, 2009 URISA Public Health Conference Leslie Lenert, MD, MS, FACMI, Director National Center for Public Health Informatics, CCHIS, CDC
22. @Home Model Extended Grid application models protein folding & misfolding (1224 teraflops , as of 23 Sept 2007) Grid application models the way malaria spreads in Africa and the potential impact that new anti-malarial drugs may have on the region Grid application models the design of new anti-HIV drugs based on molecular structure ( in silico ) 1
23.
24. Data Grid Examples caBIG – Cancer Research Datagrid GEON – Geosciences Network Datagrid EGEE / CERN - The world's largest particle physics laboratory... where the web was born (LHC – The Large Hadron Collider, May, 2008) DataGrid –EU funded resource of shared large-scale databases TeraGrid –Shares resources at San Diego Supercomputer Center, Indiana University, Oak Ridge National Laboratory, National Center for Supercomputing Applications, Pittsburgh Supercomputing Center, Purdue University, Texas Advanced Computing Center, University of Chicago/Argonne National Laboratory, and the National Center for Atmospheric Research
31. Examining PHGrid and Security: Leveraging the Expertise of Others… Grid Authentication and Authorization with Reliably Distributed Services (GAARDS) is a series of tools developed by Ohio State to enhance the open source Gobus Toolkit. Provides enterprise level administrative tools for managing users, federated identities, trust, credential delegation, group management, access control policy, and integration between grid and non-grid-based security domains.
40. Clinical or Surveillance Database GIPSE Data set cache Periodic Extract of GIPSE data sets Result #1 GIPSE Delivery Service (Grid node) GIPSE Publication Service Result #2 Result #3 Firewall GIPSE Registry Quicksilver Viewer
41.
42.
43.
44.
45.
46. ArcIMS (10 years of this) DATABASE DATABASE MAP SERVER LAN ArcIMS Consumers
47. PH-DGInet WAN (Multiple PH-DGInet nodes) LAN PH-DGInet SERVER CDC PH-DGInet Node LAN PH-DGInet Consumers (Internet Explorer) (PH-DGInet Explorer) (ArcGIS Client) LAN PH-DGInet SERVER EPA PH-DGInet Node LAN LAN PH-DGInet SERVER CA PH-DGInet Node LAN LAN PH-DGInet SERVER NIH PH-DGInet Node LAN
48. Aggregation Query: State accessing Veterans Admin Data at CDC PH-DGINet South Carolina Consumers DGInet SERVER South Carolina PH-DGINet Node Event Database DGInet SERVER CDC PH-DGINet Node Event Database DGInet SERVER Other State PH-DGINet Node Event Database Access Row Level Data Aggregate Row Level Data (Web Client) SQL Query Aggregated VA Data
49. Aggregation Query: State access it’s own hospital data and VA data at CDC PH-DGINet South Carolina Consumers DGInet SERVER South Carolina PH-DGINet Node Event Database DGInet SERVER CDC PH-DGINet Node Event Database DGInet SERVER Other State PH-DGINet Node Event Database Access Row Level Data Aggregate Row Level Data Access Row Level Data Aggregate Row Level Data (Web Client) SQL Query Aggregated VA Data
50.
51. PH-DGINet: Increased Interoperability July 23, 2009 Draft Clients DGINet Services Data Service Providers OGC Map Viewers DGINet Map Viewer ArcGIS Desktop ArcGIS Explorer Commercial Map Viewers Google Earth ArcGIS Server WMS Server GML Server WFS Server KML Server ArcIMS Server DGINet Web Services - GetMetadataService -LocatorService -GetListOfProductNamesService -GetProductService -GetFeaturesService -GetMapImage Service -GetAnnotationService -GetExtractService -DGINetSystemServiceBroadcastService -OGCProvidersManagerService -SystemMonitorService -DGINetSystemServiceUDDIService -DGINetSystemServiceScheduleService -DGINetSystemServiceNotifyService -LOS_GeoprocessingService DGINet Tools - Bookmark -Download/zip -Annotation -Query Builder Data Management Service DGINet Content Manager API Custom HTML Web Pages DGINet -CMS
60. Toolbox – NPDS Query National Poison Control Data Service Basemaps are coming from Redlands, California and the Poison Control data is coming from Denver. The poison control data is only a data service, not a feature service so PH-DGInet is building the spatial component on the fly based on the geography listed in the data.
Speakers Notes: Public Health Surveillance has traditionally followed a one way data model – from practitioners to state and local health and on ward to CDC. Past biosurveillance models have even circumvented that model, with data often bypassing the state and local health department. Regardless of the steps, the one way flow has led to a system that is resource intensive to both the data providers and the CDC. Moreover, the current model has many non-technical hurdles that must be addressed including: Politics of control of data has been the primary obstacle to formation of a national system Much existing data remains siloed at the Local/ State level – accessibility and visualization limited Building systems non collaboratively leads to low adoption rates
Having recognized the limitations of the previous model, the Public Health community has started to explore the feasibility of a federated data architecture, where the work of public health surveillance and practice is distributed amongst the national public health community. In this model, operations like biosurveillance will: Leverage the existing investments of the state and local public health communities. These may include the expertise of the scientific community, existing data sets and standards, as well as the inclusion of industry and academic partners that can facilitate biosurveillance practice. These will be supported by distributed information technology frameworks, under the general heading of the public health grid. The goal is to create a shared services platform that will allow the public health community to leverage the investment of it’s partners and in the end serve the public more effectively. On framework that is being explored to support this model is the Public Health Distributed Geospatial Intelligence Framework.
-Amazon’s Elastic Computing Cloud (EC2) provides a service interface to grid computing capability ($0.10/instance hour) -Amazon’s Simple Storage Service (S3) provides a service interface to remote storage ($0.15/GB/month storage) -EBay’s Trading Web Services provides service interface to listing and managing auctions
COE – cross-enterprise body with multi-disciplinary representation from Centers, Institutes and Offices. Responsibilities: Service portfolio plan Development Priorities Reusability Funding Ownership Pros Consolidated decision making Optimal use of resources Cons Can slow adoption of new standards Requires discipline Which services to develop? Which are the clearest components of our IT infrastructure that can be reused the most by our applications? Which services to do first? Which services deliver the highest return? Is a potential service actually new and reusable? Or should we reuse or modify existing services? Who's going to pay for the development and maintenance of this service? Who owns the service? Does ownership change throughout development, operation and maintenance Registry/Repository Service Descriptions Service Metadata Security Metadata Service Management System Performance Metrics Usage Metrics Executive Committee Enforces decision making Decides on funding Architecture Committee Evaluate technology
So, to summarize, there are three main types of grids: computational, collaborative and data, or a dynamic combination of the three
There are many misconceptions about grid computing. The most popular by far is the association with grid computing with the search for extraterrestrial intelligence, or SETI. SETI does utilize grid software to accomplish its task and has led to a unique public involvement in science where anybody with a PC at home may sign up to allow their computer to process massive amounts of radio signal data received by satellite.
The SETI model has been hugely popular and has branched off into other science domains such as Folding@home, Africa@home and FightAids@home. Thousands of compute cycles at home across the planet are being offered by individuals to solve these huge computational problems at a fraction of the cost. Serious organizations are behind these efforts—National Science Foundation, National Institutes of Health, Google, Dell, Apple and Intel, to name a few
Grid computing also enables people-to-people and organization-to-organization communication through collaboration grids. These grids combine resources used to support group-to-group interactions, large-scale distributed meetings, collaborative work sessions, seminars, lectures, tutorials, and training.
Besides enables large-scale computational and collaboration networks, grid computing also enables access to data and databases distributed throughout the world. caBIG is an active research grid developed by the National Cancel Institute to interconnect cancer research centers. CERN is building the largest particle physics laboratory ever. It went online this past May. The European Community is spending billions of euros building the grid infrastructure necessary to support the data produced by the LHC.
To give you a sense of the commercial activity relative to grid computing, the following companies all have grid products on offer today.
Much is also happening in the open source world of grid computing. This is a small sample of the open source grid projects currently in operation globally. PRAGMA is pacific rim focused. EGEE is an operational grid currently handling over 100,000 transactions per day, and growing. The Globus Alliance, for example, is based at Argonne National Laboratory , the University of Southern California's Information Sciences Institute , the University of Chicago , the University of Edinburgh , the Swedish Center for Parallel Computers , and the National Center for Supercomputing Applications (NCSA) . The Alliance produces open-source software that is central to science and engineering activities totalling nearly a half-billion dollars internationally and is the substrate for significant Grid products offered by leading IT companies.
What is most interesting, to me, at least, are the major companies that are currently using grid computing to support their infrastructure. Second Life is grid-based. Google is grid-based. Goodyear, Boeing, AMD, Adobe, the department of energy and Partners Healthcare all use grid computing in their product development lifecycle. And Amazon.com offers companies the ability to run their services on their grid infrastructure on a per CPU per hour basis.
Having recognized the limitations of the previous model, the Public Health community has started to explore the feasibility of a federated data architecture, where the work of public health surveillance and practice is distributed amongst the national public health community. In this model, operations like biosurveillance will: Leverage the existing investments of the state and local public health communities. These may include the expertise of the scientific community, existing data sets and standards, as well as the inclusion of industry and academic partners that can facilitate biosurveillance practice. These will be supported by distributed information technology frameworks, under the general heading of the public health grid. The goal is to create a shared services platform that will allow the public health community to leverage the investment of it’s partners and in the end serve the public more effectively. On framework that is being explored to support this model is the Public Health Distributed Geospatial Intelligence Framework.
Public health official at state or regional or national levels configures subscription services (defines GIPSE sets to be computed) using GIPSE Subscription Service. GIPSE Subscription Service reports the creation of a new service to the GIPSE registry, along with meta data GIPSE Subscription Service sends specifications for data retrieval to the GIPSE Publication Service GIPSE Publication service periodically computes specified GIPSE objects and sends them to a data cache GIPSE objects cache stores the objects for retrieval by the Pop Delivery Service A user using the PH GRID data visualization tool wants to query summary data for a geo-region The visualization tool uses the grid query service to determine the appropriate regional sources of GIPSE data from the GIPSE registry. The grid query tool then uses Population Summary Delivery Services to retrieve the relevant GIPSE Pop Summary Delivery Service retrieves appropriate object or creates a new object by combining existing objects (for example, it might combine 30 one day GIPSE objects into a 30-day object. The population summary delivery service returns the GIPSE from a data source in response to the request from a Grid reporting service The Grid Query Service combines reports from several different GIPSE services to produce an integrated GIPSE report The visualization program receives the Integrated GIPSE report. The visualization program uses other services to perform statistical testing. The visualization program displays the integrated GIPSE report using geographical display services.
07/23/09 Draft Public Health DGINet is a pilot program that NCPHI has been exploring to test the federated model. PH-DGINet builds on an Eight year – DOD certified program known as DGInet. This program is supported by a distributed data and service model that has ~30 nodes upon which several federal DoD and intelligence agencies share information and spatial imagery. In 2007 and 2008, NCPHI started exploring the viability of the DGInet, and its service oriented architecture, to understand if and how it could support the needs of the public health community, and in particular, biosurviellance. Services Oriented Architecture (SOA) GIS enterprise solution for geospatial data services and geoprocessing services Data Management Services : Provides services for auto-data loading/management of multi-terabyte databases Web Map Services : Allows for easy discovery, fusion and display of geospatial and geospatial intelligence data from multiple remote organizations via low bandwidth web services Web Geoprocessing Services : Allows for easy discovery/utilization of server side GIS based analytical services from multiple remote organizations via web client
This examples illustrates how an end user in South Carolina would access information for South Carolina but on the CDC node. In addition it shows how the PH-DGINet Distributed Aggregation Query allows the end user to build a query based on information from the CDC node. When the user selects the Distributed Aggregation Query function, a dialog appears requesting the user to select the syndrome, geography, node, age group, gender, and date range. This process generates a SQL statement that is passed to the identified node. The node processes the SQL statement, identifies the row level data that meets specifications of the SQL statement, aggregates the data by the identified geography, and then sends it back to the end user. Once the information is received by the client side application, a chloropleth map and histogram are created and displayed. In this example the South Carolina end user sees aggregate counts of flu by county in the VA medical centers in South Carolina from the CDC node.
This examples illustrates how an end user in South Carolina would access information for South Carolina on the South Carolina node and information from the CDC node. In addition it shows how the PH-DGINet Distributed Aggregation Query allows the end user to build a query based on information from both the South Carolina and the CDC node. Similar to the previous example, the end user will select the Distributed Aggregation Query function, a dialog appears requesting the user to select the syndrome, geography, node, age group, gender, and date range but in this example the end user will select two nodes to query, the South Carolina and CDC node. This process generates a SQL statement that is passed to both nodes this time. The nodes processes the SQL statement, identifies the row level data that meets specifications of the SQL statement, aggregates the data by the identified geography, and then sends it back to the end user. Once the information is received by the client side application, the information is aggregated from both nodes by the identified geography, a chloropleth map and histogram are created and displayed. This time there is a histogram that is created with aggregate information from each node and the height represents the total count from both nodes. In this example, the South Carolina end user sees aggregate counts of flu by county in the VA medical centers from the CDC node and hospital data from the South Carolina node combined.