Ähnlich wie A Quick Survey of Open Source Software for PH Organizations, a paper by Massimo Mirabito, MBA (US CDC) and Taha Kass-Hout, MD, MS, 2007 (20)
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
A Quick Survey of Open Source Software for PH Organizations, a paper by Massimo Mirabito, MBA (US CDC) and Taha Kass-Hout, MD, MS, 2007
1. A Quick Survey of Open Source Software for PH Organizations
By Massimo Mirabito, MBA (US CDC) and Taha Kass-Hout, MD, MS, 2007
Unstructured Text
1. Lucene: Apache Lucene is a high-performance, full-featured text search engine library
written entirely in Java. This technology suitable for nearly any application that requires
full-text search, especially cross-platform. Lucene itself is just an indexing and search
library and does not contain crawling and HTML parsing functionality. The Apache
project Nutch is based on Lucene and provides this functionality. Lucene provides
capabilities to index a variety of document formats.
2. Solr: Solr is an open source enterprise search server based on the Lucene Java search
library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching,
replication, and a web administration interface. Solr is a stand alone server which
applications communicate with using XML and HTTP to index documents, or execute
searches. Solr supports a rich schema specification that allows for a wide range of
flexibility in dealing with different document fields, and has an extensive search plugin
API for developing custom search behavior
3. Nutch: Nutch is an effort to build an open source search engine based on Lucene Java
for the search and index component. The fetcher ("robot" or "web crawler") has been
written from scratch solely for this project. Nutch has a highly modular architecture
allowing developers to create plugins for the following activities: media-type parsing,
data retrieval, querying and clustering. As of June 2005, Nutch has graduated from the
Apache Incubator, and is now a subproject of Lucene. It is coded completely in the Java
programming language, but data is written in language-independent formats. In June
2003, there was a successful 100 million page demo system. To meet the multimachine
processing needs of the crawl and index tasks, the Nutch project has also implemented a
MapReduce facility and a distributed file system. These two facilities have been spun
out into their own subproject called Hadoop.
4. UIMA: UIMA stands for Unstructured Information Management Architecture. It is a
component software architecture for the development, discovery, composition, and
deployment of multi-modal analytics for the analysis of unstructured information and its
integration with search technologies developed by IBM. The source code for a reference
implementation of this framework has been made available on SourceForge, and later on
Apache Software Foundation website. UIMA is a framework and SDK for developing
such applications. An example UIM application might ingest plain text and identify
entities, such as persons, places, organizations; or relations, such as works-for or
located-at. UIMA enables such an application to be decomposed into components, for
example "language identification" -> "language specific segmentation" -> "sentence
2. boundary detection" -> "entity detection (person/place names etc.)". Each component
must implement interfaces defined by the framework and must provide self-describing
metadata via XML descriptor files. The framework manages these components and the
data flow between them. Components are written in Java or C++; the data that flows
between components is designed for efficient mapping between these languages. UIMA
additionally provides capabilities to wrap components as network services, and can scale
to very large volumes by replicating processing pipelines over a cluster of networked
nodes.
Alternative GIS
Graphical Information System (GIS) is an equally critical component. GIS provides a way of
capturing, storing, analyzing and managing data and associated attributes which are spatially
referenced to the earth. Additionally, for proper data analysis a time series should be supported
as it provides researchers, first responders and emergency personnel capabilities to view data
spatially and over time. The most prominent inexpensive tools one the internet are Google Map,
Google Earth, Microsoft Live Earth and Yahoo Maps. All these tools are relatively easy to use
configure and distribute. Google Earth and Goggle maps are the most prominent tools used by
web developer. The Keyhole Markup Language (KML) is XML based and used to describe
geospatial data, KML can be used by Google Earth and Google Maps.
1. Open Layers (http://www.openlayers.org): OpenLayers provide capabilities to embed
dynamic maps in any web page. It can display map tiles and markers loaded from a
variety of sources. MetaCarta developed the initial version of OpenLayers and gave it to
the public to further the use of geographic information of all kinds. OpenLayers is
completely free, Open Source JavaScript, released under the BSD License.
2. MapServer (http://mapserver.gis.umn.edu): MapServer is an open source development
environment for building spatially-enabled internet applications. MapServer supports
Open Geospatial Consortium (OGC) standards, including Web Map Service (WMS) and
Web Feature Service (WFS). MapServer works with PostgreSQL and its PostGIS
extension, and supports proprietary GIS formats including ESRI's Shapefile format.
MapServer uses OGR and GDAL libraries to translate files from one file format to
another. MapServer supports PHP, Python, Perl, Ruby, Java, and C# for scripting and
customization.
3. GeoServer (http://geoserver.org): GeoServer is an Open Source server that connects
information to the Geospatial Web including publishing and editing data using open
standards. It is a fully functional geospatial web service implementing the WMS 1.1.1
and WFS 1.0 implementation specifications from OGC. Information is made available in
a large variety of formats as maps/images or actual geospatial data. GeoServer's
transactional capabilities offer robust support for shared editing. GeoServer's focus is
ease of use and support for standards, in order to serve as 'glue' for the geospatial web,
connecting from legacy databases to many diverse clients.
3. 4. GeoTools (http://geotools.codehaus.org): Geo Tools is an open source (LGPL) Java code
library which provides standards compliant methods for the manipulation of geospatial
data, for example to implement Geographic Information Systems (GIS) . The Geo Tools
library implements Open Geospatial Consortium (OGC) specifications as they are
developed, in close collaboration with the GeoAPI and GeoWidgets projects.
Enterprise Services Bus (ESB)
Application integration is one of the most challenging aspects when building a platform. An ESB
is middleware infrastructure that connects multiple systems via standard protocols, exposes
services for consummation, provides messaging capabilities, transformation, routing, as well as
leverage existing IT assets. There are several open source ESB products
1. ServiceMix: ServiceMix is an Open Source ESB combining functionality of a Service
Oriented Architecture (SOA) and an Event Driven Architecture (EDA) to create an
agile, enterprise ESB. Apache ServiceMix is an open source distributed ESB built from
the ground up on the Java Business Integration (JBI) specification JSR 208 and released
under the Apache license. The goal of JBI is to allow components and services to be
integrated in a vendor independent way, allowing users and vendors to plug and play.
ServiceMix is lightweight and easily embeddable, has integrated Spring support and can
be run at the edge of the network (inside a client or server), as a standalone ESB
provider or as a service within another ESB.
2. Mule: Mule is a light-weight messaging framework. It is a highly distributable object
broker that can seamlessly handle interactions with other applications using disparate
technologies, transports and protocols. The Mule framework provides a highly scalable
environment in which you can deploy your business components. Mule manages all the
interactions between components transparently whether they exist in the same VM or
over the internet and regardless of the underlying transport used. The common scenario
for using Mule include Integration projects where two or more existing systems need to
communicate with each other. Applications that need to be totally decoupled from their
surrounding environment or where the ability to scale one more components in the
system is needed.
3. FUSE ESB: Fuse ESB is an Open source product based on Apache ServiceMix odder by
IONA Technologies. FUSE ESB provides a standardized methodology, server, and tools
to deploy integration components, freeing architects from the dependencies that have
traditionally locked enterprises into proprietary middleware stacks. FUSE ESB enables
organizations to achieve their service-oriented architecture (SOA) objectives with a
proven open source solution for enterprise integration.
4. Scalability
Scalability is important when deploying solutions that need to perform adequately during high
volume. Scalability is the ability to ensure availability, reliability, and performance based on the
amount of concurrent connections, load as they progressively increase. Scalability can be defined
as follows:
• Scale vertically: To scale vertically (or scale up) implies adding resources to a single
server, typically involving the addition of CPUs or memory. This could also mean
expanding the number of running processes.
• Scale horizontally: To scale horizontally (or scale out) means to add more servers to a
system, such as adding a new computer to a distributed software application. An
example might be scaling out from 1 web server to 3.
The following products can deliver high availability and clustered solutions:
1. Open Terracotta: Open Terracotta is Open Source JVM-level clustering software for
Java, delivering clustering as a runtime infrastructure service, simplifying the task of
clustering a Java application. The capability is provided by clustering the JVM
underneath the application, instead of clustering the application itself.
2. GridGain: GridGain is a computational grid framework. Its goal is to improve general
performance of processing intensive applications by splitting and parallelizing the
workload. In many cases GridGain is used to achieve better overall throughput, better
scalability or availability of services. GridGain supports out-of-the-box the follwign:
JBoss, Spring, Spring AOP, JBoss AOP, AspectJ, JGroups, Weblogic, Websphere,
Oracle Coherence, Mule, JXInsight, and GigaSpaces.