Assisting Migration and Evolution of Relational Legacy Databases

Assisting

Migration and Evolution

of

Relational Legacy Databases

by

G.N. Wikramanayake

Department of Computer Science,

University of Wales Cardiff,

Cardiff

September 1996

Abstract

The research work reported here is concerned with enhancing and preparing databases with limited
DBMS capability for migration to keep up with current database technology. In particular, we have
addressed the problem of re-engineering heterogeneous relational legacy databases to assist them in
a migration process. Special attention has been paid to the case where the legacy database service
lacks the specification, representation and enforcement of integrity constraints. We have shown how
knowledge constraints of modern DBMS capabilities can be incorporated into these systems to
ensure that when migrated they can benefit from the current database technology.

To this end, we have developed a prototype conceptual constraint visualisation and enhancement
system (CCVES) to automate as efficiently as possible the process of re-engineering for a
heterogeneous distributed database environment, thereby assisting the global system user in
preparing their heterogeneous database systems for a graceful migration. Our prototype system has
been developed using a knowledge based approach to support the representation and manipulation of
structural and semantic information about schemas that the re-engineering and migration process
requires. It has a graphical user interface, including graphical visualisation of schemas with
constraints using user preferred modelling techniques for the convenience of the user. The system
has been implemented using meta-programming technology because of the proven power and
flexibility that this technology offers to this type of research applications.

The important contributions resulting from our research includes extending the benefits of meta-
programming technology to the very important application area of evolution and migration of
heterogeneous legacy databases. In addition, we have provided an extension to various relational
database systems to enable them to overcome their limitations in the representation of meta-data.
These extensions contribute towards the automation of the reverse-engineering process of legacy
databases, while allowing the user to analyse them using extended database modelling concepts.

Page v

CHAPTER 1

Introduction

This chapter introduces the thesis. Section 1.1 is devoted to the background and motivations of the
research undertaken. Section 1.2 presents the broad goals of the research. The original achievements
which have resulted from the research are summarised in Section 1.3. Finally, the overall
organisation of the thesis is described in Section 1.4.

1.1 Background and Motivations of the Research

Over the years rapid technological changes have taken place in all fields of computing. Most
of these changes have been due to the advances in data communications, computer hardware and
software [CAM89] which together have provided a reliable and powerful networking environment
(i.e. standard local and wide area networks) that allow the management of data stored in computing
facilities at many nodes of the network [BLI92]. These changes have turned round the hardware
technology from centralised mainframes to networked file-server and client-server architectures
[KHO92] which support various ways to use and share data. Modern computers are much more
powerful than the previous generations and perform business tasks at a much faster rate by using
their increased processing power [CAM88, CAM89]. Simultaneous developments in the software
industry have produced techniques (e.g. for system design and development) and products capable of
utilising the new hardware resources (e.g. multi-user environments with GUIs). These new
developments are being used for a wide variety of applications, including modern distributed
information processing applications, such as office automation where users can create and use
databases with forms and reports with minimal effort, compared to the development efforts using
3GLs [HIR85, WOJ94]. Such applications are being developed with the aid of database technology
[ELM94, DAT95] as this field too has advanced by allowing users to represent and manipulate
advanced forms of data and their functionalities. Due to the program data independence feature of
DBMSs the maintenance of database application programs has become easier as functionalities that
were traditionally performed by procedural application routines are now supported declaratively
using database concepts such as constraints and rules.

In the field of databases, the recent advances resulting from technological transformation
include many areas such as the use of distributed database technology [OZS91, BEL92], object-
oriented technology [ATK89, ZDO90], constraints [DAT83, GRE93], knowledge-based systems
[MYL89, GUI94], 4GLs and CASE tools [COMP90, SCH95, SHA95]. Meanwhile, the older
technology was dealing with files and primitive database systems which now appear inflexible, as
the technology itself limits them from being adapted to meet the current changing business needs
catalysed by newer technologies. The older systems which have been developed using 3GLs and in
operation for many years, often suffer from failures, inappropriate functionality, lack of
documentation, poor performance and are referred to as legacy information systems [BRO93,
COMS94, IEE94, BRO95, IEEE95].

The current technology is much more flexible as it supports methods to evolve (e.g. 4GLs,
CASE tools, GUI toolkits and reusable software libraries [HAR90, MEY94]), and can share
resources through software that allows interoperability (e.g. ODBC [RIC94, GEI95]). This evolution

reflects the changing business needs. However, modern systems need to be properly designed and
implemented to benefit from this technology, which may still be unable to prevent such systems
themselves being considered to be legacy information systems in the near future due to the advent of
the next generation of technology with its own special features. The only salvation would appear to
be building in evolution paths in the current systems.

The increasing power of computers and their software has meant they have already taken
over many day to day functions and are taking over more of these tasks as time passes. Thus
computers are managing a larger volume of information in a more efficient manner. Over the years
most enterprises have adopted the computerisation option to enable them to efficiently perform their
business tasks and to be able to compete with their counterparts. As the performance ability of
computers has increased, the enterprises still using early computer technology face serious problems
due to the difficulties that are inherent in their legacy systems.

This means that new enterprises using systems purely based on the latest technology have an
advantage over those which need to continue to use legacy information systems (ISs), as modern ISs
have been developed using current technology which provides not only better performance, but also
utilises the benefits of improved functionality. Hence, managers of legacy IS enterprises want to
retire their legacy code and use modern database management systems (DBMSs) in the latest
environment to gain the full benefits from this newer technology. However they want to use this
technology on the information and data they already hold as well as on data yet to be captured. They
also want to ensure that any attempts to incorporate the modern technology will not adversely affect
the ongoing functionality of their existing systems. This means legacy ISs need to be evolved and
migrated to a modern environment in such a way that the migration is transparent to the current
users. The theme of this thesis is how we can support this form of system evolution.

1.1.1 The Barriers to Legacy Information System Migration

Legacy ISs are usually those systems that have stood the test of time and have become a core
service component for a business’s information needs. These systems are a mix of hardware and
software, sometimes proprietary, often out of date, and built to earlier styles of design,
implementation and operation. Although they were productive and fulfilled their original
performance criteria and their requirements, these systems lack the ability to change and evolve. The
following can be seen as barriers to evolution in legacy IS [IEE94].

• The technology used to build and maintain the legacy IS is obsolete,
• The system is unable to reflect changes in the business world and to support new needs,
• The system cannot integrate with other sub-systems,
• The cost, time and risk involved in producing new alternative systems to the legacy IS.

The risk factor is that a new system may not provide the full functionality of the current
system for a period because of teething problems. Due to these barriers, large organisations [PHI94]
prefer to write independent sub-systems to perform new tasks using modern technology which will
run alongside the existing systems, rather than attempt to achieve this by adapting existing code or
by writing a new system that replaces the old and has new facilities as well. We see the following
immediate advantages of this low risk approach.

Page 4

• The performance, reliability and functionality of the existing system is not affected,
• New applications can take advantage of the latest technology,
• There is no need to retrain those staff who only need the facilities of the old system.

However with this approach, as business requirements evolve with time, more and more new
needs arise, resulting in the development and regular use of many diverse systems within the same
organisation. Hence, in the long term the above advantages are overshadowed by the more serious
disadvantages of this approach, such as:

• The existing systems continue to exist and are legacy IS running on older and older
technology,
• The need to maintain many different systems to perform similar tasks increases the
maintenance and support costs of the organisation,
• Data becomes duplicated in different systems which implies the maintenance of redundant data
with its associated increased risk of inconsistency between the data copies if updating occurs,
• The overall maintenance cost for hardware, software and support personnel increases as many
platforms are being supported,
• The performance of the integrated information functions of the organisation decreases due to
the need to interface many disparate systems.

To address the above issues, legacy ISs need to be evolved and migrated to new computing
environments, when their owning organisation upgrades. This migration should occur within a
reasonable time after the upgrade occurs. This means that it is necessary to migrate legacy ISs to
new target environments in order to allow the organisation to dispose of the technology which is
becoming obsolete. Managers of some enterprises have chosen an easy way to overcome this
problem, by emulating [CAM89, PHI94] the current environment on the new platforms (e.g. AS/400
emulators for IBM S/360 and ICL’s DME emulators for 1900 and System 4 users). An alternative
strategy is achieved by translating [SHA93, PHI94, SHE94, BRO95] the software to run in new
environments (i.e. code-to-code level translation). The emulator approach perpetuates all the
software deficiencies of the legacy ISs although successfully removing the old-fashioned hardware
technology and so it does enjoy the increased processing power of the new hardware. The translation
approach takes advantage of some of the modern technological benefits in the target environment as
the conversions - such as IBM’s JCL and ICL’s SCL code to Unix shell scripts, Assembler to
COBOL, COBOL to COBOL embedded with SQL, and COBOL data structures to relational DBMS
tables - are also done as part of the translation process. This approach, although a step forward, still
carries over most of the legacy code as legacy systems are not evolved by this process. For example,
the basic design is not changed. Hence the barrier to change and/or integration to a common sub-
system still remains, and the translated systems were not designed for the environment they are now
running in, so they may not be compatible with it.

There are other approaches to overcoming this problem which have been used by enterprises
[SHA93, BRO95]. These include re-implementing systems under the new environment and/or
upgrading existing systems to achieve performance improvements. As computer technology
continues to evolve at an ever quicker pace the need to migrate arises more rapidly. This means,
most small organisations and individuals are left behind and are forced to work in a technologically

Page 5

obsolete environment, mainly due to the high cost of frequently migrating to newer systems and/or
upgrading existing software, as this process involves time and manpower which cost money. The
gap between the older and newer system users will very soon create a barrier to information sharing
unless some tools are developed to assist the older technology users’ migration to new technology
environments. This assistance for the older technology users may take many forms, including tools
for: analysing and understanding existing systems; enhancing and modifying existing systems;
migrating legacy ISs to newer platforms. The complete migration process for a legacy IS needs to
consider these requirements and many other aspects, as recently identified by Brodie and
Stonebraker in [BRO95]. Our work was primarily motivated by these business oriented legacy
database issues and by work in the area of extending relational database technology to enable it to
represent more knowledge about its stored data [COD79, STO86a, STO86b, WIK90]. This second
consideration is an important aspect of legacy system migration, since if a graceful migration is to be
achieved we must be able to enhance a legacy relational database with such knowledge to take full
advantage of the new system environment.

1.1.2 Heterogeneous Distributed Environments

As well as the problem of having to use legacy ISs, most large enterprises are faced with the
problem of heterogeneity and the need for interoperability between existing ISs [IMS91]. This arises
due to the increased use of different computer systems and software tools for information processing
within an organisation as time passes. The development of networking capabilities to manage and
share information stored over a network has made interoperability a requirement and local area
networks finding broad acceptance in business enterprises has enhanced the need to perform this task
within organisations. Network file servers, client-server technology and the use of distributed
databases [OZS91, BEL92, KHO92] are results of these challenging innovations. This technology is
currently being used to create and process information held in heterogeneous databases, which
involves linking different databases in an interoperable environment. An aspect of this work is
legacy database interoperation, since as time passes these databases will have been built using
different generations of software.

In recent years, the demand for distributed database capabilities has been fuelled mostly by
the decentralisation of business functions in large organisations to address customer needs, and by
mergers and acquisitions that have taken place in the corporate world. As a consequence, there is a
strong requirement among enterprises for the ability to cross-correlate data stored in different
existing heterogeneous databases. This has led to the development of products referred to as
gateways, to enable users to link different databases together, e.g. Microsoft’s Open Database
Connectivity (ODBC) drivers can link Access, FoxPro, Btrieve, dBASE and Paradox databases
together [COL94, RIC94]. There are similar products for other database vendors, such as Oracle1
[HOL93] and others2 [PUR93, SME93, RIC94, BRO95]. Database vendors have targetted cross-
platform compatibility via SQL access protocols to support interoperability in a heterogeneous
environment. As heterogeneity in distributed systems may occur in various forms ranging from

1
For IBM’s DB2, UNISYS’s DMS, DEC RMS.
2
For INGRES, SYBASE, Informix and other popular SQL DBMSs.
3
During the life-time of this project the SQL-3 standards moved from a preliminary draft, through
several modifications before being finalised in 1995.

Page 6

different hardware platforms, operating systems, networking protocols and local database systems,
cross-platform compatibility via SQL provides only a simple form of heterogeneous distributed
database access. The biggest challenge comes in addressing heterogeneity due to differences in local
databases [OZS91, BEL92]. This challenge is also addressed in the design and development of our
system.

Distributed DBMSs have become increasingly popular in organisations as they offer the
ability to interconnect existing databases, as well as having many other advantages [OZS91,
BEL92]. The interconnection of existing databases leads to two types of distributed DBMS, namely:
homogeneous and heterogeneous distributed DBMSs. In homogeneous systems all of the constituent
nodes run the same DBMS and the databases can be designed in harmony with each other. This
simplifies both the processing of queries at different nodes and the passing of data between nodes. In
heterogeneous systems the situation is more complex, as each node can be running a different
DBMS and the constituent databases can be designed independently. This is the normal situation
when we are linking legacy databases, as the DBMS and the databases used are more likely to be
heterogeneous since they are usually implemented for different platforms during different
technological eras. In such a distributed database environment, heterogeneity may occur in various
forms, at different levels [OZS91, BEL92], namely :

• The logical level (i.e. involving different database designs),
• The data management level (i.e. involving different data models),
• The physical level, (i.e. involving different hardware, operating systems and network
protocols), and
• At all three or any pair of these levels.

1.1.3 The Problems and Search for a Solution

The concept of heterogeneity itself is valuable as it allows designers a freedom of choice
between different systems and design approaches, thus enabling them to identify those most suitable
for different applications. The exploitation of this freedom over the years in many organisations has
resulted in the creation of multiple local and remote information systems which now need to be
made interoperable to provide an efficient and effective information service to the enterprise
managers. Open database connectivity (ODBC) [RIC94, GEI95] and its standards has been proposed
to support interoperability among databases managed by different DBMSs. Database vendors such
as Oracle, INGRES, INFORMIX and Microsoft have already produced tools, engines and
connectivity products to fulfil this task [HOL93, PUR93, SME93, COL94, RIC94, BRO95]. These
products allow limited data transfer and query facilities among databases to support interoperability
among heterogeneous DBMSs. These features, although they permit easy, transparent heterogeneous
database access, still do not provide a solution to legacy IS where a primary concern is to evolve and
migrate the system to a target environment so that obsolete support systems can be retired.
Furthermore, the ODBC facilities are developed for current DBMSs and hence may not be capable
of accessing older generation DBMSs, and, if they are, are unlikely to be able to enhance them to
take advantage of the newer technologies. Hence there is a need to create tools that will allow ODBC
equivalent functionality for older generation DBMSs. Our work provides such functionality for all
the DBMSs we have chosen for this research. It also provides the ability to enhance and evolve
legacy databases.

Page 7

In order to evolve an information system, one needs to understand the existing system’s
structure and code. Most legacy information systems are not properly documented and hence
understanding such systems is a complex process. This means that changing any legacy code
involves a high risk as it could result in unexpected system behaviour. Therefore one needs to
analyse and understand existing system code before performing any changes to the system.

Database system design and implementation tools have appeared recently which have the
aim of helping new information system development. Reverse and re-engineering tools are also
appearing in an attempt to address issues concerned with existing databases [SHA93, SCH95]. Some
of these tools allow the examination of databases built using certain types of DBMSs, however, the
enhancements they allow are done within the limitation of that system. Due to continuous ongoing
technology changes, most current commercial DBMSs do not support the most recent software
modelling techniques and features (e.g. Oracle version 7 does not support Object-Oriented features).
Hence a system built using current software tools is guaranteed to become a legacy system in the
near future (i.e. when new products with newer techniques and features begin to appear in the
commercial market place).

Reverse engineering tools [SHA93] are capable of recreating the conceptual model of an
existing database and hence they are an ideal starting point when trying to gain a comprehensive
understanding of the information held in the database and its current state, as they create a visual
picture of that state. However, in legacy systems the schemas are basic, since most of the
information used to compose a conceptual model is not available in these databases. Information
such as constraints that show links between entities is usually embedded in the legacy application
code and users find it difficult to reverse engineer these legacy ISs. Our work addresses these issues
while assisting in overcoming this barrier within the knowledge representation limitations of existing
DBMSs.

1.1.4 Primary and Secondary Motivations

The research reported in this thesis therefore was primarily promoted by the need to provide,
for a logically heterogeneous distributed database environment, a design tool that allows users not
only to understand their existing systems but also to enhance and visualise an existing database’s
structure using new techniques that are either not yet present in existing systems or not supported by
the existing software environment. It was also motivated by:

a) Its direct applicability in the business world, as the new technique can be applied to incrementally
enhance existing systems and prepare them to be easily migrated to new target environments,
hence avoiding continued use of legacy information systems in the organisation.

Although previous work and some design tools address the issue of legacy information
system analysis, evolution and migration, these are mainly concerned with 3GL languages such as
COBOL and C [COMS94, BRO95, IEEE95]. Little work has been reported which addresses the new
issues that arise due to the Object-Oriented (O-O) data model or the extended relational data model
[CAT94]. There are no reports yet of enhancing legacy systems so that they can migrate to O-O or
extended relational environments in a graceful migration from a relational system. There has been

Page 8

some work in the related areas of identifying extended entity relationship structures in relational
schemas, and some attempts at reverse-engineering relational databases [MAR90, CHI94, PRE94].

b) The lack of previous research in visualising pre-existing heterogeneous database schemas and
evolving them by enhancing them with modern concepts supported in more recent releases of
software.

Most design tools [COMP90, SHA93] which have been developed to assist in Entity-
Relationship (E-R) modelling [ELM94] and Object Modelling Technique (OMT) modelling
[RUM91] are used in a top-down database design approach (i.e. forward engineering) to assist in
developing new systems. However, relatively few tools attempt to support a bottom-up approach
(i.e. reverse engineering) to allow visualisation of pre-existing database schemas as E-R or OMT
diagrams. Among these tools only a very few allow enhancement of the pre-existing database
schemas, i.e. they apply forward engineering to enhance a reverse-engineered schema. Even those
which do permit this action to some extent, always operate on a single database management system
and work mostly with schemas originally designed using such systems (e.g. CASE tools). The tools
that permit only the bottom-up approach are referred to as reverse-engineering tools and those which
support both (i.e. bottom-up and top-down) are called re-engineering tools [SHA93]. This thesis is
primarily concerned with creating re-engineering tools that assist legacy database migration.

The commercially available re-engineering tools are customised for particular DBMSs and
are not easily usable in a heterogeneous environment. This barrier against widespread usability of re-
engineering tools means that a substantial adaptation and reprogramming effort (costing time and
money) is involved every time a new DBMS appears in a heterogeneous environment. An obvious
example that reflects this limitation arises in a heterogeneous distributed database environment
where there may be a need to visualise each participant database’s schema. In such an environment
if the heterogeneity occurs at the database management level (where each node uses a different
DBMS, for example, one node uses INGRES [DAT87] and another uses Oracle [ROL92]), then we
have to use two different re-engineering tools to display these schemas. This situation is exacerbated
for each additional DBMS that is incorporated into the given heterogeneous context. Also, legacy
databases are migrated to different DBMS environments as newer versions and better database
products have appeared since the original release of their DBMS. This means that a re-engineering
tool that assists legacy database migration must work in an heterogeneous environment so that its
use will not be restricted to particular types of ISs.

Existing re-engineering tools provide a single target graphical data model (usually the E-R
model or a variant of it), which may differ in presentation style between tools and therefore inhibits
the uniformity of visualisation that is highly desirable in an interoperable heterogeneous distributed
database environment. This limitation means that users may need to use different tools to provide the
required uniformity of display in such an environment. The ability to visualise the conceptual model
of an information system using a user-preferred graphical data model is important as it ensures that
no inaccurate enhancements are made to the system due to any misinterpretation of graphical
notations used.

c) The need to apply rules and constraints to pre-existing databases to identify and clean inconsistent
legacy data, as preparation for migration or as an enhancement of the database’s quality.

Page 9

The inability to define and apply rules and constraints on early database systems due to
system limitations resulted in them not using constraints to increase the accuracy and consistency of
the data held by these systems. This limitation is now a barrier to information system migration as a
new target DBMS is unable to enforce constraints on a migrated database until all violations are
investigated and resolved either by omitting the violating data or by cleaning it. This investigation
may also show that a constraint has to be adjusted as the violating data is needed by the organisation.
The enhancement of such a system by rules and constraints provides knowledge that is usable to
determine possible data violations. The process of detecting constraint violations may be done by
applying queries that are generated from these enhanced constraints. Similar methods have been
used to implement integrity constraints [STO75], optimise queries [OZS91] and obtain intensional
answers [FON92, MOT89]. This is essential as constraints may have been implemented at the
application coding level and that can lead to their inconsistent application.

d) An awareness of the potential contribution that knowledge-based systems and meta-programming
technologies, in association with extended relational database technology, have to offer in coping
with semantic heterogeneity.

The successful production of a conceptual model is highly dependent on the semantic
information available, and on the ability to reason about these semantics. A knowledge-based system
can be used to assist in this task, as the process to generalise effective exploitation of semantic
information for pre-existing heterogeneous databases needs to undergo three sub-processes, namely:
knowledge acquisition, representation and manipulation. The knowledge acquisition process extracts
the existing knowledge from a database’s data dictionaries. This knowledge may include subsequent
enhancements made by the user, as the use of a database to store such knowledge will provide easy
access to this information along with its original knowledge. The knowledge representation process
represents existing and enhanced knowledge. The knowledge manipulation process is concerned
with deriving new knowledge and ensuring consistency of existing knowledge. These stages are
addressable using specific processes. For instance, the reverse-engineering process used to produce a
conceptual model can be used to perform the knowledge acquisition task. Then the derived and
enhanced knowledge can be stored in the same database by adopting a process that will allow us to
distinguish this knowledge from its original meta-data. Finally, knowledge manipulation can be done
with the assistance of a Prolog based system [GRA88], while data and knowledge consistency can be
verified using the query language of the database.

1.2 Goals of the Research

The broad goals of the research reported in this thesis are highlighted here, with detailed aims
and objectives presented in section 2.4. These goals are to investigate interoperability problems,
schema enhancement and migration in a heterogeneous distributed database environment, with
particular emphasis on extended relational systems. This should provide a basis for the design and
implementation of a prototype software system that brings together new techniques from the areas of
knowledge-based systems, meta-programming and O-O conceptual data modelling with the aim of
facilitating schema enhancement, by means of generalising the efficient representation of constraints
using the current standards. Such a system is a tool that would be a valuable asset in a logically
heterogeneous distributed extended relational database environment as it would make it possible for

Page 10

global users to incrementally enhance legacy information systems. This offers the potential for users
in this type of environment to work in terms of such a global schema, through which they can
prepare their legacy systems to easily migrate to target environments and so gain the benefits of
modern computer technology.

1.3 Original Achievements of the Research

The importance of this research lies in establishing the feasibility of enhancing, cleaning and
migrating heterogeneous legacy databases using meta-programming technology, knowledge-based
system technology, database system technology and O-O conceptual data modelling concepts, to
create a comprehensive set of techniques and methods that form an efficient and useful generalised
database re-engineering tool for heterogeneous sets of databases. The benefits such a tool can bring
are also demonstrated and assessed.

A prototype Conceptual Constraint Visualisation and Enhancement System (CCVES)
[WIK95a] has been developed as a result of the research. To be more specific, our work has made
four important contributions to progress in the database topic area of Computer Science:

1) CCVES is the first system to bring the benefits of meta-programming technology to the very
important application area of enhancing and evolving heterogeneous distributed legacy databases
to assist the legacy database migration process [GRA94, WIK95c].

2) CCVES is also the first system to enhance existing databases with constraints to improve their
visual presentation and hence provide a better understanding of existing applications [WIK95b].
This process is applicable to any relational database application, including those which are unable
to naturally support the specification and enforcement of constraints. More importantly, this
process does not affect the performance of an existing application.

3) As will be seen later, we have chosen the current SQL-3 standards [ISO94] as the basis for
knowledge representation in our research. This project provides an extension to the
representation of the relational data model to cope with automated reuse of knowledge in the re-
engineering process. In order to cope with technological changes that result from the emergence
of new systems or new versions of existing DBMSs, we also propose a series of extended
relational system tables conforming to SQL-3 standards to enhance existing relational DBMSs
[WIK95b].

4) The generation of queries using the constraint specifications of the enhanced legacy systems is an
easy and convenient method of detecting any constraint violating data in existing systems. The
application of this technique in the context of a heterogeneous environment for legacy
information systems is a significant step towards detecting and cleaning inconsistent data in
legacy systems prior to their migration. This is essential if a graceful migration is to be effected
[WIK95c].

1.4 Organisation of the Thesis

Page 11

The thesis is organised into 8 chapters. This first chapter has given an introduction to the
research done, covering background and motivations, and outlining original achievements. The rest
of the thesis is organised as follows:

Chapter 2 is devoted to presenting an overview of the research together with detailed aims and
objectives for the work undertaken. It begins by identifying the scope of the work in terms of
research constraints and development technologies. This is followed by an overview of the research
undertaken, where a step by step discussion of the approach adopted and its role in a heterogeneous
distributed database environment is given. Finally, detailed aims and objectives are drawn together
to conclude the chapter.

Chapter 3 identifies the relational data model as the current dominant database model and presents
its development along with its terminology, features and query languages. This is followed by a
discussion of conceptual data models with special emphasis on the data models and symbols used in
our project. Finally, we pay attention to key concepts related to our project, mainly the notion of
semantic integrity constraints and extensions to the relational model. Here, we present important
integrity constraint extensions to the relational model and its support using different SQL standards.

Chapter 4 addresses the issue of legacy information system migration. The discussion commences
with an introduction to legacy and our target information systems. This is followed by migration
strategies and methods for such ISs. Finally, we conclude by referring to current techniques and
identify the trends and existing tools applicable to database migration.

Chapter 5 addresses the re-engineering process for relational databases. Techniques currently used
for this purpose are identified first. Our approach, which uses constraints to re-engineer a relational
legacy database is described next. This is followed by a process for detecting possible keys and
structures of legacy databases. Our schema enhancement and knowledge representation techniques
are then introduced. Finally, we present a process to detect and resolve conflicts that may occur due
to schema enhancement.

Chapter 6 introduces some example test databases which were chosen to represent a legacy
heterogeneous distributed database environment and its access processes. Initially, we present the
design of our test databases, the selection of our test DBMSs and the prototype system environment.
This is followed by the application of our re-engineering approach to our test databases. Finally, the
organisation of relational meta-data and its access is described using our test DBMSs.

Chapter 7 presents the internal and external architecture and operation of our conceptual constraint
visualisation and enforcement system (CCVES) in terms of the design, structure and operation of its
interfaces, and its intermediate modelling system. The internal schema mappings, e.g. mapping from
INGRES QUEL to SQL and vice-versa, and internal database migration processes are presented in
detail here.

Chapter 8 provides an evaluation of CCVES, identifying its limitations and improvements that could
be made to the system. A discussion of potential applications is presented. Finally we conclude the

Page 12

chapter by drawing conclusions about the research project as a whole.

Page 13

CHAPTER 2

Research Scope, Approach, Aims and Objectives

This chapter describes, in some detail, the aims and objectives of the research that has been
undertaken. Firstly, the boundaries of the research are defined in section 2.1, which considers the
scope of the project. Secondly, an overview of the research approach we have adopted in dealing
with heterogeneous distributed legacy database evolution and migration is given in section 2.2.
Next, in section 2.3, the discussion is extended to the wider aspects of applying our approach in a
heterogeneous distributed database environment using the existing meta-programming technology
developed at Cardiff in other projects. Finally, the research aims and objectives are detailed in
section 2.4, illustrating what we intend to achieve, and the benefits expected from achieving the
stated aims.

2.1 Scope of the Project

We identify the scope of the work in terms of research constraints and the limitations of
current development technologies. An overview of the problem is presented along with the
drawbacks and limitations of database software development technology in addressing the
problem. This will assist in identifying our interests and focussing the issues to be addressed.

2.1.1 Overview of the Problem

In most database designs, a conceptual design and modelling technique is used in
developing the specifications at the user requirements and analysis stage of the design. This stage
usually describes the real world in terms of object/entity types that are related to one another in
various ways [BAT92, ELM94]. Such a technique is also used in reverse-engineering to portray
the current information content of existing databases, as the original designs are usually either
lost, or inappropriate because the database has evolved from its original design. The resulting
pictorial representation of a database can be used for database maintenance, for database re-
design, for database enhancement, for database integration or for database migration, as it gives its
users a sound understanding of an existing database’s architecture and contents.

Only a few current database tools [COMP90, BAT92, SHA93, SCH95] allow the capture
and presentation of database definitions from an existing database, and the analysis and display of
this information at a higher level of abstraction. Furthermore, these tools are either restricted to
accessing a specific database management system’s databases or permit modelling with only a
single given display formalism, usually a variant of the EER [COMP90]. Consequently there is a
need to cater for multiple database platforms with different user needs to allow access to a set of
databases comprising a heterogeneous database, by providing a facility to visualise databases
using a preferred conceptual modelling technique which is familiar to the different user
communities of the heterogeneous system.

The fundamental modelling constructs of current reverse and re-engineering tools are
entities, relationships and associated attributes. These constructs are useful for database design at

a high level of abstraction. However, the semantic information now available in the form of rules
and constraints in modern DBMSs provides their users with a better understanding of the
underlying database as its data conforms to these constraints. This may not necessarily be true for
legacy systems, which may have constraints defined that were not enforced. The ability to
visualise rules and constraints as part of the conceptual model increases user understanding of a
database. Users could also exploit this information to formulate queries that more effectively
utilise the information held in a database. Having these features in mind, we concentrated on
providing a tool that permits specification and visualisation of constraints as part of the graphical
display of the conceptual model of a database. With modern technology increasing the number of
legacy systems and with increasing awareness of the need to use legacy data [BRO95, IEEE95],
the availability of such a visualisation tool will be more important in future as it will let users see
the full definition of the contents of their databases in a familiar format.

Three types of abstraction mechanism, namely: classification, aggregation and
generalisation, are used in conceptual design [ELM94]. However, most existing DBMSs do not
maintain sufficient meta-data information to assist in identifying all these abstraction mechanisms
within their data models. This means that reverse and re-engineering tools are semi-automated, in
that they extract information, but users have to guide them and decide what information to look
for [WAT94]. This requires interactions with the database designer in order to obtain missing
information and to resolve possible conflicts. Such additional information is supplied by the tool
users when performing the reverse-engineering process. As this additional information is not
retained in the database, it must be re-entered every time a reverse engineering process is
undertaken if the full representation is to be achieved. To overcome this problem, knowledge
bases are being used to retain this information when it is supplied. However, this approach
restricts the use of this knowledge by other tools which may exist in the database’s environment.
The ability to hold this knowledge in the database itself would enhance an existing database with
information that can be widely used. This would be particularly useful in the context of legacy
databases as it would enrich their semantics. One of the issues considered in this thesis is how this
can be achieved.

Most existing relational database applications record only entities and their properties (i.e.
attribute names and data types) as system meta-data. This is because these systems conformed to
early database standards (e.g. the SQL/86 standard [ANSI86], supported by INGRES version 5
and Oracle version 5). However, more recent relational systems record additional information
such as constraint and rule definitions, as they conform to the SQL/92 standards [ANSI92] (e.g.
Oracle version 7). This additional information includes, for example, primary and foreign key
specifications, and can be used to identify classification and aggregation abstractions used in a
conceptual model [CHI94, PRE94, WIK95b]. However, the SQL/92 standard does not capture the
full range of modelling abstractions, e.g. inheritance representing generalisation hierarchies. This
means that early relational database applications are now legacy systems as they fail to naturally
represent additional information such as constraint and rule definitions. Such legacy database
systems are being migrated to modern database systems not only to gain the benefits of the current
technology but also to be compatible with new applications built with the modern technology. The
SQL standards are currently subject to review to permit the representation of extra knowledge
(e.g. object-oriented features), and we have anticipated some of these proposals in our work - i.e.
SQL-33 [ISO94] will be adopted by commercial systems and thus the current modern DBMSs

Page 15

will become legacy databases in the near future or already may be considered to be legacy
databases in that their data model type will have to be mapped onto the newer version. Having
experienced the development process of recent DBMSs it is inevitable that most current databases
will have to be migrated, either to a newer version of the existing DBMS or to a completely
different newer technology DBMS for a variety of reasons. Thus the migration of legacy
databases is perceived to be a continuing requirement, in any organisation, as technology
advances continue to be made.

Most migrations currently being undertaken are based on code-to-code level translations of
the applications and associated databases to enable the older system to be functional in the target
environment. Minimal structural changes are made to the original system and database, thus the
design structures of these systems are still old-fashioned, although they are running in a modern
computing environment. This means that such systems are inflexible and cannot be easily
enhanced with new functions or integrated with other applications in their new environment. We
have also observed that more recent database systems have often failed to benefit from modern
database technology due to inherent design faults that have resulted in the use of unnormalised
structures, which cause omission of the features enforcing integrity constraints even when this is
possible. The ability to create and use databases without the benefit of a database design course is
one reason for such design faults. Hence there is a need to assist existing systems to be evolved,
not only to perform new tasks but also to improve their structure so that these systems can
maximise the gains they receive from their current technology environment and any environment
they migrate to in the future.

2.1.2 Narrowing Down the Problem

Technological advances in both hardware and software have improved the performance
and maintenance functionality of all information systems (ISs), and as a result, older ISs suffer
from comparatively poor performance and inappropriate functionality when compared with more
modern systems. Most of these legacy systems are written in a 3GL such as COBOL, have been
around for many years, and run on old-fashioned mainframes. Problems associated with legacy
systems are being identified and various solutions are being developed [BRO93, SHE94, BRO95].
These systems basically have three functional components, namely: interface, application and a
database service, which are sometimes inter-related to each other, depending on how they were
used during the design and implementation stages of the IS development. This means that the
complexity of a legacy IS depends on what occurred during the design and implementation of the
system. These systems may range from a simple single user database application using separate
interfaces and applications, to a complex multi-purpose unstructured application. Due to the
complex nature of the problem area we do not address this issue as a whole, but focus only on
problems associated with one sub-component of such legacy information systems, namely the
database service. This in itself is a wide field, and we have further restricted ourselves to legacy
ISs using a specific DBMS for their database service. We considered data models ranging from
original flat file and relational systems, to modern relational DBMSs and object-oriented DBMSs.
From these data models we have chosen the traditional relational model for the following reasons.

• The relational model is currently the most widely used database model.

Page 16

• During the last two decades the relational model has been the most popular model;
therefore it has been used to develop many database applications and most of these are now
legacy systems.
• There have been many extensions and variations of the relational model, which has
resulted in many heterogeneous relational database systems being used in organisations.
• The relational model can be enhanced to represent additional semantics currently
supported only by modern DBMSs (e.g. extended relational systems [ZDO90, CAT94]).

As most business requirements change with time, the need to enhance and migrate legacy
information systems exists for almost every organisation. We address problems faced by these
users while seeking for a solution that prevents new systems becoming legacy systems in the near
future. The selection of the relational model as our database service to demonstrate how one could
achieve these needs means that we shall be addressing only relational legacy database systems and
not looking at any other type of legacy information systems.

This decision means we are not considering the many common legacy IS migration
problems identified by Brodie [BRO95] (e.g. migration of legacy database services such as flat-
file structures or hierarchical databases into modern extended relational databases; migration of
legacy applications with millions of lines of code written in some COBOL-like language into a
modern 4GL/GUI environment). However, as shown later, addressing the problems associated
with relational legacy databases has enabled us to identify and solve problems associated with
more recent DBMSs, and it also assists in identifying precautions which if implemented by
designers of new systems will minimise the chance of similar problems being faced by these
systems as IS developments occur in the future.

2.2 Overview of the Research Approach

Having presented an overview and narrowing down of our problem, we identify the
following as the main functionalities that should be provided to fulfil our research goal:

• Reverse-engineering of a relational legacy database to fully portray its current information
content.
• Enhancing a legacy database with new knowledge to identify modelling concepts that
should be available to the database concerned or to applications using that database.
• Determining the extent to which the legacy database conforms to its existing and enhanced
descriptions.
• Ensuring that the migrated IS will not become a legacy IS in the future.

We need to consider the heterogeneity issue in order to be able to reverse-engineer any
given relational legacy database. Three levels of heterogeneity are present for a particular data
model, namely: at a physical, logical and data management level. The physical level of
heterogeneity usually arises due to different data model implementation techniques, use of
different computer platforms and use of different DBMSs. The physical / logical data
independence of DBMSs hides implementation differences from users, hence we need only
address how to access databases that are built using different DBMSs, running on different
computer platforms.

Page 17

Differences in DBMS characteristics lead to heterogeneity at the logical level. Here, the
different DBMSs conform to a particular standard (e.g. SQL/86 or SQL/92), which supports a
particular database query language (e.g. SQL or QUEL) and different relational data model
features (e.g. handling of integrity constraints and availability of object-oriented features). To
tackle heterogeneity at the logical level, we need to be aware of different standards, and to model
ISs supporting different features and query languages.

Heterogeneity at the data management level arises due to the physical limitations of a
DBMS, differences in the logical design and inconsistencies that occurred when populating the
database. Logical differences in different database schemas have to be resolved only if we are
going to integrate them. The schema integration process is concerned with merging different
related database applications. Such a facility can assist the migration of heterogeneous database
systems. However any attempt to integrate legacy database schemas prior to the migration process
complicates the entire process as it is similar to attempting to provide new functionalities within
the system which is being migrated. Such attempts increase the chance of failure of the overall
migration process. Hence we consider any integration or enhancements in the form of new
functionalities only after successfully migrating the original legacy IS. However, the physical
limitations of a DBMS and data inconsistencies in the database need to be addressed beforehand
to ensure a successful migration.

Our work addresses the heterogeneity issues associated with database migration by
adopting an approach that allows its users to incrementally increase the number of DBMSs it
could handle without having to reprogram its main application modules. Here, the user needs to
supply specific knowledge about DBMS schema and query language constructs. This is held
together with the knowledge of the DBMSs already supported and has no effect on the
application’s main processing modules.

2.2.1 Meta-Programming

Meta-programming technology allows the meta-data (schema information) of a database to
be held and processed independently of its source specification language. This allows us to work
on a database language independent environment and hence overcome many logical heterogeneity
issues. Prolog based meta-programming technology has been used in previous research at Cardiff
in the area of logical heterogeneity [FID92, QUT94]. Using this technology the meta-translation
of database query languages [HOW87] and database schemas [RAM91] has been performed. This
work has shown how the heterogeneity issues of different DBMSs can be addressed without
having to reprogram the same functionality for each and every DBMS. We use meta-programming
technology for our legacy database migration approach as we need to be able to start with a legacy
source database and end with a modern target database where the respective database schema and
query languages may be different from each other. In this approach the source database schema or
query language is mapped on input into an internal canonical form. All the required processing is
then done using the information held in this internal form. This information is finally mapped to
the target schema or query language to produce the desired output. The advantage of this approach
is that processing is not affected by heterogeneity as it is always performed on data held in the
canonical form. This canonical form is an enriched collection of semantic data modelling features.

Page 18

2.2.2 Application

We view our migration approach as consisting of a series of stages, with the final stage
being the actual migration and earlier stages being preparatory. At stage 1, the data definition of
the selected database is reverse-engineered to produce a graphical display (cf. paths A-1 and A-2
of figure 2.1). However, in legacy systems much of the information needed to present the database
schema in this way is not available as part of the database meta-data and hence these links which
are present in the database cannot be shown in this conceptual model. In modern systems such
links can be identified using constraint specifications. Thus, if the database does not have any
explicit constraints, or it does but these are incomplete, new knowledge about the database needs
to be entered at stage 2 (cf. path B-1 of figure 2.1), which will then be reflected in the enhanced
schema appearing in the graphical display (cf. path B-2 of figure 2.1). This enhancement will
identify new links that should be present for the database concerned. These new database
constraints can next be applied experimentally to the legacy database to determine the extent to
which it conforms to them. This process is done at stage 3 (cf. paths C-1 and C-2 of figure 2.1).
The user can then decide whether these constraints should be enforced to improve the quality of
the legacy database prior to its migration. At this point the three preparatory stages in the
application of our approach are complete. The actual migration process is then performed. All
stages are further described below to enable us to identify the main processing components of our
proposed system as well as to explain how we deal with different levels of heterogeneity.

Stage 1: Reverse Engineering

In stage 1, the data definition of the selected database is reverse-engineered to produce a
graphical display of the database. To perform this task, the database’s meta-data must be extracted
(cf. path A-1 of figure 2.1). This is achieved by connecting directly to the heterogeneous database.
The accessed meta-data needs to be represented using our internal form. This is achieved through
a schema mapping process as used in the SMTS (Schema Meta-Translation System) of Ramfos
[RAM91]. The meta-data in our internal formalism then needs to be processed to derive the
graphical constructs present for the database concerned (cf. path A-2 of figure 2.1). These
constructs are in the form of entity types and the relationships and their derivation process is the
main processing component in stage 1. The identified graphical constructs are mapped to a display
description language to produce a graphical display of the database.

Page 19

Schema
Enhanced Visualisation Enforced
Constraints (EER or OMT) Constraints
with Constraints

B-1 C-1
B-2 A-2

Internal Processing

B-3 C-2

A-1

Heterogeneous Databases

Stage 1 (Reverse Engineering) Stage 2 (Knowledge Augmentation)
Stage 3 (Constraint Enforcement)

Figure 2.1: Information flow in the 3 stages of our approach prior to migration

a) Database connectivity for heterogeneous database access

Unlike the previous Cardiff meta-translation systems [HOW87, RAM91, QUT92], which
addressed heterogeneity at the logical and data management levels, our system looks at the
physical level as well. While these previous systems processed schemas in textual form and did
not access actual databases to extract their DDL specification, our system addresses physical
heterogeneity by accessing databases running on different hardware / software platforms (e.g.
computer systems, operating systems, DBMSs and network protocols). Our aim is to directly
access the meta-data of a given database application by specifying its name, the name and version
of the host DBMS, and the address of the host machine4. If this database access process can
produce a description of the database in DDL formalism, then this textual file is used as the
starting point for the meta-translation process as in previous Cardiff systems [RAM91, QUT92].
We found that it is not essential to produce such a textual file, as the required intermediate
representation can be directly produced by the database access process. This means that we could
also by-pass the meta-translation process that performs the analysis of the DDL text to translate it
into the intermediate representation5. However the DDL formalism of the schema can be used for
optional textual viewing and could also serve as the starting point for other tools6 developed at
Cardiff for meta-programming database applications.

The initial functionality of the Stage 1 database connectivity process is to access a
heterogeneous database and supply the accessed meta-data as input to our schema meta-translator

4
We assume that access privileges for this host machine and DBMS have been granted.
5
A list of tokens ready for syntactic analysis in the parsing phase is produced and processed
based on the BNF syntax specification of the DDL [QUT92].
6
e.g. The Schema Meta-Integration System (SMIS) of Qutaishat [QUT92].

Page 20

(SMTS). This module needs to deal with heterogeneity at the physical and data management
levels. We achieve this by using DML commands of the specific DBMS to extract the required
meta-data held in database data dictionaries treated like user defined tables.

Relatively recently, the functionalities of a heterogeneous database access process have
been provided by means of drivers such as ODBC [RIC94]. Use of such drivers will allow access
to any database supported by them and hence obviate the need to develop specialised tools for
each database type as happened in our case. These driver products were not available when we
undertook this stage of our work.

b) Schema meta-translation

The schema meta-translation process [RAM91] accepts input of any database schema
irrespective of its DDL and features. The information captured during this process is represented
internally to enable it to be mapped from one database schema to another or to further process and
supply information to other modules such as the schema meta-visualisation system (SMVS)
[QUT93] and the query meta-translation system (QMTS) [HOW87]. Thus, the use of an internal
canonical form for meta representation has successfully accommodated heterogeneity at the data
management and logical levels.

c) Schema meta-visualisation

Schema visualisation using graphical notation and diagrams has proved to be an important
step in a number of applications, e.g. during the initial stages of the database design process; for
database maintenance; for database re-design; for database enhancement; for database integration;
or for database migration; as it gives users a sound understanding of an existing database’s
structure in an easily assimilated format [BAT92, ELM94]. Database users need to see a visual
picture of their database structure instead of textual descriptions of the defining schema as it is
easier for them to comprehend a picture. This has led to the production of graphical
representations of schema information, effected by a reverse engineering process. Graphical data
models of schemas employ a set of data modelling concepts and a language-independent graphical
notation (e.g. the Entity Relationship (E-R) model [CHE76], Extended/Enhanced Entity
Relationship (EER) model [ELM94] or the Object Modelling Technique (OMT) [RUM91]). In a
heterogeneous environment different users may prefer different graphical models, and an
understanding of the database structure and architecture beyond that given by the traditional
entities and their properties. Therefore, there is a need to produce graphical models of a database’s
schema using different graphical notations such as either E-R/EER or OMT, and to accompany
them with additional information such as a display of the integrity constraints in force in the
database [WIK95b]. The display of integrity constraints allows users to look at intra- and inter-
object constraints and gain a better understanding of domain restrictions applicable to particular
entities. Current reverse engineering tools do not support this type of display.

The generated graphical constructs are held internally in a similar form to the meta-data of
the database schema. Hence using a schema meta visualisation process (SMVS) it is possible to
map the internally held graphical constructs into appropriate graphical symbols and coordinates
for the graphical display of the schema. This approach has a similarity to the SMTS, the main

Page 21

difference being that the output is graphical rather than textual.

Stage 2: Knowledge Augmentation

In a heterogeneous distributed database environment, evolution is expected, especially in
legacy databases. This evolution can affect the schema description and in particular schema
constraints that are not reflected in the stage 1 (path A-2) graphical display as they may be
implicit in applications. Thus our system is designed to accept new constraint specifications (cf.
path B-1 of figure 2.1) and add them to the graphical display (cf. path B-2 of figure 2.1) so that
these hidden constraints become explicit.

The new knowledge accepted at this point is used to enhance the schema and is retained in
the database using a database augmentation process (cf. path B-3 of figure 2.1). The new
information is stored in a form that conforms with the enhanced target DBMS’s methods of
storing such information. This assists the subsequent migration stage.

a) Schema enhancement

Our system needs to permit a database schema to be enhanced by specifying new
constraints applicable to the database. This process is performed via the graphical display. These
constraints, which are in the form of integrity constraints (e.g. primary key, foreign key, check
constraints) and structural components (e.g. inheritance hierarchies, entity modifications) are
specified using a GUI. When they are entered they will appear in the graphical display.

b) Database augmentation

The input data to enhance a schema provides new knowledge about a database. It is
essential to retain this knowledge within the database itself, if it is to be readily available for any
further processing. Typically, this information is retained in the knowledge base of the tool used
to capture the input data, so that it can be reused by the same tool. This approach restricts the use
of this knowledge by other tools and hence it must be re-entered every time the re-engineering
process is applied to that database. This makes it harder for the user to gain a consistent
understanding of an application, as different constraints may be specified during two separate re-
engineering processes. To overcome this problem, we augment the database itself using the
techniques proposed in SQL-3 [ISO94], wherever possible. When it is not possible to use SQL-3
structures we store the information in our own augmented table format which is a natural
extension of the SQL-3 approach.

When a database is augmented using this method, the new knowledge is available in the
database itself. Hence, any further re-engineering processes need not make requests for the same
additional knowledge. The augmented tables are created and maintained in a similar way to user-
defined tables, but have a special identification to distinguish them. Their structure is in line with
the international standards and the newer versions of commercial DBMSs, so that the enhanced
database can be easily migrated to either a newer version of the host DBMS or to a different
DBMS supporting the latest SQL standards. Migration should then mean that the newer system
can enforce the constraints. Our approach should also mean that it is easy to map our tables for

Page 22

holding this information into the representation used by the target DBMS even if it is different, as
we are mapping from a well defined structure.

Legacy databases that do not support explicit constraints can be enhanced by using the
above knowledge augmentation method. This requirement is less likely to occur for databases
managed by more recent DBMSs as they already hold some constraint specification information
in their system tables. The direction taken by Oracle version 6 was a step towards our
augmentation approach, as it allowed the database administrator to specify integrity constraints
such as primary and foreign keys, but did not yet enforce them [ROL92]. The next release of
Oracle, i.e. version 7, implemented this constraint enforcement process.

Stage 3: Constraint Enforcement

The enhanced schema can be held in the database, but the DBMS can only enforce these
constraints if it has the capability to do so. This will not normally be the case in legacy systems. In
this situation, the new constraints may be enforced via a newer version of the DBMS or by
migrating the database to another DBMS supporting constraint enforcement. However, the data
being held in the database may not conform to the new constraints, and hence existing data may
be rejected by the target DBMS in the migration, thus losing data and / or delaying the migration
process. To address this problem and to assist the migration process, we provide an optional
constraint enforcement process module which can be applied to a database before it is migrated.
The objective of this process is to give users the facility to ensure that the database conforms to all
the enhanced constraints before migration occurs. This process is optional so that the user can
decide whether these constraints should be enforced to improve the quality of the legacy data prior
to its migration, whether it is best left as it stands, or whether the new constraints are too severe.

The constraint definitions in the augmented schema are employed to perform this task. As
all constraints held have already been internally represented in the form of logical expressions,
these can be used to produce data manipulation statements suitable for the host DBMS. Once
these statements are produced, they are executed against the current database to identify the
existence of data violating a constraint.

Stage 4: Migration Process

The migration process itself is incrementally performed by initially creating the target
database and then copying the legacy data over to it. The schema meta-translation (SMTS)
technique of Ramfos [RAM91] is used to produce the target database schema. The legacy data can
be copied using the import / export tools of source and target DBMS or DML statements of the
respective DBMSs. During this process, the legacy applications must continue to function until
they too are migrated. To achieve this an interface can be used to capture and process all database
queries of the legacy applications during migration. This interface can decide how to process
database queries against the current state of the migration and re-direct those newly related to the
target database. The query meta-translation (QMTS) technique of Howells [HOW87] can be used
to convert these queries to the target DML. This approach will facilitate transparent migration for
legacy databases. Our work does not involve the development of an interface to capture and

Page 23

process all database queries, as interaction with the query interface of the legacy IS is embedded
in the legacy application code. However, we demonstrate how to create and populate a legacy
database schema in the desired target environment while showing the role of SMTS and QMTS in
such a process.

2.3 The Role of CCVES in Context of Heterogeneous Distributed Databases

Our approach described in section 2.2 is based on preparing a legacy database schema for
graceful migration. This involves visualisation of database schemas with constraints and
enhancing them with constraints to capture more knowledge. Hence we call our system the
Conceptualised Constraint Visualisation and Enhancement System (CCVES).

CCVES has been developed to fit in with the previously developed schema (SMTS)
[RAM91] and query (QMTS) [HOW87] meta-translation systems, and the schema meta-
visualisation system (SMVS) [QUT93]. This allows us to consider the complementary roles of
CCVES, SMTS, QMTS and SMVS during Heterogeneous Distributed Database access in a
uniform way [FID92, QUT94]. The combined set of tools achieves semantic coordination and
promotes interoperability in a heterogeneous environment at logical, physical and data
management levels.

Figure 2.2 illustrates the architecture of CCVES in the context of heterogeneous
distributed databases. It outlines in general terms the process of accessing a remote (legacy)
database to perform various database tasks, such as querying, visualisation, enhancement,
migration and integration.

There are seven sub-processes: the schema mapping process [RAM91], query mapping
process [HOW87], schema integration process [QUT92], schema visualisation process [QUT93],
database connectivity process, database enhancement process and database migration process. The
first two processes together have been called the Integrated Translation Support Environment
[FID92], and the first four processes together have been called the Meta-Integration/Translation
Support Environment [QUT92]. The last three processes were introduced as CCVES to perform
database enhancement and migration in such an environment.

The schema mapping process, referred to as SMTS, translates the definition of a source
schema to a target schema definition (e.g. an INGRES schema to a POSTGRES schema). The
query mapping process, referred to as QMTS, translates a source query to a target query (e.g. an
SQL query to a QUEL query). The meta-integration process, referred to as SMIS, tackles
heterogeneity at the logical level in a distributed environment containing multiple database
schemas (e.g. Ontos and Exodus local schemas with a POSTGRES global schema) - it integrates
the local schemas to create the global schema. The meta-visualisation process, referred to as
SMVS, generates a graphical representation of a schema. The remaining three processes, namely:
database connectivity, enhancement and migration with their associated processes, namely:
SMVS, SMTS and QMTS, are the subject of the present thesis, as they together form CCVES
(centre section of figure 2.2).

The database connectivity process (DBC), queries meta-data from a remote database (route

Page 24

A-1 in figure 2.2) to supply meta-knowledge (route A-2 in figure 2.2) to the schema mapping
process referred to as SMTS. SMTS translates this meta-knowledge to an internal representation
which is based on SQL schema constructs. These SQL constructs are supplied to SMVS for
further processing (route A-3 in figure 2.2) which results in the production of a graphical view of
the schema (route A-4 in figure 2.2). Our reverse-engineering techniques [WIK95b] are applied to
identify entity and relationship types to be used in the graphical model. Meta-knowledge
enhancements are solicited at this point by the database enhancement process (DBE) (route B-1 in
figure 2.2), which allows the definition of new constraints and changes to the existing schema.
These enhancements are reflected in the graphical view (route B-2 and B-3 in figure 2.2) and may
be used to augment the database (route B-4 to B-8 in figure 2.2). This approach to augmentation
makes use of the query mapping process, referred to as QMTS, to generate the required queries to
update the database via the DBC process. At this stage any existing or enhanced constraints may
be applied to the database to determine the extent to which it conforms to the new enhancements.
Carrying out this process will also ensure that legacy data will not be rejected by the target DBMS
due to possible violations. Finally, the database migration process, referred to as DBMI, assists
migration by incrementally migrating the database to the target environment (route C-1 to C-6 in
figure 2.2). Target schema constructs for each migratable component are produced via SMTS, and
DDL statements are issued to the target DBMS to create the new database schema. The data for
these migrated tables are extracted by instructing the source DBMS to export the source data to
the target database via QMTS. Here too, the queries which implement this export are issued to the
DBMS via the DBC process.

2.4 Research Aims and Objectives

Our relational database enhancement and augmentation approach is important in three
respects, namely:

1) by holding the additional defining information in the database itself, this information is
usable by any design tool in addition to assisting the full automation of any future re-
engineering of the same database;
2) it allows better user understanding of database applications, as the associated constraints
are shown in addition to the traditional entities and attributes at the conceptual level;

Page 25

3) the process which assists a database administrator to clean inconsistent legacy data ensures a
safe migration. To perform this latter task in a real world situation without an automated support
tool is a very difficult, tedious, time consuming and error prone task.

Therefore the main aim of this project has been the design and development of a tool to
assist database enhancement and migration in a heterogeneous distributed relational database
environment. Such a system is concerned with enhancing the constituent databases in this type of
environment to exploit potential knowledge both to automate the re-engineering process and to
assist in evolving and cleaning the legacy data to prevent data rejection, possible losses of data
and/or delays in the migration process. To this end, the following detailed aims and objectives
have been pursued in our research:

1. Investigation of the problems inherent in schema enhancement and migration for a
heterogeneous distributed relational legacy database environment, in order to fully understand
these processes.

2. Identification of the conceptual foundation on which to successfully base the design and
development of a tool for this purpose. This foundation includes:

• A framework to establish meta-data representation and manipulation.
• A real world data modelling framework that facilitates the enhancement of existing working
systems and which supports applications during migration.
• A framework to retain the enhanced knowledge for future use which is in line with current
international standards and techniques used in newer versions of relational DBMSs.
• Exploiting existing databases in new ways, particularly linking them with data held in other
legacy systems or more modern systems.
• Displaying the structure of databases in a graphical form to make it easy for users to
comprehend their contents.
• The provision of an interactive graphical response when enhancements are made to a
database.
• A higher level of data abstraction for tasks associated with visualising the contents,
relationships and behavioural properties of entities and constraints.
• Determining the constraints on the information held and the extent to which the data
conforms to these constraints.
• Integrating with other tools to maximise the benefits of the new tool to the user community.

3. Development of a prototype tool to automate the re-engineering process and the migration
assisting tasks as far as possible. The following development aims have been chosen for this
system:

• It should provide a realistic solution to the schema enhancement and migration assistance
process.
• It should be able to access and perform this task for legacy database systems.
• It should be suitable for the data model at which it is targeted.
• It should be as generic as possible so that it can be easily customised for other data models.
• It should be able to retain the enhanced knowledge for future analysis by itself and other

Page 26

tools.
• It should logically support a model using modern data modelling techniques irrespective of
whether it is supported by the DBMS in use.
• It should make extensive use of modern graphical user interface facilities for all graphical
displays of the database schema.
• Graphical displays should also be as generic as possible so that they can be easily enhanced or
customised for other display methods.

Page 27

CHAPTER 3
Database Technology, Relational Model,
Conceptual Modelling and Integrity Constraints

The origins and historical development of database technology are initially presented here to focus
the evolution of ISs and the emergence of database models. The relational data model is identified as
currently the most commonly used database model and some terminology for this data model, along
with its features including query languages is then presented. A discussion of conceptual data
models with special emphasis on EER and OMT is provided to introduce these data models and the
symbols used in our project. Finally, we pay attention to crucial concepts relating to our work,
namely the notion of semantic integrity constraints, with special emphasis on those used in semantic
extensions to the relational model. The relational database language SQL is also discussed,
identifying how and when it supports the implementation of these semantic integrity constraints.

3.1 Origins and Historical Developments

The origin of data management goes back to the 1950’s and hence, this section is sub divided
into two parts: the first part describes database technology prior to the relational data model, and the
second part describes developments since. This division was chosen as the relational model is
currently the most dominant database model for information management [DAT90].

3.1.1 Database Technology Prior to the Relational Data Model

Database technology emerged from the need to manipulate large collections of data for
frequently used data queries and reports. The first major step in mechanisation of information
systems came with the advent of punched card machines which worked sequentially on fixed-length
fields [SEN73, SEN77]. With the appearance of stored program computers, tape-oriented systems
were used to perform these tasks with an increase in user efficiency. These systems used sequential
processing of files in batch mode, which was adequate until peripheral storage with random access
capabilities (e.g. DASD) and time sharing operating systems with interactive processing appeared to
support real-time processing in computer systems.

Access methods such as direct and indexed sequential access methods (e.g. ISAM, VSAM)
[BRA82, MCF91] were used to assist with the storage and location of physical records in stored
files. Enhancements were made to procedural languages (e.g. COBOL) to define and manage
application files, making the application program dependent on the organisation of the file. This
technique caused data redundancy as several files were used in systems to hold the same data (e.g.
emp_name and address in a payroll file; insured_name and address in an insurance file; and
depositors_name and address in a bank file). These stored data files used in the applications of the
1960's are now referred to as conventional file systems, and they were maintained using third
generation programming languages such as COBOL and PL/1. This evolution of mechanised
information systems was influenced by the hardware and software developments which occurred in
the 1950’s and early 1960’s. Most long existing legacy ISs are based on this technology. Our work
does not address this type of IS as they do not use a DBMS for their data management.

The evolution of databases and database management systems [CHA76, FRY76, SIB76,

Chapter 3 Database Technology, Relational Model, Conceptual Modelling and Integrity
Constraints
SEN77, KIM79, MCG81, SEL87, DAT90, ELM94] was to a large extent the result of addressing the
main deficiencies in the use of files, i.e. by reducing data redundancy and making application
programs less dependent on file organisation. An important factor in this evolution was the
development of data definition languages which allowed the description of a database to be
separated from its application programs. This facility allowed the data definition (often called a
schema) to be shared and integrated to provide a wide variety of information to the users. The
repository of all data definitions (meta data) is called data dictionaries and their use allows data
definitions to be shared and widely available to the user community.

In the late 1960's applications began to share their data files using an integrated layer of
stored data descriptions, making the first true database, e.g. the IMS hierarchical database [MCG77,
DAT90]. This type of database was navigational in nature and applications explicitly followed the
physical organisation of records in files to locate data using commands such as GNP - get next under
parent. These databases provided centralised storage management, transaction management,
recovery facilities in the event of failure and system maintained access paths. These were the typical
characteristics of early DBMSs.

Work on extending COBOL to handle databases was carried out in the late 60s and 70s. This
resulted in the establishment of the DBTG (i.e. DataBase Task Group) of CODASYL and the formal
introduction of the network model along with its data manipulation commands [DBTG71]. The
relational model was proposed during the same period [COD70], followed by the 3 level
ANSI/SPARC architecture [ANSI75] which made databases more independent of applications, and
became a standard for the organisation of DBMSs. Three popular types of commercial database
systems7 classified by their underlying data model emerged during the 70s [DAT90, ELM94],
namely:

• hierarchical
• network
• relational

and these have been the dominant types of DBMS from the late 60s on into the 80s and 90s.

3.1.2 Database Technology Since the Relational Data Model

At the same time as the relational data model appeared, database systems introduced another
layer of data description on top of the navigational functionality of the early hierarchical and
network models to bring extra logical data independence8. The relational model also introduced the
use of non-procedural (i.e. declarative) languages such as SQL [CHA74]. By the early 1980's many
relational database products, e.g. System R [AST76], DB2 [HAD84], INGRES [STO76] and Oracle
were in use and due to their growing maturity in the mid 80s and the complexity of programming,
navigating, and changing data structures in the older DBMS data models, the relational data model
was able to take over the commercial database market with the result that it is now dominant.

7
Other types such as flat file, inverted file systems were also used.
8
This allows changes to the logical structure of data without changing the application programs.

Page 29

Constraints
The advent of inexpensive and reliable communication between computer systems, through
the development of national and international networks, has brought further changes in the design of
these systems. These developments led to the introduction of distributed databases, where a
processor uses data at several locations and links it as though it were at a single site. This technology
has led to distributed DBMSs and the need for interoperability among different database systems
[OZS91, BEL92].

Several shortcomings of the relational model have been identified, including its inability to
perform efficiently compute-intensive applications such as simulation, to cope with computer-aided
design (CAD) and programming language environments, and to represent and manipulate effectively
concepts such as [KIM90]:

• Complex nested entities (e.g. design and engineering objects),
• Unstructured data (e.g. images, textual documents),
• Generalisation and aggregation within a data structure,
• The notion of time and versioning of objects and schemas,
• Long duration transactions.

The notion of a conceptual schema for application-independent modelling introduced by the
ANSI/SPARC architecture led to another data model, namely: the semantic model. One of the most
successful semantic models is the entity-relationship (E-R) model [CHE76]. Its concepts include
entities, relationships, value sets and attributes. These concepts are used in traditional database
design as they are application-independent. Many modelling concepts based on variants/extensions
to the E-R model have appeared since Chen’s paper. The enhanced/extended entity-relationship
model (EER) [TEO86, ELM94], the entity-category-relationship model (ECR) [ELM85], and the
Object Modelling Technique (OMT) [RUM91] are the most popular of these.

The DAPLEX functional model [SHI81] and the Semantic Data Model [HAM81] are also
semantic models. They capture a richer set of semantic relationships among real-world entities in a
database than the E-R based models. Semantic relationships such as generalisation / specialisation
between a superclass and its subclass, the aggregation relationship between a class and its attributes,
the instance-of relationship between an instance and its class, the part-of relationship between
objects forming a composite object, and the version-of relationship between abstracted versioned
objects are semantic extensions supported in these models. The object-oriented data model with its
notions of class hierarchy, class-composition hierarchy (for nested objects) and methods could be
regarded as a subset of this type of semantic data model in terms of its modelling power, except for
the fact that the semantic data model lacks the notion of methods [KIM90] which is an important
aspect of the object-oriented model.

The relational model of data and the relational query language have been extended [ROW87]
to allow modelling and manipulation of additional semantic relationships and database facilities.
These extensions include data abstraction, encapsulation, object identity, composite objects, class
hierarchies, rules and procedures. However, these extended relational systems are still being
evolved to fully incorporate features such as implementation of domain and extended data types,
enforcement of primary and foreign key and referential integrity checking, prohibition of duplicate
rows in tables and views, handling missing information by supporting four-valued predicate logic

Page 30

Constraints
(i.e. true, false, unknown, not applicable) and view updatability [KIV92], and they are not yet
available as commercial products.

The early 1990's saw the emergence of new database systems by a natural evolution of
database technology, with many relational database systems being extended and other data models
(e.g. the object-oriented model) appearing to satisfy more diverse application needs. This opened
opportunities to use databases for a greater diversity of applications which had not been previously
exploited as they were not perceived as tractable by a database approach (e.g. Image, medical,
document management, engineering design and multi-media information, used in complex
information processing applications such as office automation (OA), computer-aided design (CAD),
computer-aided manufacturing (CAM) and hyper media [KIM90, ZDO90, CAT94]). The object-
oriented (O-O) paradigm represents a sound basis for making progress in these areas and as a result
two types of DBMS are beginning to dominate in the mid 90s [ZDO90], namely: the object-oriented
DBMS, and the extended relational DBMS.

There are two styles of O-O DBMS, depending on whether they have evolved from
extensions to an O-O programming language or by evolving a database model. Extensions have been
created for two database models, namely: the relational and the functional models. The extensions to
existing relational DBMSs have resulted in the so-called Extended Relational DBMSs which have
O-O features (e.g. POSTGRES and Starburst), while extensions to the functional model have
produced PROBE and OODAPLEX. The approach of extending O-O programming language
systems with database management features has resulted in many systems (e.g. Smalltalk into
GemStone and ALLTALK, and C++ into many DBMSs including VBase / ONTOS, IRIS and O2).
References to these systems with additional information and references can be found in [CAT94].

Research is currently taking place into other kinds of database such as active, deductive and
expert database systems [DAT90]. This thesis focuses on the relational model and possible
extensions to it which can represent semantics in existing relational database information systems in
such a way that these systems can be viewed in new ways and easily prepared for migration to more
modern database environments.

3.2 Relational Data Model

In this section we introduce some of the commonly used terminology of the relational model.
This is followed by a selective description of the features and query languages of this model. Further
details of this data model can be found in most introductory database text books, e.g. [MCF91,
ROB93, ELM94, DAT95].

A relation is represented as a table (entity) in which each row represents a tuple (record), the
number of columns being the degree of the relation and the number of rows being its cardinality. An
example of this representation is shown in figure 3.1, which shows a relation holding Student details,
with degree 3 and cardinality 5. This table and each of its columns are named, so that a unique
identity for a table column of a given schema is achieved via its table name and column name. The
columns of a table are called attributes (fields) each having its own domain (data type) representing
its pool of legal data. Basic types of domains are used (e.g. integer, real, character, text, date) to
define the domains of attributes. Constraints may be enforced to further restrict the pool of legal

Page 31

Constraints
values for an attribute. Tables which actually hold data are called base tables to distinguish them
from view tables which can be used for viewing data associated with one or more base tables. A
view table can also be an abstraction from a single base table which is used to control access to parts
of the data.

A column or set of columns whose values uniquely identify a row of a relation is called a
candidate key (key) of the relation. It is customary to designate one candidate key of a relation as a
primary key (e.g. SNO in figure 3.1). The specification of keys restricts the possible values the key
attribute(s) may hold (e.g. no duplicate values), and is a type of constraint enforceable on a relation.
Additional constraints may be imposed on an attribute to further restrict its legal values. In such
cases, there should be a common set of legal values satisfying all the constraints of that attribute,
ensuring its ability to accept some data. For example, a pattern constraint which ensures that the first
character of SNO is ‘S’ further restricts the possible values of SNO - see figure 3.1. Many other
concepts and constraints are associated with the relational model although most of them are not
supported by early relational systems as, indeed, some of the more recent relational systems (e.g. a
value set constraint for the Address field as shown in figure 3.1).

Domain
(type character)
Value Set Constraint
Pattern Constraint
(all values begin with 'S')

Primary Key
(unique values)

Student SNO Name Address

Cardinality
S1 Jones Cardiff
S2 Smith Bristol :
Relation Tuples
S3 Gray Swansea
S4 Brown Cardiff :
S5 Jones Newport

Attributes
Degree

Figure 3.1: The Student relation

3.2.1 Requisite Features of the Relational Model

During the early stages of the development of relational database systems there were many
requisite features identified which a comprehensive relational system should have [KIM79, DAT90].
We shall now examine these features to illustrate the kind of features expected from early relational
database management systems. They included support for:

• Recovery from both soft and hard crashes,
• A report generator for formatted display of the results of queries,
• An efficient optimiser to meet the response-time requirements of users,
• User views of the stored database,

Page 32

Assisting Migration and Evolution of Relational Legacy Databases

Assisting Migration and Evolution of Relational Legacy Databases

Recommended

Recommended

More Related Content

What's hot

What's hot (15)

Similar to Assisting Migration and Evolution of Relational Legacy Databases

Similar to Assisting Migration and Evolution of Relational Legacy Databases (20)

More from Gihan Wikramanayake

More from Gihan Wikramanayake (20)

Recently uploaded

Recently uploaded (20)

Assisting Migration and Evolution of Relational Legacy Databases