SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Downloaden Sie, um offline zu lesen
ARTICLE IN PRESS




                                              Information Systems 30 (2005) 492–525
                                                                                                  www.elsevier.com/locate/infosys




        A generic and customizable framework for the design
                         of ETL scenarios
    Panos Vassiliadisa, Alkis Simitsisb, Panos Georgantasb, Manolis Terrovitisb,
                                Spiros Skiadopoulosb
                             a
                            Department of Computer Science, University of Ioannina, Ioannina, Greece
           b
            Department of Electrical and Computer Engineering, National Technical University of Athens, Athens, Greece




Abstract

   Extraction–transformation–loading (ETL) tools are pieces of software responsible for the extraction of data from
several sources, their cleansing, customization and insertion into a data warehouse. In this paper, we delve into the
logical design of ETL scenarios and provide a generic and customizable framework in order to support the DW
designer in his task. First, we present a metamodel particularly customized for the definition of ETL activities. We
follow a workflow-like approach, where the output of a certain activity can either be stored persistently or passed to a
subsequent activity. Also, we employ a declarative database programming language, LDL, to define the semantics of
each activity. The metamodel is generic enough to capture any possible ETL activity. Nevertheless, in the pursuit of
higher reusability and flexibility, we specialize the set of our generic metamodel constructs with a palette of frequently
used ETL activities, which we call templates. Moreover, in order to achieve a uniform extensibility mechanism for this
library of built-ins, we have to deal with specific language issues. Therefore, we also discuss the mechanics of template
instantiation to concrete activities. The design concepts that we introduce have been implemented in a tool, ARKTOS II,
which is also presented.
r 2004 Elsevier B.V. All rights reserved.

Keywords: Data warehousing; ETL



1. Introduction                                                       data extraction, transformation, integration,
                                                                      cleaning and transport. To deal with this work-
  Data warehouse operational processes normally                       flow, specialized tools are already available in the
compose a labor-intensive workflow, involving                          market [1–4], under the general title Extraction—
                                                                      Transformation– Loading (ETL) tools. To give a
                                                                      general idea of the functionality of these tools we
    E-mail addresses: pvassil@cs.uoi.gr (P. Vassiliadis), asimi@
dbnet.ece.ntua.gr (A. Simitsis), pgeor@dbnet.ece.ntua.gr
                                                                      mention their most prominent tasks, which include
(P. Georgantas), mter@dbnet.ece.ntua.gr (M. Terrovitis),              (a) the identification of relevant information at the
spiros@dbnet.ece.ntua.gr (S. Skiadopoulos).                           source side, (b) the extraction of this information,

0306-4379/$ - see front matter r 2004 Elsevier B.V. All rights reserved.
doi:10.1016/j.is.2004.11.002
ARTICLE IN PRESS

                                  P. Vassiliadis et al. / Information Systems 30 (2005) 492–525                                     493

                                          Logical Perspective                                           Physical Perspective


                                                                Execution Plan
                                                                                                                   Resource Layer


                   Execution Sequence
                   Execution Schedule
                   Recovery Plan

                                                                  Relationship
                                                                  with data                                   Operational Layer



                    Primary Data Flow
                    Data Flowfor Logical Exceptions



                                                                                  Administration Plan




                                        Monitoring & Logging
                                        Security & Access Rights Management


                                         Fig. 1. Different perspectives for an ETL workflow.




(c) the customization and integration of the                                     defining an execution plan for the scenario. The
information coming from multiple sources into a                                  definition of an execution plan can be seen from
common format, (d) the cleaning of the resulting                                 various perspectives. The execution sequence in-
data set on the basis of database and business                                   volves the specification of which activity runs first,
rules, and (e) the propagation of the data to the                                second, and so on, which activities run in parallel,
data warehouse and/or data marts.                                                or when a semaphore is defined so that several
   If we treat an ETL scenario as a composite                                    activities are synchronized at a rendezvous point.
workflow, in a traditional way, its designer is                                   ETL activities normally run in batch, so the
obliged to define several of its parameters (Fig. 1).                             designer needs to specify an execution schedule,
Here, we follow a multi-perspective approach that                                i.e., the time points or events that trigger the
enables to separate these parameters and study                                   execution of the scenario as a whole. Finally, due
them in a principled approach. We are mainly                                     to system crashes, it is imperative that there exists
interested in the design and administration parts of                             a recovery plan, specifying the sequence of steps to
the lifecycle of the overall ETL process, and we                                 be taken in the case of failure for a certain activity
depict them at the upper and lower part of Fig. 1,                               (e.g., retry to execute the activity, or undo any
respectively. At the top of Fig. 1, we are mainly                                intermediate results produced so far). On the right-
concerned with the static design artifacts for a                                 hand side of Fig. 1, we can also see the physical
workflow environment. We will follow a tradi-                                     perspective, involving the registration of the actual
tional approach and group the design artifacts into                              entities that exist in the real world. We will reuse
logical and physical, with each category compris-                                the terminology of [5] for the physical perspective.
ing its own perspective. We depict the logical                                   The resource layer comprises the definition of roles
perspective on the left-hand side of Fig. 1, and the                             (human or software) that are responsible for
physical perspective on the right-hand side. At the                              executing the activities of the workflow. The
logical perspective, we classify the design artifacts                            operational layer, at the same time, comprises the
that give an abstract description of the workflow                                 software modules that implement the design
environment. First, the designer is responsible for                              entities of the logical perspective in the real world.
ARTICLE IN PRESS

494                            P. Vassiliadis et al. / Information Systems 30 (2005) 492–525

In other words, the activities defined at the logical            activities, which we call templates. Moreover, in
layer (in an abstract way) are materialized and                 order to achieve a uniform extensibility mechan-
executed through the specific software modules of                ism for this library of built-ins, we have to deal
the physical perspective.                                       with specific language issues: thus, we also discuss
   At the lower part of Fig. 1, we are dealing with             the mechanics of template instantiation to concrete
the tasks that concern the administration of the                activities. The design concepts that we introduce
workflow environment and their dynamic beha-                     have been implemented in a tool, ARKTOS II, which
vior at runtime. First, an administration plan                  is also presented.
should be specified, involving the notification of                   Our contributions can be listed as follows:
the administrator either on-line (monitoring) or
off-line (logging) for the status of an executed                   First, we define a formal metamodel as an
activity, as well as the security and authentication                abstraction of ETL processes at the logical level.
management for the ETL environment.                                 The data stores, activities and their constituent
   We find that research has not dealt with the                      parts are formally defined. An activity is defined
definition of data-centric workflows to the entirety                  as an entity with possibly more than one input
of its extent. In the ETL case, for example, due to                 schemata, an output schema and a parameter
the data centric nature of the process, the designer                schema, so that the activity is populated each
must deal with the relationship of the involved                     time with its proper parameter values. The flow
activities with the underlying data. This involves the              of data from producers towards their consumers
definition of a primary data flow that describes the                  is achieved through the usage of provider
route of data from the sources towards their final                   relationships that map the attributes of the
destination in the data warehouse, as they pass                     former to the respective attributes of the latter.
through the activities of the scenario. Also, due to                A serializable combination of ETL activities,
possible quality problems of the processed data,                    provider relationships and data stores constitu-
the designer is obliged to define a data flow for                     tes an ETL scenario.
logical exceptions, i.e., a flow for the problematic                Second, we provide a reusability framework that
data, i.e., the rows that violate integrity or business             complements the genericity of the metamodel.
rules. It is the combination of the execution                       Practically, this is achieved from a set of ‘‘built-
sequence and the data flow that generates the                        in’’ specializations of the entities of the meta-
semantics of the ETL workflow: the data flow                          model layer, specifically tailored for the most
defines what each activity does and the execution                    frequent elements of ETL scenarios. This palette
plan defines in which order and combination.                         of template activities will be referred to as
   In this paper, we work in the internals of the                   template layer and it is characterized by its
data flow of ETL scenarios. First, we present a                      extensibility; in fact, due to language considera-
metamodel particularly customized for the defini-                    tions, we provide the details of the mechanism
tion of ETL activities. We follow a workflow-like                    that instantiates templates to specific activities.
approach, where the output of a certain activity                   Finally, we discuss implementation issues and we
can either be stored persistently or passed to a                    present a graphical tool, ARKTOS II that facil-
subsequent activity. Moreover, we employ a                          itates the design of ETL scenarios, based on our
declarative database programming language,                          model.
LDL, to define the semantics of each activity.
The metamodel is generic enough to capture any                    This paper is organized as follows. In Section 2,
possible ETL activity; nevertheless, reusability and            we present a generic model of ETL activities.
ease-of-use dictate that we can do better in aiding             Section 3 describes the mechanism for specifying
the data warehouse designer in his task. In this                and materializing template definitions of fre-
pursuit of higher reusability and flexibility, we                quently used ETL activities. Section 4 presents
specialize the set of our generic metamodel                     ARKTOS II, a prototype graphical tool. In Section 5,
constructs with a palette of frequently used ETL                we survey related work. In Section 6, we make a
ARTICLE IN PRESS

                             P. Vassiliadis et al. / Information Systems 30 (2005) 492–525                                  495

general discussion on the completeness and general            staging area (DSA)1 DS. The scenario involves the
applicability of our approach. Section 7 offers               propagation of data from the table PARTSUPP of
conclusions and presents topics for future re-                source S1 to the data warehouse DW. Table
search. Short versions of parts of this paper have            DW.PARTSUPP (PKEY, SOURCE, DATE, QTY,
been presented in [6,7].                                      COST) stores information for the available quan-
                                                              tity (QTY) and cost (COST) of parts (PKEY)
                                                              per source (SOURCE). The data source S1.
2. Generic model of ETL activities                            PARTSUPP (PKEY, DATE, QTY, COST) records
                                                              the supplies from a specific geographical region,
   The purpose of this section is to present a formal         e.g., Europe. All the attributes, except for the dates
logical model for the activities of an ETL                    are instances of the Integer type. The scenario is
environment. This model abstracts from the                    graphically depicted in Fig. 3 and involves the
technicalities of monitoring, scheduling and log-             following transformations.
ging while it concentrates on the flow of data from
the sources towards the data warehouse through                1. First, we transfer via FTP_PS1 the snapshot
the composition of activities and data stores. The               from the source S1.PARTSUPP to the file
full layout of an ETL scenario, involving activities,            DS.PS1_NEW of the DSA.2
recordsets and functions can be modeled by a                  2. In the DSA, we maintain locally a copy of the
graph, which we call the architecture graph. We                  snapshot of the source as it was at the previous
employ a uniform, graph-modeling framework for                   loading (we assume here the case of the
both the modeling of the internal structure of                   incremental maintenance of the DW, instead of
activities and the ETL scenario at large, which                  the case of the initial loading of the DW). The
enables the treatment of the ETL environment                     recordset DS.PS1_NEW (PKEY, DATE, QTY,
from different viewpoints. First, the architecture               COST) stands for the last transferred snapshot
graph comprises all the activities and data stores of            of S1.PARTSUPP. By detecting the difference
a scenario, along with their components. Second,                 of this snapshot with the respective version of
the architecture graph captures the data flow                     the previous loading, DS.PS1_OLD (PKEY,
within the ETL environment. Finally, the informa-                DATE, QTY, COST), we can derive the newly
tion on the typing of the involved entities and the              inserted rows in S1.PARTSUPP. Note that the
regulation of the execution of a scenario, through               difference activity that we employ, namely
specific parameters are also covered.                             Diff_PS1, checks for differences only on the
                                                                 primary key of the recordsets; thus, we ignore
2.1. Graphical notation and motivating example                   here any possible deletions or updates for the
                                                                 attributes COST, QTY of existing rows. Any not
   Being a graph, the architecture graph of an ETL               newly inserted row is rejected and so, it is
scenario comprises nodes and edges. The involved                 propagated to Diff_PS1_REJ that stores all
data types, function types, constants, attributes,               the rejected rows. The schema of Diff_PS1_
activities, recordsets, parameters and functions                 REJ is identical to the input schema of the
constitute the nodes of the graph. The different                 activity Diff_PS1.
kinds of relationships among these entities are
modeled as the edges of the graph. In Fig. 2, we                 1
                                                                  In data warehousing terminology a DSA is an intermediate
give the graphical notation for all the modeling              area of the data warehouse, specifically destined to enable the
constructs that will be presented in the sequel.              transformation, cleaning and integration of source data, before
   Motivating example: To motivate our discus-                being loaded to the warehouse.
                                                                 2
sion, we will present an example involving the                    The technical points, like FTP, are mostly employed to show
                                                              what kind of problems someone has to deal with in a practical
propagation of data from a certain source S1,                 situation, rather than to relate this kind of physical operations
towards a data warehouse DW through intermedi-                to a logical model. In terms of logical modelling this is a simple
ate recordsets. These recordsets belong to a data             passing of data from one site to another.
ARTICLE IN PRESS

496                                   P. Vassiliadis et al. / Information Systems 30 (2005) 492–525



      Data Types             Black ellipsoid                                        RecordSets           Cylinders


      Function
                             Black rectangles                                       Functions            Gray rectangles
      Types


      Constants              Black circles                        1                 Parameters           White rectangles



      Attributes             Unshaded ellipsoid                                     Activities           Triangles




                                                                                    Provider             Bold solid arrows
      Part-Of                Simple lines with
                                                                                    Relationships        (from provider to
      Relationships          diamond edges*
                                                                                                         consumer)
                                                                                                         Bold dotted
                             Dotted arrows                                          Derived
      Instance-Of                                                                                         arrows (from
                             (from instance                                         Provider
      Relationships                                                                                       provider to
                             towards the type)                                      Relationships
                                                                                                          consumer)
                                                                                     * We annotate the part-of relationship among a
      Regulator
                             Dotted lines                                            function and its return type with a directed edge, to
      Relationships                                                                  distinguish it from therest of the parameters.

                                             Fig. 2. Graphical notation for the architecture graph.




                                                                                                                         DS.PS1.PKEY
                                                  DS.PS1_NEW.PKEY
                                                                                                                        LOOKUP.PKEY
                                                         =              COST              SOURCE = 1
                                                                                                                       LOOKUP.SOURCE
                                                  DS.PS1_OLD.PKEY
                                                                                                                        LOOKUP.SKEY
 S1.PARTSUPP       FTP_PS1        DS.PS1_NEW




                                                     Diff_PS1         NotNu111            Add_Attr1      DS.PS1            SK1               DW.PARTSUPP
  Source

                                   DS.PS1_OLD                                                                                              Data
                                                                                                                                         Warehouse
                                                       Diff_PS1                  Not Nul 111
                                                                                                           LOOKUP
                                                          _REJ                     _ REJ




                                                                                   DSA

                                               Fig. 3. Bird’s-eye view of the motivating example.




3. The rows that pass the activity Diff_PS1 are                                     Diff_PS1_REJ recordset for further examina-
   checked for null values of the attribute COST                                    tion by the data warehouse administrator.
   through the activity NotNull1. Rows having a                                  4. Although we consider the data flow for only
   NULL value for their COST are kept in the                                        one source, namely S1, the data warehouse can
ARTICLE IN PRESS

                             P. Vassiliadis et al. / Information Systems 30 (2005) 492–525                       497

   clearly have more sources for part supplies. In               Finally, before proceeding, we would like to
   order to keep track of the source of each row              stress that we do not anticipate a manual
   entering into the DW, we need to add a ‘flag’               construction of the graph by the designer; rather,
   attribute, namely SOURCE, indicating the re-               we employ this section to clarify how the graph
   spective source. This is achieved through the              will look once constructed. To assist a more
   activity Add_Attr1. We store the rows that                 automatic construction of ETL scenarios, we have
   stem from this process in the recordset DS.PS1             implemented the ARKTOS II tool that supports the
   (PKEY, SOURCE, DATE, QTY, COST).                           designing process through a friendly GUI. We
5. Next, we assign a surrogate key on PKEY. In the            present ARKTOS II in Section 4.
   data warehouse context, it is common tactics to
   replace the keys of the production systems with            2.2. Preliminaries
   a uniform key, which we call a surrogate key [8].
   The basic reasons for this replacement are                   In this subsection, we will introduce the formal
   performance and semantic homogeneity. Tex-                 modeling of data types, data stores and functions,
   tual attributes are not the best candidates for            before proceeding to the modeling of ETL
   indexed keys and thus, they need to be replaced            activities.
   by integer keys. At the same time, different
   production systems might use different keys for              Elementary entities: We assume the existence of
   the same object, or the same key for different             a countable set of data types. Each data type T is
   objects, resulting in the need for a global                characterized by a name and a domain, i.e., a
   replacement of these values in the data ware-              countable set of values, called dom (T). The
   house. This replacement is performed through a             values of the domains are also referred to as
   lookup table of the form L (PRODKEY,                       constants.
   SOURCE, SKEY). The SOURCE column is due                      We also assume the existence of a countable set
   to the fact that there can be synonyms in the              of attributes, which constitute the most elementary
   different sources, which are mapped to different           granules of the infrastructure of the information
   objects in the data warehouse. In our case, the            system. Attributes are characterized by their name
   activity that performs the surrogate key assign-           and data type. The domain of an attribute is a
   ment for the attribute PKEY is SK1. It uses the            subset of the domain of its data type. Attributes
   lookup table LOOKUP (PKEY, SOURCE,                         and constants are uniformly referred to as terms.
   SKEY). Finally, we populate the data ware-
   house with the output of the previous activity.               A schema is a finite list of attributes. Each entity
                                                              that is characterized by one or more schemata will
   The role of rejected rows depends on the                   be called structured entity. Moreover, we assume
peculiarities of each ETL scenario. If the designer           the existence of a special family of schemata, all
needs to administrate these rows further, then he/            under the general name of NULL schema,
she should use intermediate storage recordsets                determined to act as placeholders for data which
with the burden of an extra I/O cost. If the rejected         are not to be stored permanently in some data
rows should not have a special treatment, then the            store. We refer to a family instead of a single
best solution is to be ignored; thus, in this case we         NULL schema, due to a subtle technicality
avoid overloading the scenario with any extra                 involving the number of attributes of such a
storage recordset. In our case, we annotate only              schema (this will become clear in the sequel).
two of the presented activities with a destina-
tion for rejected rows. Out of these, while                      Recordsets: We define a record as the instantia-
NotNull1_REJ absolutely makes sense as a                      tion of a schema to a list of values belonging to
placeholder for problematic rows having non-                  the domains of the respective schema attributes.
acceptable NULL values, Diff_PS1_REJ is pre-                  We can treat any data structure as a re-
sented for demonstration reasons only.                        cordset provided that there are ways to logically
ARTICLE IN PRESS

498                          P. Vassiliadis et al. / Information Systems 30 (2005) 492–525

restructure it into a flat, typed record schema.               LDL, we avoid dealing with the peculiarities of a
Formally, a recordset is characterized by its name,           particular programming language. Once again, we
its (logical) schema and its (physical) extension             want to stress that the presented LDL description
(i.e., a finite set of records under the recordset             is intended to capture the semantics of each
schema). If we consider a schema S ¼                          activity, instead of the way these activities are
[A1, y, Ak], for a certain recordset, its extension           actually implemented.
is a mapping S ¼ [A1, y, Ak]-dom(A1) Â y                        An elementary activity is formally described by
 Â dom(Ak). Thus, the extension of the recordset              the following elements:
is a finite subset of dom(A1)  y  dom(Ak) and
a record is the instance of a mapping dom(A1)
                                                                 Name: A unique identifier for the activity.
  y  dom(Ak)-[x1,y,xk], xiAdom(Ai).
                                                                 Input schemata: A finite set of one or more input
                                                                  schemata that receives data from the data
In the rest of this paper we will mainly deal with
                                                                  providers of the activity.
the two most popular types of recordsets, namely
relational tables and record files. A database is a
                                                                 Output schema: A schema that describes the
                                                                  placeholder for the rows that pass the check
finite set of relational tables.
                                                                  performed by the elementary activity.
   Functions. We assume the existence of a
                                                                 Rejections schema: A schema that describes the
                                                                  placeholder for the rows that do not pass the
countable set of built-in system function types. A
                                                                  check performed by the activity, or their values
function type comprises a name, a finite list of
                                                                  are not appropriate for the performed transfor-
parameter data types, and a single return data type.
                                                                  mation.
A function is an instance of a function type.
Consequently, it is characterized by a name, a list
                                                                 Parameter list: A set of pairs which act as
                                                                  regulators for the functionality of the activity
of input parameters and a parameter for its return
                                                                  (the target attribute of a foreign key check, for
value. The data types of the parameters of the
                                                                  example). The first component of the pair is a
generating function type also define (a) the data
                                                                  name and the second is a schema, an attribute, a
types of the parameters of the function and (b) the
                                                                  function or a constant.
legal candidates for the function parameters (i.e.,
attributes or constants of a suitable data type).
                                                                 Output operational semantics: An LDL state-
                                                                  ment describing the content passed to the
                                                                  output of the operation, with respect to its
2.3. Activities
                                                                  input. This LDL statement defines (a) the
                                                                  operation performed on the rows that pass
   Activities are the backbone of the structure of
                                                                  through the activity and (b) an implicit mapping
any information system. We adopt the WfMC
                                                                  between the attributes of the input schema(ta)
terminology [9] for processes/programs and we will
                                                                  and the respective attributes of the output
call them activities in the sequel. An activity is an
                                                                  schema.
amount of ‘‘work which is processed by a
combination of resource and computer applica-
                                                                 Rejection operational semantics: An LDL state-
                                                                  ment describing the rejected records, in a sense
tions’’ [9]. In our framework, activities are logical
                                                                  similar to the output operational semantics.
abstractions representing parts or full modules of
                                                                  This statement is by default considered to be the
code.
                                                                  complement of the output operational seman-
   The execution of an activity is performed from a
                                                                  tics, except if explicitly defined differently.
particular program. Normally, ETL activities will
be either performed in a black-box manner by a
                                                                 There are two issues that we would like to
dedicated tool, or they will be expressed in some
                                                              elaborate on, here:
language (e.g., PL/SQL, Perl, C). Still, we want to
deal with the general case of ETL activities. We                 NULL schemata: Whenever we do not specify
employ an abstraction of the source code of an                    a data consumer for the output or rejec-
activity, in the form of an LDL statement. Using                  tion schemata, the respective NULL schema
ARTICLE IN PRESS

                                P. Vassiliadis et al. / Information Systems 30 (2005) 492–525                          499

    (involving the correct number of attributes) is                 notation of entities (nodes) and relationships
    implied. This practically means that the data                   (edges) is presented in Fig. 2.
    targeted for this schema will neither be stored to
    some persistent data store, nor will they be                      Part-of relationships. These relationships in-
    propagated to another activity, but they will                     volve attributes and parameters and relate them
    simply be ignored.                                                to the respective activity, recordset or function
   Language issues: Initially, we used to specify the                to which they belong.
    semantics of activities with SQL statements.                      Instance-of relationships. These relationships are
    Still, although clear and easy to write and                       defined among a data/function type and its
    understand, SQL is rather hard to use if one is                   instances.
    to perform rewriting and composition of state-                    Provider relationships. These are relationships
    ments. Thus, we have supplemented SQL with                        that involve attributes with a provider–consu-
    LDL [10], a logic programming, declarative                        mer relationship.
    language as the basis of our scenario definition.                  Regulator relationships. These relationships are
    LDL is a Datalog variant based on a Horn-                         defined among the parameters of activities and
    clause logic that supports recursion, complex                     the terms that populate these activities.
    objects and negation. In the context of its                       Derived provider relationships. A special case of
    implementation in an actual deductive database                    provider relationships that occurs whenever
    management system, LDL++ [11], the lan-                           output attributes are computed through the
    guage has been extended to support external                       composition of input attributes and parameters.
    functions, choice, aggregation (and even, user-                   Derived provider relationships can be deduced
    defined aggregation), updates and several other                    from a simple rule and do not originally
    features.                                                         constitute a part of the graph.

2.4. Relationships in the architecture graph                           In the rest of this subsection, we will detail the
                                                                    notions pertaining to the relationships of the
   In this subsection, we will elaborate on the                     Architecture Graph; the knowledgeable reader is
different kinds of relationships that the entities of               referred to Section 2.5 where we discuss the issue
an ETL scenario have. Whereas these entities are                    of scenarios. We will base our discussions on a
modeled as the nodes of the architecture graph,                     part of the scenario of the motivating example
relationships are modeled as its edges. Due to their                (presented in Section 2.1), including activity SK1.
diversity, before proceeding, we list these types of
relationships along with the related terminology                      Data types and instance-of relationships: To
that we will use in this paper. The graphical                       capture typing information on attributes and

                                OUT                      IN          OUT                     IN   DW.PARTS
                       DS.PS1                                 SK1
                                                                                                    UPP




                                  PKEY           PKEY                   PKEY          PKEY                   Integer


                                      QTY        QTY                       QTY         QTY


                                  COST           COST                   COST          COST


                    Date          DATE           DATE                   DATE          DATE


                                 SOURCE         SOURCE                 SOURCE        SOURCE


                                                                        SKEY


                                 Fig. 4. Instance-of relationships of the architecture graph.
ARTICLE IN PRESS

500                              P. Vassiliadis et al. / Information Systems 30 (2005) 492–525

functions, the architecture graph comprises data                        tion schema of the activity, respectively). Natu-
and function types. Instantiation relationships are                     rally, if the activity involves more than one input
depicted as dotted arrows that stem from the                            schemata, the relationship is tagged with an INi
instances and head toward the data/function types.                      tag for the ith input schema. We also incorporate
In Fig. 4, we observe the attributes of the two                         the functions along with their respective para-
activities of our example and their correspondence                      meters and the part-of relationships among the
to two data types, namely integer and date.                             former and the latter. We annotate the part-of
For reasons of presentation, we merge several                           relationship with the return type with a directed
instantiation edges so that the figure does not                          edge, to distinguish it from the rest of the
become too crowded.                                                     parameters.
                                                                           Fig. 5 depicts a part of the motivating example.
  Attributes and part-of relationships: The first                        In terms of part-of relationships, we present the
thing to incorporate in the architecture graph is                       decomposition of (a) the recordsets DS.PS1,
the structured entities (activities and recordsets)                     LOOKUP, DW.PARTSUPP and (b) the activity SK1
along with all the attributes of their schemata. We                     and the attributes of its input and output
choose to avoid overloading the notation by                             schemata. Note the tagging of the schemata of
incorporating the schemata per se; instead we                           the involved activity. We do not consider the
apply a direct part-of relationship between an                          rejection schemata in order to avoid crowding the
activity node and the respective attributes. We                         picture. Also note, how the parameters of the
annotate each such relationship with the name of                        activity are also incorporated in the architecture
the schema (by default, we assume a IN, OUT,                            graph. Activity SK1 has five parameters: (a) PKEY,
PAR, REJ tag to denote whether the attribute                            which stands for the production key to be
belongs to the input, output, parameter or rejec-                       replaced, (b) SOURCE, which stands for an integer


                           OUT                           IN               OUT                      IN   DW.PARTS
                  DS.PS1                                      SK1
                                                                                                          UPP
                                                                    PAR


                                 PKEY            PKEY                           PKEY       PKEY


                                 QTY              QTY                           QTY        QTY


                                 COST            COST                           COST       COST


                                 DATE            DATE                           DATE       DATE


                             SOURCE             SOURCE                      SOURCE        SOURCE


                                                                                SKEY




                                                                PKEY


                                                               SOURCE


                                                PKEY            LPKEY

                                        OUT
                           LOOKUP             SOURCE          LSOURCE


                                                SKEY            LSKEY



                      Fig. 5. Part-of regulator and provider relationships of the architecture graph.
ARTICLE IN PRESS

                             P. Vassiliadis et al. / Information Systems 30 (2005) 492–525                      501

value that characterizes which source’s data are              relationship among the attributes of the involved
processed, (c) LPKEY, which stands for the                    schemata.
attribute of the lookup table which contains the                 Formally, a provider relationship is defined by
production keys, (d) LSOURCE, which stands for                the following elements:
the attribute of the lookup table which contains
the source value (corresponding to the aforemen-                 Name: A unique identifier for the provider
tioned SOURCE parameter), (e) LSKEY, which                        relationship.
stands for the attribute of the lookup table which               Mapping: An ordered pair. The first part of the
contains the surrogate keys.                                      pair is a term (i.e., an attribute or constant)
                                                                  acting as a provider and the second part is an
   Parameters and regulator relationships: Once the               attribute acting as the consumer.
part-of and instantiation relationships have been
established, it is time to establish the regulator               The mapping need not necessarily be 1:1 from
relationships of the scenario. In this case, we link          provider to consumer attributes, since an input
the parameters of the activities to the terms                 attribute can be mapped to more than one
(attributes or constants) that populate them. We              consumer attributes. Still, the opposite does not
depict regulator relationships with simple dotted             hold. Note that a consumer attribute can also be
edges.                                                        populated by a constant, in certain cases.
   In the example of Fig. 5 we can also observe                  In order to achieve the flow of data from the
how the parameters of activity SK1 are populated              providers of an activity towards its consumers, we
through regulator relationships. The parameters               need the following three groups of provider
in and out are mapped to the respective terms                 relationships:
through regulator relationships. All the para-
meters of SK1, namely PKEY, SOURCE, LPKEY,                    1. A mapping between the input schemata of the
LSOURCE and LSKEY, are mapped to the respec-                     activity and the output schema of their data
tive attributes of either the activity’s input schema            providers. In other words, for each attribute of
or the employed lookup table LOOKUP. The                         an input schema of an activity, there must exist
parameter LSKEY deserves particular attention.                   an attribute of the data provider, or a constant,
This parameter is (a) populated from the attribute               which is mapped to the former attribute.
SKEY of the lookup table and (b) used to populate             2. A mapping between the attributes of the activity
the attribute SKEY of the output schema of the                   input schemata and the activity output (or
activity. Thus, two regulator relationships are                  rejection, respectively) schema.
related with parameter LSKEY, one for each of                 3. A mapping between the output or rejection
the aforementioned attributes. The existence of a                schema of the activity and the (input) schema of
regulator relationship among a parameter and an                  its data consumer.
output attribute of an activity normally denotes
that some external data provider is employed in                 The mappings of the second type are internal to
order to derive a new attribute, through the                  the activity. Basically, they can be derived from the
respective parameter.                                         LDL statement for each of the output/rejection
                                                              schemata. As far as the first and the third types of
  Provider relationships: The flow of data from the            provider relationships are concerned, the map-
data sources towards the data warehouse is                    pings must be provided during the construction of
performed through the composition of activities               the ETL scenario. This means that they are either
in a larger scenario. In this context, the input for          (a) by default assumed by the order of the
an activity can be either a persistent data store, or         attributes of the involved schemata or (b) hard-
another activity. Usually, this applies for the               coded by the user. Provider relationships are
output of an activity, too. We capture the passing            depicted with bold solid arrows that stem from
of data from providers to consumers by a provider             the provider and end in the consumer attribute.
ARTICLE IN PRESS

502                          P. Vassiliadis et al. / Information Systems 30 (2005) 492–525

   Observe Fig. 5. The flow starts from table                  relationships, previously discussed: the first rule
DS.PS1 of the data staging area. Each of the                  explains how the data from the DS.PS1 recordset
attributes of this table is mapped to an attribute of         are fed into the input schema of the activity, the
the input schema of activity SK1. The attributes of           second rule explains the semantics of activity (i.e.,
the input schema of the latter are subsequently               how the surrogate key is generated) and, finally,
mapped to the attributes of the output schema of              the third rule shows how the DW.PARTSUPP
the activity. The flow continues to DW.PARTSUPP.               recordset is populated from the output schema of
Another interesting thing is that during the data             the activity SK1.
flow, new attributes are generated, resulting on new
streams of data, whereas the flow seems to stop for               Derived provider relationships: As we have
other attributes. Observe the rightmost part of               already mentioned, there are certain output
Fig. 5 where the values of attribute PKEY are not             attributes that are computed through the composi-
further propagated (remember that the reason for              tion of input attributes and parameters. A derived
the application of a surrogate key transformation is          provider relationship is another form of provider
to replace the production keys of the source data to          relationship that captures the flow from the input
a homogeneous surrogate for the records of the                to the respective output attributes.
data warehouse, which is independent of the source               Formally, assume that (a) source is a term in
they have been collected from). Instead of the                the architecture graph, (b) target is an attribute
values of the production key, the values from the             of the output schema of an activity A and (c) x,y
attribute SKEY will be used to denote the unique              are parameters in the parameter list of A (not
identifier for a part in the rest of the flow.                  necessary different). Then, a derived provider
   In Fig. 6, we depict the LDL definition of this             relationship pr(source, target) exists iff the
part of the motivating example. The three rules               following regulator relationships (i.e., edges) exist:
correspond to the three categories of provider                rr1(source, x) and rr2(y, target).

      addSkey_in1(A_IN1_PKEY,A_IN1_DATE,A_IN1_QTY,A_IN1_COST,A_IN1_SOURCE)
       ds_ps1(A_OUT_PKEY,A_OUT_DATE,A_OUT_QTY,A_OUT_COST,A_OUT_SOURCE),
        A_OUT_PKEY=A_IN1_PKEY,
        A_OUT_DATE=A_IN1_DATE,
        A_OUT_QTY=A_IN1_QTY,
        A_OUT_COST=A_IN1_COST,
        A_OUT_SOURCE=A_IN1_SOURCE.

      addSkey_out(A_OUT_PKEY,A_OUT_DATE,A_OUT_QTY,A_OUT_COST,A_OUT_SOURCE,A_OUT_SKEY)
       addSkey_in1(A_IN1_PKEY,A_IN1_DATE,A_IN1_QTY,A_IN1_COST,A_IN1_SOURCE),
        lookup(A_IN1_SOURCE,A_IN1_PKEY,A_OUT_SKEY),
        A_OUT_PKEY=A_IN1_PKEY,
        A_OUT_DATE=A_IN1_DATE,
        A_OUT_QTY=A_IN1_QTY,
        A_OUT_COST=A_IN1_COST,
        A_OUT_SOURCE=A_IN1_SOURCE.

      dw_partsupp(PKEY,DATE,QTY,COST,SOURCE)
       addSkey_out(A_OUT_PKEY,A_OUT_DATE,A_OUT_QTY,A_OUT_COST,A_OUT_SOURCE,A_OUT_SKEY),
        DATE=A_IN1_DATE,
        QTY=A_IN1_QTY,
        COST=A_IN1_COST
        SOURCE=A_IN1_SOURCE,
        PKEY=A_IN1_SKEY.

      NOTE: For reasonsof readability we do not replace the Ain attribute names with
      the activity name, i.e.,A_OUT_PKEYshould be diffPS1_OUT_PKEY.

                                 Fig. 6. LDL specification of the motivating example.
ARTICLE IN PRESS

                                     P. Vassiliadis et al. / Information Systems 30 (2005) 492–525                                     503



                                     IN                   OUT
                                           SK1

                                                    PAR


                             PKEY                               PKEY


                            SOURCE                           SOURCE
                                                                                                     IN                   OUT
                                                                                                            SK1
                                                                SKEY
                                                                                                                    PAR

                                                                                             PKEY

                                              PKEY
                                                                                            SOURCE

                                             SOURCE                                                                             SKEY
                                                                                                          PKEY
                           PKEY              LPKEY
                                                                                          OUT
                                                                               LOOKUP                  SOURCE
             OUT
  LOOKUP                 SOURCE             LSOURCE
                                                                                                        SKEY
                           SKEY              LSKEY

Fig. 7. Derived provider relationships of the architecture graph: the original situation on the left and the derived provider relationships
on the right.



   Intuitively, the case of derived relationships                        ships that do not involve constants (remember that
models the situation where the activity computes                         we have defined source as a term); (b) relation-
a new attribute in its output. In this case, the                         ships involving only attributes of the same/
produced output depends on all the attributes that                       different activity (as a measure of internal com-
populate the parameters of the activity, resulting                       plexity or external dependencies); (c) relationships
in the definition of the corresponding derived                            relating attributes that populate only the same
relationship.                                                            parameter (e.g., only the attributes LOOKUP.SKEY
   Observe Fig. 7, where we depict a small part of                       and SK1.OUT.SKEY).
our running example. The left side of the figure
depicts the situation where only provider relation-                      2.5. Scenarios
ships exist. The legend in the right side of Fig. 7
depicts how we compute the derived provider                                 A scenario is an enumeration of activities along
relationships between the parameters of the                              with their source/target recordsets and the respec-
activity and the computed output attribute SKEY.                         tive provider relationships for each activity. An
The meaning of these five relationships is that                           ETL scenario consists of the following elements:
SK1.OUT.SKEY is not computed only from
attribute LOOKUP.SKEY, but from the combina-                                Name: A unique identifier for the scenario.
tion of all the attributes that populate the                                Activities: A finite list of activities. Note that by
parameters.                                                                  employing a list (instead of e.g., a set) of
   One can also assume different variations of                               activities, we impose a total ordering on the
derived provider relationships such as (a) relation-                         execution of the scenario.
ARTICLE IN PRESS

504                                        P. Vassiliadis et al. / Information Systems 30 (2005) 492–525


                                                         Entity                    Model-specific    Scenario-specific

                                      Data Types                                         DI                  D
                     Built
                      -in
                                      Function Types                                     FI                  F
                                      Constants                                          CI                  C
                                      Attributes                                         ΩI                  Ω
                                      Functions                                          ΦI                  Φ
                                      Schemata                                           SI                  S
                      User-provided




                                      RecordSets                                        RSI                 RS
                                      Activities                                         AI                  A
                                      Provider Relationships                            PrI                 Pr
                                      Part-Of Relationships                             PoI                 Po
                                      Instance-Of Relationships                         IoI                 Io
                                      Regulator Relationships                           RrI                 Rr
                                      Derived Provider Relationships                    DrI                 Dr

                                                Fig. 8. Formal definition of domains and notation.


   Recordsets: A finite set of recordsets.                                    In the sequel, we treat the terms architecture
   Targets: A special-purpose subset of the record-                        graph and scenario interchangeably. The reason-
    sets of the scenario, which includes the final                           ing for the term ‘architecture graph’ goes all the
    destinations of the overall process (i.e., the data                     way down to the fundamentals of conceptual
    warehouse tables that must be populated by the                          modeling. As mentioned in [12], conceptual
    activities of the scenario).                                            models are the means by which designers conceive,
   Provider relationships: A finite list of provider                        architect, design, and build software systems.
    relationships among activities and recordsets of                        These conceptual models are used in the same
    the scenario.                                                           way that blueprints are used in other engineering
                                                                            disciplines during the early stages of the lifecycle of
   In our modeling, a scenario is a set of activities,                      artificial systems, which involves the creation of
                                                                            their architecture. The term ‘architecture graph’
deployed along a graph in an execution sequence
                                                                            expresses the fact that the graph that we employ
that can be linearly serialized. For the moment, we
                                                                            for the modeling of the data flow of the ETL
do not consider the different alternatives for the
                                                                            scenario is practically acting as a blueprint of the
ordering of the execution; we simply require that a
                                                                            architecture of this software artifact.
total order for this execution is present (i.e., each
                                                                              Moreover, we assume the following integrity
activity has a discrete execution priority).
                                                                            constraints for a scenario:
   In terms of formal modeling of the architecture
graph, we assume the infinitely countable, mu-
                                                                                Static constraints:
tually disjoint sets of names (i.e., the values of
which respect the unique name assumption) of                                   All the weak entities of a scenario (i.e.,
column model-specific in Fig. 8. As far as a specific                             attributes or parameters) should be defined
scenario is concerned, we assume their respective                               within a part-of relationship (i.e., they should
finite subsets, depicted in column scenario-specific                              have a container object).
in Fig. 8. Data types, function types and constants                            All the mappings in provider relationships
are considered built-in’s of the system, whereas the                            should be defined among terms (i.e., attributes
rest of the entities are provided by the user (user                             or constants) of the same data type.
provided).
   Formally, the architecture graph of an ETL
scenario is a graph G(V,E) defined as follows:                                   Data flow constraints:
      V ¼ D[F[C[X[/[S[RS[A                                                     All the attributes of the input schema(ta) of an
      E ¼ Pr[Po[Io[Rr[Dr.                                                       activity should have a provider.
ARTICLE IN PRESS

                                       P. Vassiliadis et al. / Information Systems 30 (2005) 492–525                                                    505

   Resulting from the previous requirement, if                                                 3.1. General framework
    some attribute is a parameter in an activity A,
    the container of the attribute (i.e., recordset or                                             Our philosophy during the construction of our
    activity) should precede A in the scenario.                                                 metamodel was based on two pillars: (a) genericity,
   All the attributes of the schemata of the target                                            i.e., the derivation of a simple model, powerful to
    recordsets should have a data provider.                                                     capture ideally all the cases of ETL activities and
                                                                                                (b) extensibility, i.e., the possibility of extending
  Summarizing, in this section, we have presented                                               the built-in functionality of the system with new,
a generic model for the modeling of the data flow                                                user-specific templates.
for ETL workflows. In the next section, we will                                                     The genericity doctrine was pursued through the
proceed to detail how this generic model can be                                                 definition of a rather simple activity metamodel, as
accompanied by a customization mechanism, in                                                    described in Section 2. Still, providing a single
order to provide higher flexibility to the designer                                              metaclass for all the possible activities of an ETL
of the workflow.                                                                                 environment is not really enough for the designer
                                                                                                of the overall process. A richer ‘‘language’’ should
                                                                                                be available, in order to describe the structure of
3. Templates for ETL activities                                                                 the process and facilitate its construction. To this
                                                                                                end, we provide a palette of template activities,
   In this section, we present the mechanism for                                                which are specializations of the generic metamodel
exploiting template definitions of frequently used                                               class.
ETL activities. The general framework for the                                                      Observe Fig. 9 for a further explanation of our
exploitation of these templates is accompanied                                                  framework. The lower layer of Fig. 9, namely
with the presentation of the language-related                                                   schema layer, involves a specific ETL scenario.
issues for template management and appropriate                                                  All the entities of the schema layer are instances of
examples.                                                                                       the classes Data Type, Function Type,



                                                                      Datatypes                        Functions



                                              Elementary Activity                   RecotdSet                      Relationships


                     Metamodel Layer


                                                                                                                                             IsA




                                                 Domain Mismatch                     Source Table
                                                                                                                             Provider Re

                                    NotNull                         SK Assignment                     Fact Table




                     Template Layer


                                                                                                                                           InstanceOf




                                      S1.PARTSUPF            NN              DM1                SK1                DW.PARTSUPP




                     Schema Layer


                              Fig. 9. The metamodel for the logical entities of the ETL environment.
ARTICLE IN PRESS

506                            P. Vassiliadis et al. / Information Systems 30 (2005) 492–525

Elementary Activity, RecordSet and                                  In the example of Fig. 9 the concept DW.
Relationship. Thus, as one can see on the                        PARTSUPP must be populated from a certain
upper part of Fig. 9, we introduce a meta-class                  source S1.PARTSUPP. Several operations must
layer, namely metamodel layer involving the                      intervene during the propagation. For instance in
aforementioned classes. The linkage between the                  Fig. 9, we check for null values and domain
metamodel and the schema layers is achieved                      violations, and we assign a surrogate key. As one
through instantiation (InstanceOf) relation-                     can observe, the recordsets that take part in this
ships. The metamodel layer implements the afore-                 scenario are instances of class RecordSet (be-
mentioned genericity desideratum: the classes                    longing to the metamodel layer) and specifically of
which are involved in the metamodel layer are                    its subclasses Source Table and Fact Table.
generic enough to model any ETL scenario,                        Instances and encompassing classes are related
through the appropriate instantiation.                           through links of type InstanceOf. The same
   Still, we can do better than the simple provision             mechanism applies to all the activities of
of a metalayer and an instance layer. In order to                the scenario, which are (a) instances of class
make our metamodel truly useful for practi-                      Elementary Activity and (b) instances of
cal cases of ETL activities, we enrich it with a set             one of its subclasses, depicted in Fig. 9. Relation-
of ETL-specific constructs, which constitute a                    ships do not escape this rule either. For instance,
subset of the larger metamodel layer, namely                     observe how the provider links from the concept
the template layer. The constructs in the template               S1.PS toward the concept DW.PARTSUPP are
layer are also meta-classes, but they are                        related to class Provider Relationship
quite customized for the regular cases of ETL                    through the appropriate InstanceOf links.
activities. Thus, the classes of the template layer                 As far as the class Recordset is concerned, in
are specializations (i.e., subclasses) of the generic            the template layer we can specialize it to several
classes of the metamodel layer (depicted as                      subclasses, based on orthogonal characteristics,
IsA relationships in Fig. 9). Through this custo-                such as whether it is a file or RDBMS table, or
mization mechanism, the designer can pick the                    whether it is a source or target data store (as in
instances of the schema layer from a much                        Fig. 9). In the case of the class Relationship,
richer palette of constructs; in this setting, the               there is a clear specialization in terms of the five
entities of the schema layer are instantiations, not             classes of relationships which have already
only of the respective classes of the metamodel                  been mentioned in Section 2 (i.e., Provider,
layer, but also of their subclasses in the template              Part-Of, Instance-Of, Regulator and
layer.                                                           Derived Provider).



 Filters                         Unary operations                                  Binary operations
 - Selection (σ)                 - Push                                            - Union (U)
 - Not null (NN)                 - Aggregation (γ)                                 - Join ( )
                                                                                               ∆
                                                                                               ∆




 - Primary key                   - Projection (Π)                                  - Diff (∆)
   violation (PK)                - Function application (f)                        - Update Detection (∆UPD)
 - Foreign key                   - Surrogate key assignment (SK)
   violation (FK)                - Tuple normalization (N)
 - Unique value (UN)             - Tuple denormalization(DN)
 - Domain mismatch (DM)
                                 File operations                                   Transfer operations
                                 - EBCDIC to ASCII conversion                      - Ftp (FTP)
                                   (EB2AS)                                         - Compress / Decompress (Z/dZ)
                                 - Sort file (Sort)                                - Encrypt / Decrypt (Cr/dCr)

               Fig. 10. Template activities, along with their graphical notation symbols, grouped by category.
ARTICLE IN PRESS

                             P. Vassiliadis et al. / Information Systems 30 (2005) 492–525                      507

   Following the same framework, class Elemen-                frequent elements of ETL scenarios. Moreover,
tary Activity is further specialized to an                    apart from this ‘‘built-in’’, ETL-specific extension
extensible set of reoccurring patterns of ETL                 of the generic metamodel, if the designer decides
activities, depicted in Fig. 10. As one can see on            that several ‘patterns’, not included in the palette
the top side of Fig. 9, we group the template                 of the template layer, occur repeatedly in his data
activities in five major logical groups. We do not             warehousing projects, he can easily fit them into
depict the grouping of activities in subclasses in            the customizable template layer through a specia-
Fig. 9, in order to avoid overloading the figure;              lization mechanism.
instead, we depict the specialization of class
Elementary Activity to three of its subclasses                3.2. Formal definition and usage of template
whose instances appear in the employed scenario               activities
of the schema layer. We now proceed to present
each of the aforementioned groups in more detail.               Once the template layer has been introduced,
   The first group, named filters, provides checks              the obvious issue that is raised is its linkage with
for the satisfaction (or not) of a certain condition.         the employed declarative language of our frame-
The semantics of these filters are the obvious                 work. In general, the broader issue is the usage of
(starting from a generic selection condition                  the template mechanism from the user; to this end,
and proceeding to the check for null values,                  we will explain the substitution mechanism for
primary or foreign key violation, etc.).                      templates in this subsection and refer the interested
The second group of template activities is called             reader to [13] for a presentation of the specific
unary operations and except for the most generic              templates that we have constructed.
push activity (which simply propagates data from                A template activity is formally defined by the
the provider to the consumer), consists of the                following elements:
classical aggregation and function appli-
cation operations along with three data ware-                    Name: A unique identifier for the template
house specific transformations (surrogate key                      activity.
assignment, normalization and denorma-                           Parameter list: A set of names which act as
lization). The third group consists of classical                  regulators in the expression of the semantics of
binary operations, such as union, join and                        the template activity. For example, the para-
difference of recordsets/activities as well as                    meters are used to assign values to constants,
with a special case of difference involving the                   create dynamic mapping at instantiation time,
detection of updates. Except for the afore-                       etc.
mentioned template activities, which mainly refer                Expression: A declarative statement describing
to logical transformations, we can also consider                  the operation performed by the instances of the
the case of physical operators that refer to the                  template activity. As with elementary activities,
application of physical transformations to whole                  our model supports LDL as the formalism for
files/tables. In the ETL context, we are mainly                    the expression of this statement.
interested in operations like transfer operations                Mapping: A set of bindings, mapping input to
(ftp, compress/decompress, encrypt/                               output attributes, possibly through intermediate
decrypt) and file operations (EBCDIC to AS-                        placeholders. In general, mappings at the
CII, sort file).                                                   template level try to capture a default way of
   Summarizing, the metamodel layer is a set of                   propagating incoming values from the input
generic entities, able to represent any ETL                       towards the output schema. These default
scenario. At the same time, the genericity of the                 bindings are easily refined and possibly rear-
metamodel layer is complemented with the exten-                   ranged at instantiation time.
sibility of the template layer, which is a set of
‘‘built-in’’ specializations of the entities of the            The template mechanism we use is a substitution
metamodel layer, specifically tailored for the most            mechanism, based on macros, that facilitates the
ARTICLE IN PRESS

508                           P. Vassiliadis et al. / Information Systems 30 (2005) 492–525

automatic creation of LDL code. This simple                    which returns the arity of the respective schema,
notation and instantiation mechanism permits the               mainly in order to define upper bounds in loop
easy and fast registration of LDL templates. In the            iterators.
rest of this section, we will elaborate on the
notation, instantiation mechanisms and template                   Loops: Loops are a powerful mechanism that
taxonomy particularities.                                      enhances the genericity of the templates by
                                                               allowing the designer to handle templates with
3.2.1. Notation                                                unknown number of variables and with unknown
  Our template notation is a simple language                   arity for the input/output schemata. The general
featuring five main mechanisms for dynamic                      form of loops is
production of LDL expressions: (a) variables that              ½hsimple constraintiŠfhloop bodyig;
are replaced by their values at instantiation
time; (b) a function that returns the arity of an              where simple constraint has the form:
input, output or parameter schema; (c) loops,                  hlower boundi hcomparison operatori hiteratori
where the loop body is repeated at instantiation
                                                                hcomparison operatori hupper boundi:
time as many times as the iterator constraint
defines; (d) keywords to simplify the creation                     We consider only linear increase with step equal
of unique predicate and attribute names; and,                  to 1, since this covers most possible cases. Upper
finally, (e) macros which are used as syntactic                 bound and lower bound can be arithmetic
sugar to simplify the way we handle complex                    expressions involving arityOf() function calls,
expressions (especially in the case of variable size           variables and constants. Valid arithmetic opera-
schemata).                                                     tors are +, À, /, * and valid comparison operators
                                                               are o, 4, ¼ , all with their usual semantics. If
   Variables: We have two kinds of variables in the            lower bound is omitted, 1 is assumed. During each
template mechanism: parameter variables and loop               iteration the loop body will be reproduced and at
iterators. Parameter variables are marked with a @             the same time all the marked appearances of the
symbol at their beginning and they are replaced by             loop iterator will be replaced by its current value,
user-defined values at instantiation time. A list of            as described before. Loop nesting is permitted.
an arbitrary length of parameters is denoted by
@/parameter nameS[ ]. For such lists, the                         Keywords: Keywords are used in order to refer
user has to explicitly or implicitly provide their             to input and output schemata. They provide two
length at instantiation time. Loop iterators, on the           main functionalities: (a) they simplify the reference
other hand, are implicitly defined in the loop                  to the input output/schema by using standard
constraint. During each loop iteration, all the                names for the predicates and their attributes, and
properly marked appearances of the iterator in the             (b) they allow their renaming at instantiation time.
loop body are replaced by its current value                    This is done in such a way that no different
(similarly to the way the C preprocessor treats                predicates with the same name will appear in the
#DEFINE statements). Iterators that appear                     same program, and no different attributes with the
marked in loop body are instantiated even when                 same name will appear in the same rule. Keywords
they are a part of another string or of a variable             are recognized even if they are parts of another
name. We mark such appearances by enclosing                    string, without a special notation. This facilitates a
them with $. This functionality enables referencing            homogenous renaming of multiple distinct input
all the values of a parameter list and facilitates the         schemata at template level, to multiple distinct
creation of an arbitrary number of pre-formatted               schemata at instantiation, with all of them having
strings.                                                       unique names in the LDL program scope. For
                                                               example, if the template is expressed in terms of
  Functions: We employ a built-in function, ari-               two different input schemata a_in1 and a_in2,
tyOf(/input/output/parameter schemaS           ),              at instantiation time they will be renamed to
ARTICLE IN PRESS

                              P. Vassiliadis et al. / Information Systems 30 (2005) 492–525                     509



   Keyword                                       Usage                                           Example

                    A   unique   name   for   the   output/input   schema
    a_out           of   the   activity.    The   predicate   that    is                      difference3_out
                    produced    when   this    template is instantiated
    a_in            has the form:                                                             difference3_in
                    unique_pred_name_out (or, _in respectively)

                    A_OUT/A_IN is used for constructing the names
                    of the a_out/a_in attributes. The names
    A_OUT                                                                                     DIFFERENCE3_OUT
                    produced have the form:
    A_IN                                                                                      DIFFERENCE3_IN
                    predicate unique name in upper case_OUT
                    (or, _IN respectively)

                                           Fig. 11. Keywords for templates.



dm1_in1 and dm1_in2 so that the produced                       definition of a template for a simple relational
names will be unique throughout the scenario                   selection:
program. In Fig. 11, we depict the way the
                                                                   a_out([ioarityOf(a_out)]{A_OUT_$i$,}
renaming is performed at instantiation time.
                                                                     [i ¼ arityOf(a_out)]{A_OUT_$i$}) o-
                                                                      a_in1([ioarityOf(a_in1)]
   Macros: To make the definition of templates
                                                                        {A_IN1_$i$,} [i ¼ arityOf(a_in1)]
easier and to improve their readability, we
                                                                        {A_IN1_$i$}),
introduce a macro to facilitate attribute and
                                                                      expr([ioarityOf(@PARAM)]
variable name expansion. For example, one of
                                                                        {@PARAM[$i$],}
the major problems in defining a language for
                                                                      [i ¼ arityOf(@PARAM)]
templates is the difficulty of dealing with schemata
                                                                        {@PARAM[$i$]}),
of arbitrary arity. Clearly, at the template level, it
                                                                      [ioarityOf(a_out)]
is not possible to pin-down the number of
                                                                        {A_OUT_$i$ ¼ A_IN1_$i$,}
attributes of the involved schemata to a specific
                                                                      [i ¼ arityOf(a_out)]
value. For example, in order to create a series of
                                                                        {A_OUT_$i$ ¼ A_IN1_$i$}
names like the following
                                                               As already mentioned at the syntax for loops, the
   name_theme_1,name_theme_2,y,                                expression
     name_theme_k
                                                                   [ioarityOf(a_out)]{A_OUT_$i$,}
we need to give the following expression:                          [i ¼ arityOf(a_out)]{A_OUT_$i$}

   [iteratoromaxLimit]                                           defining the attributes of the output schema
     {name_theme$iterator$}                                    a_out simply wants to list a variable number of
   [iterator ¼ maxLimit]                                       attributes that will be fixed at instantiation time.
     {name_theme$iterator$}                                    Exactly the same tactics apply for the attributes of
                                                               the predicate names a_in1 and expr. Also, the
   Obviously, this results in making the writing of            final two lines state that each attribute of the
templates hard and reduces their readability. To               output will be equal to the respective attribute of
attack this problem, we resort to a simple reusable            the input (so that the query is safe), e.g.,
macro mechanism that enables the simplification                 A_OUT_4 ¼ A_IN1_4. We can simplify the
of employed expressions. For example, observe the              definition of the template by allowing the designer
ARTICLE IN PRESS

510                          P. Vassiliadis et al. / Information Systems 30 (2005) 492–525

to define certain macros that simplify the manage-             4. All the rest parameter variables are instantiated.
ment of temporary length attribute lists. We                  5. Keywords are recognized and renamed.
employ the following macros:
                                                                 We will try to explain briefly the intuition
      DEFINE INPUT_SCHEMA AS
                                                              behind this execution order. Macros are expanded
        [ioarityOf(a_in1)]{A_IN1_$i$,}
                                                              first. Step (2) proceeds step (3) because loop
        [i ¼ arityOf(a_in1)] {A_IN1_$i$}
                                                              boundaries have to be calculated before loop
      DEFINE OUTPUT_SCHEMA AS
                                                              productions are performed. Loops on the other
        [ioarityOf(a_in)]{A_OUT_$i$,}
                                                              hand, have to be expanded before parameter
        [i ¼ arityOf(a_out)]{A_OUT_$i$}
                                                              variables are instantiated, if we want to be able
      DEFINE PARAM_SCHEMA AS                                  to reference lists of variables. The only exception
        [ioarityOf(@PARAM)]{@PARAM[$i$],}
                                                              to this is the parameter variables that appear in the
        [i ¼ arityOf(@PARAM)]{@PARAM[$i$]}
                                                              loop boundaries, which have to be calculated first.
      DEFINE DEFAULT_MAPPING AS
                                                              Notice though, that variable list elements cannot
        [ioarityOf(a_out)]
                                                              appear in the loop constraint. Finally, we have to
          {A_OUT_$i$ ¼ A_IN1_$i$,}
                                                              instantiate variables before keywords since vari-
        [i ¼ arityOf(a_out)]
                                                              ables are used to create a dynamic mapping
          {A_OUT_$i$ ¼ A_IN1_$i$}
                                                              between the input/output schemata and other
                                                              attributes.
  Then, the template definition is as follows:
                                                                 Fig. 12 shows a simple example of template
      a_out(OUTPUT_SCHEMA) o-                                 instantiation for the function application activity.
        a_in1(INPUT_SCHEMA),                                  To understand the overall process better, first
        expr(PARAM_SCHEMA),                                   observe the outcome of it, i.e., the specific activity
        DEFAULT_MAPPING                                       which is produced, as depicted in the final row of
                                                              Fig. 12, labeled keyword renaming. The output
                                                              schema of the activity, fa12_out, is the head of
                                                              the LDL rule that specifies the activity. The body
3.2.2. Instantiation
                                                              of the rule says that the output records are
   Template instantiation is the process where the
                                                              specified by the conjunction of the following
user chooses a certain template and creates a
                                                              clauses: (a) the input schema myFunc_in, (b)
concrete activity out of it. This procedure requires
                                                              the application of function subtract over the
that the user specifies the schemata of the activity
                                                              attributes COST_IN, PRICE_IN and the produc-
and gives concrete values to the template para-
                                                              tion of a value PROFIT, and (c) the mapping of
meters. Then, the process of producing the
respective LDL description of the activity is easily          the input to the respective output attributes as
                                                              specified in the last three conjuncts of the rule.
automated. Instantiation order is important in our
                                                                 The first row, template, shows the initial
template creation mechanism, since, as it can easily
                                                              template as it has been registered by the designer.
been seen from the notation definitions, different
                                                              @FUNCTION holds the name of the function to be
orders can lead to different results. The instantia-
                                                              used, subtract in our case, and the @PARAM[ ]
tion order is as follows:
                                                              holds the inputs of the function, which in our case
1. Replacement of macro definitions with their                 are the two attributes of the input schema. The
   expansions.                                                problem we have to face is that all input, output
2. arityOf() functions and parameter variables                and function schemata have a variable number of
   appearing in loop boundaries are calculated                parameters. To abstract from the complexity of
   first.                                                      this problem, we define four macro definitions, one
3. Loop productions are performed by instantiat-              for each schema (INPUT_SCHEMA, OUTPUT_
   ing the appearances of the iterators. This leads           SCHEMA, FUNCTION_INPUT) along with a macro
   to intermediate results without any loops.                 for the mapping of input to output attributes
ARTICLE IN PRESS

                            P. Vassiliadis et al. / Information Systems 30 (2005) 492–525                     511




                                          Fig. 12. Instantiation procedure.



(DEFAULT_MAPPING). The second row, macro                     also shown in the last two lines of the template. In
expansion, shows how the template looks after the            the third row, parameter instantiation, we can see
macros have been incorporated in the template                how the parameter variables were materialized at
definition. The mechanics of the expansion are                instantiation. In the fourth row, loop production,
straightforward: observe how the attributes of the           we can see the intermediate results after the loop
output schema are specified by the expression                 expansions are done. As it can easily be seen, these
[ioarityOf(a_in)+1]{A_OUT_$i$,}OUT-                          expansions must be done before @PARAM[]
FIELD as an expansion of the macro OUTPUT_                   variables are replaced by their values. In the fifth
SCHEMA. In a similar fashion, the attributes of the          row, variable instantiation, the parameter variables
input schema and the parameters of the function              have been instantiated creating a default mapping
are also specified; note that the expression for the          between the input, the output and the function
last attribute in the list is different (to avoid            attributes. Finally, in the last row, keyword
repeating an erroneous comma). The mappings                  renaming, the output LDL code is presented after
between the input and the output attributes are              the keywords are renamed. Keyword instantiation
Etl design document
Etl design document
Etl design document
Etl design document
Etl design document
Etl design document
Etl design document
Etl design document
Etl design document
Etl design document
Etl design document
Etl design document
Etl design document
Etl design document

Weitere ähnliche Inhalte

Was ist angesagt?

Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsKhalid Salama
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Azure data factory
Azure data factoryAzure data factory
Azure data factoryBizTalk360
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshJeffrey T. Pollock
 
Microsoft Azure Data Factory Hands-On Lab Overview Slides
Microsoft Azure Data Factory Hands-On Lab Overview SlidesMicrosoft Azure Data Factory Hands-On Lab Overview Slides
Microsoft Azure Data Factory Hands-On Lab Overview SlidesMark Kromer
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Azure Data Factory
Azure Data FactoryAzure Data Factory
Azure Data FactoryHARIHARAN R
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseDatabricks
 
Introduction to Azure Data Factory
Introduction to Azure Data FactoryIntroduction to Azure Data Factory
Introduction to Azure Data FactorySlava Kokaev
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lakeMykola Zerniuk
 
Oracle Sql Developer Data Modeler 3 3 new features
Oracle Sql Developer Data Modeler 3 3 new featuresOracle Sql Developer Data Modeler 3 3 new features
Oracle Sql Developer Data Modeler 3 3 new featuresPhilip Stoyanov
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
 
Scaling Databricks to Run Data and ML Workloads on Millions of VMs
Scaling Databricks to Run Data and ML Workloads on Millions of VMsScaling Databricks to Run Data and ML Workloads on Millions of VMs
Scaling Databricks to Run Data and ML Workloads on Millions of VMsMatei Zaharia
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture DesignKujambu Murugesan
 
Azure Data Factory presentation with links
Azure Data Factory presentation with linksAzure Data Factory presentation with links
Azure Data Factory presentation with linksChris Testa-O'Neill
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in DeltaDatabricks
 
Data Vault Vs Data Lake
Data Vault Vs Data LakeData Vault Vs Data Lake
Data Vault Vs Data LakeCalum Miller
 

Was ist angesagt? (20)

Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Azure data factory
Azure data factoryAzure data factory
Azure data factory
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
 
Microsoft Azure Data Factory Hands-On Lab Overview Slides
Microsoft Azure Data Factory Hands-On Lab Overview SlidesMicrosoft Azure Data Factory Hands-On Lab Overview Slides
Microsoft Azure Data Factory Hands-On Lab Overview Slides
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
ETL QA
ETL QAETL QA
ETL QA
 
Azure Data Factory
Azure Data FactoryAzure Data Factory
Azure Data Factory
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Introduction to Azure Data Factory
Introduction to Azure Data FactoryIntroduction to Azure Data Factory
Introduction to Azure Data Factory
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
 
Oracle Sql Developer Data Modeler 3 3 new features
Oracle Sql Developer Data Modeler 3 3 new featuresOracle Sql Developer Data Modeler 3 3 new features
Oracle Sql Developer Data Modeler 3 3 new features
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Scaling Databricks to Run Data and ML Workloads on Millions of VMs
Scaling Databricks to Run Data and ML Workloads on Millions of VMsScaling Databricks to Run Data and ML Workloads on Millions of VMs
Scaling Databricks to Run Data and ML Workloads on Millions of VMs
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture Design
 
Azure Data Factory presentation with links
Azure Data Factory presentation with linksAzure Data Factory presentation with links
Azure Data Factory presentation with links
 
Snowflake Overview
Snowflake OverviewSnowflake Overview
Snowflake Overview
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 
Data Vault Vs Data Lake
Data Vault Vs Data LakeData Vault Vs Data Lake
Data Vault Vs Data Lake
 

Andere mochten auch

Gathering Business Requirements for Data Warehouses
Gathering Business Requirements for Data WarehousesGathering Business Requirements for Data Warehouses
Gathering Business Requirements for Data WarehousesDavid Walker
 
What is Employee Spend Management
What is Employee Spend ManagementWhat is Employee Spend Management
What is Employee Spend ManagementSean Goldie
 
Writing Effective Policies & Procedures2
Writing Effective  Policies & Procedures2Writing Effective  Policies & Procedures2
Writing Effective Policies & Procedures2noha1309
 
Blood bags and its anticoagulants
Blood bags and its anticoagulantsBlood bags and its anticoagulants
Blood bags and its anticoagulantsSowmya Srinivas
 
Cheap auto insurance in kansas
Cheap auto insurance in kansasCheap auto insurance in kansas
Cheap auto insurance in kansasautousinsurance
 
Concept and characteristics of biodiversity
Concept and characteristics of biodiversityConcept and characteristics of biodiversity
Concept and characteristics of biodiversityAnil kumar
 
Anatomy of a cyber attack
Anatomy of a cyber attackAnatomy of a cyber attack
Anatomy of a cyber attackMark Silver
 
Executive director recommendation letter
Executive director recommendation letterExecutive director recommendation letter
Executive director recommendation lettertumr220
 
86 executive interview questions and answers
86 executive interview questions and answers86 executive interview questions and answers
86 executive interview questions and answersbradleylindsey345
 
B Plan On Consumer Electronics
B Plan On Consumer ElectronicsB Plan On Consumer Electronics
B Plan On Consumer ElectronicsAshish Bansal
 
Agile Testing Process
Agile Testing ProcessAgile Testing Process
Agile Testing ProcessIntetics
 
Diarrhea:Myths and facts, Precaution
Diarrhea:Myths and facts, Precaution Diarrhea:Myths and facts, Precaution
Diarrhea:Myths and facts, Precaution Wuzna Haroon
 

Andere mochten auch (20)

Gathering Business Requirements for Data Warehouses
Gathering Business Requirements for Data WarehousesGathering Business Requirements for Data Warehouses
Gathering Business Requirements for Data Warehouses
 
What is Employee Spend Management
What is Employee Spend ManagementWhat is Employee Spend Management
What is Employee Spend Management
 
Writing Effective Policies & Procedures2
Writing Effective  Policies & Procedures2Writing Effective  Policies & Procedures2
Writing Effective Policies & Procedures2
 
Blood bags and its anticoagulants
Blood bags and its anticoagulantsBlood bags and its anticoagulants
Blood bags and its anticoagulants
 
Managing Cash Flow
Managing Cash FlowManaging Cash Flow
Managing Cash Flow
 
2015 Quality Management System Vendor Benchmark
2015 Quality Management System Vendor Benchmark2015 Quality Management System Vendor Benchmark
2015 Quality Management System Vendor Benchmark
 
Uses of computer networks
Uses of computer networksUses of computer networks
Uses of computer networks
 
Cheap auto insurance in kansas
Cheap auto insurance in kansasCheap auto insurance in kansas
Cheap auto insurance in kansas
 
Concept and characteristics of biodiversity
Concept and characteristics of biodiversityConcept and characteristics of biodiversity
Concept and characteristics of biodiversity
 
Energy Drinks
Energy DrinksEnergy Drinks
Energy Drinks
 
Anatomy of a cyber attack
Anatomy of a cyber attackAnatomy of a cyber attack
Anatomy of a cyber attack
 
Dental cements and cementation procedures
Dental cements and cementation proceduresDental cements and cementation procedures
Dental cements and cementation procedures
 
ERP And Enterprise Architecture
ERP And Enterprise ArchitectureERP And Enterprise Architecture
ERP And Enterprise Architecture
 
Channel Partner Management
Channel Partner ManagementChannel Partner Management
Channel Partner Management
 
Executive director recommendation letter
Executive director recommendation letterExecutive director recommendation letter
Executive director recommendation letter
 
86 executive interview questions and answers
86 executive interview questions and answers86 executive interview questions and answers
86 executive interview questions and answers
 
B Plan On Consumer Electronics
B Plan On Consumer ElectronicsB Plan On Consumer Electronics
B Plan On Consumer Electronics
 
Agile Testing Process
Agile Testing ProcessAgile Testing Process
Agile Testing Process
 
Basics of Coding in Pediatrics Medical Billing
Basics of Coding in Pediatrics Medical BillingBasics of Coding in Pediatrics Medical Billing
Basics of Coding in Pediatrics Medical Billing
 
Diarrhea:Myths and facts, Precaution
Diarrhea:Myths and facts, Precaution Diarrhea:Myths and facts, Precaution
Diarrhea:Myths and facts, Precaution
 

Ähnlich wie Etl design document

An Overview on Data Quality Issues at Data Staging ETL
An Overview on Data Quality Issues at Data Staging ETLAn Overview on Data Quality Issues at Data Staging ETL
An Overview on Data Quality Issues at Data Staging ETLidescitation
 
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of UsPossible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of UsBertram Ludäscher
 
Intelligent agents in ontology-based applications
Intelligent agents in ontology-based applicationsIntelligent agents in ontology-based applications
Intelligent agents in ontology-based applicationsinfopapers
 
GeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL toolGeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL toolThierry Badard
 
Chapter 1 - Introduction to System Integration and Architecture.pdf
Chapter 1 - Introduction to System Integration and Architecture.pdfChapter 1 - Introduction to System Integration and Architecture.pdf
Chapter 1 - Introduction to System Integration and Architecture.pdfKhairul Anwar Sedek
 
Dimitri Papadimitriou (Alcatel-Lucent Bell) - Future Internet Architecture (F...
Dimitri Papadimitriou (Alcatel-Lucent Bell) - Future Internet Architecture (F...Dimitri Papadimitriou (Alcatel-Lucent Bell) - Future Internet Architecture (F...
Dimitri Papadimitriou (Alcatel-Lucent Bell) - Future Internet Architecture (F...FIA2010
 
Configuration management for lyee software
Configuration management for lyee softwareConfiguration management for lyee software
Configuration management for lyee softwaremariase324
 
Software_Engineering_Presentation (1).pptx
Software_Engineering_Presentation (1).pptxSoftware_Engineering_Presentation (1).pptx
Software_Engineering_Presentation (1).pptxArifaMehreen1
 
Running head LITERATURE REVIEW AND PROPOSAL 12LITERATU.docx
Running head LITERATURE REVIEW AND PROPOSAL    12LITERATU.docxRunning head LITERATURE REVIEW AND PROPOSAL    12LITERATU.docx
Running head LITERATURE REVIEW AND PROPOSAL 12LITERATU.docxcowinhelen
 
Building A Target Conformant And Flexible Enterprise...
Building A Target Conformant And Flexible Enterprise...Building A Target Conformant And Flexible Enterprise...
Building A Target Conformant And Flexible Enterprise...Courtney Davis
 
An ai planning approach for generating
An ai planning approach for generatingAn ai planning approach for generating
An ai planning approach for generatingijaia
 
AN AI PLANNING APPROACH FOR GENERATING BIG DATA WORKFLOWS
AN AI PLANNING APPROACH FOR GENERATING BIG DATA WORKFLOWSAN AI PLANNING APPROACH FOR GENERATING BIG DATA WORKFLOWS
AN AI PLANNING APPROACH FOR GENERATING BIG DATA WORKFLOWSgerogepatton
 
4+1view architecture
4+1view architecture4+1view architecture
4+1view architecturedrewz lin
 
4+1view architecture
4+1view architecture4+1view architecture
4+1view architectureTot Bob
 
Is The Architectures Of The Convnets ) For Action...
Is The Architectures Of The Convnets ) For Action...Is The Architectures Of The Convnets ) For Action...
Is The Architectures Of The Convnets ) For Action...Sheila Guy
 

Ähnlich wie Etl design document (20)

An Overview on Data Quality Issues at Data Staging ETL
An Overview on Data Quality Issues at Data Staging ETLAn Overview on Data Quality Issues at Data Staging ETL
An Overview on Data Quality Issues at Data Staging ETL
 
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of UsPossible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
 
Intelligent agents in ontology-based applications
Intelligent agents in ontology-based applicationsIntelligent agents in ontology-based applications
Intelligent agents in ontology-based applications
 
OOP ppt.pdf
OOP ppt.pdfOOP ppt.pdf
OOP ppt.pdf
 
GeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL toolGeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL tool
 
Chapter 1 - Introduction to System Integration and Architecture.pdf
Chapter 1 - Introduction to System Integration and Architecture.pdfChapter 1 - Introduction to System Integration and Architecture.pdf
Chapter 1 - Introduction to System Integration and Architecture.pdf
 
Elangovan Arumugayya
Elangovan ArumugayyaElangovan Arumugayya
Elangovan Arumugayya
 
Dimitri Papadimitriou (Alcatel-Lucent Bell) - Future Internet Architecture (F...
Dimitri Papadimitriou (Alcatel-Lucent Bell) - Future Internet Architecture (F...Dimitri Papadimitriou (Alcatel-Lucent Bell) - Future Internet Architecture (F...
Dimitri Papadimitriou (Alcatel-Lucent Bell) - Future Internet Architecture (F...
 
Configuration management for lyee software
Configuration management for lyee softwareConfiguration management for lyee software
Configuration management for lyee software
 
Software_Engineering_Presentation (1).pptx
Software_Engineering_Presentation (1).pptxSoftware_Engineering_Presentation (1).pptx
Software_Engineering_Presentation (1).pptx
 
Running head LITERATURE REVIEW AND PROPOSAL 12LITERATU.docx
Running head LITERATURE REVIEW AND PROPOSAL    12LITERATU.docxRunning head LITERATURE REVIEW AND PROPOSAL    12LITERATU.docx
Running head LITERATURE REVIEW AND PROPOSAL 12LITERATU.docx
 
Building A Target Conformant And Flexible Enterprise...
Building A Target Conformant And Flexible Enterprise...Building A Target Conformant And Flexible Enterprise...
Building A Target Conformant And Flexible Enterprise...
 
An ai planning approach for generating
An ai planning approach for generatingAn ai planning approach for generating
An ai planning approach for generating
 
AN AI PLANNING APPROACH FOR GENERATING BIG DATA WORKFLOWS
AN AI PLANNING APPROACH FOR GENERATING BIG DATA WORKFLOWSAN AI PLANNING APPROACH FOR GENERATING BIG DATA WORKFLOWS
AN AI PLANNING APPROACH FOR GENERATING BIG DATA WORKFLOWS
 
4+1view architecture
4+1view architecture4+1view architecture
4+1view architecture
 
4+1view architecture
4+1view architecture4+1view architecture
4+1view architecture
 
Sda 2
Sda   2Sda   2
Sda 2
 
Is The Architectures Of The Convnets ) For Action...
Is The Architectures Of The Convnets ) For Action...Is The Architectures Of The Convnets ) For Action...
Is The Architectures Of The Convnets ) For Action...
 
Function points and elements
Function points and elementsFunction points and elements
Function points and elements
 
CSE509 Lecture 4
CSE509 Lecture 4CSE509 Lecture 4
CSE509 Lecture 4
 

Etl design document

  • 1. ARTICLE IN PRESS Information Systems 30 (2005) 492–525 www.elsevier.com/locate/infosys A generic and customizable framework for the design of ETL scenarios Panos Vassiliadisa, Alkis Simitsisb, Panos Georgantasb, Manolis Terrovitisb, Spiros Skiadopoulosb a Department of Computer Science, University of Ioannina, Ioannina, Greece b Department of Electrical and Computer Engineering, National Technical University of Athens, Athens, Greece Abstract Extraction–transformation–loading (ETL) tools are pieces of software responsible for the extraction of data from several sources, their cleansing, customization and insertion into a data warehouse. In this paper, we delve into the logical design of ETL scenarios and provide a generic and customizable framework in order to support the DW designer in his task. First, we present a metamodel particularly customized for the definition of ETL activities. We follow a workflow-like approach, where the output of a certain activity can either be stored persistently or passed to a subsequent activity. Also, we employ a declarative database programming language, LDL, to define the semantics of each activity. The metamodel is generic enough to capture any possible ETL activity. Nevertheless, in the pursuit of higher reusability and flexibility, we specialize the set of our generic metamodel constructs with a palette of frequently used ETL activities, which we call templates. Moreover, in order to achieve a uniform extensibility mechanism for this library of built-ins, we have to deal with specific language issues. Therefore, we also discuss the mechanics of template instantiation to concrete activities. The design concepts that we introduce have been implemented in a tool, ARKTOS II, which is also presented. r 2004 Elsevier B.V. All rights reserved. Keywords: Data warehousing; ETL 1. Introduction data extraction, transformation, integration, cleaning and transport. To deal with this work- Data warehouse operational processes normally flow, specialized tools are already available in the compose a labor-intensive workflow, involving market [1–4], under the general title Extraction— Transformation– Loading (ETL) tools. To give a general idea of the functionality of these tools we E-mail addresses: pvassil@cs.uoi.gr (P. Vassiliadis), asimi@ dbnet.ece.ntua.gr (A. Simitsis), pgeor@dbnet.ece.ntua.gr mention their most prominent tasks, which include (P. Georgantas), mter@dbnet.ece.ntua.gr (M. Terrovitis), (a) the identification of relevant information at the spiros@dbnet.ece.ntua.gr (S. Skiadopoulos). source side, (b) the extraction of this information, 0306-4379/$ - see front matter r 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.is.2004.11.002
  • 2. ARTICLE IN PRESS P. Vassiliadis et al. / Information Systems 30 (2005) 492–525 493 Logical Perspective Physical Perspective Execution Plan Resource Layer Execution Sequence Execution Schedule Recovery Plan Relationship with data Operational Layer Primary Data Flow Data Flowfor Logical Exceptions Administration Plan Monitoring & Logging Security & Access Rights Management Fig. 1. Different perspectives for an ETL workflow. (c) the customization and integration of the defining an execution plan for the scenario. The information coming from multiple sources into a definition of an execution plan can be seen from common format, (d) the cleaning of the resulting various perspectives. The execution sequence in- data set on the basis of database and business volves the specification of which activity runs first, rules, and (e) the propagation of the data to the second, and so on, which activities run in parallel, data warehouse and/or data marts. or when a semaphore is defined so that several If we treat an ETL scenario as a composite activities are synchronized at a rendezvous point. workflow, in a traditional way, its designer is ETL activities normally run in batch, so the obliged to define several of its parameters (Fig. 1). designer needs to specify an execution schedule, Here, we follow a multi-perspective approach that i.e., the time points or events that trigger the enables to separate these parameters and study execution of the scenario as a whole. Finally, due them in a principled approach. We are mainly to system crashes, it is imperative that there exists interested in the design and administration parts of a recovery plan, specifying the sequence of steps to the lifecycle of the overall ETL process, and we be taken in the case of failure for a certain activity depict them at the upper and lower part of Fig. 1, (e.g., retry to execute the activity, or undo any respectively. At the top of Fig. 1, we are mainly intermediate results produced so far). On the right- concerned with the static design artifacts for a hand side of Fig. 1, we can also see the physical workflow environment. We will follow a tradi- perspective, involving the registration of the actual tional approach and group the design artifacts into entities that exist in the real world. We will reuse logical and physical, with each category compris- the terminology of [5] for the physical perspective. ing its own perspective. We depict the logical The resource layer comprises the definition of roles perspective on the left-hand side of Fig. 1, and the (human or software) that are responsible for physical perspective on the right-hand side. At the executing the activities of the workflow. The logical perspective, we classify the design artifacts operational layer, at the same time, comprises the that give an abstract description of the workflow software modules that implement the design environment. First, the designer is responsible for entities of the logical perspective in the real world.
  • 3. ARTICLE IN PRESS 494 P. Vassiliadis et al. / Information Systems 30 (2005) 492–525 In other words, the activities defined at the logical activities, which we call templates. Moreover, in layer (in an abstract way) are materialized and order to achieve a uniform extensibility mechan- executed through the specific software modules of ism for this library of built-ins, we have to deal the physical perspective. with specific language issues: thus, we also discuss At the lower part of Fig. 1, we are dealing with the mechanics of template instantiation to concrete the tasks that concern the administration of the activities. The design concepts that we introduce workflow environment and their dynamic beha- have been implemented in a tool, ARKTOS II, which vior at runtime. First, an administration plan is also presented. should be specified, involving the notification of Our contributions can be listed as follows: the administrator either on-line (monitoring) or off-line (logging) for the status of an executed First, we define a formal metamodel as an activity, as well as the security and authentication abstraction of ETL processes at the logical level. management for the ETL environment. The data stores, activities and their constituent We find that research has not dealt with the parts are formally defined. An activity is defined definition of data-centric workflows to the entirety as an entity with possibly more than one input of its extent. In the ETL case, for example, due to schemata, an output schema and a parameter the data centric nature of the process, the designer schema, so that the activity is populated each must deal with the relationship of the involved time with its proper parameter values. The flow activities with the underlying data. This involves the of data from producers towards their consumers definition of a primary data flow that describes the is achieved through the usage of provider route of data from the sources towards their final relationships that map the attributes of the destination in the data warehouse, as they pass former to the respective attributes of the latter. through the activities of the scenario. Also, due to A serializable combination of ETL activities, possible quality problems of the processed data, provider relationships and data stores constitu- the designer is obliged to define a data flow for tes an ETL scenario. logical exceptions, i.e., a flow for the problematic Second, we provide a reusability framework that data, i.e., the rows that violate integrity or business complements the genericity of the metamodel. rules. It is the combination of the execution Practically, this is achieved from a set of ‘‘built- sequence and the data flow that generates the in’’ specializations of the entities of the meta- semantics of the ETL workflow: the data flow model layer, specifically tailored for the most defines what each activity does and the execution frequent elements of ETL scenarios. This palette plan defines in which order and combination. of template activities will be referred to as In this paper, we work in the internals of the template layer and it is characterized by its data flow of ETL scenarios. First, we present a extensibility; in fact, due to language considera- metamodel particularly customized for the defini- tions, we provide the details of the mechanism tion of ETL activities. We follow a workflow-like that instantiates templates to specific activities. approach, where the output of a certain activity Finally, we discuss implementation issues and we can either be stored persistently or passed to a present a graphical tool, ARKTOS II that facil- subsequent activity. Moreover, we employ a itates the design of ETL scenarios, based on our declarative database programming language, model. LDL, to define the semantics of each activity. The metamodel is generic enough to capture any This paper is organized as follows. In Section 2, possible ETL activity; nevertheless, reusability and we present a generic model of ETL activities. ease-of-use dictate that we can do better in aiding Section 3 describes the mechanism for specifying the data warehouse designer in his task. In this and materializing template definitions of fre- pursuit of higher reusability and flexibility, we quently used ETL activities. Section 4 presents specialize the set of our generic metamodel ARKTOS II, a prototype graphical tool. In Section 5, constructs with a palette of frequently used ETL we survey related work. In Section 6, we make a
  • 4. ARTICLE IN PRESS P. Vassiliadis et al. / Information Systems 30 (2005) 492–525 495 general discussion on the completeness and general staging area (DSA)1 DS. The scenario involves the applicability of our approach. Section 7 offers propagation of data from the table PARTSUPP of conclusions and presents topics for future re- source S1 to the data warehouse DW. Table search. Short versions of parts of this paper have DW.PARTSUPP (PKEY, SOURCE, DATE, QTY, been presented in [6,7]. COST) stores information for the available quan- tity (QTY) and cost (COST) of parts (PKEY) per source (SOURCE). The data source S1. 2. Generic model of ETL activities PARTSUPP (PKEY, DATE, QTY, COST) records the supplies from a specific geographical region, The purpose of this section is to present a formal e.g., Europe. All the attributes, except for the dates logical model for the activities of an ETL are instances of the Integer type. The scenario is environment. This model abstracts from the graphically depicted in Fig. 3 and involves the technicalities of monitoring, scheduling and log- following transformations. ging while it concentrates on the flow of data from the sources towards the data warehouse through 1. First, we transfer via FTP_PS1 the snapshot the composition of activities and data stores. The from the source S1.PARTSUPP to the file full layout of an ETL scenario, involving activities, DS.PS1_NEW of the DSA.2 recordsets and functions can be modeled by a 2. In the DSA, we maintain locally a copy of the graph, which we call the architecture graph. We snapshot of the source as it was at the previous employ a uniform, graph-modeling framework for loading (we assume here the case of the both the modeling of the internal structure of incremental maintenance of the DW, instead of activities and the ETL scenario at large, which the case of the initial loading of the DW). The enables the treatment of the ETL environment recordset DS.PS1_NEW (PKEY, DATE, QTY, from different viewpoints. First, the architecture COST) stands for the last transferred snapshot graph comprises all the activities and data stores of of S1.PARTSUPP. By detecting the difference a scenario, along with their components. Second, of this snapshot with the respective version of the architecture graph captures the data flow the previous loading, DS.PS1_OLD (PKEY, within the ETL environment. Finally, the informa- DATE, QTY, COST), we can derive the newly tion on the typing of the involved entities and the inserted rows in S1.PARTSUPP. Note that the regulation of the execution of a scenario, through difference activity that we employ, namely specific parameters are also covered. Diff_PS1, checks for differences only on the primary key of the recordsets; thus, we ignore 2.1. Graphical notation and motivating example here any possible deletions or updates for the attributes COST, QTY of existing rows. Any not Being a graph, the architecture graph of an ETL newly inserted row is rejected and so, it is scenario comprises nodes and edges. The involved propagated to Diff_PS1_REJ that stores all data types, function types, constants, attributes, the rejected rows. The schema of Diff_PS1_ activities, recordsets, parameters and functions REJ is identical to the input schema of the constitute the nodes of the graph. The different activity Diff_PS1. kinds of relationships among these entities are modeled as the edges of the graph. In Fig. 2, we 1 In data warehousing terminology a DSA is an intermediate give the graphical notation for all the modeling area of the data warehouse, specifically destined to enable the constructs that will be presented in the sequel. transformation, cleaning and integration of source data, before Motivating example: To motivate our discus- being loaded to the warehouse. 2 sion, we will present an example involving the The technical points, like FTP, are mostly employed to show what kind of problems someone has to deal with in a practical propagation of data from a certain source S1, situation, rather than to relate this kind of physical operations towards a data warehouse DW through intermedi- to a logical model. In terms of logical modelling this is a simple ate recordsets. These recordsets belong to a data passing of data from one site to another.
  • 5. ARTICLE IN PRESS 496 P. Vassiliadis et al. / Information Systems 30 (2005) 492–525 Data Types Black ellipsoid RecordSets Cylinders Function Black rectangles Functions Gray rectangles Types Constants Black circles 1 Parameters White rectangles Attributes Unshaded ellipsoid Activities Triangles Provider Bold solid arrows Part-Of Simple lines with Relationships (from provider to Relationships diamond edges* consumer) Bold dotted Dotted arrows Derived Instance-Of arrows (from (from instance Provider Relationships provider to towards the type) Relationships consumer) * We annotate the part-of relationship among a Regulator Dotted lines function and its return type with a directed edge, to Relationships distinguish it from therest of the parameters. Fig. 2. Graphical notation for the architecture graph. DS.PS1.PKEY DS.PS1_NEW.PKEY LOOKUP.PKEY = COST SOURCE = 1 LOOKUP.SOURCE DS.PS1_OLD.PKEY LOOKUP.SKEY S1.PARTSUPP FTP_PS1 DS.PS1_NEW Diff_PS1 NotNu111 Add_Attr1 DS.PS1 SK1 DW.PARTSUPP Source DS.PS1_OLD Data Warehouse Diff_PS1 Not Nul 111 LOOKUP _REJ _ REJ DSA Fig. 3. Bird’s-eye view of the motivating example. 3. The rows that pass the activity Diff_PS1 are Diff_PS1_REJ recordset for further examina- checked for null values of the attribute COST tion by the data warehouse administrator. through the activity NotNull1. Rows having a 4. Although we consider the data flow for only NULL value for their COST are kept in the one source, namely S1, the data warehouse can
  • 6. ARTICLE IN PRESS P. Vassiliadis et al. / Information Systems 30 (2005) 492–525 497 clearly have more sources for part supplies. In Finally, before proceeding, we would like to order to keep track of the source of each row stress that we do not anticipate a manual entering into the DW, we need to add a ‘flag’ construction of the graph by the designer; rather, attribute, namely SOURCE, indicating the re- we employ this section to clarify how the graph spective source. This is achieved through the will look once constructed. To assist a more activity Add_Attr1. We store the rows that automatic construction of ETL scenarios, we have stem from this process in the recordset DS.PS1 implemented the ARKTOS II tool that supports the (PKEY, SOURCE, DATE, QTY, COST). designing process through a friendly GUI. We 5. Next, we assign a surrogate key on PKEY. In the present ARKTOS II in Section 4. data warehouse context, it is common tactics to replace the keys of the production systems with 2.2. Preliminaries a uniform key, which we call a surrogate key [8]. The basic reasons for this replacement are In this subsection, we will introduce the formal performance and semantic homogeneity. Tex- modeling of data types, data stores and functions, tual attributes are not the best candidates for before proceeding to the modeling of ETL indexed keys and thus, they need to be replaced activities. by integer keys. At the same time, different production systems might use different keys for Elementary entities: We assume the existence of the same object, or the same key for different a countable set of data types. Each data type T is objects, resulting in the need for a global characterized by a name and a domain, i.e., a replacement of these values in the data ware- countable set of values, called dom (T). The house. This replacement is performed through a values of the domains are also referred to as lookup table of the form L (PRODKEY, constants. SOURCE, SKEY). The SOURCE column is due We also assume the existence of a countable set to the fact that there can be synonyms in the of attributes, which constitute the most elementary different sources, which are mapped to different granules of the infrastructure of the information objects in the data warehouse. In our case, the system. Attributes are characterized by their name activity that performs the surrogate key assign- and data type. The domain of an attribute is a ment for the attribute PKEY is SK1. It uses the subset of the domain of its data type. Attributes lookup table LOOKUP (PKEY, SOURCE, and constants are uniformly referred to as terms. SKEY). Finally, we populate the data ware- house with the output of the previous activity. A schema is a finite list of attributes. Each entity that is characterized by one or more schemata will The role of rejected rows depends on the be called structured entity. Moreover, we assume peculiarities of each ETL scenario. If the designer the existence of a special family of schemata, all needs to administrate these rows further, then he/ under the general name of NULL schema, she should use intermediate storage recordsets determined to act as placeholders for data which with the burden of an extra I/O cost. If the rejected are not to be stored permanently in some data rows should not have a special treatment, then the store. We refer to a family instead of a single best solution is to be ignored; thus, in this case we NULL schema, due to a subtle technicality avoid overloading the scenario with any extra involving the number of attributes of such a storage recordset. In our case, we annotate only schema (this will become clear in the sequel). two of the presented activities with a destina- tion for rejected rows. Out of these, while Recordsets: We define a record as the instantia- NotNull1_REJ absolutely makes sense as a tion of a schema to a list of values belonging to placeholder for problematic rows having non- the domains of the respective schema attributes. acceptable NULL values, Diff_PS1_REJ is pre- We can treat any data structure as a re- sented for demonstration reasons only. cordset provided that there are ways to logically
  • 7. ARTICLE IN PRESS 498 P. Vassiliadis et al. / Information Systems 30 (2005) 492–525 restructure it into a flat, typed record schema. LDL, we avoid dealing with the peculiarities of a Formally, a recordset is characterized by its name, particular programming language. Once again, we its (logical) schema and its (physical) extension want to stress that the presented LDL description (i.e., a finite set of records under the recordset is intended to capture the semantics of each schema). If we consider a schema S ¼ activity, instead of the way these activities are [A1, y, Ak], for a certain recordset, its extension actually implemented. is a mapping S ¼ [A1, y, Ak]-dom(A1)  y An elementary activity is formally described by  dom(Ak). Thus, the extension of the recordset the following elements: is a finite subset of dom(A1)  y  dom(Ak) and a record is the instance of a mapping dom(A1) Name: A unique identifier for the activity.  y  dom(Ak)-[x1,y,xk], xiAdom(Ai). Input schemata: A finite set of one or more input schemata that receives data from the data In the rest of this paper we will mainly deal with providers of the activity. the two most popular types of recordsets, namely relational tables and record files. A database is a Output schema: A schema that describes the placeholder for the rows that pass the check finite set of relational tables. performed by the elementary activity. Functions. We assume the existence of a Rejections schema: A schema that describes the placeholder for the rows that do not pass the countable set of built-in system function types. A check performed by the activity, or their values function type comprises a name, a finite list of are not appropriate for the performed transfor- parameter data types, and a single return data type. mation. A function is an instance of a function type. Consequently, it is characterized by a name, a list Parameter list: A set of pairs which act as regulators for the functionality of the activity of input parameters and a parameter for its return (the target attribute of a foreign key check, for value. The data types of the parameters of the example). The first component of the pair is a generating function type also define (a) the data name and the second is a schema, an attribute, a types of the parameters of the function and (b) the function or a constant. legal candidates for the function parameters (i.e., attributes or constants of a suitable data type). Output operational semantics: An LDL state- ment describing the content passed to the output of the operation, with respect to its 2.3. Activities input. This LDL statement defines (a) the operation performed on the rows that pass Activities are the backbone of the structure of through the activity and (b) an implicit mapping any information system. We adopt the WfMC between the attributes of the input schema(ta) terminology [9] for processes/programs and we will and the respective attributes of the output call them activities in the sequel. An activity is an schema. amount of ‘‘work which is processed by a combination of resource and computer applica- Rejection operational semantics: An LDL state- ment describing the rejected records, in a sense tions’’ [9]. In our framework, activities are logical similar to the output operational semantics. abstractions representing parts or full modules of This statement is by default considered to be the code. complement of the output operational seman- The execution of an activity is performed from a tics, except if explicitly defined differently. particular program. Normally, ETL activities will be either performed in a black-box manner by a There are two issues that we would like to dedicated tool, or they will be expressed in some elaborate on, here: language (e.g., PL/SQL, Perl, C). Still, we want to deal with the general case of ETL activities. We NULL schemata: Whenever we do not specify employ an abstraction of the source code of an a data consumer for the output or rejec- activity, in the form of an LDL statement. Using tion schemata, the respective NULL schema
  • 8. ARTICLE IN PRESS P. Vassiliadis et al. / Information Systems 30 (2005) 492–525 499 (involving the correct number of attributes) is notation of entities (nodes) and relationships implied. This practically means that the data (edges) is presented in Fig. 2. targeted for this schema will neither be stored to some persistent data store, nor will they be Part-of relationships. These relationships in- propagated to another activity, but they will volve attributes and parameters and relate them simply be ignored. to the respective activity, recordset or function Language issues: Initially, we used to specify the to which they belong. semantics of activities with SQL statements. Instance-of relationships. These relationships are Still, although clear and easy to write and defined among a data/function type and its understand, SQL is rather hard to use if one is instances. to perform rewriting and composition of state- Provider relationships. These are relationships ments. Thus, we have supplemented SQL with that involve attributes with a provider–consu- LDL [10], a logic programming, declarative mer relationship. language as the basis of our scenario definition. Regulator relationships. These relationships are LDL is a Datalog variant based on a Horn- defined among the parameters of activities and clause logic that supports recursion, complex the terms that populate these activities. objects and negation. In the context of its Derived provider relationships. A special case of implementation in an actual deductive database provider relationships that occurs whenever management system, LDL++ [11], the lan- output attributes are computed through the guage has been extended to support external composition of input attributes and parameters. functions, choice, aggregation (and even, user- Derived provider relationships can be deduced defined aggregation), updates and several other from a simple rule and do not originally features. constitute a part of the graph. 2.4. Relationships in the architecture graph In the rest of this subsection, we will detail the notions pertaining to the relationships of the In this subsection, we will elaborate on the Architecture Graph; the knowledgeable reader is different kinds of relationships that the entities of referred to Section 2.5 where we discuss the issue an ETL scenario have. Whereas these entities are of scenarios. We will base our discussions on a modeled as the nodes of the architecture graph, part of the scenario of the motivating example relationships are modeled as its edges. Due to their (presented in Section 2.1), including activity SK1. diversity, before proceeding, we list these types of relationships along with the related terminology Data types and instance-of relationships: To that we will use in this paper. The graphical capture typing information on attributes and OUT IN OUT IN DW.PARTS DS.PS1 SK1 UPP PKEY PKEY PKEY PKEY Integer QTY QTY QTY QTY COST COST COST COST Date DATE DATE DATE DATE SOURCE SOURCE SOURCE SOURCE SKEY Fig. 4. Instance-of relationships of the architecture graph.
  • 9. ARTICLE IN PRESS 500 P. Vassiliadis et al. / Information Systems 30 (2005) 492–525 functions, the architecture graph comprises data tion schema of the activity, respectively). Natu- and function types. Instantiation relationships are rally, if the activity involves more than one input depicted as dotted arrows that stem from the schemata, the relationship is tagged with an INi instances and head toward the data/function types. tag for the ith input schema. We also incorporate In Fig. 4, we observe the attributes of the two the functions along with their respective para- activities of our example and their correspondence meters and the part-of relationships among the to two data types, namely integer and date. former and the latter. We annotate the part-of For reasons of presentation, we merge several relationship with the return type with a directed instantiation edges so that the figure does not edge, to distinguish it from the rest of the become too crowded. parameters. Fig. 5 depicts a part of the motivating example. Attributes and part-of relationships: The first In terms of part-of relationships, we present the thing to incorporate in the architecture graph is decomposition of (a) the recordsets DS.PS1, the structured entities (activities and recordsets) LOOKUP, DW.PARTSUPP and (b) the activity SK1 along with all the attributes of their schemata. We and the attributes of its input and output choose to avoid overloading the notation by schemata. Note the tagging of the schemata of incorporating the schemata per se; instead we the involved activity. We do not consider the apply a direct part-of relationship between an rejection schemata in order to avoid crowding the activity node and the respective attributes. We picture. Also note, how the parameters of the annotate each such relationship with the name of activity are also incorporated in the architecture the schema (by default, we assume a IN, OUT, graph. Activity SK1 has five parameters: (a) PKEY, PAR, REJ tag to denote whether the attribute which stands for the production key to be belongs to the input, output, parameter or rejec- replaced, (b) SOURCE, which stands for an integer OUT IN OUT IN DW.PARTS DS.PS1 SK1 UPP PAR PKEY PKEY PKEY PKEY QTY QTY QTY QTY COST COST COST COST DATE DATE DATE DATE SOURCE SOURCE SOURCE SOURCE SKEY PKEY SOURCE PKEY LPKEY OUT LOOKUP SOURCE LSOURCE SKEY LSKEY Fig. 5. Part-of regulator and provider relationships of the architecture graph.
  • 10. ARTICLE IN PRESS P. Vassiliadis et al. / Information Systems 30 (2005) 492–525 501 value that characterizes which source’s data are relationship among the attributes of the involved processed, (c) LPKEY, which stands for the schemata. attribute of the lookup table which contains the Formally, a provider relationship is defined by production keys, (d) LSOURCE, which stands for the following elements: the attribute of the lookup table which contains the source value (corresponding to the aforemen- Name: A unique identifier for the provider tioned SOURCE parameter), (e) LSKEY, which relationship. stands for the attribute of the lookup table which Mapping: An ordered pair. The first part of the contains the surrogate keys. pair is a term (i.e., an attribute or constant) acting as a provider and the second part is an Parameters and regulator relationships: Once the attribute acting as the consumer. part-of and instantiation relationships have been established, it is time to establish the regulator The mapping need not necessarily be 1:1 from relationships of the scenario. In this case, we link provider to consumer attributes, since an input the parameters of the activities to the terms attribute can be mapped to more than one (attributes or constants) that populate them. We consumer attributes. Still, the opposite does not depict regulator relationships with simple dotted hold. Note that a consumer attribute can also be edges. populated by a constant, in certain cases. In the example of Fig. 5 we can also observe In order to achieve the flow of data from the how the parameters of activity SK1 are populated providers of an activity towards its consumers, we through regulator relationships. The parameters need the following three groups of provider in and out are mapped to the respective terms relationships: through regulator relationships. All the para- meters of SK1, namely PKEY, SOURCE, LPKEY, 1. A mapping between the input schemata of the LSOURCE and LSKEY, are mapped to the respec- activity and the output schema of their data tive attributes of either the activity’s input schema providers. In other words, for each attribute of or the employed lookup table LOOKUP. The an input schema of an activity, there must exist parameter LSKEY deserves particular attention. an attribute of the data provider, or a constant, This parameter is (a) populated from the attribute which is mapped to the former attribute. SKEY of the lookup table and (b) used to populate 2. A mapping between the attributes of the activity the attribute SKEY of the output schema of the input schemata and the activity output (or activity. Thus, two regulator relationships are rejection, respectively) schema. related with parameter LSKEY, one for each of 3. A mapping between the output or rejection the aforementioned attributes. The existence of a schema of the activity and the (input) schema of regulator relationship among a parameter and an its data consumer. output attribute of an activity normally denotes that some external data provider is employed in The mappings of the second type are internal to order to derive a new attribute, through the the activity. Basically, they can be derived from the respective parameter. LDL statement for each of the output/rejection schemata. As far as the first and the third types of Provider relationships: The flow of data from the provider relationships are concerned, the map- data sources towards the data warehouse is pings must be provided during the construction of performed through the composition of activities the ETL scenario. This means that they are either in a larger scenario. In this context, the input for (a) by default assumed by the order of the an activity can be either a persistent data store, or attributes of the involved schemata or (b) hard- another activity. Usually, this applies for the coded by the user. Provider relationships are output of an activity, too. We capture the passing depicted with bold solid arrows that stem from of data from providers to consumers by a provider the provider and end in the consumer attribute.
  • 11. ARTICLE IN PRESS 502 P. Vassiliadis et al. / Information Systems 30 (2005) 492–525 Observe Fig. 5. The flow starts from table relationships, previously discussed: the first rule DS.PS1 of the data staging area. Each of the explains how the data from the DS.PS1 recordset attributes of this table is mapped to an attribute of are fed into the input schema of the activity, the the input schema of activity SK1. The attributes of second rule explains the semantics of activity (i.e., the input schema of the latter are subsequently how the surrogate key is generated) and, finally, mapped to the attributes of the output schema of the third rule shows how the DW.PARTSUPP the activity. The flow continues to DW.PARTSUPP. recordset is populated from the output schema of Another interesting thing is that during the data the activity SK1. flow, new attributes are generated, resulting on new streams of data, whereas the flow seems to stop for Derived provider relationships: As we have other attributes. Observe the rightmost part of already mentioned, there are certain output Fig. 5 where the values of attribute PKEY are not attributes that are computed through the composi- further propagated (remember that the reason for tion of input attributes and parameters. A derived the application of a surrogate key transformation is provider relationship is another form of provider to replace the production keys of the source data to relationship that captures the flow from the input a homogeneous surrogate for the records of the to the respective output attributes. data warehouse, which is independent of the source Formally, assume that (a) source is a term in they have been collected from). Instead of the the architecture graph, (b) target is an attribute values of the production key, the values from the of the output schema of an activity A and (c) x,y attribute SKEY will be used to denote the unique are parameters in the parameter list of A (not identifier for a part in the rest of the flow. necessary different). Then, a derived provider In Fig. 6, we depict the LDL definition of this relationship pr(source, target) exists iff the part of the motivating example. The three rules following regulator relationships (i.e., edges) exist: correspond to the three categories of provider rr1(source, x) and rr2(y, target). addSkey_in1(A_IN1_PKEY,A_IN1_DATE,A_IN1_QTY,A_IN1_COST,A_IN1_SOURCE) ds_ps1(A_OUT_PKEY,A_OUT_DATE,A_OUT_QTY,A_OUT_COST,A_OUT_SOURCE), A_OUT_PKEY=A_IN1_PKEY, A_OUT_DATE=A_IN1_DATE, A_OUT_QTY=A_IN1_QTY, A_OUT_COST=A_IN1_COST, A_OUT_SOURCE=A_IN1_SOURCE. addSkey_out(A_OUT_PKEY,A_OUT_DATE,A_OUT_QTY,A_OUT_COST,A_OUT_SOURCE,A_OUT_SKEY) addSkey_in1(A_IN1_PKEY,A_IN1_DATE,A_IN1_QTY,A_IN1_COST,A_IN1_SOURCE), lookup(A_IN1_SOURCE,A_IN1_PKEY,A_OUT_SKEY), A_OUT_PKEY=A_IN1_PKEY, A_OUT_DATE=A_IN1_DATE, A_OUT_QTY=A_IN1_QTY, A_OUT_COST=A_IN1_COST, A_OUT_SOURCE=A_IN1_SOURCE. dw_partsupp(PKEY,DATE,QTY,COST,SOURCE) addSkey_out(A_OUT_PKEY,A_OUT_DATE,A_OUT_QTY,A_OUT_COST,A_OUT_SOURCE,A_OUT_SKEY), DATE=A_IN1_DATE, QTY=A_IN1_QTY, COST=A_IN1_COST SOURCE=A_IN1_SOURCE, PKEY=A_IN1_SKEY. NOTE: For reasonsof readability we do not replace the Ain attribute names with the activity name, i.e.,A_OUT_PKEYshould be diffPS1_OUT_PKEY. Fig. 6. LDL specification of the motivating example.
  • 12. ARTICLE IN PRESS P. Vassiliadis et al. / Information Systems 30 (2005) 492–525 503 IN OUT SK1 PAR PKEY PKEY SOURCE SOURCE IN OUT SK1 SKEY PAR PKEY PKEY SOURCE SOURCE SKEY PKEY PKEY LPKEY OUT LOOKUP SOURCE OUT LOOKUP SOURCE LSOURCE SKEY SKEY LSKEY Fig. 7. Derived provider relationships of the architecture graph: the original situation on the left and the derived provider relationships on the right. Intuitively, the case of derived relationships ships that do not involve constants (remember that models the situation where the activity computes we have defined source as a term); (b) relation- a new attribute in its output. In this case, the ships involving only attributes of the same/ produced output depends on all the attributes that different activity (as a measure of internal com- populate the parameters of the activity, resulting plexity or external dependencies); (c) relationships in the definition of the corresponding derived relating attributes that populate only the same relationship. parameter (e.g., only the attributes LOOKUP.SKEY Observe Fig. 7, where we depict a small part of and SK1.OUT.SKEY). our running example. The left side of the figure depicts the situation where only provider relation- 2.5. Scenarios ships exist. The legend in the right side of Fig. 7 depicts how we compute the derived provider A scenario is an enumeration of activities along relationships between the parameters of the with their source/target recordsets and the respec- activity and the computed output attribute SKEY. tive provider relationships for each activity. An The meaning of these five relationships is that ETL scenario consists of the following elements: SK1.OUT.SKEY is not computed only from attribute LOOKUP.SKEY, but from the combina- Name: A unique identifier for the scenario. tion of all the attributes that populate the Activities: A finite list of activities. Note that by parameters. employing a list (instead of e.g., a set) of One can also assume different variations of activities, we impose a total ordering on the derived provider relationships such as (a) relation- execution of the scenario.
  • 13. ARTICLE IN PRESS 504 P. Vassiliadis et al. / Information Systems 30 (2005) 492–525 Entity Model-specific Scenario-specific Data Types DI D Built -in Function Types FI F Constants CI C Attributes ΩI Ω Functions ΦI Φ Schemata SI S User-provided RecordSets RSI RS Activities AI A Provider Relationships PrI Pr Part-Of Relationships PoI Po Instance-Of Relationships IoI Io Regulator Relationships RrI Rr Derived Provider Relationships DrI Dr Fig. 8. Formal definition of domains and notation. Recordsets: A finite set of recordsets. In the sequel, we treat the terms architecture Targets: A special-purpose subset of the record- graph and scenario interchangeably. The reason- sets of the scenario, which includes the final ing for the term ‘architecture graph’ goes all the destinations of the overall process (i.e., the data way down to the fundamentals of conceptual warehouse tables that must be populated by the modeling. As mentioned in [12], conceptual activities of the scenario). models are the means by which designers conceive, Provider relationships: A finite list of provider architect, design, and build software systems. relationships among activities and recordsets of These conceptual models are used in the same the scenario. way that blueprints are used in other engineering disciplines during the early stages of the lifecycle of In our modeling, a scenario is a set of activities, artificial systems, which involves the creation of their architecture. The term ‘architecture graph’ deployed along a graph in an execution sequence expresses the fact that the graph that we employ that can be linearly serialized. For the moment, we for the modeling of the data flow of the ETL do not consider the different alternatives for the scenario is practically acting as a blueprint of the ordering of the execution; we simply require that a architecture of this software artifact. total order for this execution is present (i.e., each Moreover, we assume the following integrity activity has a discrete execution priority). constraints for a scenario: In terms of formal modeling of the architecture graph, we assume the infinitely countable, mu- Static constraints: tually disjoint sets of names (i.e., the values of which respect the unique name assumption) of All the weak entities of a scenario (i.e., column model-specific in Fig. 8. As far as a specific attributes or parameters) should be defined scenario is concerned, we assume their respective within a part-of relationship (i.e., they should finite subsets, depicted in column scenario-specific have a container object). in Fig. 8. Data types, function types and constants All the mappings in provider relationships are considered built-in’s of the system, whereas the should be defined among terms (i.e., attributes rest of the entities are provided by the user (user or constants) of the same data type. provided). Formally, the architecture graph of an ETL scenario is a graph G(V,E) defined as follows: Data flow constraints: V ¼ D[F[C[X[/[S[RS[A All the attributes of the input schema(ta) of an E ¼ Pr[Po[Io[Rr[Dr. activity should have a provider.
  • 14. ARTICLE IN PRESS P. Vassiliadis et al. / Information Systems 30 (2005) 492–525 505 Resulting from the previous requirement, if 3.1. General framework some attribute is a parameter in an activity A, the container of the attribute (i.e., recordset or Our philosophy during the construction of our activity) should precede A in the scenario. metamodel was based on two pillars: (a) genericity, All the attributes of the schemata of the target i.e., the derivation of a simple model, powerful to recordsets should have a data provider. capture ideally all the cases of ETL activities and (b) extensibility, i.e., the possibility of extending Summarizing, in this section, we have presented the built-in functionality of the system with new, a generic model for the modeling of the data flow user-specific templates. for ETL workflows. In the next section, we will The genericity doctrine was pursued through the proceed to detail how this generic model can be definition of a rather simple activity metamodel, as accompanied by a customization mechanism, in described in Section 2. Still, providing a single order to provide higher flexibility to the designer metaclass for all the possible activities of an ETL of the workflow. environment is not really enough for the designer of the overall process. A richer ‘‘language’’ should be available, in order to describe the structure of 3. Templates for ETL activities the process and facilitate its construction. To this end, we provide a palette of template activities, In this section, we present the mechanism for which are specializations of the generic metamodel exploiting template definitions of frequently used class. ETL activities. The general framework for the Observe Fig. 9 for a further explanation of our exploitation of these templates is accompanied framework. The lower layer of Fig. 9, namely with the presentation of the language-related schema layer, involves a specific ETL scenario. issues for template management and appropriate All the entities of the schema layer are instances of examples. the classes Data Type, Function Type, Datatypes Functions Elementary Activity RecotdSet Relationships Metamodel Layer IsA Domain Mismatch Source Table Provider Re NotNull SK Assignment Fact Table Template Layer InstanceOf S1.PARTSUPF NN DM1 SK1 DW.PARTSUPP Schema Layer Fig. 9. The metamodel for the logical entities of the ETL environment.
  • 15. ARTICLE IN PRESS 506 P. Vassiliadis et al. / Information Systems 30 (2005) 492–525 Elementary Activity, RecordSet and In the example of Fig. 9 the concept DW. Relationship. Thus, as one can see on the PARTSUPP must be populated from a certain upper part of Fig. 9, we introduce a meta-class source S1.PARTSUPP. Several operations must layer, namely metamodel layer involving the intervene during the propagation. For instance in aforementioned classes. The linkage between the Fig. 9, we check for null values and domain metamodel and the schema layers is achieved violations, and we assign a surrogate key. As one through instantiation (InstanceOf) relation- can observe, the recordsets that take part in this ships. The metamodel layer implements the afore- scenario are instances of class RecordSet (be- mentioned genericity desideratum: the classes longing to the metamodel layer) and specifically of which are involved in the metamodel layer are its subclasses Source Table and Fact Table. generic enough to model any ETL scenario, Instances and encompassing classes are related through the appropriate instantiation. through links of type InstanceOf. The same Still, we can do better than the simple provision mechanism applies to all the activities of of a metalayer and an instance layer. In order to the scenario, which are (a) instances of class make our metamodel truly useful for practi- Elementary Activity and (b) instances of cal cases of ETL activities, we enrich it with a set one of its subclasses, depicted in Fig. 9. Relation- of ETL-specific constructs, which constitute a ships do not escape this rule either. For instance, subset of the larger metamodel layer, namely observe how the provider links from the concept the template layer. The constructs in the template S1.PS toward the concept DW.PARTSUPP are layer are also meta-classes, but they are related to class Provider Relationship quite customized for the regular cases of ETL through the appropriate InstanceOf links. activities. Thus, the classes of the template layer As far as the class Recordset is concerned, in are specializations (i.e., subclasses) of the generic the template layer we can specialize it to several classes of the metamodel layer (depicted as subclasses, based on orthogonal characteristics, IsA relationships in Fig. 9). Through this custo- such as whether it is a file or RDBMS table, or mization mechanism, the designer can pick the whether it is a source or target data store (as in instances of the schema layer from a much Fig. 9). In the case of the class Relationship, richer palette of constructs; in this setting, the there is a clear specialization in terms of the five entities of the schema layer are instantiations, not classes of relationships which have already only of the respective classes of the metamodel been mentioned in Section 2 (i.e., Provider, layer, but also of their subclasses in the template Part-Of, Instance-Of, Regulator and layer. Derived Provider). Filters Unary operations Binary operations - Selection (σ) - Push - Union (U) - Not null (NN) - Aggregation (γ) - Join ( ) ∆ ∆ - Primary key - Projection (Π) - Diff (∆) violation (PK) - Function application (f) - Update Detection (∆UPD) - Foreign key - Surrogate key assignment (SK) violation (FK) - Tuple normalization (N) - Unique value (UN) - Tuple denormalization(DN) - Domain mismatch (DM) File operations Transfer operations - EBCDIC to ASCII conversion - Ftp (FTP) (EB2AS) - Compress / Decompress (Z/dZ) - Sort file (Sort) - Encrypt / Decrypt (Cr/dCr) Fig. 10. Template activities, along with their graphical notation symbols, grouped by category.
  • 16. ARTICLE IN PRESS P. Vassiliadis et al. / Information Systems 30 (2005) 492–525 507 Following the same framework, class Elemen- frequent elements of ETL scenarios. Moreover, tary Activity is further specialized to an apart from this ‘‘built-in’’, ETL-specific extension extensible set of reoccurring patterns of ETL of the generic metamodel, if the designer decides activities, depicted in Fig. 10. As one can see on that several ‘patterns’, not included in the palette the top side of Fig. 9, we group the template of the template layer, occur repeatedly in his data activities in five major logical groups. We do not warehousing projects, he can easily fit them into depict the grouping of activities in subclasses in the customizable template layer through a specia- Fig. 9, in order to avoid overloading the figure; lization mechanism. instead, we depict the specialization of class Elementary Activity to three of its subclasses 3.2. Formal definition and usage of template whose instances appear in the employed scenario activities of the schema layer. We now proceed to present each of the aforementioned groups in more detail. Once the template layer has been introduced, The first group, named filters, provides checks the obvious issue that is raised is its linkage with for the satisfaction (or not) of a certain condition. the employed declarative language of our frame- The semantics of these filters are the obvious work. In general, the broader issue is the usage of (starting from a generic selection condition the template mechanism from the user; to this end, and proceeding to the check for null values, we will explain the substitution mechanism for primary or foreign key violation, etc.). templates in this subsection and refer the interested The second group of template activities is called reader to [13] for a presentation of the specific unary operations and except for the most generic templates that we have constructed. push activity (which simply propagates data from A template activity is formally defined by the the provider to the consumer), consists of the following elements: classical aggregation and function appli- cation operations along with three data ware- Name: A unique identifier for the template house specific transformations (surrogate key activity. assignment, normalization and denorma- Parameter list: A set of names which act as lization). The third group consists of classical regulators in the expression of the semantics of binary operations, such as union, join and the template activity. For example, the para- difference of recordsets/activities as well as meters are used to assign values to constants, with a special case of difference involving the create dynamic mapping at instantiation time, detection of updates. Except for the afore- etc. mentioned template activities, which mainly refer Expression: A declarative statement describing to logical transformations, we can also consider the operation performed by the instances of the the case of physical operators that refer to the template activity. As with elementary activities, application of physical transformations to whole our model supports LDL as the formalism for files/tables. In the ETL context, we are mainly the expression of this statement. interested in operations like transfer operations Mapping: A set of bindings, mapping input to (ftp, compress/decompress, encrypt/ output attributes, possibly through intermediate decrypt) and file operations (EBCDIC to AS- placeholders. In general, mappings at the CII, sort file). template level try to capture a default way of Summarizing, the metamodel layer is a set of propagating incoming values from the input generic entities, able to represent any ETL towards the output schema. These default scenario. At the same time, the genericity of the bindings are easily refined and possibly rear- metamodel layer is complemented with the exten- ranged at instantiation time. sibility of the template layer, which is a set of ‘‘built-in’’ specializations of the entities of the The template mechanism we use is a substitution metamodel layer, specifically tailored for the most mechanism, based on macros, that facilitates the
  • 17. ARTICLE IN PRESS 508 P. Vassiliadis et al. / Information Systems 30 (2005) 492–525 automatic creation of LDL code. This simple which returns the arity of the respective schema, notation and instantiation mechanism permits the mainly in order to define upper bounds in loop easy and fast registration of LDL templates. In the iterators. rest of this section, we will elaborate on the notation, instantiation mechanisms and template Loops: Loops are a powerful mechanism that taxonomy particularities. enhances the genericity of the templates by allowing the designer to handle templates with 3.2.1. Notation unknown number of variables and with unknown Our template notation is a simple language arity for the input/output schemata. The general featuring five main mechanisms for dynamic form of loops is production of LDL expressions: (a) variables that ½hsimple constraintiŠfhloop bodyig; are replaced by their values at instantiation time; (b) a function that returns the arity of an where simple constraint has the form: input, output or parameter schema; (c) loops, hlower boundi hcomparison operatori hiteratori where the loop body is repeated at instantiation hcomparison operatori hupper boundi: time as many times as the iterator constraint defines; (d) keywords to simplify the creation We consider only linear increase with step equal of unique predicate and attribute names; and, to 1, since this covers most possible cases. Upper finally, (e) macros which are used as syntactic bound and lower bound can be arithmetic sugar to simplify the way we handle complex expressions involving arityOf() function calls, expressions (especially in the case of variable size variables and constants. Valid arithmetic opera- schemata). tors are +, À, /, * and valid comparison operators are o, 4, ¼ , all with their usual semantics. If Variables: We have two kinds of variables in the lower bound is omitted, 1 is assumed. During each template mechanism: parameter variables and loop iteration the loop body will be reproduced and at iterators. Parameter variables are marked with a @ the same time all the marked appearances of the symbol at their beginning and they are replaced by loop iterator will be replaced by its current value, user-defined values at instantiation time. A list of as described before. Loop nesting is permitted. an arbitrary length of parameters is denoted by @/parameter nameS[ ]. For such lists, the Keywords: Keywords are used in order to refer user has to explicitly or implicitly provide their to input and output schemata. They provide two length at instantiation time. Loop iterators, on the main functionalities: (a) they simplify the reference other hand, are implicitly defined in the loop to the input output/schema by using standard constraint. During each loop iteration, all the names for the predicates and their attributes, and properly marked appearances of the iterator in the (b) they allow their renaming at instantiation time. loop body are replaced by its current value This is done in such a way that no different (similarly to the way the C preprocessor treats predicates with the same name will appear in the #DEFINE statements). Iterators that appear same program, and no different attributes with the marked in loop body are instantiated even when same name will appear in the same rule. Keywords they are a part of another string or of a variable are recognized even if they are parts of another name. We mark such appearances by enclosing string, without a special notation. This facilitates a them with $. This functionality enables referencing homogenous renaming of multiple distinct input all the values of a parameter list and facilitates the schemata at template level, to multiple distinct creation of an arbitrary number of pre-formatted schemata at instantiation, with all of them having strings. unique names in the LDL program scope. For example, if the template is expressed in terms of Functions: We employ a built-in function, ari- two different input schemata a_in1 and a_in2, tyOf(/input/output/parameter schemaS ), at instantiation time they will be renamed to
  • 18. ARTICLE IN PRESS P. Vassiliadis et al. / Information Systems 30 (2005) 492–525 509 Keyword Usage Example A unique name for the output/input schema a_out of the activity. The predicate that is difference3_out produced when this template is instantiated a_in has the form: difference3_in unique_pred_name_out (or, _in respectively) A_OUT/A_IN is used for constructing the names of the a_out/a_in attributes. The names A_OUT DIFFERENCE3_OUT produced have the form: A_IN DIFFERENCE3_IN predicate unique name in upper case_OUT (or, _IN respectively) Fig. 11. Keywords for templates. dm1_in1 and dm1_in2 so that the produced definition of a template for a simple relational names will be unique throughout the scenario selection: program. In Fig. 11, we depict the way the a_out([ioarityOf(a_out)]{A_OUT_$i$,} renaming is performed at instantiation time. [i ¼ arityOf(a_out)]{A_OUT_$i$}) o- a_in1([ioarityOf(a_in1)] Macros: To make the definition of templates {A_IN1_$i$,} [i ¼ arityOf(a_in1)] easier and to improve their readability, we {A_IN1_$i$}), introduce a macro to facilitate attribute and expr([ioarityOf(@PARAM)] variable name expansion. For example, one of {@PARAM[$i$],} the major problems in defining a language for [i ¼ arityOf(@PARAM)] templates is the difficulty of dealing with schemata {@PARAM[$i$]}), of arbitrary arity. Clearly, at the template level, it [ioarityOf(a_out)] is not possible to pin-down the number of {A_OUT_$i$ ¼ A_IN1_$i$,} attributes of the involved schemata to a specific [i ¼ arityOf(a_out)] value. For example, in order to create a series of {A_OUT_$i$ ¼ A_IN1_$i$} names like the following As already mentioned at the syntax for loops, the name_theme_1,name_theme_2,y, expression name_theme_k [ioarityOf(a_out)]{A_OUT_$i$,} we need to give the following expression: [i ¼ arityOf(a_out)]{A_OUT_$i$} [iteratoromaxLimit] defining the attributes of the output schema {name_theme$iterator$} a_out simply wants to list a variable number of [iterator ¼ maxLimit] attributes that will be fixed at instantiation time. {name_theme$iterator$} Exactly the same tactics apply for the attributes of the predicate names a_in1 and expr. Also, the Obviously, this results in making the writing of final two lines state that each attribute of the templates hard and reduces their readability. To output will be equal to the respective attribute of attack this problem, we resort to a simple reusable the input (so that the query is safe), e.g., macro mechanism that enables the simplification A_OUT_4 ¼ A_IN1_4. We can simplify the of employed expressions. For example, observe the definition of the template by allowing the designer
  • 19. ARTICLE IN PRESS 510 P. Vassiliadis et al. / Information Systems 30 (2005) 492–525 to define certain macros that simplify the manage- 4. All the rest parameter variables are instantiated. ment of temporary length attribute lists. We 5. Keywords are recognized and renamed. employ the following macros: We will try to explain briefly the intuition DEFINE INPUT_SCHEMA AS behind this execution order. Macros are expanded [ioarityOf(a_in1)]{A_IN1_$i$,} first. Step (2) proceeds step (3) because loop [i ¼ arityOf(a_in1)] {A_IN1_$i$} boundaries have to be calculated before loop DEFINE OUTPUT_SCHEMA AS productions are performed. Loops on the other [ioarityOf(a_in)]{A_OUT_$i$,} hand, have to be expanded before parameter [i ¼ arityOf(a_out)]{A_OUT_$i$} variables are instantiated, if we want to be able DEFINE PARAM_SCHEMA AS to reference lists of variables. The only exception [ioarityOf(@PARAM)]{@PARAM[$i$],} to this is the parameter variables that appear in the [i ¼ arityOf(@PARAM)]{@PARAM[$i$]} loop boundaries, which have to be calculated first. DEFINE DEFAULT_MAPPING AS Notice though, that variable list elements cannot [ioarityOf(a_out)] appear in the loop constraint. Finally, we have to {A_OUT_$i$ ¼ A_IN1_$i$,} instantiate variables before keywords since vari- [i ¼ arityOf(a_out)] ables are used to create a dynamic mapping {A_OUT_$i$ ¼ A_IN1_$i$} between the input/output schemata and other attributes. Then, the template definition is as follows: Fig. 12 shows a simple example of template a_out(OUTPUT_SCHEMA) o- instantiation for the function application activity. a_in1(INPUT_SCHEMA), To understand the overall process better, first expr(PARAM_SCHEMA), observe the outcome of it, i.e., the specific activity DEFAULT_MAPPING which is produced, as depicted in the final row of Fig. 12, labeled keyword renaming. The output schema of the activity, fa12_out, is the head of the LDL rule that specifies the activity. The body 3.2.2. Instantiation of the rule says that the output records are Template instantiation is the process where the specified by the conjunction of the following user chooses a certain template and creates a clauses: (a) the input schema myFunc_in, (b) concrete activity out of it. This procedure requires the application of function subtract over the that the user specifies the schemata of the activity attributes COST_IN, PRICE_IN and the produc- and gives concrete values to the template para- tion of a value PROFIT, and (c) the mapping of meters. Then, the process of producing the respective LDL description of the activity is easily the input to the respective output attributes as specified in the last three conjuncts of the rule. automated. Instantiation order is important in our The first row, template, shows the initial template creation mechanism, since, as it can easily template as it has been registered by the designer. been seen from the notation definitions, different @FUNCTION holds the name of the function to be orders can lead to different results. The instantia- used, subtract in our case, and the @PARAM[ ] tion order is as follows: holds the inputs of the function, which in our case 1. Replacement of macro definitions with their are the two attributes of the input schema. The expansions. problem we have to face is that all input, output 2. arityOf() functions and parameter variables and function schemata have a variable number of appearing in loop boundaries are calculated parameters. To abstract from the complexity of first. this problem, we define four macro definitions, one 3. Loop productions are performed by instantiat- for each schema (INPUT_SCHEMA, OUTPUT_ ing the appearances of the iterators. This leads SCHEMA, FUNCTION_INPUT) along with a macro to intermediate results without any loops. for the mapping of input to output attributes
  • 20. ARTICLE IN PRESS P. Vassiliadis et al. / Information Systems 30 (2005) 492–525 511 Fig. 12. Instantiation procedure. (DEFAULT_MAPPING). The second row, macro also shown in the last two lines of the template. In expansion, shows how the template looks after the the third row, parameter instantiation, we can see macros have been incorporated in the template how the parameter variables were materialized at definition. The mechanics of the expansion are instantiation. In the fourth row, loop production, straightforward: observe how the attributes of the we can see the intermediate results after the loop output schema are specified by the expression expansions are done. As it can easily be seen, these [ioarityOf(a_in)+1]{A_OUT_$i$,}OUT- expansions must be done before @PARAM[] FIELD as an expansion of the macro OUTPUT_ variables are replaced by their values. In the fifth SCHEMA. In a similar fashion, the attributes of the row, variable instantiation, the parameter variables input schema and the parameters of the function have been instantiated creating a default mapping are also specified; note that the expression for the between the input, the output and the function last attribute in the list is different (to avoid attributes. Finally, in the last row, keyword repeating an erroneous comma). The mappings renaming, the output LDL code is presented after between the input and the output attributes are the keywords are renamed. Keyword instantiation